Comments (16)
Thanks for your interest and these optimization suggestions. If I recall, division counts the number of leading zero limbs for optimization. So we can discuss this also perhaps for multiplication. My first fear would be that it would slow down all multiplication for the sake of a few selected operations that happen to use small numbers. But we could selectively enable/disable as suggested. Iām traveling now and will look into this next week
from wide-integer.
Hi Maksym (@maksymbevza), the OP is preliminarily handled by #363, which is now merged to master.
I will, for the moment, keep this issue open. Moving forward, we can open secondary issue(s) for additional points discovered herein, as needed.
from wide-integer.
from wide-integer.
@johnmcfarlane Indeed it can, thanks for pointing it out. We can guard it with -D at compile time or so. Also, I think it's already susceptible to this attack, due to the check if the first number's limb value is not zero.
from wide-integer.
Hi Maksym (@maksymbevza) I've started looking into these kinds of optimizations in the clz_limb_optimizations_issue362
branch.
I'm considering using -DWIDE_INTEGER_HAS_CLZ_LIMB_OPTIMIZATIONS
for these optimizations
In other words:
#define WIDE_INTEGER_HAS_CLZ_LIMB_OPTIMIZATIONS
I started off a full batch of CI with WIDE_INTEGER_HAS_CLZ_LIMB_OPTIMIZATIONS
defined. Local tests were OK on the first-tried optimization. Let's see how it does with full sanitizers and all of CI.
from wide-integer.
Hi Maksym (@maksymbevza) it's good to read of your progress.
I have now tested counting leading zero limbs. I actually think I can/will check that in to the main-branch soon with the compiler switch WIDE_INTEGER_HAS_CLZ_LIMB_OPTIMIZATIONS
.
For the 8-by-8 case, believe it or not, there is already an unrolled multiplication. To get this, you need to define the preprocessor switch WIDE_INTEGER_HAS_MUL_8_BY_8_UNROLL
.
You can try the 8-by-8 switch and see if it helps your use-case.
In general, I'm reluctant-to-refusing to implement specific architecture assembly. I use this library with about 15-20 different architectures and I'm happy with its portability and overall performance. So it is unlikely that I will support an individual architectural optimization any time soon. You can do this in your fork, and, especially with the unrolled 8-by-8 multiplication, you might be able to see a simple way to put your own assembler in there.
from wide-integer.
I can create another issue on the
avx2
specifically to discuss it...
Hi Maksym (@maksymbevza) I am happy to discuss this. As mentioned above, please try first WIDE_INTEGER_HAS_MUL_8_BY_8_UNROLL
.
As mentioned above, however, I'm unlikely to support machine-specific details. But I could discuss with you how you might be able to get these in something like a fork.
from wide-integer.
Sure, totally understand your point on architectural assembly.
Yes, I've seen WIDE_INTEGER_HAS_MUL_8_BY_8_UNROLL
and it's implementation. Will do more of the benchmarking soon.
Thanks a lot for the help in optimizing this very point, will try to investigate the actual performance gain.
from wide-integer.
Hi Maksym, in addition, there is a static member function called from_rep()
that can be used directly for limb manipulations. But you would be responsible for shifting/halving/doubling the to-from limb types in the representation(s).
Also, I must admit, from_rep()
is not quite as well tested as some other parts of the library.
from wide-integer.
The results of the speed-up are incredible, thanks a lot.
Great! I am happy this helped. This change was a great experience in the domain of local, application-specific optimizatoin. We reachad, I believe, a good point.
Thanks Maksym (@maksymbevza) for your feedback and for your original, good ideas.
If anything else comes up or you have other ideas, please contact any time.
Kind regards, Chris
I'll close this issue now with: Fixed by 6a3777b
from wide-integer.
Hello, @ckormanyos. Thanks a lot.
I spent time again trying to understand the biggest bottleneck and use case. Turns out we have the biggest use by the function which performs multiplication of who uint256
values a
and b
and division by another uint256
value c
. It has logic guarantees that the result of division is uint256
and it converts back.
In the a*b/c
formula I conditionally swapped a, b
to make sure a < b
, just in case to reuse the optimization of skipping the whole inner loop cycle on short a
values provided by the library.
However, most of the time both a
and b
use 120-200 bits, which means this is not the main optimization point. Luckily since multiplication is performed under uint512
to hold the multiplied value, I know that 50% of higher limbs will be always empty for both a
and b
, which pretty much guarantees the use of the suggested optimization.
Also, I see that division can be yet another bottleneck, but I have not investigated it yet that much.
--
Another thing I thought about is using uint512
with 64-bit limb, unrolling 8x8 multiplication, and using avx2
which is widely adopted already in cloud platforms. Not sure if the compiler is clever enough to automatically construct relevant assembly to use avx2
just by enabling the -march=haswell
/-mavx2
flag and no changes in the code itself.
By the way, on the topic of 64-bit limbs, I see that using 64-bit limbs hurt the performance bit of our function, and not it's not clear to me yet as to why. I suspect that conversion from 32-bit limbs to 64 and back is more expensive than the gain from it.
If you prefer, I can create another issue on the avx2
specifically to discuss it in more detail.
from wide-integer.
@ckormanyos Could you please hint if there's a quick way to cast from/to 64-bit limb type to 32-bit limb type integers? At the moment we use string-based conversion which is obviously very slow. I've seen import/export bits that probably can be used for this, but maybe there is another quick way?
from wide-integer.
Regarding from_rep()
, basically you need to get the representation of the source type, then manipulate its limbs manually. These manipulated limbs are then used in a specific call to from_rep()
which acts like a factory to create the new wide-integer having different limb width.
I have not (yet) and might never support mixed limb-width conversions.
from wide-integer.
Hi Maksym (@maksymbevza) the optimizations we've discussed so far and agreed to attempt, whether they turn out to be needed or not, are cycling for inclusion in master in #363.
As soon as the PR gets merged, you can use the compile-time definition WIDE_INTEGER_HAS_CLZ_LIMB_OPTIMIZATIONS
to activate these.
These optimizations are disabled by default and the library has not significantly been changed in its default state.
from wide-integer.
As soon as the PR gets merged, you can use the compile-time definition
WIDE_INTEGER_HAS_CLZ_LIMB_OPTIMIZATIONS
to activate these.
The first round of optimizations is now in master.
from wide-integer.
Hello
The results of the speed-up are incredible, thanks a lot. Measured under google benchmark is the following.
All measurements below are under uint512
32-bit limb type. All test cases have only leading zero limbs, but never non-leading zero limbs (e.g. 0x0000...123456
, but not 0x000..000123000...000
). Tested on Intel(R) Core(TM) i7-8569U CPU @ 2.80GHz
- 90-bit by 90-bit : 37ns -> 20ns (+85% op/s)
- 256-bit by 256-bit : 66ns -> 58ns (+13% op/s)
- 512-bit by 512-bit : 86ns -> 104ns (-17% op/s)
In our case we are mostly between 90 and 256 bits, and never over 256 bits for operands, yielding considerable, statistically significant improvement in multiplication.
Thanks again, @ckormanyos, for both building the library and for building the effective optimization.
from wide-integer.
Related Issues (20)
- equivalent to mp::import_bits, mp::export_bits? HOT 5
- Upgrade docs to use Markdown's own LaTeX interpreter
- Handle Ubuntu 18.04 Deprecation on GHA
- about import_bits/export_bits HOT 9
- Replace (soon to be deprectated) LGTM with GitHub's CodeQL
- Cannot use integers in template parameters due to private data members. HOT 6
- Support signed/unsigned floored division (like Python // operator or divmod) HOT 3
- Adapt for GCC 13 HOT 2
- valgrind investigations HOT 6
- Repair GCD reduction which is broken for certain small values
- Refactoring has lead to slight performance drop and needs repair HOT 2
- Add cygwin run in CI HOT 2
- Update and modernize CI HOT 2
- Client request C++14 constexpr-ness HOT 1
- Construction from Constexpr StringView HOT 11
- Clean up constexpr-ness HOT 1
- Coverity scan for `uintwide_t`, example and test software HOT 1
- Repair cygwin in CI and upgrade all CI HOT 1
- wide_integer.h header erroneously uses std::lexicographical compare
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
š Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ššš
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ā¤ļø Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from wide-integer.