Code Monkey home page Code Monkey logo

Comments (16)

ckormanyos avatar ckormanyos commented on July 18, 2024 2

Thanks for your interest and these optimization suggestions. If I recall, division counts the number of leading zero limbs for optimization. So we can discuss this also perhaps for multiplication. My first fear would be that it would slow down all multiplication for the sake of a few selected operations that happen to use small numbers. But we could selectively enable/disable as suggested. Iā€™m traveling now and will look into this next week

from wide-integer.

ckormanyos avatar ckormanyos commented on July 18, 2024 2

Hi Maksym (@maksymbevza), the OP is preliminarily handled by #363, which is now merged to master.

I will, for the moment, keep this issue open. Moving forward, we can open secondary issue(s) for additional points discovered herein, as needed.

from wide-integer.

johnmcfarlane avatar johnmcfarlane commented on July 18, 2024 1

from wide-integer.

maksymbevza avatar maksymbevza commented on July 18, 2024 1

@johnmcfarlane Indeed it can, thanks for pointing it out. We can guard it with -D at compile time or so. Also, I think it's already susceptible to this attack, due to the check if the first number's limb value is not zero.

from wide-integer.

ckormanyos avatar ckormanyos commented on July 18, 2024 1

Hi Maksym (@maksymbevza) I've started looking into these kinds of optimizations in the clz_limb_optimizations_issue362 branch.

I'm considering using -DWIDE_INTEGER_HAS_CLZ_LIMB_OPTIMIZATIONS for these optimizations

In other words:

#define WIDE_INTEGER_HAS_CLZ_LIMB_OPTIMIZATIONS

I started off a full batch of CI with WIDE_INTEGER_HAS_CLZ_LIMB_OPTIMIZATIONS defined. Local tests were OK on the first-tried optimization. Let's see how it does with full sanitizers and all of CI.

from wide-integer.

ckormanyos avatar ckormanyos commented on July 18, 2024 1

Hi Maksym (@maksymbevza) it's good to read of your progress.

I have now tested counting leading zero limbs. I actually think I can/will check that in to the main-branch soon with the compiler switch WIDE_INTEGER_HAS_CLZ_LIMB_OPTIMIZATIONS.

For the 8-by-8 case, believe it or not, there is already an unrolled multiplication. To get this, you need to define the preprocessor switch WIDE_INTEGER_HAS_MUL_8_BY_8_UNROLL.

You can try the 8-by-8 switch and see if it helps your use-case.

In general, I'm reluctant-to-refusing to implement specific architecture assembly. I use this library with about 15-20 different architectures and I'm happy with its portability and overall performance. So it is unlikely that I will support an individual architectural optimization any time soon. You can do this in your fork, and, especially with the unrolled 8-by-8 multiplication, you might be able to see a simple way to put your own assembler in there.

from wide-integer.

ckormanyos avatar ckormanyos commented on July 18, 2024 1

I can create another issue on the avx2 specifically to discuss it...

Hi Maksym (@maksymbevza) I am happy to discuss this. As mentioned above, please try first WIDE_INTEGER_HAS_MUL_8_BY_8_UNROLL.

As mentioned above, however, I'm unlikely to support machine-specific details. But I could discuss with you how you might be able to get these in something like a fork.

from wide-integer.

maksymbevza avatar maksymbevza commented on July 18, 2024 1

Sure, totally understand your point on architectural assembly.

Yes, I've seen WIDE_INTEGER_HAS_MUL_8_BY_8_UNROLL and it's implementation. Will do more of the benchmarking soon.

Thanks a lot for the help in optimizing this very point, will try to investigate the actual performance gain.

from wide-integer.

ckormanyos avatar ckormanyos commented on July 18, 2024 1

Hi Maksym, in addition, there is a static member function called from_rep() that can be used directly for limb manipulations. But you would be responsible for shifting/halving/doubling the to-from limb types in the representation(s).

Also, I must admit, from_rep() is not quite as well tested as some other parts of the library.

from wide-integer.

ckormanyos avatar ckormanyos commented on July 18, 2024 1

The results of the speed-up are incredible, thanks a lot.

Great! I am happy this helped. This change was a great experience in the domain of local, application-specific optimizatoin. We reachad, I believe, a good point.

Thanks Maksym (@maksymbevza) for your feedback and for your original, good ideas.

If anything else comes up or you have other ideas, please contact any time.

Kind regards, Chris

I'll close this issue now with: Fixed by 6a3777b

from wide-integer.

maksymbevza avatar maksymbevza commented on July 18, 2024

Hello, @ckormanyos. Thanks a lot.

I spent time again trying to understand the biggest bottleneck and use case. Turns out we have the biggest use by the function which performs multiplication of who uint256 values a and b and division by another uint256 value c. It has logic guarantees that the result of division is uint256 and it converts back.

In the a*b/c formula I conditionally swapped a, b to make sure a < b, just in case to reuse the optimization of skipping the whole inner loop cycle on short a values provided by the library.

However, most of the time both a and b use 120-200 bits, which means this is not the main optimization point. Luckily since multiplication is performed under uint512 to hold the multiplied value, I know that 50% of higher limbs will be always empty for both a and b, which pretty much guarantees the use of the suggested optimization.

Also, I see that division can be yet another bottleneck, but I have not investigated it yet that much.

--

Another thing I thought about is using uint512 with 64-bit limb, unrolling 8x8 multiplication, and using avx2 which is widely adopted already in cloud platforms. Not sure if the compiler is clever enough to automatically construct relevant assembly to use avx2 just by enabling the -march=haswell/-mavx2 flag and no changes in the code itself.

By the way, on the topic of 64-bit limbs, I see that using 64-bit limbs hurt the performance bit of our function, and not it's not clear to me yet as to why. I suspect that conversion from 32-bit limbs to 64 and back is more expensive than the gain from it.

If you prefer, I can create another issue on the avx2 specifically to discuss it in more detail.

from wide-integer.

maksymbevza avatar maksymbevza commented on July 18, 2024

@ckormanyos Could you please hint if there's a quick way to cast from/to 64-bit limb type to 32-bit limb type integers? At the moment we use string-based conversion which is obviously very slow. I've seen import/export bits that probably can be used for this, but maybe there is another quick way?

from wide-integer.

ckormanyos avatar ckormanyos commented on July 18, 2024

Regarding from_rep(), basically you need to get the representation of the source type, then manipulate its limbs manually. These manipulated limbs are then used in a specific call to from_rep() which acts like a factory to create the new wide-integer having different limb width.

I have not (yet) and might never support mixed limb-width conversions.

from wide-integer.

ckormanyos avatar ckormanyos commented on July 18, 2024

Hi Maksym (@maksymbevza) the optimizations we've discussed so far and agreed to attempt, whether they turn out to be needed or not, are cycling for inclusion in master in #363.

As soon as the PR gets merged, you can use the compile-time definition WIDE_INTEGER_HAS_CLZ_LIMB_OPTIMIZATIONS to activate these.

These optimizations are disabled by default and the library has not significantly been changed in its default state.

from wide-integer.

ckormanyos avatar ckormanyos commented on July 18, 2024

As soon as the PR gets merged, you can use the compile-time definition WIDE_INTEGER_HAS_CLZ_LIMB_OPTIMIZATIONS to activate these.

The first round of optimizations is now in master.

from wide-integer.

maksymbevza avatar maksymbevza commented on July 18, 2024

Hello

The results of the speed-up are incredible, thanks a lot. Measured under google benchmark is the following.
All measurements below are under uint512 32-bit limb type. All test cases have only leading zero limbs, but never non-leading zero limbs (e.g. 0x0000...123456, but not 0x000..000123000...000). Tested on Intel(R) Core(TM) i7-8569U CPU @ 2.80GHz

-  90-bit by  90-bit :  37ns ->  20ns (+85% op/s)
- 256-bit by 256-bit :  66ns ->  58ns (+13% op/s)
- 512-bit by 512-bit :  86ns -> 104ns (-17% op/s)

In our case we are mostly between 90 and 256 bits, and never over 256 bits for operands, yielding considerable, statistically significant improvement in multiplication.

Thanks again, @ckormanyos, for both building the library and for building the effective optimization.

from wide-integer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.