Code Monkey home page Code Monkey logo

pospop's Issues

bit matrix transpositions slows down counting

Hi Robert,

Thanks for persisting optimizing this great package.

I found that bit matrix transpositions slows down counting (100K buffer and others) since d24d616b (Add new count8 variant using bit matrix transpositions). I know you're creating a general framework to port to other platform, a little performance reduction may be tolerated.

I add some tests with few bytes, which I use in my cases.

Current version: d6e39e5

BenchmarkCount8/avx2/32-16              74935256                15.5 ns/op      2066.51 MB/s
BenchmarkCount8/avx2/64-16              55956423                20.3 ns/op      3153.58 MB/s
BenchmarkCount8/avx2/128-16             37906530                29.8 ns/op      4302.12 MB/s
BenchmarkCount8/avx2/256-16             24502731                50.8 ns/op      5038.58 MB/s
BenchmarkCount8/avx2/512-16             31988312                37.8 ns/op      13560.77 MB/s
BenchmarkCount8/avx2/1000-16            20510130                61.9 ns/op      16148.71 MB/s
BenchmarkCount8/avx2/10000-16            2245747               524 ns/op        19087.36 MB/s
BenchmarkCount8/avx2/100000-16            248845              4595 ns/op        21761.87 MB/s

Starting using bit matrix transpositions: 41dbbc5 (speedup after d24d616b)

BenchmarkCount8/avx2/32-16              100472971               13.2 ns/op      2431.84 MB/s
BenchmarkCount8/avx2/64-16              66744648                17.9 ns/op      3568.20 MB/s
BenchmarkCount8/avx2/128-16             42810946                28.3 ns/op      4530.39 MB/s
BenchmarkCount8/avx2/256-16             20535319                56.7 ns/op      4516.74 MB/s
BenchmarkCount8/avx2/512-16             33010789                37.1 ns/op      13811.14 MB/s
BenchmarkCount8/avx2/1000-16            21271256                56.3 ns/op      17774.12 MB/s
BenchmarkCount8/avx2/10000-16            2517070               447 ns/op        22377.79 MB/s
BenchmarkCount8/avx2/100000-16            296733              4059 ns/op        24637.18 MB/s

Old but fast way:
677120e

BenchmarkCount8/avx2/32-16              181525946                7.16 ns/op     4466.72 MB/s
BenchmarkCount8/avx2/64-16              112528216               10.5 ns/op      6069.95 MB/s
BenchmarkCount8/avx2/128-16             63801217                18.7 ns/op      6836.36 MB/s
BenchmarkCount8/avx2/256-16             40247318                29.1 ns/op      8795.27 MB/s
BenchmarkCount8/avx2/512-16             38962676                28.7 ns/op      17869.65 MB/s
BenchmarkCount8/avx2/1000-16            20517376                57.8 ns/op      17289.99 MB/s
BenchmarkCount8/avx2/10000-16            2644093               432 ns/op        23135.55 MB/s
BenchmarkCount8/avx2/100000-16            295675              3913 ns/op        25554.00 MB/s

Eliminate temporary variables from the CSA operation

Consider:

// B:A = A+B+C
#define CSA(A, B, C, D) \
	MOVOA A, D \
	PAND B, D \
	PXOR B, A \
	MOVOA A, B \
	PAND C, B \
	PXOR C, A \
	POR D, B

vs

// B:A = A+B+C
#define CSA(A, B, C) \
	PXOR C, B \ 
	PXOR A, C \
	PXOR B, A \
	POR  C, B \
	PXOR A, B

The C input must be ready 1 cycle earlier.

This is mainly for SSE2 platforms. AVX2/NEON instructions have non-destructive 3-operand forms.

Some architectures have "free" "mov elimination" which makes this change hard to benchmark.

IIRC, a problem I was having before was about how the compiler was merging a load with an xor instruction...

xor r1, [mem]

vs

load r2, [mem]
xor r1, r2

Not sure if this would be an issue with GoLang Assembly.

======

This issue is not important, feel free to close this out, just one of my pet projects.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.