Code Monkey home page Code Monkey logo

md5-simd's Issues

Quickly slows down

Hi
We attempted to use your library in our server and discovered that md5 calculation quickly slows down
I then tried to reproduce the slowdown and I was able to do that using the following example:

package main

import (
    "fmt"
    "time"
    "math/rand"
    "github.com/minio/md5-simd"
)

func main() {
    md5Server := md5simd.NewServer()
    n := 200
    sem := make(chan int)
    for i := 0; i < n; i++ {
        go func() {
            buf := make([]byte, 1048569)
            for i := 0; i < len(buf); i++ {
                buf[i] = byte(i * 53)
            }
            for i := 0; i < 1000; i++ {
                time.Sleep(time.Duration(1500 * rand.Float32()) * time.Millisecond)
                start := time.Now()
                c := md5Server.NewHash()
                for j := 0; j < 15; j++ {
                    c.Write(buf)
                }
                c.Sum(nil)
                c.Close()
                dur := time.Now().Sub(start)
                fmt.Printf("%.02f s\n", dur.Seconds())
            }
            sem <- 1
        }()
    }
    for i := 0; i < n; i++ {
        <- sem
    }
}

It's not exactly our case because we actually use sync.Pool over md5Server.NewHash() so we don't Close() the hasher but rather Reset() it after each usage and put back to the pool, and I couldn't reproduce the slowdown easily using the pool.

But anyway I'm reporting the bug to you, maybe you'll have an idea about why that happens.

I also looked at the synchronisation/channel code and I thought that maybe it would be better to use reflect.Select https://golang.org/pkg/reflect/#Select over all client channels instead of nontrivial synchronisation based on N+1 channels... But I'm of course not sure that the slowdown is caused by this thing.

AVX512 Performance suggestions

Whilst doing research for my article on MD5 optimisation, I came across your blog post.

I had a quick skim of the assembly code here, and thought I'd offer up some suggestions if you're interested.

  • avoid unnecessarily mixing floating-point and integer instructions to avoid potential bypass delays (e.g. use VPXORD instead of VXORPS)
  • all bitwise logic should be handled by VPTERNLOG (i.e. don't do this, where you've got a XOR operation that the ternary-logic instruction can handle, use a move instruction to preserve the original value (note that modern processors support move-elimination, so moves will be more efficient than logic))
  • avoid using gathers - do regular loads and unpack/permute everything into place (Intel CPUs only have one shuffle port, so I can see some appeal with avoiding shuffle-ops, but 32-bit gathers place so much load on the LSU that I doubt it's ever worth it)
  • consider interleaving two instruction streams to make better use of ILP (i.e. compute 32 hashes at a time, instead of 16, for AVX512)
  • consider using EVEX embedded broadcast for loading constants, rather than duplicating it in memory. If you're interleaving two streams, it may be better to use VPBROADCASTD for loading.

Note: for a good reference, check out Intel's multi-buffer MD5 implementation, which incorporates the suggestions above.

Hope you found that useful!

avx512: use VPTERNLOGQ for ternary operations

http://www.0x80.pl/articles/avx512-ternary-functions.html

Made this calculator: https://play.golang.org/p/JkeaBPpu2b-

ROUND1:

	VXORPS  c, tmp, tmp            \
[...]
	VANDPS  b, tmp, tmp            \
	VXORPS  d, tmp, tmp            \

This looks to be tmp = (((c ^ tmp) & b) ^ d). This should be able to be reduced by 1 instruction.

If to replace the last two tmp = ((tmp & b) ^ d). c=tmp, b=b, a=d), meaning c = ((c & b) ^ a)

-> VPTERNLOG $120, d, b, tmp - https://play.golang.org/p/4A9Ex1-q9ft

ROUND3:

	VXORPS  d, tmp, tmp            \
	VXORPS  b, tmp, tmp            \

Easy, VPTERNLOG $150, b, d, tmp - https://play.golang.org/p/hNUMvRjSwQN

ROUND4:

	VORPS  b, tmp, tmp            \
	VXORPS c, tmp, tmp            \

This looks like tmp = (b | tmp) ^c. With params, c = (b | c) ^a, so

-> VPTERNLOG $30, c, b, tmp - https://play.golang.org/p/2NqJElhLfSH

cpuid.CPU.AVX512F undefined (type cpuid.CPUInfo has no field or method AVX512F)

I'm trying to run minio-go on virtual machine (and in Docker environment).

When I try to build or go get, i'll get this error message:

../../../go/src/github.com/minio/md5-simd/block_amd64.go:86:23: cpuid.CPU.AVX512F undefined (type cpuid.CPUInfo has no field or method AVX512F)
../../../go/src/github.com/minio/md5-simd/md5-server_amd64.go:63:15: cpuid.CPU.AVX2 undefined (type cpuid.CPUInfo has no field or method AVX2)

I attached quick example project.
minio-example.zip

Run make run or manually docker build -t minio-example . and you should see the same error

license in block-generic.go

I'm packaging md5-simd in Debian. I noticed that the complete LICENSE file of block-generic.go is not included, and that there is also the gen.go source file missing. Could you please add them to the repo?

// Copyright 2013 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

// Code generated by go run gen.go -output md5block.go; DO NOT EDIT.

As more hashers are created, the speed slows down

Hello, I encountered the same problem as this issue #33 (comment) in the production environment. I will create a large number of Hasher instances in the production environment, but there are not many existing Hasher instances at the same time. There are a large number of scenes of creation and destruction, and it is guaranteed that they will be successfully closed eventually, but my speed will continue to decrease, from 3GiB/s to 100MiB/s
I used this test case in this issue #33 (comment),

func TestSomething(t *testing.T) {
	var total int64
	var finish int64

	md5Server := md5simd.NewServer()
	n := 2000
	go func() {
		t := time.NewTicker(1 * time.Second)
		lastTime := time.Now()
		for range t.C {
			elapsed := time.Since(lastTime)
			processed := atomic.SwapInt64(&total, 0)
			finished := atomic.SwapInt64(&finish, 0)
			lastTime = time.Now()
			fmt.Printf("%0.2fGB/s finish:%d\n", float64(processed)/elapsed.Seconds()/float64(1<<30), finished)
		}
	}()

	var wg sync.WaitGroup
	wg.Add(n)
	for i := 0; i < n; i++ {
		go func() {
			defer wg.Done()
			buf := make([]byte, 1048569)
			for i := 0; i < len(buf); i++ {
				buf[i] = byte(i * 53)
			}
			for i := 0; i < 1000; i++ {
				time.Sleep(time.Duration(1500*rand.Float32()) * time.Millisecond)
				c := md5Server.NewHash()
				for j := 0; j < 15; j++ {
					c.Write(buf)
					atomic.AddInt64(&total, int64(len(buf)))
				}
				c.Sum(nil)
				c.Close()
				atomic.AddInt64(&finish, 1)
			}
		}()
	}
	wg.Wait()
}

and the final results are as follows:


=== RUN   TestSomething
2.62GB/s finish:0
3.05GB/s finish:0
3.49GB/s finish:0
3.48GB/s finish:0
2.88GB/s finish:0
0.00GB/s finish:0
0.00GB/s finish:0
0.00GB/s finish:0
0.00GB/s finish:0
0.00GB/s finish:0
0.00GB/s finish:0
0.00GB/s finish:0
0.00GB/s finish:0
0.00GB/s finish:0

I suspect that it is due to competition, because when I try to replace Close when I can use Reset, the decline speed will be much slower in the production environment, but it will still decline to a very low speed in the end

Data race

Write operations keeps a reference to data after they return.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.