Code Monkey home page Code Monkey logo

cudasieve's Introduction

CUDASieve

A GPU accelerated C++/CUDA C implementation of the segmented sieve of Eratosthenes

Update April 2021: The code compiles again with the latest version of CUDA on my computer. This required getting rid of some legacy GPU capability. If you have a compute capability 3.2 to 5.0 gpu, you can make the required changes in the makefile, but I have removed those architectures from the main branch. It is about six years of backwards compatibility. If this means your hardware won't work, your beef is with tech companies' embrace of planned obsolescence, friend.

CUDASieve is a high performance segmented sieve of Eratosthenes for counting and generating prime numbers on Nvidia GPUs. This work contains some optimizations found in Ben Buhrow's CUDA Sieve of Eratosthenes as well as an attempt at implementing Tomás Oliveira e Silva's Bucket algorithm on the GPU. While this code is in no way as elegant as that of Kim Walisch's primesieve, the use of GPU acceleration allows a significant speedup. For those interested in building CUDASieve from source, the makefile will require changes to CUDA_DIR. A smaller number of device architectures can be specified depending on your use case, though this has no effect on performance. A couple hints are provided at the top of the makefile. Let me know if you have any hangups.

Binaries

The available binaries have been compiled for x86-64 linux and Nvidia GPUs with compute capability 3.0 or greater (everything in the past four years).

libcudasieve.a - the CUDASieve library (see below)
cudasieve - the CUDASieve command line interface (type ./cudasieve --help for commands)
cstest - a command line utility for testing the correctness of CUDASieve

       A screenshot of the CUDASieve CLI

Benchmarks

      With GTX 1080:

RangeTime to generate list
of sieving primes
Time to sieve
this range
Total running timeCount
0 to 106 0.062 ms 0.071 ms 0.164 s 78 498
0 to 107 0.063 ms 0.198 ms 0.127 s 664 579
0 to 108 0.062 ms 0.790 ms 0.156 s 5 761 455
0 to 109 0.063 ms 5.65 ms 0.161 s 50 847 534
0 to 1010 0.247 ms 63.1 ms 0.236 s 455 052 511
0 to 1012 0.258 ms 12.3 s 12.5 s 37 607 912 018
0 to 250 0.768 ms * 28 653 s 33 483 379 603 407
240 to 240 + 230 0.247 ms 32.7 ms 0.169 s 38 726 266
250 to 250 + 230 0.806 ms 34.5 ms 0.206 s 30 984 665
260 to 260 + 230 18.9 ms 126 ms 0.314 s 25 818 737
264 - 236 to 264 - 236 + 230 49.0 ms 225 ms 0.446 s 24 201 154

*Separate sieves for <240 and >=240


      Hardware Scaling for sieving 0 to 109:

GPUTime to generate list
of sieving primes
Time to sieve
this range
Total running time
GTX 7500.100 ms67.5 ms0.128 s
GTX 9500.106 ms36.4 ms0.105 s
GTX 10700.075 ms8.81 ms0.140 s
GTX 10800.069 ms6.03 ms0.121 s

Additionally, this code contains a way of generating a list of sieving primes, in order, on the device that is much faster than the bottleneck of ordering them on the host. Generating the list of 189 961 800 primes from 38 to 4e9 takes just 50 ms. This is about 15.2 GB of primes/second (about the max speed of PCIe 3.0 x16)! Primes are also prepared to be printed in the same way. For example, the kernel time for preparing an array of all the (25 818 737) primes from 260 to 260+230 and getting this array to the host is 157 ms with the GTX 1080.

This implementation of Oliveira's bucket method requires a fixed 10 bytes of DRAM per prime, which equates to just over 2 GB for sieving up to 264. The fact that large primes are handled in global memory, rather than on-chip, means that increasing the number of blocks working on the task of sieving these large primes does not increase the amount of memory used since the data set is not duplicated.

Correctness

CUDASieve has been checked against primesieve in counts and with Rabin-Miller primality tests of the 64k primes on each end of the output using random, exponentially-distrubuted ranges of random length. At the moment, it has passed about 400 000 consecutive tests without error. When run concurrently with some other application running on the same device (e.g. folding@home), there appears to be about a 1 in 20000 chance that the count will be off by one. These errors, however, are not repeatable when the program is running alone, even over 2 million trials. One million of those were with a device driving a display, so that does not impose a similar liability for error. These tests can be performed with the 'cstest' binary.

      Example run of cstest

CUDASieve Library

If the CUDASieve/cudasieve.hpp header is #included, one can make use of several public member functions of the CudaSieve class for e.g. creating host or device arrays of primes by linking the libcudasieve.a binary (with nvcc). For example:

/* main.cpp */

#include <iostream>
#include <stdint.h>
#include <math.h>                   // pow()
#include <cuda_runtime.h>           // cudaFreeHost()
#include "CUDASieve/cudasieve.hpp"  // CudaSieve::getHostPrimes()

int main()
{
    uint64_t bottom = pow(2,63);
    uint64_t top = pow(2,63)+pow(2,30);
    size_t len;

    uint64_t * primes = CudaSieve::getHostPrimes(bottom, top, len);

    for(uint32_t i = 0; i < len; i++)
        std::cout << primes[i] << std::endl;

    cudaFreeHost(primes);            // must be freed with this call b/c page-locked memory is used.
    return 0;
}

placed in the CUDASieve directory compiles with the command

nvcc -I include -lcudasieve -std=c++11 -arch=compute_30 main.cpp

and prints out the primes in the range 263 to 263+230.

                /* CudaSieve static member functions */
/* Returns count from 0 to top */
uint64_t countPrimes(uint64_t top);

/* Returns count from bottom to top, caveats mentioned below apply */
uint64_t countPrimes(uint64_t bottom, uint64_t top);

/* Returns pointer to a page-locked host array of the primes in [bottom, top] of length count.
   Memory must be freed with cudaFreeHost() */
uint64_t * getHostPrimes(uint64_t bottom, uint64_t top, size_t & count);

/* Returns a std::vector of the primes in the interval [bottom, top] */
std::vector<uint64_t> getHostPrimesVector(uint64_t bottom, uin64_t top, size_t count);

/* Returns pointer to a device array of primes in [bottom, top] of length count */
uint64_t * getDevicePrimes(uint64_t bottom, uint64_t top, size_t & count);

The above functions all have an optional last parameter that allows the user to specify the CUDA-enabled device used for the sieve. If this is not specified, the value defaults to 0. The thrust branch adds the ability to get a thrust::device_vector of primes at the cost of much larger binaries.

For many iterations, it is preferable to avoid some of the overhead associated with memory allocation and creating the list of sieving primes repeatedly. Fortunately, it is possible to do most of this work once by calling the CudaSieve constructor with two or three arguments (as shown below) and then calling the suitable non-static member function.

int main()
{
  size_t len;
  uint64_t bottom = pow(2,60);
  uint64_t top = bottom + pow(2,50);
  uint64_t range = pow(2,30);
  
  CudaSieve * sieve = new CudaSieve(bottom, top, range);

  for(; bottom <= top; bottom += range){
    uint64_t * primes = sieve->getHostPrimesSegment(bottom, len);
    /* something productive */
  }

  delete sieve;
  return 0;
}

The above code creates a CudaSieve object with an appropriate list of sieving primes for ranges up to top and with memory allocated for copying arrays of primes over range range as long as they are above bottom. However, I have been having some issues with them and depend mostly on the static member functions, which are very reliable. The relevant non-static member functions are:

  CudaSieve(uint64_t bottom, uint64_t top, uint64_t range); 
                                                              
  uint64_t * getHostPrimesSegment(uint64_t bottom, uint64_t top, size_t & count);   
  uint64_t * getDevicePrimesSegment(uint64_t bottom, uint64_t top, size_t & count); 
  // returns a pointer to an array of primes in the  range [bottom, bottom+range] 
  // (the latter as specified when calling the constructor).  Invalid bottom will
  // return NULL and count = 0;

Issues

There is a device memory leak of ~150 kb that can become problematic after several thousand iterations when using the functions described above, but it is of no consequence for the CLI. Strangely, it doesn't seem to be a result of cudaMalloc() / cudaFree() asymmetry, nor is it detected with cuda-memcheck... Anyways, if you need to run several thousand or more iterations of one of the above functions, it seems best to call cudaDeviceReset() after every 1000 or so. When called this infrequently, cudaDeviceReset() adds negligible time.

There is an issue with selecting the non-default GPU in the CLI that causes timing to fail and counts to sometimes be off. However, this limitation does not carry over to the API, where selecting the non-default GPU does not cause problems.

State of the Project

Currently working on expanding the range of available functions, specifically the non-static ones. Let me know if you have any requests for features, and I'll see what I can do.

cudasieve's People

Contributors

curtisseizert avatar kimwalisch avatar rbergen avatar sethtroisi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

cudasieve's Issues

-bs Block size command line option

I am using Ubuntu 18.04 LTS as my operating system and I downloaded the binary provided for cudasieve. Even though I give -bs 1024 as an option, it keeps using 16kb as sieve size. Is this a bug or is it intended? I have a GTX 1080 as my gpu.

Binary hosting

Hi Curtis,

It took me quite some time to figure out that you have committed your binary and static library to your GitHub repo!? You are in fact the first guy I am aware of who has done this...

Personally I suggest that you create a dedicated Binary section in your README.md (with some description) where you offer a download link to a binary hosting website such as https://bintray.com (which is free for open source projects). I have done it the same way for my primecount project (https://github.com/kimwalisch/primecount#binaries).

This way your users will not waste 10 minutes to find out how to download your binary!

Greetings,
Kim

Example fails

After a succesfull compile I found the following not working

Screenshot from 2022-05-25 22-14-58

Example with exponents do not seem to work whilst invoking sh inside the binary [?]

i would like to set bottom and top areas for the sieve to measure output, performance in multi GPU/node environments

Any idea towards solving this example would be greatly appreciated :)

Windows version

Hello, I like this program but I would like to use it under Windows. So wanted to ask if it is possible to compile it somehow under windows.

Verifying memory leaks

Hi!

I have been porting CUDASieve to HIP, and now I'm wondering if the mentioned (readme) memory leak also occurs there (and if so perhaps it's a good chance to debug it).
How do I verify that the memory leak occurs for me? How were you able to see it with CUDA?

Thanks!
Tuxifan

GPU Sieve make fail / binary links

Thank you for creating and sharing this project. I am currently working on an prime generation/storage releated project and would like to use your project to generate said primes. Progress and code will also be shared under the same github roof.

However, the make gives me the grief :)

Screenshot from 2022-05-25 20-44-02

This is compiled in an docker build environment eventually based on an official NVidia docker image.
I get a similar error when building on my dev desktop with NV driver befause of default make behaviour

Screenshot from 2022-05-25 21-01-56

The de-facto default for nvidia-cuda is to reside in /usr/local/cuda placing cuda in /opt/ is fairly obsolete, but no symlink that won't fix that :)

changing that inside the make or adding a symlink gives a new error ! 🥳

Screenshot from 2022-05-25 21-03-12

Do you have an idea here ? I am building with CUDA Toolkit 11.7 on a supported platform.

[edit: found it, the repo needs to add an /obj folder to put the binary in]

Note: Trying to download the binaries gives a JFrog registration page that leads to a sales pitch of said company, no files. Maybe a link to the releases page is in order [?] [edit]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.