Code Monkey home page Code Monkey logo

Comments (7)

kunzmi avatar kunzmi commented on July 24, 2024

Hi,

you set sumdsqKernel.BlockDimensions to the number of parallel executed threads, e.g. (16,16) and then sumdsqKernel.GridDimensions according to your problem size divided by the block size as you did. Instead of computing the grid size manually you can simplify things and just call sumdsqKernel.SetComputeSize(numRows, numCols) and ManagedCuda does the division for you.

In your example, you mixed grid and block dimensions:

sumdsqKernel.BlockDimensions = new dim3(gridDimX, gridDimY); //<-- should be blockSizeX, blockSizeY
sumdsqKernel.GridDimensions = new dim3(blockSizeX, blockSizeY); //<-- should be gridDimX, gridDimY

The maximum block dimensions you can set depend on your actual kernel (number of registers and shared memory used etc.) and your GPU. You can query the maximum with sumdsqKernel.MaxThreadsPerBlock and set the dimensions so that blockDim.x * blockDim.y * blockDim.z <= sumdsqKernel.MaxThreadsPerBlock

And as a general hint: do not create a CudaContext, load the kernel and allocate memory each time you want to run a kernel. You might also want to have a look at NPPs (or NPPi) functions like Norm-L2 or Norm-L2Sqr.

from managedcuda.

jdanielpa avatar jdanielpa commented on July 24, 2024

Thanks for he help kunzmi. Adding:

sumdsqKernel.SetComputeSize((uint)numRows, (uint)numColumns);

and removing all my code to attempt to set the grid and block resolved the random crashes. However, when I run it and pass a matrix, the result I get back is different each time I call the method. The result is always close to what is expected but not quite correct. Could this be tied to memory allocations? Can you point me to an example where a CudaContext is not created? Again, I am very new to CUDA so I am sorry if my issues are trivial.

from managedcuda.

kunzmi avatar kunzmi commented on July 24, 2024

Usually you create the CudaContext once when the application starts, same holds for loading kernels. Once you know the size of the memory allocations, you allocate them and reuse the buffers avoiding allocating and freeing memory at each use. Memory allocations are time costly, need a full device synchronisation etc. - in short, they decrease performance a lot and are rarely necessary for each kernel call.

What do you mean by "The result is close"? What order of magnitudes? Note that floating point arithmetic is not exact and that (a + b) + c is not the same as a + (b + c). Given that your kernel is using atomicAdd, the order in which the actual addition is performed is random, so your result will vary in the limits of floating point precision.

from managedcuda.

jdanielpa avatar jdanielpa commented on July 24, 2024

In the image below sumdsq is the result from a nested for look in C# and is the expected value. I ran the ManageCUDA code 3 times and get different results. Too big to be floating point precision.

image

from managedcuda.

kunzmi avatar kunzmi commented on July 24, 2024

Definitely looks like floating point precision error. In that value range that you have, you get a precision of ~ 1, which means that if you take one of the result numbers and add or substract 0.1, nothing will change...
You can take the nested C# loops and let them run backwards, your result will change, too. You could also exchange the inner loop and outer loop, your result will change again in a similar value range as the GPU result.
Given the larger difference in between CPU and GPU result and the different GPU results, I even guess that your data has increasing values for increasing indices.
Some reading about floating points: https://blog.demofox.org/2017/11/21/floating-point-precision/

from managedcuda.

jdanielpa avatar jdanielpa commented on July 24, 2024

Interesting - Thanks for the help in understanding this!

Separate question, do you have a SVD method in ManagedCUDA? I have a matrix which is [1577021, 36] which I need to pass into an SVD method. I need to get back matrices u and v, and the vector w. I am trying to replace my current method which comes directly from:

http://numerical.recipes/webnotes/nr3web2.pdf

This method takes ~7 minutes to run. I need that greatly reduced. Any help would be greatly appreciated!

from managedcuda.

kunzmi avatar kunzmi commented on July 24, 2024

Have a look at cuSolver. ManagedCuda includes a wrapper for it.

from managedcuda.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.