I am very new to ManagedCUDA. I am trying to the sum of the squares on a large 2D flo

Have a look at <a href="https://docs.nvidia.com/cuda/cusolver/index.html" rel="nofollo

Issues setting GridDimensions and BlockDimensions on CudaKernel for large 2D array about managedcuda HOT 7 OPEN

jdanielpa commented on July 24, 2024

Issues setting GridDimensions and BlockDimensions on CudaKernel for large 2D array

from managedcuda.

Comments (7)

kunzmi commented on July 24, 2024

Hi,

you set sumdsqKernel.BlockDimensions to the number of parallel executed threads, e.g. (16,16) and then sumdsqKernel.GridDimensions according to your problem size divided by the block size as you did. Instead of computing the grid size manually you can simplify things and just call sumdsqKernel.SetComputeSize(numRows, numCols) and ManagedCuda does the division for you.

In your example, you mixed grid and block dimensions:

sumdsqKernel.BlockDimensions = new dim3(gridDimX, gridDimY); //<-- should be blockSizeX, blockSizeY
sumdsqKernel.GridDimensions = new dim3(blockSizeX, blockSizeY); //<-- should be gridDimX, gridDimY

The maximum block dimensions you can set depend on your actual kernel (number of registers and shared memory used etc.) and your GPU. You can query the maximum with sumdsqKernel.MaxThreadsPerBlock and set the dimensions so that blockDim.x * blockDim.y * blockDim.z <= sumdsqKernel.MaxThreadsPerBlock

And as a general hint: do not create a CudaContext, load the kernel and allocate memory each time you want to run a kernel. You might also want to have a look at NPPs (or NPPi) functions like Norm-L2 or Norm-L2Sqr.

from managedcuda.

jdanielpa commented on July 24, 2024

Thanks for he help kunzmi. Adding:

sumdsqKernel.SetComputeSize((uint)numRows, (uint)numColumns);

and removing all my code to attempt to set the grid and block resolved the random crashes. However, when I run it and pass a matrix, the result I get back is different each time I call the method. The result is always close to what is expected but not quite correct. Could this be tied to memory allocations? Can you point me to an example where a CudaContext is not created? Again, I am very new to CUDA so I am sorry if my issues are trivial.

from managedcuda.

kunzmi commented on July 24, 2024

Usually you create the CudaContext once when the application starts, same holds for loading kernels. Once you know the size of the memory allocations, you allocate them and reuse the buffers avoiding allocating and freeing memory at each use. Memory allocations are time costly, need a full device synchronisation etc. - in short, they decrease performance a lot and are rarely necessary for each kernel call.

What do you mean by "The result is close"? What order of magnitudes? Note that floating point arithmetic is not exact and that (a + b) + c is not the same as a + (b + c). Given that your kernel is using atomicAdd, the order in which the actual addition is performed is random, so your result will vary in the limits of floating point precision.

from managedcuda.

jdanielpa commented on July 24, 2024

In the image below sumdsq is the result from a nested for look in C# and is the expected value. I ran the ManageCUDA code 3 times and get different results. Too big to be floating point precision.

from managedcuda.

kunzmi commented on July 24, 2024

Definitely looks like floating point precision error. In that value range that you have, you get a precision of ~ 1, which means that if you take one of the result numbers and add or substract 0.1, nothing will change...
You can take the nested C# loops and let them run backwards, your result will change, too. You could also exchange the inner loop and outer loop, your result will change again in a similar value range as the GPU result.
Given the larger difference in between CPU and GPU result and the different GPU results, I even guess that your data has increasing values for increasing indices.
Some reading about floating points: https://blog.demofox.org/2017/11/21/floating-point-precision/

from managedcuda.

jdanielpa commented on July 24, 2024

Interesting - Thanks for the help in understanding this!

Separate question, do you have a SVD method in ManagedCUDA? I have a matrix which is [1577021, 36] which I need to pass into an SVD method. I need to get back matrices u and v, and the vector w. I am trying to replace my current method which comes directly from:

http://numerical.recipes/webnotes/nr3web2.pdf

This method takes ~7 minutes to run. I need that greatly reduced. Any help would be greatly appreciated!

from managedcuda.

kunzmi commented on July 24, 2024

Have a look at cuSolver. ManagedCuda includes a wrapper for it.

from managedcuda.

Issues setting GridDimensions and BlockDimensions on CudaKernel for large 2D array about managedcuda HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent