Comments (7)
Hi,
you set sumdsqKernel.BlockDimensions
to the number of parallel executed threads, e.g. (16,16)
and then sumdsqKernel.GridDimensions
according to your problem size divided by the block size as you did. Instead of computing the grid size manually you can simplify things and just call sumdsqKernel.SetComputeSize(numRows, numCols)
and ManagedCuda does the division for you.
In your example, you mixed grid and block dimensions:
sumdsqKernel.BlockDimensions = new dim3(gridDimX, gridDimY); //<-- should be blockSizeX, blockSizeY
sumdsqKernel.GridDimensions = new dim3(blockSizeX, blockSizeY); //<-- should be gridDimX, gridDimY
The maximum block dimensions you can set depend on your actual kernel (number of registers and shared memory used etc.) and your GPU. You can query the maximum with sumdsqKernel.MaxThreadsPerBlock
and set the dimensions so that blockDim.x * blockDim.y * blockDim.z <= sumdsqKernel.MaxThreadsPerBlock
And as a general hint: do not create a CudaContext, load the kernel and allocate memory each time you want to run a kernel. You might also want to have a look at NPPs (or NPPi) functions like Norm-L2 or Norm-L2Sqr.
from managedcuda.
Thanks for he help kunzmi. Adding:
sumdsqKernel.SetComputeSize((uint)numRows, (uint)numColumns);
and removing all my code to attempt to set the grid and block resolved the random crashes. However, when I run it and pass a matrix, the result I get back is different each time I call the method. The result is always close to what is expected but not quite correct. Could this be tied to memory allocations? Can you point me to an example where a CudaContext is not created? Again, I am very new to CUDA so I am sorry if my issues are trivial.
from managedcuda.
Usually you create the CudaContext once when the application starts, same holds for loading kernels. Once you know the size of the memory allocations, you allocate them and reuse the buffers avoiding allocating and freeing memory at each use. Memory allocations are time costly, need a full device synchronisation etc. - in short, they decrease performance a lot and are rarely necessary for each kernel call.
What do you mean by "The result is close"? What order of magnitudes? Note that floating point arithmetic is not exact and that (a + b) + c
is not the same as a + (b + c)
. Given that your kernel is using atomicAdd
, the order in which the actual addition is performed is random, so your result will vary in the limits of floating point precision.
from managedcuda.
In the image below sumdsq is the result from a nested for look in C# and is the expected value. I ran the ManageCUDA code 3 times and get different results. Too big to be floating point precision.
from managedcuda.
Definitely looks like floating point precision error. In that value range that you have, you get a precision of ~ 1, which means that if you take one of the result numbers and add or substract 0.1, nothing will change...
You can take the nested C# loops and let them run backwards, your result will change, too. You could also exchange the inner loop and outer loop, your result will change again in a similar value range as the GPU result.
Given the larger difference in between CPU and GPU result and the different GPU results, I even guess that your data has increasing values for increasing indices.
Some reading about floating points: https://blog.demofox.org/2017/11/21/floating-point-precision/
from managedcuda.
Interesting - Thanks for the help in understanding this!
Separate question, do you have a SVD method in ManagedCUDA? I have a matrix which is [1577021, 36] which I need to pass into an SVD method. I need to get back matrices u and v, and the vector w. I am trying to replace my current method which comes directly from:
http://numerical.recipes/webnotes/nr3web2.pdf
This method takes ~7 minutes to run. I need that greatly reduced. Any help would be greatly appreciated!
from managedcuda.
Have a look at cuSolver. ManagedCuda includes a wrapper for it.
from managedcuda.
Related Issues (20)
- Assembly.GetTypes() problem HOT 2
- cudaFftPlanMany.Exec only works with in place transforms HOT 3
- cudaOccMaxPotentialOccupancyBlockSize fails with A6000 HOT 3
- Big Thanks! HOT 1
- Need a way to call cudaDeviceSynchronize() after CudaFFTPlanMany.Exec(...) or Texture is corrupted HOT 3
- when the NPPJpegCompression project from Sample file is runing will occur the problem,why? HOT 2
- what is the latest nuget package HOT 3
- template<typename T> causes a failure in CudaRuntimeCompiler.Compile() (nvrtcCompileProgram dll) HOT 2
- CudaFFT 2D Array transform HOT 4
- ManagedCuda.CudaException: 'ErrorInvalidPtx: This indicates that a PTX JIT compilation failed.' HOT 3
- Exception with CudaContext.CreateOpenGLContext HOT 7
- (NPP) Shared Library loading issue on Linux HOT 5
- Can you manage the tensorrt library HOT 1
- How do I convert the following code to managedCuda code HOT 1
- Where can I download cuda11 version of managedCuda HOT 1
- ManagedCuda.NVRTC.NVRTCException:“ErrorBuiltinOperationFailure: Builtin operation failure.”
- cudaTextureDesc can not find readMode
- Why `cuD3D11GetDevice` success, but still got a `0` device? HOT 3
- do copy with CudaOpenGLImageInteropResource alwasy returns `ErrorInvalidValue` HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from managedcuda.