coffeebeforearch / cuda_programming Goto Github PK

View Code? Open in Web Editor NEW

691.0 691.0 156.0 420 KB

Code from the "CUDA Crash Course" YouTube series by CoffeeBeforeArch

License: GNU General Public License v3.0

Cuda 100.00%

cuda_programming's People

Contributors

Stargazers

Watchers

Forkers

xiongyw benoitkao rsivorap tianxingyzxq aamich zilongzhong benmaticthings saurabh6996 alex-linhares hazard-nico mazispider madhavkhoslaa yhuangbj canbedirhan12 rhshriva imrrahul konradteichert santoshreddy48 chiranjivan-kn minhbau damionfan mazy1998 salvatorelaiso yashasvibhatt mpandey-git hemirt krmohanty ayushtues batmanabcdefg dusansulan ehsanw42 grnydawn hbnworkstation krishnapals markram1729 dipeshsapkota101 brunoscaglione zhongyiping97dlmu rauffatali shuangliuhsd helderjfl sebastiansigvard arka816 zyliugit manas04 zhangsanfeng86 antoniorodriguezufam snehashis1997 liuqi123123 vaibhavi-28 travismitchell np95 ravitejaroyal muhateer zain923 umairnaseer253 deciding mirzawd mendezv cyber-machine dnjegovanovic jasonxingqi pushkar-khetrapal lupusorina neuzxy briandbl arturomendoza689 willert98 masrul rkrohanrk ismailkocdemir saadmann18 akashdl zzt1998 amazoedu0 kwonyoung9120 xioaxin lennarth-anaya adnios briancerberus jren73 kendrick1123 mangoship clayne rodionbukhanevych11 topsy404 cosmoimai tshiamor sanyamlakhanpal neelpawarcmu boringlee24 hitanshu-punj kaoutar55 heyangjlu shihaoxu arwa-mili oqba06878 smitpurohit mtubpeng1 alexilis

cuda_programming's Issues

`sumReduction` modification suggestions

How do I modify code to accommodate arrays of arbitrary length instead of powers of 2 in sumReduction?

Assertion `temp == result[i]' failed.

Please note that when I attempt to run your ~/1d_constant_memory/convolution.cu with CUDA compilation tools, release 7.5, V7.5.17 on Ubuntu 16.04, I am getting these errors.

~/Resources/Github/cuda_programming/convolution/1d_constant_memory $ nvcc convolution.cu -o convolution.x 
~/Resources/Github/cuda_programming/convolution/1d_constant_memory $ ./convolution.x 
convolution.x: convolution.cu:58: void verify_result(int*, int*, int*, int): Assertion `temp == result[i]' failed.
aborted (core dumped)

Do you have any suggestions? TIA.

Unexpected results with Memory Coalescing

Hi, I am using the following system configuration:

Windows 10
Visual Studio 2019 Community
Cuda 10.2
Nvidia Nsight Compute 2019.5.0
Nvidia RTX 2060 GPU (Turing Architecture)

I am following your tutorials on YouTube and used the file alignment_matrix_mul.cu, in three configuartions:

No transpose (just as we were doing it before)
Transpose a matrix (temp_sum += a[k * n + row] * b[col + n * k];)
Transpose b matrix (temp_sum += a[k + n * row] * b[col * n + k];)

We would expect that the GPU would perform best when we transpose matrix a, as the memory accesses for each thread are coalesced in this way, but the profiling shows that it performs better when I transpose matrix b.

The only thing that I am doing different here is that I am using Nsight Compute as a separate application to profile the built binary from Visual Studio and not the inbuilt extension. I am also attaching the performance images I got:

No Transpose: https://drive.google.com/file/d/18-l8W3csIjCRRoxgASsWevjRIV9hINXp/view?usp=sharing
Transpose a matrix: https://drive.google.com/file/d/1rPwMpalSwfVpZ8-jBpO3ROL1R7POAzRt/view?usp=sharing
Transpose b matrix: https://drive.google.com/file/d/1WHIQBRRk1KjJk5MXVUc4AopGzqWPDwFh/view?usp=sharing

I have double checked the transpositions and this is what I get. Can there be any other bottleneck causing these results? i.e. the cost of fetching multiple elements for the loop (index k) overpowers the coalesced access?

Consulting

Hi @CoffeeBeforeArch,

I'm not sure where else to put this, as I don't have a twitter account and don't plan on getting one + verified just to get in touch.

Is there any chance you do consulting? If so, would you be open to being hired for a few hours to help me optimize a specific kernel?

Thanks very much,
Ian

example vectorAdd c is zero after adding

I am using

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

running the example return :

vectorAdd.cu:24: void verify_result(std::vector<int>&, std::vector<int>&, std::vector<int>&): Assertion `c[i] == a[i] + b[i]' failed.
[1]    49753 abort (core dumped)  ./prog

if I print few first elements of a, b, c after running kernel and copy data from device to host it returns all elements of c is zero.

vector_add_um.cu appears "abort() has been called"

The environment:
Windows 10
VS 2017
CUDA 10.1

The snapshot of this error:
https://drive.google.com/open?id=1V-Mv2xk9Leny3GEUYsMkBWBHumWastxU

However,
run the same code in Linux environment, there is no error.

Any help is appreciated.

Issues with the code of SUM Reduction

cuda_programming/03_sum_reduction/diverged/sumReduction.cu

Line 75 in 8711be0

sumReduction<<<1, TB_SIZE>>> (d_v_r, d_v_r);

Shouldn't the second call be <<<1, GRID_SIZE>>> instead of <<<1, TB_SIZE>>>? I think GRID_SIZE is the number of partial sums.

Different TB_SIZE in 03_sum_reduction/diverged won't pass assertion

I found that a TB_SIZE and SH_SIZE= 128, won't give the expected 65536 result.

profiling cuda code with Nvidia Nsight

Hi nick, i'm trying to profile the matrix multiplication cuda code, it is the same as your naive matrix multiplication code with nvidia Nsight, i tried with 1<<10 and it worked, and i tried with 1<<11 and the profiler didn't catch the kernel launch. i have nvidia GTX 960M gpu.
so is it the problem with my gpu capability or there is something else wrong?
thanks in advance.

hello nick

where is the CUDA Crash Course (v3) series ? i can't find it on youtube😢

Assertion `tmp == c[i * N + j]' failed

Hi I'm trying to test my GPGPU Sim build and I'm following the blog https://coffeebeforearch.github.io/2020/03/30/gpgpu-sim-1.html
On following the instructions, I'm getting
mmul: mmul.cu:42: void verify_result(std::vector&, std::vector&, std::vector&, int): Assertion `tmp == c[i * N + j]' failed.
Aborted (core dumped)

My configuration-
intel i5-8265U
8 GB DDR4 RAM
Ubuntu 20.04 LTS
gcc 10.2
Cuda build version - 11.2
Please help

Link for `CUDA Crash Course (v3)`

I can only find a link for CUDA Crash Course and CUDA Crash Course (v2). Is there a link for the CUDA Crash Course (v3) somewhere?

Thank you so much for the invaluable content. That is extremely helpful.

issue with vector-addion.cu

Hey Nick,
I am getting an error in / Boundary Check
if (tid < N) c[tid] = a[tid] + b[tid];
but when i corrected according to vs suggestion to / Boundary Check
if (tid < N) c[tid] == a[tid] + b[tid];
then it says warning #174-D: expression has no effect
1> if (tid < N) c[tid] == a[tid] + b[tid];
However, after this warning, it showed "completed successfully" .
I used the VS-2022 community. Can you please explain what happened, I am still confused

Also, when I compile the same program with nvcc on WSL2-ubuntu 20.2 it says
vector-add.cu(26): warning #174-D: expression has no effect
but the compilation was done and on running the output file give this -
vector-add: vector-add.cu:33: void verify_result(std::vector&, std::vector&, std::vector&): Assertion `c[i] == a[i] + b[i]' failed.
Aborted

Can you please help me out with the same

Regards
Pronod