Comments (11)
Hi,
Thank you for the interest.
The --verify option loads content from the input file into a buffer and does a memcmp with what it got when reading from the disk. It's only useful if you have written the same data to the disk before (either using the --write option or the read_blocks sample program with the write option).
You can also use the --output option (with or without --verify) to dump what was read from the disk to file.
from ssd-gpu-dma.
Thank you for your kind answer.
I have misunderstood that the verify option compares data which is stored in SSD and data which is loaded in the GPU memory.
What I actually want to know is, how we can access the data in the GPU memory.
I want to verify whether the data stored in SSD and in the GPU memory are the same.
And then I want to try to implement simple applications using the data in the GPU memory.
Currently, I am studying your latency benchmark. Regarding the latency benchmark, can you kindly advise me how to access the data in the GPU memory?
Thank you
from ssd-gpu-dma.
Hi,
Combining --verify
and --gpu
are the right options for this.
With the --gpu option, the program will allocate the memory buffer the disk writes to or reads from on the gpu. In this path in the code, the --input option will do an extra cudaMemcpy which loads GPU memory before the benchmark, and the --verify option (and/or --output option) will do an cudaMemcpy from GPU memory in order to verify that the memory content is the same. Without --input, the buffer will be memset to zero. The most convenient way of verifying in my opinion is to use the --input, --verify and --write options, in this case nvm-latency-bench will load file content in to memory, write it to the disk, then read it back from the disk, and finally compare it with the original file content loaded in memory.
If you use the --gpu option in addition to --input, --verify and --write, then nvm-latency-bench does the following:
- Allocate a RAM buffer and read file content in to that buffer
- Allocate a buffer on the GPU and do
cudaMemcpy
to copy from the RAM buffer to the memory chunk on the GPU. - Write data to disk from GPU memory (the disk reads from GPU memory directly)
- Read data back from the disk in to GPU memory (the disk writes directly to GPU memory)
- Create a new buffer in RAM and do
cudaMemcpy
from the GPU buffer to that. - Compare the two RAM memory buffers to verify that the content is the same.
I don't see from your first post that you compiled with CUDA support. The status messages when running the cmake command should confirm where the driver is located.
In order for above to work, you need to point cmake to the Nvidia driver so that building the kernel module can find the necessary symbols from nv-p2p.h
for calling the GPUDirect RDMA API. Where the driver source is located depends on your system and how you installed CUDA. It is also possible to download the local run-file installer and extract the source. Make sure that you run make
in the driver folder first, so that cmake can locate the Module.symvers
file. Please let me know what distro you are using and how you installed CUDA if you have difficulties with this step.
P.S. You should also have a look at the nvm-cuda-bench example if you're interested in having the CUDA kernel itself initiate disk reads/writes and accessing that memory.
from ssd-gpu-dma.
Hi,
Thank you for the interest.
The --verify option loads content from the input file into a buffer and does a memcmp with what it got when reading from the disk. It's only useful if you have written the same data to the disk before (either using the --write option or the read_blocks sample program with the write option).
You can also use the --output option (with or without --verify) to dump what was read from the disk to file.
I also want to discuss the benchmark binary with --verify option, We have tried to perform nvm-latency-benchmark with both --verify and --write option. We describe detailed settings as the following command.
nvm-latency-bench --input test.in --write --verify --ctrl /dev/libnvm0 --bytes 4096 --count 100000 --iterations=1 --queue 'no=1' --info --gpu 0 --output out.out
Where 'test.in' is an input file and 'out.out is an output file.
By the way, the function "verifyTransfer" still returns the exception. In order to check output contents, we also use the nvm-latency-benchmark with --output option. However, all bytes in the output file is filled with '0xFF'. Can you advise for fixing this problem?
We also tried the read_blocks sample program with the write option in order to verify the write operation. The detailed setting is described as follow :
./nvm-read-blocks --write test.in --ctrl /dev/libnvm0 --block 1 --output out2.out
In this case, we checked that output file 'out2.out' shows the same data written in the input file 'test.in'. I want to know the difference between read_blocks and latency-benchmark program in terms of the write operation.
Thanks for your help.
from ssd-gpu-dma.
However, all bytes in the output file is filled with '0xFF'. Can you advise for fixing this problem?
Yes, this indicates that your system is not able to do PCIe peer-to-peer. There is no definitive list over which architectures that supports this, but in my experience workstation CPUs such as Xeon, and other higher-end CPUs tend to support it, while i3-7s do not. What CPU are you using?
It is possible to put the disk and the GPU in an expansion chassis with a PCIe switch that supports peer-to-peer, but this is also expensive and requires equipment.
from ssd-gpu-dma.
However, all bytes in the output file is filled with '0xFF'. Can you advise for fixing this problem?
Yes, this indicates that your system is not able to do PCIe peer-to-peer. There is no definitive list over which architectures that supports this, but in my experience workstation CPUs such as Xeon, and other higher-end CPUs tend to support it, while i3-7s do not. What CPU are you using?
It is possible to put the disk and the GPU in an expansion chassis with a PCIe switch that supports peer-to-peer, but this is also expensive and requires equipment.
I am using Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz CPU.
That might be the reason of this problem?
from ssd-gpu-dma.
Since the read_blocks example works, I believe so, yes. Reading only 0xFFs from device memory is generally a symptom of that. You can also drop the --gpu option but otherwise use the same options to nvm-latency-bench, if that also works, then I'm pretty convinced that is the issue.
from ssd-gpu-dma.
Since the read_blocks example works, I believe so, yes. Reading only 0xFFs from device memory is generally a symptom of that. You can also drop the --gpu option but otherwise use the same options to nvm-latency-bench, if that also works, then I'm pretty convinced that is the issue.
Oh... Thank you for your kind answers.
I might need to buy a new processor to achieve actually what I want.
If you don't mind, can you tell me how much better than sending data from PCI disk directly to GPU is than going through the main memory?
I really want to know how much promising that sending data directly is.
Thank you again.
from ssd-gpu-dma.
I might need to buy a new processor to achieve actually what I want.
Before you run off to buy that, please also check that you have a GPU that is able to do GPUDirect. In my experience, most Nvidia Quadro or Tesla GPUs are able to do this, while GeForce/GTX GPUs are not.
It depends on your workload, reading disk data into main memory and then copying it to GPU memory is slow with cudaMemcpy
. It is also possible to memory-map the file (with mmap
) and register that memory with CUDAs unified memory modeul using cudaHostRegisterMemory
and fault it in to GPU memory, but that is difficult to control and also not the most efficient. If you are able to do it peer-to-peer, especially with a large PCIe network, writing and reading directly between peering devices can yield very low latency and high bandwidth.
But, as I said, it depends heavily on the scenario. Most NVMe drives are x4, and unable to provide a high bandwidth because of that. If your workload or use case allows it, it is also possible to pipeline the disk I/O for your CUDA program by reading from disk ahead of time. In this case, using GPUDirect offers very little benefit.
So to answer your question, I made this primarily to see if it was possible to do. If you require very low latency or have sporadic disk access that is not easily predicted ahead of time, then this approach will have some benefit. I'm currently in the process of testing with multiple disks in order to fully saturate the x16 PCIe link to a GPU and I'm also experimenting with doing work on the GPU at the same time (which will affect the GPU memory latency), but for the results I already see, I'm able to achieve maximum disk bandwidth and very low command completion latencies by by-passing the kernel's block-device implementation alone. With the Intel Optane 900P and Intel Optane P4800x disks and Quadro P600 and Quadro P620s I see up to 2.7 GB/s for reads and around 6-7 microsecond command completion latencies, even for accessing GPU memory.
from ssd-gpu-dma.
Thank you very much for such a nice library.
I am curious if you have you had any success in using GPUDirect on GTX or RTX GPUs? The information seems sparse, and would like to know your experience.
Thanks
from ssd-gpu-dma.
Thank you! As far as I know, GTX does not support GPUDirect RDMA, only Quadros and Teslas do.
from ssd-gpu-dma.
Related Issues (20)
- Add support for larger IO queues HOT 3
- Investigate P2P support in the Linux DMA API for latest kernel release HOT 2
- ROCm support ? HOT 5
- Issue when using the cuda example/benchmark HOT 12
- Build and Binding the helper driver HOT 1
- Incorrect use of DMA API HOT 1
- Kernel module doesn't clean up resources if program crashes or doesn't release handles HOT 3
- Change benchmark statistics to per command for nvm-latency-bench
- Floating Point Exception HOT 16
- Sperating SQ, CQ, and PRP List Memories HOT 2
- nvm-cuda-bench infintiely waiting for IO completion HOT 1
- Invalid NSID HOT 29
- Does CQ and SQ memory need to be contiguous HOT 1
- Cmake output saying 'Configuring kernel module without CUDA' HOT 39
- Unexpected error: Unexpected CUDA error: an illegal memory access was encountered HOT 6
- nvm-identify run error HOT 7
- nvm-cuda-bench failed as "an illegal memory access was encountered" HOT 5
- Can not find "nvm-latency-bench" in build/bin HOT 4
- Issue with multiple queues for latency benchmark
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ssd-gpu-dma.