Code Monkey home page Code Monkey logo

Comments (11)

enfiskutensykkel avatar enfiskutensykkel commented on July 18, 2024

Hi,

Thank you for the interest.

The --verify option loads content from the input file into a buffer and does a memcmp with what it got when reading from the disk. It's only useful if you have written the same data to the disk before (either using the --write option or the read_blocks sample program with the write option).

You can also use the --output option (with or without --verify) to dump what was read from the disk to file.

from ssd-gpu-dma.

maxmp1031 avatar maxmp1031 commented on July 18, 2024

Thank you for your kind answer.

I have misunderstood that the verify option compares data which is stored in SSD and data which is loaded in the GPU memory.

What I actually want to know is, how we can access the data in the GPU memory.
I want to verify whether the data stored in SSD and in the GPU memory are the same.
And then I want to try to implement simple applications using the data in the GPU memory.

Currently, I am studying your latency benchmark. Regarding the latency benchmark, can you kindly advise me how to access the data in the GPU memory?

Thank you

from ssd-gpu-dma.

enfiskutensykkel avatar enfiskutensykkel commented on July 18, 2024

Hi,

Combining --verify and --gpu are the right options for this.

With the --gpu option, the program will allocate the memory buffer the disk writes to or reads from on the gpu. In this path in the code, the --input option will do an extra cudaMemcpy which loads GPU memory before the benchmark, and the --verify option (and/or --output option) will do an cudaMemcpy from GPU memory in order to verify that the memory content is the same. Without --input, the buffer will be memset to zero. The most convenient way of verifying in my opinion is to use the --input, --verify and --write options, in this case nvm-latency-bench will load file content in to memory, write it to the disk, then read it back from the disk, and finally compare it with the original file content loaded in memory.

If you use the --gpu option in addition to --input, --verify and --write, then nvm-latency-bench does the following:

  1. Allocate a RAM buffer and read file content in to that buffer
  2. Allocate a buffer on the GPU and do cudaMemcpy to copy from the RAM buffer to the memory chunk on the GPU.
  3. Write data to disk from GPU memory (the disk reads from GPU memory directly)
  4. Read data back from the disk in to GPU memory (the disk writes directly to GPU memory)
  5. Create a new buffer in RAM and do cudaMemcpy from the GPU buffer to that.
  6. Compare the two RAM memory buffers to verify that the content is the same.

I don't see from your first post that you compiled with CUDA support. The status messages when running the cmake command should confirm where the driver is located.

In order for above to work, you need to point cmake to the Nvidia driver so that building the kernel module can find the necessary symbols from nv-p2p.h for calling the GPUDirect RDMA API. Where the driver source is located depends on your system and how you installed CUDA. It is also possible to download the local run-file installer and extract the source. Make sure that you run make in the driver folder first, so that cmake can locate the Module.symvers file. Please let me know what distro you are using and how you installed CUDA if you have difficulties with this step.

P.S. You should also have a look at the nvm-cuda-bench example if you're interested in having the CUDA kernel itself initiate disk reads/writes and accessing that memory.

from ssd-gpu-dma.

maxmp1031 avatar maxmp1031 commented on July 18, 2024

Hi,

Thank you for the interest.

The --verify option loads content from the input file into a buffer and does a memcmp with what it got when reading from the disk. It's only useful if you have written the same data to the disk before (either using the --write option or the read_blocks sample program with the write option).

You can also use the --output option (with or without --verify) to dump what was read from the disk to file.

I also want to discuss the benchmark binary with --verify option, We have tried to perform nvm-latency-benchmark with both --verify and --write option. We describe detailed settings as the following command.

nvm-latency-bench --input test.in --write --verify --ctrl /dev/libnvm0 --bytes 4096 --count 100000 --iterations=1 --queue 'no=1' --info --gpu 0 --output out.out
Where 'test.in' is an input file and 'out.out is an output file.

By the way, the function "verifyTransfer" still returns the exception. In order to check output contents, we also use the nvm-latency-benchmark with --output option. However, all bytes in the output file is filled with '0xFF'. Can you advise for fixing this problem?

We also tried the read_blocks sample program with the write option in order to verify the write operation. The detailed setting is described as follow :

./nvm-read-blocks --write test.in --ctrl /dev/libnvm0 --block 1 --output out2.out

In this case, we checked that output file 'out2.out' shows the same data written in the input file 'test.in'. I want to know the difference between read_blocks and latency-benchmark program in terms of the write operation.

Thanks for your help.

from ssd-gpu-dma.

enfiskutensykkel avatar enfiskutensykkel commented on July 18, 2024

However, all bytes in the output file is filled with '0xFF'. Can you advise for fixing this problem?

Yes, this indicates that your system is not able to do PCIe peer-to-peer. There is no definitive list over which architectures that supports this, but in my experience workstation CPUs such as Xeon, and other higher-end CPUs tend to support it, while i3-7s do not. What CPU are you using?

It is possible to put the disk and the GPU in an expansion chassis with a PCIe switch that supports peer-to-peer, but this is also expensive and requires equipment.

from ssd-gpu-dma.

maxmp1031 avatar maxmp1031 commented on July 18, 2024

However, all bytes in the output file is filled with '0xFF'. Can you advise for fixing this problem?

Yes, this indicates that your system is not able to do PCIe peer-to-peer. There is no definitive list over which architectures that supports this, but in my experience workstation CPUs such as Xeon, and other higher-end CPUs tend to support it, while i3-7s do not. What CPU are you using?

It is possible to put the disk and the GPU in an expansion chassis with a PCIe switch that supports peer-to-peer, but this is also expensive and requires equipment.

I am using Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz CPU.

That might be the reason of this problem?

from ssd-gpu-dma.

enfiskutensykkel avatar enfiskutensykkel commented on July 18, 2024

Since the read_blocks example works, I believe so, yes. Reading only 0xFFs from device memory is generally a symptom of that. You can also drop the --gpu option but otherwise use the same options to nvm-latency-bench, if that also works, then I'm pretty convinced that is the issue.

from ssd-gpu-dma.

maxmp1031 avatar maxmp1031 commented on July 18, 2024

Since the read_blocks example works, I believe so, yes. Reading only 0xFFs from device memory is generally a symptom of that. You can also drop the --gpu option but otherwise use the same options to nvm-latency-bench, if that also works, then I'm pretty convinced that is the issue.

Oh... Thank you for your kind answers.

I might need to buy a new processor to achieve actually what I want.

If you don't mind, can you tell me how much better than sending data from PCI disk directly to GPU is than going through the main memory?

I really want to know how much promising that sending data directly is.

Thank you again.

from ssd-gpu-dma.

enfiskutensykkel avatar enfiskutensykkel commented on July 18, 2024

I might need to buy a new processor to achieve actually what I want.

Before you run off to buy that, please also check that you have a GPU that is able to do GPUDirect. In my experience, most Nvidia Quadro or Tesla GPUs are able to do this, while GeForce/GTX GPUs are not.

It depends on your workload, reading disk data into main memory and then copying it to GPU memory is slow with cudaMemcpy. It is also possible to memory-map the file (with mmap) and register that memory with CUDAs unified memory modeul using cudaHostRegisterMemory and fault it in to GPU memory, but that is difficult to control and also not the most efficient. If you are able to do it peer-to-peer, especially with a large PCIe network, writing and reading directly between peering devices can yield very low latency and high bandwidth.

But, as I said, it depends heavily on the scenario. Most NVMe drives are x4, and unable to provide a high bandwidth because of that. If your workload or use case allows it, it is also possible to pipeline the disk I/O for your CUDA program by reading from disk ahead of time. In this case, using GPUDirect offers very little benefit.

So to answer your question, I made this primarily to see if it was possible to do. If you require very low latency or have sporadic disk access that is not easily predicted ahead of time, then this approach will have some benefit. I'm currently in the process of testing with multiple disks in order to fully saturate the x16 PCIe link to a GPU and I'm also experimenting with doing work on the GPU at the same time (which will affect the GPU memory latency), but for the results I already see, I'm able to achieve maximum disk bandwidth and very low command completion latencies by by-passing the kernel's block-device implementation alone. With the Intel Optane 900P and Intel Optane P4800x disks and Quadro P600 and Quadro P620s I see up to 2.7 GB/s for reads and around 6-7 microsecond command completion latencies, even for accessing GPU memory.

from ssd-gpu-dma.

sureshd-fm avatar sureshd-fm commented on July 18, 2024

Thank you very much for such a nice library.
I am curious if you have you had any success in using GPUDirect on GTX or RTX GPUs? The information seems sparse, and would like to know your experience.
Thanks

from ssd-gpu-dma.

enfiskutensykkel avatar enfiskutensykkel commented on July 18, 2024

Thank you! As far as I know, GTX does not support GPUDirect RDMA, only Quadros and Teslas do.

from ssd-gpu-dma.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.