enfiskutensykkel / ssd-gpu-dma Goto Github PK

View Code? Open in Web Editor NEW

316.0 20.0 46.0 728 KB

Build userspace NVMe drivers and storage applications with CUDA support

License: BSD 2-Clause "Simplified" License

Makefile 0.20% C 51.20% Cuda 14.20% CMake 3.42% C++ 30.99%

ssd nvme pcie gpudirect-rdma gpu nvm-express disk dax driver cluster-computing

ssd-gpu-dma's Issues

Kernel module doesn't clean up resources if program crashes or doesn't release handles

Write benchmark program that memory maps a file and uses cudaMemcpy to copy into GPU memory

Add support for larger IO queues

Add support for IO queues spanning more than a single page.

Unexpected error: Unexpected CUDA error: an illegal memory access was encountered

Hi,

I use ubuntu 18.04 with V100 GPU.

I ran the benchmark 'nvm-cuda-bench -c /dev/libnvm0', but got the error message "Unexpected error: Unexpected CUDA error: an illegal memory access was encountered".
And, the dmesg also shows the following also messages,

[263169.171738] Adding controller device: 88:00.0
[263169.172098] Character device /dev/libnvm0 created (504.0)
[263169.172185] libnvm helper loaded
[263209.858820] Mapping for address 7ff72da00000 not found
[263255.876777] NVRM: Xid (PCI:0000:1b:00): 13, Graphics SM Warp Exception on (GPC 2, TPC 0, SM 0): Out Of Range Address
[263255.876795] NVRM: Xid (PCI:0000:1b:00): 13, Graphics Exception: ESR 0x514730=0x201000e 0x514734=0x20 0x514728=0x4c1eb72 0x51472c=0x174
[263255.877633] NVRM: Xid (PCI:0000:1b:00): 43, Ch 00000030

It seems the GPU cannot access the device registers of NVME, is that true? And, do you know how to solve it?

Make kernel module that supports pinning RAM and GPU pages

Does --verify option works?

Hi, I'm a student interested in GPU data processing.

I appreciate your great effort for implementing RDMA using GPU Direct. It is very useful for studying that research area.

By the way, while trying to run the project, I have met some problems.

The current case is:
Trying to send data from SSD to GPU directly, without using smartio
Below is my make command line
cmake .. -DCMAKE_BUILD_TYPE=Release -Dno_smartio=true -Dno_smartio_samples=true -Dno_smartio_benchmarks=true

And I tried to run nvm-latency-benchmark program with below options

./nvm-latency-bench --input test.in --verify --ctrl=/dev/libnvm0 --blocks 1000 --count 1 --iterations=1000 --queue 'no=1,depth=1'
./nvm-latency-bench --input test.in --verify --ctrl=/dev/libnvm0 --blocks 1000 --count 1 --iterations=1000 --queue 'no=1,depth=1 --gpu 0'

When I run the program, both shows
Verifying buffers... FAIL
Unexpected runtime error: Memory buffer differ from file content

Can you please give some advice?

Thank you.

Make kernel TCP version of RPC

nvm-cuda-bench infintiely waiting for IO completion

So I am running the cuda benchmark with different combinations of number of pages, chunks, and threads, and for some of the combinations of the options the GPU thread waiting for IO completion keeps waiting forever on the first chunk and this seems very random in terms of which combos work and which don't. Below are some for the simple combos I tried.

Pages	Chunks	Threads	Status
1	1	1	Works
1	1	2	Works
1	1	4	Works
1	1	8	Hangs
1	2	1	Works
1	2	2	Works
1	2	4	Hangs
1	2	8	Works
1	4	1	Works
1	4	2	Hangs
1	4	4	Works
1	4	8	Works
1	8	1	Hangs
1	8	2	Works
1	8	4	Works
1	8	8	Works
2	1	1	Works
2	1	2	Works
2	1	4	Hangs
2	1	8	Works
2	2	1	Works
2	2	2	Hangs
2	2	4	Works
2	2	8	Works
2	4	1	Hangs
2	4	2	Works
2	4	4	Works
2	4	8	Works
2	8	1	Works
2	8	2	Works
2	8	4	Works
2	8	8	Works
1	1	5	Hangs
1	2	5	Hangs
2	1	5	Hangs
2	2	5	Works

I know the last 4 configs is using 5 threads which isn't a power of 2, but I don't see a problem with the provided code running with any thread count as long as the __syncthreads can synchronize the threads in the thread block and there are enough entries in the NVMe queues for each thread. I have changed the settings file to allow this.

So I am not understanding why certain configs hang and why others don't.
The only changes I have done to the code in your repo is change the settings file for the cuda benchmark to remove the restriction of threads being a power of 2 and removing the +1 from the computation fo max entries from the MQES register.

Merge into master

Floating Point Exception

I am trying to run the example nvm-identify but I get the following output:
Resetting controller and setting up admin queues...
Floating point exception

The dmesg output is this:
[May24 16:40] traps: nvm-identify[3179] trap divide error ip:7f6d2f98a434 sp:7ffd9a74e3b0 error:0 in libnvm.so[7f6d2f985000+9000]
I am not sure what is going on. Any help would be appreciated.

nvm-cuda-bench failed as "an illegal memory access was encountered"

Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz
Supermicro X12DPi-N6
NVIDIA RTX A2000
Samsung 980 Pro nvme
Ubuntu 20.04.5 / 5.4.0-135-generic / cuda_12.0.0_525.60.13_linux

$ cmake .. -DCMAKE_BUILD_TYPE=Debug -Dnvidia_archs="86"
$ make identify module cuda-benchmark
$ sudo rmmod nvme
$ sudo make -C module load

$ deviceQuery
......
Device 0: "NVIDIA RTX A2000"
CUDA Driver Version / Runtime Version 12.0 / 11.8
CUDA Capability Major/Minor version number: 8.6
......

$ sudo ./bin/nvm-identify --ctrl=/dev/libnvm0
Resetting controller and setting up admin queues...
------------- Controller information -------------
PCI Vendor ID : 4d 14
PCI Subsystem Vendor ID : 4d 14
NVM Express version : 1.3.0
Controller page size : 4096
Max queue entries : 16384
Serial Number : S5GXNG0N905360M
Model Number : Samsung SSD 980 PRO 1TB
Firmware revision : 5B2QGXA7
Max data transfer size : 524288
Max outstanding commands: 256
Max number of namespaces: 1
Current number of CQs : 129
Current number of SQs : 129

When run
$ sudo ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0
CUDA device : 0 NVIDIA RTX A2000 (0000:98:00.0)
Controller page size : 4096 B
Namespace block size : 512 B
Number of threads : 32
Chunks per thread : 32
Pages per chunk : 1
Total number of pages : 1024
Total number of blocks: 8192
Double buffering : no
Unexpected error: Unexpected CUDA error: an illegal memory access was encountered

$ dmesg
[ 484.710982] NVRM: Xid (PCI:0000:98:00): 13, pid='', name=, Graphics SM Warp Exception on (GPC 2, TPC 0, SM 0): Out Of Range Address
[ 484.710999] NVRM: Xid (PCI:0000:98:00): 13, pid='', name=, Graphics SM Global Exception on (GPC 2, TPC 0, SM 0): Multiple Warp Errors
[ 484.711014] NVRM: Xid (PCI:0000:98:00): 13, pid='', name=, Graphics Exception: ESR 0x514730=0x201000e 0x514734=0x24 0x514728=0xc81eb60 0x51472c=0x1174
[ 484.711584] NVRM: Xid (PCI:0000:98:00): 43, pid=2037, name=nvm-cuda-bench, Ch 00000008

And if run under compute-sanitizer
$ sudo /usr/local/cuda/bin/compute-sanitizer ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0
========= COMPUTE-SANITIZER
CUDA device : 0 NVIDIA RTX A2000 (0000:98:00.0)
Controller page size : 4096 B
Namespace block size : 512 B
Number of threads : 32
Chunks per thread : 32
Pages per chunk : 1
Total number of pages : 1024
Total number of blocks: 8192
Double buffering : no
========= Invalid __local__ write of size 16 bytes
========= at 0x3e0 in readSingleBuffered(QueuePair *, unsigned long, void *, void *, unsigned long, unsigned long, unsigned long *, CmdTime *)
========= by thread (0,0,0) in block (0,0,0)
========= Address 0xfffcd0 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: [0x302a52]
========= in /lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame:__cudart798 [0x30e0b]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
========= Host Frame:cudaLaunchKernel [0x8cd0b]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
========= Host Frame:/usr/local/cuda/include/cuda_runtime.h:216:cudaError cudaLaunchKernel(char const*, dim3, dim3, void**, unsigned long, CUstream_st*) [0x1fd21]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
========= Host Frame:/tmp/tmpxft_00002ae3_00000000-6_main.cudafe1.stub.c:1:__device_stub__Z18readSingleBufferedP9QueuePairmPvS1_mmPmP7CmdTime(QueuePair*, unsigned long, void*, void*, unsigned long, unsigned long, unsigned long*, CmdTime*) [0x1fab2]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
========= Host Frame:/home/pc10/ssd-gpu-dma/benchmarks/cuda/main.cu:306:readSingleBuffered(QueuePair*, unsigned long, void*, void*, unsigned long, unsigned long, unsigned long*, CmdTime*) [0x1fb2a]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
========= Host Frame:/home/pc10/ssd-gpu-dma/benchmarks/cuda/main.cu:450:launchNvmKernel(Controller const&, std::shared_ptr, Settings const&, cudaDeviceProp const&) [0x1dd7f]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
========= Host Frame:/home/pc10/ssd-gpu-dma/benchmarks/cuda/main.cu:698:main [0x1ee3a]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
========= Host Frame:../csu/libc-start.c:342:__libc_start_main [0x24083]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:_start [0x1bf8e]
========= in /home/pc10/ssd-gpu-dma/build/./bin/nvm-cuda-bench
<...... Same trace from GPU thread 1 to 31 .......>

========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
<...... host backtrace omitted ......>
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaEventDestroy.
<...... host backtrace omitted ......>
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaEventDestroy.
<...... host backtrace omitted ......>
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree.
<...... host backtrace omitted ......>
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree.
<...... host backtrace omitted ......>
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree.
<...... host backtrace omitted ......>
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree.
<...... host backtrace omitted ......>
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaHostUnregister.
<...... host backtrace omitted ......>
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFree.
<...... host backtrace omitted ......>
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaFreeHost.
<...... host backtrace omitted ......>

Unexpected error: Unexpected CUDA error: unspecified launch failure
========= Target application returned an error
========= ERROR SUMMARY: 42 errors

Sperating SQ, CQ, and PRP List Memories

So in ./benchmarks/cuda/queue.cu, I am trying to use separate allocations and DMA regions for the SQ, CQ, and PRP List. by doing something like the following:

__host__ DmaPtr prepareQueuePair(QueuePair& qp, const Controller& ctrl, const Settings& settings, uint16_t id)
{
    printf("Creating QP %u\n", (unsigned int) id);
    //size_t queueMemSize = 1024 * sizeof(nvm_cmd_t) + 1024 * sizeof(nvm_cpl_t);
    size_t sq_size = 1024 * sizeof(nvm_cmd_t);
    size_t cq_size = 1024 * sizeof(nvm_cpl_t);
    
    size_t prpListSize = ctrl.info.page_size * settings.numThreads * (settings.doubleBuffered + 1);

   
    auto sq_mem = createDma(ctrl.ctrl, NVM_PAGE_ALIGN(sq_size, 1UL << 16), settings.cudaDevice, settings.adapter, settings.segmentId);
    auto cq_mem = createDma(ctrl.ctrl, NVM_PAGE_ALIGN(cq_size, 1UL << 16), settings.cudaDevice, settings.adapter, settings.segmentId);
    auto prp_list_mem = createDma(ctrl.ctrl, NVM_PAGE_ALIGN(prpListSize, 1UL << 16), settings.cudaDevice, settings.adapter, settings.segmentId);
    // Set members
    qp.pageSize = ctrl.info.page_size;
    qp.blockSize = ctrl.ns.lba_data_size;
    qp.nvmNamespace = ctrl.ns.ns_id;
    qp.pagesPerChunk = settings.numPages;
    qp.doubleBuffered = settings.doubleBuffered;
    
    qp.prpList = NVM_DMA_OFFSET(prp_list_mem, 0);
    qp.prpListIoAddr = prp_list_mem->ioaddrs[0];
    
    // Create completion queue
    int status = nvm_admin_cq_create(ctrl.aq_ref, &qp.cq, id, cq_mem->vaddr, cq_mem->ioaddrs[0]);
    if (!nvm_ok(status))
    {
        throw error(string("Failed to create completion queue: ") + nvm_strerror(status));
    }
    printf("CQ MAX_ENTRIES: %u\n", (unsigned int) qp.cq.max_entries);
    // Get a valid device pointer for CQ doorbell
    void* devicePtr = nullptr;
    cudaError_t err = cudaHostGetDevicePointer(&devicePtr, (void*) qp.cq.db, 0);
    if (err != cudaSuccess)
    {
        throw error(string("Failed to get device pointer") + cudaGetErrorString(err));
    }
    qp.cq.db = (volatile uint32_t*) devicePtr;

    // Create submission queue
    status = nvm_admin_sq_create(ctrl.aq_ref, &qp.sq, &qp.cq, id, NVM_DMA_OFFSET(sq_mem, 0), sq_mem->ioaddrs[0]);
    if (!nvm_ok(status))
    {
        throw error(string("Failed to create submission queue: ") + nvm_strerror(status));
    }
    printf("SQ MAX_ENTRIES: %u\n", (unsigned int) qp.sq.max_entries);
    // Get a valid device pointer for SQ doorbell
    err = cudaHostGetDevicePointer(&devicePtr, (void*) qp.sq.db, 0);
    if (err != cudaSuccess)
    {
        throw error(string("Failed to get device pointer") + cudaGetErrorString(err));
    }
    qp.sq.db = (volatile uint32_t*) devicePtr;

    return NULL;
}

All of these allocations seem to be fine.
However, when the GPU threads try to write to the Submission queue entry in prepareChunk with *cmd = local; I get threads accessing illegal memory addresses when they try to write the last 4 bytes of the 64 byte command entry. Am I doing something stupid? I have already tested 1024 entries in the command and completion queue using the original code so I know that part is fine. I just want to separate the memories for the 2 queues just so I avoid any errors.

Invalid NSID

I am running the cuda benchmark from your codebase, with the following output for the controller and command line configuration:

Controller page size  : 4096 B
Namespace block size  : 4096 B
Number of threads     : 1
Chunks per thread     : 1
Pages per chunk       : 5
Total number of pages : 5
Total number of blocks: 5
Double buffering      : no

The problem is the thread never finishes polling for the first chunk. So I exit out, reload the regular nvme driver and check the device's error log.
When I check the device's error log, I see the following entry for each time I try to run the benchmark:

sqid         : 1
cmdid        : 0
status_field : 0x4016(INVALID_NS)
parm_err_loc : 0xffff
lba          : 0
nsid         : 0x1
vs           : 0

The nvme ssd has only 1 namespace (NSID: 1) and its the one being used for all commands in the codebase. So what could be the issue? Any help in this matter will be appreciated.

Investigate DAX support

Investigate P2P support in the Linux DMA API for latest kernel release

It appears that some efforts have been made recently to make IOMMU and PCIe peer-to-peer (P2P) work on x86. Need to test if the latest kernel properly supports dma_map_resource() and/or if nvidia_p2p_map_pages() returns valid IO addresses.

Incorrect use of DMA API

I think the kernel module is technically using the DMA API incorrectly. dma_map_page and dma_map_single are supposed to be for 'streaming DMA' where you write to the buffer in userspace, then dma_map it, then hand it over to the device. Otherwise data in the CPU cache can be missed. I think dma_alloc_coherent is the correct thing to use in cases like this where the buffer contents are changed repeatedly. I suspect that x86_64 doesn't have a problem because the cache is coherent anyway, but it's a problem for me on 32-bit ARM. So it probably won't affect 99% of users, but I thought I'd report it in case anyone else has problems.
I've worked around it by using another memory allocation system I have on my platform.

Issue when using the cuda example/benchmark

I have been successful in run the nvm-latency-bench without GPU. The output of that is as follows:

./bin/nvm-latency-bench --ctrl=/dev/libnvm0 --blocks=1000  --queue="no=128,location=local" --bw

Resetting controller... DONE
Preparing queues... DONE
Preparing buffers and transfer lists... DONE
Running bandwidth benchmark (reading, sequential, 1000 iterations)... DONE
Calculating percentiles...
Queue #128 read percentiles (1000 samples)
            bandwidth,       adj iops,    cmd latency,    prp latency
  max:       2118.074,     517108.001,         67.191,          2.150
 0.99:       2107.156,     514442.488,         65.464,          2.095
 0.97:       2102.182,     513227.943,         64.984,          2.079
 0.95:       2097.901,     512182.780,         64.776,          2.073
 0.90:       2093.795,     511180.541,         64.481,          2.063
 0.75:       2084.105,     508814.706,         63.536,          2.033
 0.50:       2070.331,     505451.803,         61.828,          1.978
 0.25:       2014.709,     491872.302,         61.419,          1.965
 0.10:       1985.443,     484727.223,         61.136,          1.956
 0.05:       1976.456,     482533.263,         61.015,          1.952
 0.01:       1957.190,     477829.660,         60.771,          1.945
  min:       1905.024,     465093.782,         60.432,          1.934
End percentiles
OK!

But when I try to run it with a GPU or run the nvm-cuda-bench binary, I get an error saying the following: "Unexpected error: Failed to map device memory: Invalid argument"

./bin/nvm-cuda-bench --ctrl=/dev/libnvm0

CUDA device           : 0 Tesla V100-PCIE-16GB (0000:07:00.0)
Controller page size  : 4096 B
Namespace block size  : 512 B
Number of threads     : 32
Chunks per thread     : 32
Pages per chunk       : 1
Total number of pages : 1024
Total number of blocks: 8192
Double buffering      : no
Unexpected error: Failed to map device memory: Invalid argument

Change benchmark statistics to per command for nvm-latency-bench

Write benchmark that uses libnvme to read data directly into GPU memory

Clarify use-case involving CUDA

It's not clear to the lay user visiting this repository how CUDA is eventually used to access NVMs with this driver. An explanation/code snippet in README.md and/or an example with actual CUDA API calls would help with that.

Build and Binding the helper driver

Hi,

Trying to run the CUDA benchmark. Successfully build project with CUDA support. No IOMMU support

root@labuser-pc:/home/labuser# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-4.15.0-29-generic root=UUID=1d724a7d-a2bf-4b8d-b79f-4419bbedd509 ro quiet splash vt.handoff=7

But when try to load the helper driver: can't do the second step. With first step I can see unbind happening.

$ echo -n "0000:05:00.0" > /sys/bus/pci/devices/0000\:05\:00.0/driver/unbind
$ echo -n "0000:05:00.0" > /sys/bus/pci/drivers/disnvm/bind

Don't see "disnvm". Here is the output of the drivers folder. Trying these on Ubuntu 16.04. Need help to setup the driver

root@labuser-pc:/home/labuser# /sys/bus/pci/drivers/
agpgart-intel/    imsttfb/          nvidia-nvswitch/  skx_uncore/
agpgart-via/      ioatdma/          nvme/             snd_hda_intel/
ahci/             iosf_mbi_pci/     ohci-pci/         uhci_hcd/
asiliantfb/       ipmi_si/          parport_pc/       virtio-pci/
ata_generic/      libnvm helper/    pata_sis/         xen-platform-pci/
ata_piix/         lpc_ich/          pcieport/         xhci_hcd/
ehci-pci/         mei_me/           serial/           
i40e/             nvidia/           shpchp/

Regards,
MJay

nvm-identify run error

Hi:
I have build libnvm with:
cmake .. -DNVIDIA=/usr/src/nvidia-440.33.01
make identify
then run nvm-identify binary as follows:
1) when unbind device :
echo "0000:19:00.0" > /sys/bus/pci/devices/0000:19:00.0/driver/unbind
then run cmd with: ./bin/nvm-identify --ctrl=/dev/nvme0n1
with error: Failed to open descriptor: No such file or directory
2) when bind device:
run cmd with: ./bin/nvm-identify --ctrl=/dev/nvme0n1
with error: [map_memory] Page mapping kernel request failed: Inappropriate ioctl for device
what need i to do to fix this problem?

Cmake output saying 'Configuring kernel module without CUDA'

Hi;

I have a Jetson Xavier AGX kit board and I plugged into the M.2 key M an NVMe SSD. Now, I'm trying to install your libnm on my Xavier and I show the following message in CMake output:

-- Found CUDA: /usr/local/cuda-10.0 (found suitable version "10.0", minimum required is "8.0")
-- Using NVIDIA driver found in
-- Configuring kernel module without CUDA
-- Configuring done
-- Generating done
-- Build files have been written to: /home/ganapathi/Downloads/ssd-gpu-dma-master/build

How can I force Cmake to build with CUDA?

Thank

ROCm support ?

How hard would it be to add ROCm support ?

Does CQ and SQ memory need to be contiguous

Can I create separate DMA regions for the CQ memory and SQ memory? Is it supported? Or do the SQ pages have to be right after the CQ pages?
I want to create separate memory regions to make sure I am not reading/writing in the wrong memory location.
When you create the queue and prp list memory region in benchmarks/cuda/queue.c, why do you make the alignment (1UL << 16) [65536] and not controller page size?

Issue with multiple queues for latency benchmark

For the latency benchmark, there is an issue with using multiple queues causing the consumer threads to hang forever waiting for completions. This happens when using an Intel Optane 900P PCIe disk.

Can not find "nvm-latency-bench" in build/bin

Hi,

I want to run the latency benchmark with the specified controller and for 1000 blocks as your instruction, but I found out that there is no "nvm-latency-bench" in build/bin to run the command
$ ./bin/nvm-latency-bench --ctrl=0x80000 --blocks=1000 --pattern=sequential

I dont know what happened with "make libray"????

enfiskutensykkel / ssd-gpu-dma Goto Github PK

ssd-gpu-dma's Issues

Recommend Projects

Recommend Topics

Recommend Org