It's not clear to the lay user visiting this repository how CUDA is eventually used to

I'm using the git master. This is the output from nvm-cuda-bench: <div class="snip

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Clarify use-case involving CUDA,about enfiskutensykkel/ssd-gpu-dma

Comments (32)

pgera commented on August 17, 2024 1

nvm-identify continues to work, but cuda-bench exits with that error. I'm testing all this on google compute engine since I don't have a tesla/quadro handy. The CPU is a virtualized Xeon. This is the output from nvm-identify:

sudo ./nvm-identify --ctrl=/dev/libnvm0
Resetting controller and setting up admin queues...
------------- Controller information -------------
PCI Vendor ID           : e0 1a
PCI Subsystem Vendor ID : 0 0
NVM Express version     : 1.0.0
Controller page size    : 4096
Max queue entries       : 4096
Serial Number           : nvme_card
Model Number            : nvme_card
Firmware revision       : 2       
Max data transfer size  : 4096
Max outstanding commands: 0
Max number of namespaces: 1
--------------------------------------------------
Goodbye!

Next, I tried latency bench with the following command:

sudo ./nvm-latency-bench --ctrl=/dev/libnvm0 --blocks=1 --count=0x1000 --gpu=0
Resetting controller...
Queue #01 remote 1 commands
Allocating 1 pages (GPU)...
Running benchmark...
Queue #01 cmds=1 blocks=1 count=4096 min=33.425 avg=66.039 max=9073.373
	0.99:        107.377
	0.97:         71.325
	0.95:         58.911
	0.90:         55.767
	0.75:         42.562
	0.50:         41.000
OK!

So this works ! So at least the entire path works hardware wise. nvm-cuda-bench doesn't work for some reason still.

from ssd-gpu-dma.

enfiskutensykkel commented on August 17, 2024 1

Hi,

Yes, this is probably because the _CUDA define isn't set for the module, which can happen when CMake couldn't find the path to the driver source on it's own (or if it couldn't find the Module.symvers in the path).

Please make sure when you run CMake that it reports "Configuring kernel module with CUDA" . If it doesn't, you should probably point it to the driver path manually (on Ubuntu it would be something like this cmake .. -DNVIDIA=/usr/src/nvidia-384-384.111.

Also make sure that you have ran make in the driver path, so that CMake is able to find /path/to/driver/Module.symvers

from ssd-gpu-dma.

pgera commented on August 17, 2024 1

Thanks. I got it to work. The issue was that the standard nvidia src directory on ubuntu doesn't have Module.symvers. I had to make the nvidia module to generate it. After that the test works. I had to do this last time too, but I forgot about it today. Now that it works, I'll try to write some cuda applications using this. Thanks a lot.

from ssd-gpu-dma.

pgera commented on August 17, 2024 1

I'm using the git master. This is the output from nvm-cuda-bench:

sudo ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0 --double-buffer=true --gpu=1
CUDA device: Tesla K40c
Controller page size  : 4096 B
Namespace block size  : 512 B
Number of threads     : 32
Chunks per thread     : 32
Pages per chunk       : 1
Total number of pages : 1024
Total number of blocks: 8192
Double buffering      : yes
Time elapsed: 22053.663 µs
Bandwidth   : 190.186 MiB/s

If I increase the number of pages in the argument, the b/w goes up to 300 MiB/s. Isn't this low for the SSD ? This is a samsung 960 evo ssd.

Edit: This is the output with the block device

sudo ./bin/nvm-cuda-bench --block-device=/dev/nvme0n1 --gpu=1
Controller page size  : 4096 B
Assumed block size    : 512 B
Number of threads     : 32
Chunks per thread     : 32
Pages per chunk       : 1
Total number of pages : 1024
Double buffering      : no
Time elapsed: 9767.744 µs
Bandwidth   : 429.404 MiB/s

from ssd-gpu-dma.

enfiskutensykkel commented on August 17, 2024 1

This indicates that it works, great!

Yes, those numbers aren't very impressive. However, one of the main bottlenecks in that benchmark is actually the moveBytes function which moves data from a pinned global memory buffer (where the disk writes in to) and to another global memory buffer. If you comment out the contents of this, the bandwidth should increase I believe: https://github.com/enfiskutensykkel/ssd-gpu-dma/blob/master/benchmarks/cuda/main.cu#L32

The idea behind the moveBytes was to emulate a half-realistic use-case where you actually do something with the data you read from disk, in addition to making it possible to verify the data passing through the GPU.

Also, you can also experiment with the chunk and pages parameters, pages per chunks means the amount of data the benchmark will try to read per command. Another factor here is the PCI topology of course.

I get around 1000-1100 MiB/s with my Quadro K620 and an Optane 900P as a reference. The latency benchmark should provide an indication of maximum bandwidth though, if you run it with the --bandwidth argument.

from ssd-gpu-dma.

enfiskutensykkel commented on August 17, 2024 1

@pgera I suggest contacting me directly on email is the best approach for further discussions. My email is [email protected]

Based on the output from lspci -tv it appears that your GPU sits behind another CPU than the disk, which means that traffic has to cross the QPI link and that's causing the low performance: https://devblogs.nvidia.com/benchmarking-gpudirect-rdma-on-modern-server-platforms/

from ssd-gpu-dma.

enfiskutensykkel commented on August 17, 2024

There are some benchmark programs ATM that use the API, the bandwidth benchmark for example allows for reading directly into GPU memory. The relevant path is ~~benchmarks/simple-rdma/~~ benchmarks/cuda/.

But you're absolutely right. I will hopefully have time to document it better with time.

from ssd-gpu-dma.

enfiskutensykkel commented on August 17, 2024

I have added an example showing how disk IO can be initiated from within a CUDA kernel. This should demonstrate sufficiently how you can link the library with CUDA applications. I will add some text to README.md when I have time.

Thank you again for showing interest in this project.

from ssd-gpu-dma.

pgera commented on August 17, 2024

I am having some trouble with the cuda application. nvm-latency-bench passes for me, but the cuda one fails. I'm calling it like so:

sudo ./bin//nvm-cuda-bench --block-device=/dev/libnvm0
Controller page size : 4096 B
Assumed block size : 512 B
Number of threads : 32
Chunks per thread : 32
Pages per chunk : 1
Total number of pages : 1024
Double buffering : no
Unexpected CUDA error: invalid argument

When I look at syslog, the following message is printed: "Invalid range size", which seems to be from module/pci.c. I am able to test the basic functionality of GPU_RDMA features using this repository (https://github.com/NVIDIA/gdrcopy). Any idea what's going on ?

from ssd-gpu-dma.

enfiskutensykkel commented on August 17, 2024

Hi @pgera, and thank you for your response.

The CUDA error I suspect is coming from cudaMemcpy. The "Invalid range size" message coming from module/pci.c is a condition I really did not expect happening, which should happen at a couple of steps before the program even attempts to do cudaMemcpy.

~~Are you using the most recent commit?~~

Edit: see my next response instead

from ssd-gpu-dma.

enfiskutensykkel commented on August 17, 2024

Ah, I understand the confusion and my poor documentation is to blame. The block-device argument is actually supposed to test the native NVMe driver. If you unbind (or unload) the libnvm helper module and bind the native NVMe driver to the disk instead, and then use
sudo ./bin/nvm-cuda-bench --block-device=/dev/nvmen instead, that is supposed to work.

The libnvm helper module is not indended as a driver itself, but rather allows a userspace program to memory map controller registers and lock pages so that the disk can do DMA.

from ssd-gpu-dma.

pgera commented on August 17, 2024

@enfiskutensykkel , yes that works. So what's the right way of using the helper module with cuda ?

from ssd-gpu-dma.

enfiskutensykkel commented on August 17, 2024

If you load the helper module and use instead invoke the benchmark using the following arguments:

./bin/nvm-cuda-bench --ctrl=/dev/libnvm0 --double-buffer=true

This will run the benchmark doing NVMe stuff from within the kernel. The --block-device argument is intended just to compare performance with an memory mapped file using the native driver.

from ssd-gpu-dma.

pgera commented on August 17, 2024

With that command, I am hitting a different error: "Unexpected error: Unexpected CUDA error: the launch timed out and was terminated". The kernel gets launched on the GPU and the utilization goes to 100% as seen in nvidia-smi. Syslog has this message: "kernel: [ 1229.084455] NVRM: Xid (PCI:0000:00:05): 8, Channel 0000001f". Is this a watchdog issue ?

from ssd-gpu-dma.

enfiskutensykkel commented on August 17, 2024

Hmm, my initial guess is that the disk is not responding for some reason. The CUDA kernel will enqueue a bunch of commands and then wait for completion, but if the completion never arrives for some reason (being that the disk doesn't respond or is unable to respond), they will hang forever and get killed by the watchdog.

First step in debugging would be to build the samples make samples and check if if the nvm-identify --ctrl=/dev/libnvm0 still works or if this times out.

If it times out, then the disk is in a state where it needs a powercycle reset.

If it works, then there is some other issue. Maybe PCIe P2P doesn't work? What kind of CPU are you using?

from ssd-gpu-dma.

enfiskutensykkel commented on August 17, 2024

You can alternatively try to run the other benchmark as well, the latency benchmark.

./bin/nvm-latency-bench --ctrl=/dev/libnvm0 --blocks=0x1000 --count=10 --gpu=<CUDA device id>.

This benchmark does all NVMe stuff on the CPU and writes only data directly into GPU memory.

from ssd-gpu-dma.

enfiskutensykkel commented on August 17, 2024

At least this means that the disk hasn't ended up in some weird state, which is a good thing. I still suspect that P2P may not be working properly.

It is a bit of a hassle, but what you can do to verify that data is actually getting through is to write some known pattern to the disk (for example using the nvm-integrity-util program).

What I usually do is this:

$ seq 1 8000 > original_file.txt
$  ./bin/nvm-integrity-util --ctrl=/dev/libnvm0 original_file.txt
Reading from file `original_file.txt' and writing to disk (38893 bytes)
Resetting controller and configuring admin queue pair...
------------- Controller information -------------
PCI Vendor ID           : 86 80
PCI Subsystem Vendor ID : 86 80
NVM Express version     : 1.0.0
Controller page size    : 4096
Max queue entries       : 4096
Serial Number           : PHMB742301ER280CGN  
Model Number            : INTEL SSDPED1D280GA                     
Firmware revision       : E2010325
Max data transfer size  : 131072
Max outstanding commands: 0
Max number of namespaces: 1
--------------------------------------------------
Using 1 submission queues:
	Queue #0: block 0 to block 76 (page 0 + 10)
Total blocks: 76

This writes the sequence 1...8000 to disk. Then I use the latency benchmark with the output option:

$ ./bin/nvm-latency-bench --ctrl=/dev/libvnm0 --blocks=76 --count=1 --gpu=0 --output=readback
Resetting controller...
Queue #01 remote 1 commands
Allocating 10 pages (GPU)...
Running benchmark...
Queue #01 cmds=1 blocks=76 count=1 min=20.843 avg=20.843 max=0.000
	0.99:          0.000
	0.97:          0.000
	0.95:          0.000
	0.90:          0.000
	0.75:          0.000
	0.50:          0.000
Writing to file...
OK!

Then the file readback-sequential should be (almost) equal to the file original_file.txt, only with some additional trailing 0-bytes to align to a disk block.

$ tail original_file.txt
7991
7992
7993
7994
7995
7996
7997
7998
7999
8000
$ tail readback-sequential 
7992
7993
7994
7995
7996
7997
7998
7999
8000

This will verify if the disk is actually able to write into GPU memory. My guess is that the virtualised stuff is the reason why it isn't working properly.

Edit: I realised I posted the SmartIO arguments for the benchmark, so changed it to the module version.

from ssd-gpu-dma.

enfiskutensykkel commented on August 17, 2024

@pgera Thank you again for having patience to try this out. I really appreciate it, and I apologise for the unfinished state of the code. This is very much work in progress.

from ssd-gpu-dma.

pgera commented on August 17, 2024

I'm happy to test this. I need something like this for one of my projects. latency-bench with those arguments crashes the system. Maybe the virtualized environment adds some other variables to this setup. I'll have to check if I can get hold of a Tesla/Quadro to try this on a real machine. This was my output up to the crash:

sudo ./nvm-integrity-util --ctrl=/dev/libnvm0 original_file.txt
Reading from file `original_file.txt' and writing to disk (38893 bytes)
Resetting controller and configuring admin queue pair...
------------- Controller information -------------
PCI Vendor ID           : e0 1a
PCI Subsystem Vendor ID : 0 0
NVM Express version     : 1.0.0
Controller page size    : 4096
Max queue entries       : 4096
Serial Number           : nvme_card
Model Number            : nvme_card
Firmware revision       : 2       
Max data transfer size  : 4096
Max outstanding commands: 0
Max number of namespaces: 1
--------------------------------------------------
Using 1 submission queues:
        Queue #0: block 0 to block 10 (page 0 + 10)
Total blocks: 10

~/ssd-gpu-dma/build/bin$ sudo ./nvm-latency-bench --ctrl=/dev/libnvm0 --blocks=10 --count=1 --gpu=0 --output=readback
Resetting controller...
Queue #01 remote 10 commands
Allocating 10 pages (GPU)...
Running benchmark...

from ssd-gpu-dma.

enfiskutensykkel commented on August 17, 2024

I'm unfamiliar with Google compute engine and I'm not sure why it crashed, but I suspect that the disk has attempted to DMA into some memory it isn't allowed to access. If you have multiple logins, it is possible to continuously monitor the system log using dmesg -w to see if anything interesting happens (such as Low memory corruption which happens if a device by mistake overwrites kernel memory).

I assume that you are running the (virtual) IOMMU disabled, correct? I'm almost certain that the underlying hypervisor is using an IOMMU which would introduce another layer of address virtualisation, causing PCIe P2P to not work properly without additional configuration.

In the very latest Linux kernel releases, there has been some changes regarding IOMMU support and P2P. I am totally in the dark here since I don't really know how the Google compute enigne stuff works, but if IOMMU+P2P is fully supported in the latest kernel version, it may be the case that if I modify the helper module to put the GPU and the disk into the same (virtual) IOMMU domain, that whatever underlying hypervisor perhaps can trap properly and set up corresponding mappings. I am entirely sure this is the case though, but you did mention earlier that some other RDMA example does actually appear to work, which means that it definitively isn't impossible to get stuff to work. I'm currently not aware how I would do that and it would require some more investigation by my part.

However, and I'm still guessing here, some changes to the helper module code may be possible. The relevant calls to the Nvidia driver are in module/map.c, namely the call to nvidia_p2p_dma_map_pages on line 294. I am currently using the 4.11.0 version of the kernel and the 384.11 version of the Nvidia driver, and with my installation this function does not return IO addresses at all, only physical addresses. This is why you need to disable the IOMMU in order to make the CUDA stuff work. It may be possible to manually map stuff using dma_map_resource and variants (found in include/linux/dma-mapping.h) if this is supported by both the Nvidia driver and the Linux implementation of this function for x86-64 isn't just a dummy macro. I need to investigate this some more, but it is on my TODO list already.

I have just recently started working on a related project using so-called mediated device drivers, called mdev (or mdev-vfio). I will probably increase my understanding of how virtualisation the next couple of weeks, and give a proper answer of what would be required to make it work in a virtualised environment.

from ssd-gpu-dma.

pgera commented on August 17, 2024

Guest IOMMU is disabled, but you are right that the hypervisor would most probably have iommu enabled. I tried capturing dmesg -w, but it didn't print anything. I confirmed that if I run nvm-latency-bench without the gpu argument, it reads back the file successfully (except for some newline difference at the end). With the gpu arg, it crashes consistently. I do think that p2p in general is supposed to work for most cloud platforms since they let you add multiple GPUs to an instance. I'll try to see if I can make some sense of the code. I'll also try to run this on bare metal.

from ssd-gpu-dma.

enfiskutensykkel commented on August 17, 2024

Yes, I would also expect that the devices were already placed in the same IOMMU domain by the hypervisor before being passed to the guest. However, I suspect that that the GPU may actually be using mdev rather than the physical function being passed-through, that would at least make sense in a cloud environment, because you can share a single GPU to multiple VM instances without having to implement SR-IOV (https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt). I think RDMA should still work regardless, but I need to sit down and have a look at what's going on. Based on the PCI vendor ID (https://pci-ids.ucw.cz/read/PC/1ae0), I also think the disk may also be virtual or using some form of paravirtualization/virtualization with physical backing, so there might be something odd going on there as well.

I really appreciate your efforts so far, and any feedback you could provide me regarding this would be extremely helpful.

from ssd-gpu-dma.

enfiskutensykkel commented on August 17, 2024

As for where to start looking, the source code for the API is in the src ditectory. I use IOCTL to communicate with the helper module, and the function called map_memory in src/dma.c is where I pass in virtual addresses (or in the case of CUDA, a device pointer) and retrieve a list of IO addresses. I’ve made a convenience debug print function called dprintf which accepts arguments printf-style. You probably have to set the CMAKE_BUILD_TYPE to Debug otherwise there is some macro magic that will “swallow” the output.

In the module code, the IOCTL entry point is in module/pci.c. All the actual address mapping stuff happens in module/map.c.

from ssd-gpu-dma.

pgera commented on August 17, 2024

I managed to get access to a bare metal system with a Tesla K40C. I repeated all the previous steps. The CPU ones still work. With the GPU one, I'm seeing a different error:

sudo ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0 --double-buffer=true --gpu=1
CUDA device: Tesla K40c
Controller page size  : 4096 B
Namespace block size  : 512 B
Number of threads     : 32
Chunks per thread     : 32
Pages per chunk       : 1
Total number of pages : 1024
Total number of blocks: 8192
Double buffering      : yes
[map_memory] Page mapping kernel request failed: Invalid argument
Unexpected error: Failed to map device memory: Invalid argument

With any gpu test, dmesg has these messages:
[ 9314.243165] Unknown ioctl command from process 8481: 1075347458

I haven't debugged it further. Just wanted to check if you have any thoughts.

from ssd-gpu-dma.

enfiskutensykkel commented on August 17, 2024

Great!

I'm just curious, have you tried the nvm-cuda-bench benchmark, and if it works what kind of bandwidths are you seeing? I'm not sure which commit version you are using, but I had a bug some time ago where I didn't compile for all compute modes and architectures and launching CUDA kernels would just silently fail, so if you are seeing ridiculous bandwidths for that benchmark that may be the case.

I think that the K40 should support GPUDirect Async, meaning that the nvm-cuda-bench approach of having the GPU itself directly initiating disk transfers should work, but it may be the case that it only supports GPUDirect RDMA (in which case, the benchmark should fail on the call to cudaHostRegister. In any case, the nvm-latency-bench should still work because it only uses GPUDirect RDMA.

from ssd-gpu-dma.

enfiskutensykkel commented on August 17, 2024

Just saw your edit.

Edit 3: See my reply below.

If you get higher bandwidth using the block device method, that would actually be an argument against my approach of kicking off the disk from within the kernel. That's a bit confusing and obviously not the result I would expect. I can only compare it to my own configuration.

For smaller sizes, it seems that the block-device approach is unbeatable:

root@obama:~/ssd-gpu-dma/build# ./bin/nvm-cuda-bench --block-device=/dev/nvme0n1
Controller page size  : 4096 B
Assumed block size    : 512 B
Number of threads     : 32
Chunks per thread     : 32
Pages per chunk       : 1
Total number of pages : 1024
Double buffering      : no
Time elapsed: 2522.016 µs
Bandwidth   : 1663.076 MiB/s

vs.

root@obama:~/ssd-gpu-dma/build# ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0 --double-buffer=true
CUDA device: Quadro K620
Controller page size  : 4096 B
Namespace block size  : 512 B
Number of threads     : 32
Chunks per thread     : 32
Pages per chunk       : 1
Total number of pages : 1024
Total number of blocks: 8192
Double buffering      : yes
Time elapsed: 3561.856 µs
Bandwidth   : 1177.561 MiB/s

However, with slightly larger transfers that approach's performance starts to drop:

root@obama:~/ssd-gpu-dma/build# ./bin/nvm-cuda-bench --block-device=/dev/nvme0n1 --chunks 64 --pages 32
Controller page size  : 4096 B
Assumed block size    : 512 B
Number of threads     : 32
Chunks per thread     : 64
Pages per chunk       : 32
Total number of pages : 65536
Double buffering      : no
Time elapsed: 254601.501 µs
Bandwidth   : 1054.336 MiB/s

Obviously, the block-device approach could be more optimised (such as using double buffering too, for example).

However, compare that to kicking it off from within a kernel:

root@obama:~/ssd-gpu-dma/build# ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0 --chunks 64 --pages 32
CUDA device: Quadro K620
Controller page size  : 4096 B
Namespace block size  : 512 B
Number of threads     : 32
Chunks per thread     : 64
Pages per chunk       : 32
Total number of pages : 65536
Total number of blocks: 524288
Double buffering      : no
Time elapsed: 282330.231 µs
Bandwidth   : 950.785 MiB/s

And with double buffering:

root@obama:~/ssd-gpu-dma/build# ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0 --chunks 64 --pages 32 --double-buffer=true
CUDA device: Quadro K620
Controller page size  : 4096 B
Namespace block size  : 512 B
Number of threads     : 32
Chunks per thread     : 64
Pages per chunk       : 32
Total number of pages : 65536
Total number of blocks: 524288
Double buffering      : yes
Time elapsed: 190355.942 µs
Bandwidth   : 1410.176 MiB/s

I will actually get my hands on a Tesla k40 later today and I think I have a Samsung SSD laying around somewhere. Out of curiosity, what CPU architecture are you using?

Edit 2: My main suggestion would probably try to increase the number of chunks to see if there is any effect at all (which would be the case if the kernel start-up cost of the Tesla is significantly higher than for my Quadro), but I don't expect there to be any considerable difference.
See my reply below.

Edit 1: Some info about my configuration

root@obama:~/ssd-gpu-dma/build# cat /proc/cpuinfo | grep "model"
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
...
root@obama:~/ssd-gpu-dma/build# lspci -s 5: -v
05:00.0 Non-Volatile memory controller: Intel Corporation Optane SSD 900P Series (prog-if 02 [NVM Express])
	Subsystem: Intel Corporation 900P Series [Add-in Card]
	Physical Slot: 4
	Flags: bus master, fast devsel, latency 0, IRQ 24
	Memory at fa410000 (64-bit, non-prefetchable) [size=16K]
	Expansion ROM at fa400000 [disabled] [size=64K]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI-X: Enable- Count=32 Masked-
	Capabilities: [60] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [150] Virtual Channel
	Capabilities: [180] Power Budgeting <?>
	Capabilities: [190] Alternative Routing-ID Interpretation (ARI)
	Capabilities: [270] Device Serial Number 55-cd-2e-41-4e-36-39-2a
	Capabilities: [2a0] #19
	Kernel driver in use: libnvm helper
	Kernel modules: nvme
root@obama:~/ssd-gpu-dma/build# sudo dmidecode -t 2
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 3.0.0 present.

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
	Manufacturer: ASUSTeK COMPUTER INC.
	Product Name: X99-M WS/SE
	Version: Rev 1.xx
	Serial Number: 160880587900105
	Asset Tag: Default string
	Features:
		Board is a hosting board
		Board is replaceable
	Location In Chassis: Default string
	Chassis Handle: 0x0003
	Type: Motherboard
	Contained Object Handles: 0

Invalid entry length (16). Fixed up to 11.
root@obama:~/ssd-gpu-dma/build# lspci -s 1:00.0 -v
01:00.0 VGA compatible controller: NVIDIA Corporation GM107GL [Quadro K620] (rev a2) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation GM107GL [Quadro K620]
	Physical Slot: 1
	Flags: bus master, fast devsel, latency 0, IRQ 28
	Memory at f9000000 (32-bit, non-prefetchable) [size=16M]
	Memory at c0000000 (64-bit, prefetchable) [size=256M]
	Memory at d0000000 (64-bit, prefetchable) [size=32M]
	I/O ports at e000 [size=128]
	[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Legacy Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [250] Latency Tolerance Reporting
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384
root@obama:~/ssd-gpu-dma/build# lspci -t
...
\-[0000:00]-+-00.0
             +-01.0-[05]----00.0
...
             +-05.0
...

from ssd-gpu-dma.

enfiskutensykkel commented on August 17, 2024

Hi again @pgera

So I managed to find a Tesla K40 and a Samsung 960 EVO.

[root@petty build2]# ./bin/nvm-identify --ctrl=/dev/libnvm0
Resetting controller and setting up admin queues...
------------- Controller information -------------
PCI Vendor ID           : 4d 14
PCI Subsystem Vendor ID : 4d 14
NVM Express version     : 1.2.0
Controller page size    : 4096
Max queue entries       : 16384
Serial Number           : S3ESNX0JA72980N
Model Number            : Samsung SSD 960 EVO 250GB
Firmware revision       : 2B7QCXE7
Max data transfer size  : 2097152
Max outstanding commands: 0
Max number of namespaces: 1
--------------------------------------------------
Goodbye!

With default arguments, I get the same bad performance as you did:

[root@petty build2]# ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0
CUDA device: Tesla K40c
Controller page size  : 4096 B
Namespace block size  : 512 B
Number of threads     : 32
Chunks per thread     : 32
Pages per chunk       : 1
Total number of pages : 1024
Total number of blocks: 8192
Double buffering      : no
Time elapsed: 11953.088 µs
Bandwidth   : 350.897 MiB/s

With the same amount of chunks as for the Quadro example in my previous comment:

[root@petty build2]# ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0 --pages 32 --chunks 64
CUDA device: Tesla K40c
Controller page size  : 4096 B
Namespace block size  : 512 B
Number of threads     : 32
Chunks per thread     : 64
Pages per chunk       : 32
Total number of pages : 65536
Total number of blocks: 524288
Double buffering      : no
Time elapsed: 306186.676 µs
Bandwidth   : 876.705 MiB/s

With even more transfer chunks:

[root@petty build2]# ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0 --pages 32 --chunks 128
CUDA device: Tesla K40c
Controller page size  : 4096 B
Namespace block size  : 512 B
Number of threads     : 32
Chunks per thread     : 128
Pages per chunk       : 32
Total number of pages : 131072
Total number of blocks: 1048576
Double buffering      : no
Time elapsed: 604297.852 µs
Bandwidth   : 888.421 MiB/s

And finally with double-buffering:

[root@petty build2]# ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0 --pages 32 --chunks 128 --double-buffer=true
CUDA device: Tesla K40c
Controller page size  : 4096 B
Namespace block size  : 512 B
Number of threads     : 32
Chunks per thread     : 128
Pages per chunk       : 32
Total number of pages : 131072
Total number of blocks: 1048576
Double buffering      : yes
Time elapsed: 462272.064 µs
Bandwidth   : 1161.374 MiB/s

I'm not entirely sure why this is the case, but if I'm going to guess it's because the Tesla probably has a higher kernel launch cost affecting the way I measure completion time.

For comparison, here is some results using the block-device option:

[root@petty build2]# ./bin/nvm-cuda-bench --block-device=/dev/nvme1n1
Controller page size  : 4096 B
Assumed block size    : 512 B
Number of threads     : 32
Chunks per thread     : 32
Pages per chunk       : 1
Total number of pages : 1024
Double buffering      : no
Time elapsed: 9737.248 µs
Bandwidth   : 430.748 MiB/s

And with increased transfer size:

[root@petty build2]# ./bin/nvm-cuda-bench --block-device=/dev/nvme1n1 --pages 32 --chunks 128
Controller page size  : 4096 B
Assumed block size    : 512 B
Number of threads     : 32
Chunks per thread     : 128
Pages per chunk       : 32
Total number of pages : 131072
Double buffering      : no
Time elapsed: 581437.012 µs
Bandwidth   : 923.352 MiB/s

EDIT: I also noticed that the EVO supports a larger maximum transfer size, which means that pages per chunk can be increased.

[root@petty build2]# ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0 --pages 512 --chunks 128 --double-buffer=true
CUDA device: Tesla K40c
Controller page size  : 4096 B
Namespace block size  : 512 B
Number of threads     : 32
Chunks per thread     : 128
Pages per chunk       : 512
Total number of pages : 2097152
Total number of blocks: 16777216
Double buffering      : yes
Time elapsed: 7192296.875 µs
Bandwidth   : 1194.324 MiB/s

from ssd-gpu-dma.

enfiskutensykkel commented on August 17, 2024

Also, just for fun, I commented out the loop in the moveBytes function.

Disk to GPU memory (queues hosted in GPU memory, GPU triggers doorbell):

[root@petty build2]# make cuda-benchmark && ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0 --pages 512 --chunks 64 --double-buffer=true
[ 50%] Built target libnvm
Scanning dependencies of target cuda-benchmark-module
[ 57%] Building CUDA object benchmarks/cuda/CMakeFiles/cuda-benchmark-module.dir/main.cu.o
/root/ssd-gpu-dma/benchmarks/cuda/main.cu(34): warning: variable "numThreads" was declared but never referenced

/root/ssd-gpu-dma/benchmarks/cuda/main.cu(35): warning: variable "threadNum" was declared but never referenced

/root/ssd-gpu-dma/benchmarks/cuda/main.cu(37): warning: variable "source" was declared but never referenced

/root/ssd-gpu-dma/benchmarks/cuda/main.cu(38): warning: variable "destination" was declared but never referenced

/root/ssd-gpu-dma/benchmarks/cuda/main.cu(34): warning: variable "numThreads" was declared but never referenced

/root/ssd-gpu-dma/benchmarks/cuda/main.cu(35): warning: variable "threadNum" was declared but never referenced

/root/ssd-gpu-dma/benchmarks/cuda/main.cu(37): warning: variable "source" was declared but never referenced

/root/ssd-gpu-dma/benchmarks/cuda/main.cu(38): warning: variable "destination" was declared but never referenced

[ 64%] Linking CUDA device code CMakeFiles/cuda-benchmark-module.dir/cmake_device_link.o
[ 71%] Linking CXX executable ../../bin/nvm-cuda-bench
[100%] Built target cuda-benchmark-module
[100%] Built target cuda-benchmark
CUDA device: Tesla K40c
Controller page size  : 4096 B
Namespace block size  : 512 B
Number of threads     : 32
Chunks per thread     : 64
Pages per chunk       : 512
Total number of pages : 1048576
Total number of blocks: 8388608
Double buffering      : yes
Time elapsed: 1209045.654 µs
Bandwidth   : 3552.362 MiB/s

This appears to be quite close to the maximum bandwidth.

Disk to system memory (queues hosted in system memory, CPU triggers doorbell):

[root@petty build2]# ./bin/nvm-latency-bench --pattern=sequential --blocks=0x2000 --bandwidth --ctrl=/dev/libnvm0 --queues 1 --depth 63
Resetting controller...
Queue #01 remote 2 commands
Allocating 1024 pages (host)...
Running benchmark...
Queue #01 cmds=2 blocks=8192 count=1000 min=3498.058 avg=3528.195 max=3535.428
	0.99:       3533.429
	0.97:       3532.798
	0.95:       3532.233
	0.90:       3531.463
	0.75:       3530.223
	0.50:       3528.504
OK!

Disk to GPU memory (queues hosted in system memory, CPU triggers doorbell):

[root@petty build2]# ./bin/nvm-latency-bench --pattern=sequential --blocks=0x2000 --bandwidth --ctrl=/dev/libnvm0  --queues 1 --depth 63 --gpu=0
Resetting controller...
Queue #01 remote 2 commands
Allocating 1024 pages (GPU)...
Running benchmark...
Queue #01 cmds=2 blocks=8192 count=1000 min=3508.489 avg=3528.573 max=3534.829
	0.99:       3533.393
	0.97:       3532.810
	0.95:       3532.358
	0.90:       3531.641
	0.75:       3530.158
	0.50:       3528.753
OK!

Edit: Made a misleading statement before I realised I didn't use a full queue depth, sorry about that.

from ssd-gpu-dma.

pgera commented on August 17, 2024

For some reason, it never crosses about 300 MB/s in my case. I wonder if it's some misconfiguration of the hardware or the driver.

sudo ./bin/nvm-identify --ctrl=/dev/libnvm0 
Resetting controller and setting up admin queues...
------------- Controller information -------------
PCI Vendor ID           : 4d 14
PCI Subsystem Vendor ID : 4d 14
NVM Express version     : 1.2.0
Controller page size    : 4096
Max queue entries       : 16384
Serial Number           : S3ESNX0J327317F     
Model Number            : Samsung SSD 960 EVO 250GB               
Firmware revision       : 2B7QCXE7
Max data transfer size  : 2097152
Max outstanding commands: 0
Max number of namespaces: 1
--------------------------------------------------
Goodbye!


sudo ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0 --gpu=1
CUDA device: Tesla K40c
Controller page size  : 4096 B
Namespace block size  : 512 B
Number of threads     : 32
Chunks per thread     : 32
Pages per chunk       : 1
Total number of pages : 1024
Total number of blocks: 8192
Double buffering      : no
Time elapsed: 24467.968 µs
Bandwidth   : 171.420 MiB/s

sudo ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0 --gpu=1 --pages 32 --chunks 64
CUDA device: Tesla K40c
Controller page size  : 4096 B
Namespace block size  : 512 B
Number of threads     : 32
Chunks per thread     : 64
Pages per chunk       : 32
Total number of pages : 65536
Total number of blocks: 524288
Double buffering      : no
Time elapsed: 1117581.055 µs
Bandwidth   : 240.193 MiB/s

sudo ./bin/nvm-cuda-bench --ctrl=/dev/libnvm0 --gpu=1 --pages 32 --chunks 128 --double-buffer=true
CUDA device: Tesla K40c
Controller page size  : 4096 B
Namespace block size  : 512 B
Number of threads     : 32
Chunks per thread     : 128
Pages per chunk       : 32
Total number of pages : 131072
Total number of blocks: 1048576
Double buffering      : yes
Time elapsed: 1783518.188 µs
Bandwidth   : 301.018 MiB/s

Here's some machine info:

cat /proc/cpuinfo | grep "model"
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz

82:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40c] (rev a1)
        Subsystem: NVIDIA Corporation GK110BGL [Tesla K40c]
        Physical Slot: 2
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 36
        Region 0: Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at 37fc0000000 (64-bit, prefetchable) [size=256M]
        Region 3: Memory at 37fd0000000 (64-bit, prefetchable) [size=32M]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s <512ns, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
       Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
                        Status: NegoPending- InProgress-
        Capabilities: [128 v1] Power Budgeting <?>
        Capabilities: [420 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900 v1] #19
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_390, nvidia_390_drm

02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a804 (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd Device a801
        Physical Slot: 1
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx+
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 26
        Region 0: Memory at c7200000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L0s unlimited, L1 <64us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [b0] MSI-X: Enable- Count=8 Masked-
                Vector table: BAR=0 offset=00003000
                PBA: BAR=0 offset=00002000
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [158 v1] Power Budgeting <?>
        Capabilities: [168 v1] #19
        Capabilities: [188 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Capabilities: [190 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
        Kernel driver in use: libnvm helper
        Kernel modules: nvme

from ssd-gpu-dma.

enfiskutensykkel commented on August 17, 2024

That's strange, could you show me the output of lspci -tv? Maybe PCIe traffic has to cross the root complex or something?

EDIT: Also, what's the bandwidth reported by nvm-latency-bench if you use --gpu=1 --bandwidth ?

from ssd-gpu-dma.

pgera commented on August 17, 2024

-+-[0000:ff]-+-08.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0
 |           +-08.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0
 |           +-08.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0
 |           +-09.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 1
 |           +-09.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 1
 |           +-09.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 1
 |           +-0b.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R3 QPI Link 0/1
 |           +-0b.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R3 QPI Link 0/1
 |           +-0b.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R3 QPI Link 0/1
 |           +-0b.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R3 QPI Link Debug
 |           +-0c.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0d.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0d.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0d.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0d.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-10.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R2PCIe Agent
 |           +-10.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R2PCIe Agent
 |           +-10.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Ubox
 |           +-10.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Ubox
 |           +-10.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Ubox
 |           +-12.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Home Agent 0
 |           +-12.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Home Agent 0
 |           +-12.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Home Agent 1
 |           +-12.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Home Agent 1
 |           +-13.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Target Address/Thermal/RAS
 |           +-13.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Target Address/Thermal/RAS
 |           +-13.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel Target Address Decoder
 |           +-13.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel Target Address Decoder
 |           +-13.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 0/1 Broadcast
 |           +-13.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Global Broadcast
|           +-14.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel 0 Thermal Control
 |           +-14.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel 1 Thermal Control
 |           +-14.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel 0 Error
 |           +-14.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel 1 Error
 |           +-14.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 0/1 Interface
 |           +-14.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 0/1 Interface
 |           +-14.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 0/1 Interface
 |           +-14.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 0/1 Interface
 |           +-16.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Target Address/Thermal/RAS
 |           +-16.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Target Address/Thermal/RAS
 |           +-16.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Channel Target Address Decoder
 |           +-16.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Channel Target Address Decoder
 |           +-16.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 2/3 Broadcast
 |           +-16.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Global Broadcast
 |           +-17.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 1 - Channel 0 Thermal Control
 |           +-17.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 1 - Channel 1 Thermal Control
 |           +-17.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 1 - Channel 0 Error
 |           +-17.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 1 - Channel 1 Error
 |           +-17.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 2/3 Interface
 |           +-17.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 2/3 Interface
 |           +-17.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 2/3 Interface
 |           +-17.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 2/3 Interface
 |           +-1e.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           +-1e.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           +-1e.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           +-1e.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           +-1e.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           +-1f.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           \-1f.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 +-[0000:80]-+-02.0-[81]--+-00.0  NVIDIA Corporation Device 1b02
 |           |            \-00.1  NVIDIA Corporation Device 10ef
 |           +-03.0-[82]----00.0  NVIDIA Corporation GK110BGL [Tesla K40c]
 |           +-05.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Map/VTd_Misc/System Management
 |           +-05.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D IIO Hot Plug
 |           +-05.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D IIO RAS/Control Status/Global Errors
 |           \-05.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D I/O APIC
 +-[0000:7f]-+-08.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0
 |           +-08.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0
 |           +-08.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 0
 |           +-09.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 1
 |           +-09.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 1
 |           +-09.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D QPI Link 1
|           +-0b.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R3 QPI Link 0/1
 |           +-0b.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R3 QPI Link 0/1
 |           +-0b.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R3 QPI Link 0/1
 |           +-0b.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R3 QPI Link Debug
 |           +-0c.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0c.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0d.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0d.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0d.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0d.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-0f.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Caching Agent
 |           +-10.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R2PCIe Agent
 |           +-10.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R2PCIe Agent
 |           +-10.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Ubox
 |           +-10.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Ubox
 |           +-10.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Ubox
 |           +-12.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Home Agent 0
 |           +-12.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Home Agent 0
 |           +-12.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Home Agent 1
 |           +-12.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Home Agent 1
 |           +-13.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Target Address/Thermal/RAS
 |           +-13.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Target Address/Thermal/RAS
 |           +-13.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel Target Address Decoder
 |           +-13.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel Target Address Decoder
 |           +-13.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 0/1 Broadcast
 |           +-13.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Global Broadcast
 |           +-14.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel 0 Thermal Control
 |           +-14.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel 1 Thermal Control
 |           +-14.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel 0 Error
 |           +-14.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel 1 Error
 |           +-14.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 0/1 Interface
|           +-14.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 0/1 Interface
 |           +-14.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 0/1 Interface
 |           +-14.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 0/1 Interface
 |           +-16.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Target Address/Thermal/RAS
 |           +-16.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Target Address/Thermal/RAS
 |           +-16.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Channel Target Address Decoder
 |           +-16.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Channel Target Address Decoder
 |           +-16.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 2/3 Broadcast
 |           +-16.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Global Broadcast
 |           +-17.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 1 - Channel 0 Thermal Control
 |           +-17.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 1 - Channel 1 Thermal Control
 |           +-17.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 1 - Channel 0 Error
 |           +-17.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 1 - Channel 1 Error
 |           +-17.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 2/3 Interface
 |           +-17.5  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 2/3 Interface
 |           +-17.6  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 2/3 Interface
 |           +-17.7  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DDRIO Channel 2/3 Interface
 |           +-1e.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           +-1e.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           +-1e.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           +-1e.3  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           +-1e.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           +-1f.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 |           \-1f.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Power Control Unit
 \-[0000:00]-+-00.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D DMI2
             +-01.0-[01]--+-00.0  Intel Corporation I350 Gigabit Network Connection
             |            \-00.1  Intel Corporation I350 Gigabit Network Connection
             +-02.0-[02]----00.0  Samsung Electronics Co Ltd Device a804
             +-02.2-[03]----00.0  LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt]
             +-03.0-[04]--
             +-03.2-[05]--+-00.0  Solarflare Communications SFL9021 [Solarstorm]
             |            \-00.1  Solarflare Communications SFL9021 [Solarstorm]
             +-05.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Map/VTd_Misc/System Management
             +-05.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D IIO Hot Plug
             +-05.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D IIO RAS/Control Status/Global Errors
             +-05.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D I/O APIC
             +-11.0  Intel Corporation C610/X99 series chipset SPSR
             +-14.0  Intel Corporation C610/X99 series chipset USB xHCI Host Controller
             +-16.0  Intel Corporation C610/X99 series chipset MEI Controller #1
             +-16.1  Intel Corporation C610/X99 series chipset MEI Controller #2
             +-1a.0  Intel Corporation C610/X99 series chipset USB Enhanced Host Controller #2
             +-1c.0-[06]--
             +-1c.2-[07-08]----00.0-[08]----00.0  ASPEED Technology, Inc. ASPEED Graphics Family
             +-1d.0  Intel Corporation C610/X99 series chipset USB Enhanced Host Controller #1
             +-1f.0  Intel Corporation C610/X99 series chipset LPC Controller
             +-1f.3  Intel Corporation C610/X99 series chipset SMBus Controller
             \-1f.6  Intel Corporation C610/X99 series chipset Thermal Subsystem

Edit: Here's the output from nvm-latency-bench with and without gpu

sudo ./bin/nvm-latency-bench --ctrl=/dev/libnvm0 --blocks=1000 --pattern=sequential 
Resetting controller...
Queue #01 remote 1 commands
Allocating 125 pages (host)...
Running benchmark...
Queue #01 cmds=1 blocks=1000 count=1000 min=540.781 avg=543.591 max=573.464
	0.99:        546.186
	0.97:        545.667
	0.95:        545.410
	0.90:        544.983
	0.75:        544.330
	0.50:        543.475
OK!

sudo ./bin/nvm-latency-bench --ctrl=/dev/libnvm0 --blocks=1000 --pattern=sequential --gpu=1
Resetting controller...
Queue #01 remote 1 commands
Allocating 125 pages (GPU)...
Running benchmark...
Queue #01 cmds=1 blocks=1000 count=1000 min=1752.694 avg=1758.617 max=1768.666
	0.99:       1763.780
	0.97:       1763.417
	0.95:       1763.158
	0.90:       1762.851
	0.75:       1761.899
	0.50:       1757.245
OK!

from ssd-gpu-dma.

eyalroz commented on August 17, 2024

@enfiskutensykkel , @pgera : This issue is veering farther and farther from the reason it was originally opened. Also, @enfiskutensykkel , you've renamed it "General QA" which is not what the issue was about. (It's also not a good idea IMHO to have that as the name of single issue). So please move this discussion to a different issue, or elsewhere.

from ssd-gpu-dma.

Clarify use-case involving CUDA about ssd-gpu-dma HOT 32 CLOSED

Comments (32)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent