coreylowman / cudarc Goto Github PK

View Code? Open in Web Editor NEW

459.0 459.0 63.0 1.2 MB

Safe rust wrapper around CUDA toolkit

License: Apache License 2.0

Rust 99.97% Shell 0.03% C 0.01%

cublas cuda cuda-kernels cuda-programming cuda-toolkit curand gpu gpu-acceleration nvrtc rust

cudarc's People

Contributors

Stargazers

Watchers

Forkers

m1ngxu viliamvadocz nkoppel codingonion narsil tthebst l3utterfly chaserileyroberts sdake ariasanovsky emchristiansen mathiswellmann laurentmazare fallingsnow shabbirhasan1 mert-kurttutan jark5455 olivierdehaene shenjiangqiu zjsec gardene-el mneilly furiosa-ai atomicapple0 mcobzarenco agerasev xnorpx videowindow klochowicz turbotage bloodre vberthet 0x5459 djmarcin jafioti ericlbuehler fiend-star-666 jeanfelixm jpgoodale danikvitek zhangzaixuan johnallen jreuben11 philsippl wenhaozhao ssh352 lhcap evopen linecode jrazek jorgeantonio21 nikhilr612 alphastrata b0xtch bitemyapp nogo6260 kckeiks dkales vartec psmyth94 brandonros sebhtml

cudarc's Issues

Unsound-ness of `&mut T` as a kernel param

At the time of calling launch_async, the callee function may have exclusive ownership of the reference (&mut T), but that is released as soon as launch_async is returned.

What if device mutates the mutable variable at the same time as a host function? Or similar to #29, what if t is dropped before launch_async returns?

`take_async` should only accept `Send` or `Sync` types

I accidently passed &mut u8 instead of u8 and got weird errors. The data of the borrow can't be ensured to be valid and probably can't be accessed too. Not sure weather we have to restrict T to Send or Sync (or maybe even something else).

Use of RwLock doesn't work in nostd

Perhaps use spin instead?

Use of std::vec doesn't work with nostd

error[E0433]: failed to resolve: could not find `vec` in `std`
   --> /root/.cargo/registry/src/github.com-1ecc6299db9ec823/cudarc-0.4.0/src/device.rs:387:50
    |
387 |         let mut buf = buf.unwrap_or_else(|| std::vec![Default::default(); src.len]);
    |                                                  ^^^ could not find `vec` in `std`

Keep track of ordinal in CudaDevice & expose a reference to it

struct CudaDevice {
    ...
    ordinal: usize
}

fn ordinal(&self) -> &usize {
    &self.ordinal
}

Add `#[derive(Debug)]` for gemm config structs

Use default stream instead of a new stream per CudaDevice

Instantiating multiple devices creates multiple streams, and there is tricky synchronization problems between them.

Move no-std-compat behind feature flag

To reduce additional dependencies

ValidAsZeroBits for tuples

tuples where all elements are ValidAsZeroBits are ValidAsZeroBits too

cudnn support

See initial discussion in #1 and #16 for wip

On thread safety:

The cuDNN library is thread-safe. Its functions can be called from multiple host threads, so long as the threads do not share the same cuDNN handle simultaneously.
Source

More parameters for launching kernels

sometimes the current limit (4 or 5) of parameters is not enough; can we have tuples implement the IntoKernelParam trait? (maybe IntoKernelParam also specifies the size of a parameter, so a tuple of size 2 would have the double size)

impl std::fmt::Display for CudaError

Add hgemm functions for half precision data

Add ability to use views in cublas methods

rename CudaError -> DeviceError

Unsound API ?

Thanks for this lib, I was trying it out and managed to cause illegal address access in the cuda kernel.

Here is the following example (not necessarily the most elegant, I took the matmul example from nvidia).

This example works OK for when running in debug mode and fails in release mode.

My guess is that the slices get dropped too early after optimization (since all ops are asynchronous, I don't think the compiler knows that it should wait for a device synchronization for a and b.

Hereafter the program and stacktrace

use cudarc::device::{CudaDeviceBuilder, LaunchConfig, LaunchCudaFunction};
use cudarc::jit::{compile_ptx_with_opts, CompileError, CompileOptions};

fn main() -> Result<(), CompileError> {
    let start = std::time::Instant::now();
    let opts = CompileOptions {
        ftz: Some(true),
        prec_div: Some(false),
        prec_sqrt: Some(false),
        fmad: Some(true),
        ..Default::default()
    };

    let ptx = compile_ptx_with_opts(
        "

            __global__ void matmul(float* A, float* B, float* C, int N) {

                int ROW = blockIdx.y*blockDim.y+threadIdx.y;
                int COL = blockIdx.x*blockDim.x+threadIdx.x;

                float tmpSum = 0;

                if (ROW < N && COL < N) {
                    // each thread computes one element of the block sub-matrix
                    for (int i = 0; i < N; i++) {
                        tmpSum += A[ROW * N + i] * B[i * N + COL];
                    }
                }
                // printf(\"pos, (%d, %d) - N %d - value %d\\n\", ROW, COL, N, tmpSum);
                C[ROW * N + COL] = tmpSum;
            }

            ",
        opts,
    )?;
    use core::ffi::CStr;
    let ptx: &CStr = unsafe { CStr::from_ptr(ptx.image().as_ptr()) };
    let ptx: &str = ptx.to_str().unwrap();
    println!("Compilation succeeded!");
    use std::fs::File;
    use std::io::Write;
    let mut file = File::create("matmul.ptx").unwrap();
    file.write_all(ptx.as_bytes()).unwrap();
    println!("Init in {:?}", start.elapsed());
    let dev = CudaDeviceBuilder::new(0)
        .with_ptx_from_file("_Z6matmulPfS_S_i", "./matmul.ptx", &["_Z6matmulPfS_S_i"])
        .build()
        .unwrap();
    println!("Init in {:?}", start.elapsed());

    // "sin_module" is the key used with CudaDeviceBuilder
    let module = dev.get_module("_Z6matmulPfS_S_i").unwrap();

    // "sin_kernel" is the name of the actual function inside the .ptx file
    let f = module.get_fn("_Z6matmulPfS_S_i").unwrap();
    println!("Loaded in {:?}", start.elapsed());

    let a_host = [1.0f32, 2.0, 3.0, 4.0];
    let b_host = [1.0f32, 2.0, 3.0, 4.0];
    let mut c_host = [0.0f32; 4];

    let a_dev = dev.sync_copy(&a_host).unwrap();
    let b_dev = dev.sync_copy(&b_host).unwrap();
    let mut c_dev = dev.sync_copy(&c_host).unwrap();

    let cfg = LaunchConfig {
        block_dim: (2, 2, 1),
        grid_dim: (1, 1, 1),
        shared_mem_bytes: 0,
    };
    unsafe { dev.launch_async(f, cfg, (&a_dev, &b_dev, &mut c_dev, 2)) }.unwrap();

    drop(a_dev);
    drop(b_dev);

    dev.sync_copy_from(&c_dev, &mut c_host).unwrap();
    drop(c_dev);
    println!("Found {:?} in {:?}", c_host, start.elapsed());
    Ok(())
}

Stacktrace;

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")', /home/nicolas/src/cudarc/src/device.rs:204:82
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")', /home/nicolas/src/cudarc/src/device.rs:204:82
stack backtrace:
   0:     0x55b4bfb5c5fa - std::backtrace_rs::backtrace::libunwind::trace::h196c489280b5a090
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x55b4bfb5c5fa - std::backtrace_rs::backtrace::trace_unsynchronized::h13bf778637c29636
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x55b4bfb5c5fa - std::sys_common::backtrace::_print_fmt::h0f4861fc71bccacd
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/sys_common/backtrace.rs:65:5
   3:     0x55b4bfb5c5fa - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h5f011e3cd484ad42
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x55b4bfb7ad2e - core::fmt::write::h1524a82fa967022b
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/core/src/fmt/mod.rs:1208:17
   5:     0x55b4bfb5a445 - std::io::Write::write_fmt::ha3d33adc534ae39a
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/io/mod.rs:1682:15
   6:     0x55b4bfb5c3c5 - std::sys_common::backtrace::_print::h11f95f719c502811
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/sys_common/backtrace.rs:47:5
   7:     0x55b4bfb5c3c5 - std::sys_common::backtrace::print::h71adea94b47d00ae
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/sys_common/backtrace.rs:34:9
   8:     0x55b4bfb5da4f - std::panicking::default_hook::{{closure}}::he466417f162d07e8
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:267:22
   9:     0x55b4bfb5d78b - std::panicking::default_hook::hff6ac2c6adb87df1
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:286:9
  10:     0x55b4bfb5e15c - std::panicking::rust_panic_with_hook::hcb61d2700034f188
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:688:13
  11:     0x55b4bfb5def9 - std::panicking::begin_panic_handler::{{closure}}::h44457d7d7214b4ef
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:579:13
  12:     0x55b4bfb5caac - std::sys_common::backtrace::__rust_end_short_backtrace::hf4f789a62590af34
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/sys_common/backtrace.rs:137:18
  13:     0x55b4bfb5dc02 - rust_begin_unwind
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:575:5
  14:     0x55b4bfb3b193 - core::panicking::panic_fmt::h19d66e5282d7808f
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/core/src/panicking.rs:64:14
  15:     0x55b4bfb3b5c3 - core::result::unwrap_failed::h5275d1617160e163
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/core/src/result.rs:1790:5
  16:     0x55b4bfb3bce6 - core::ptr::drop_in_place<cudarc::device::CudaSlice<f32>>::h1e3a20025b0386e7
  17:     0x55b4bfb3c7bf - matmul::main::h96270d3cbc9d8793
  18:     0x55b4bfb3c823 - std::sys_common::backtrace::__rust_begin_short_backtrace::h77c3c7fd1a851eb0
  19:     0x55b4bfb3c843 - std::rt::lang_start::{{closure}}::he524618c01c92a82
  20:     0x55b4bfb576ec - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::hee44d3bca815d6cd
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/core/src/ops/function.rs:606:13
  21:     0x55b4bfb576ec - std::panicking::try::do_call::h8e2913788124065e
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:483:40
  22:     0x55b4bfb576ec - std::panicking::try::h460040cf489c3878
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:447:19
  23:     0x55b4bfb576ec - std::panic::catch_unwind::h5e12d45b7905f1f3
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panic.rs:137:14
  24:     0x55b4bfb576ec - std::rt::lang_start_internal::{{closure}}::h53e0b25b0eaaa72b
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/rt.rs:148:48
  25:     0x55b4bfb576ec - std::panicking::try::do_call::h84ed13c66e27b8c7
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:483:40
  26:     0x55b4bfb576ec - std::panicking::try::h9972483dcd03469d
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:447:19
  27:     0x55b4bfb576ec - std::panic::catch_unwind::h56b5ad191546ecc2
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panic.rs:137:14
  28:     0x55b4bfb576ec - std::rt::lang_start_internal::hc154d6be0270f9da
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/rt.rs:148:20
  29:     0x55b4bfb3c815 - main
  30:     0x7f40d043c290 - <unknown>
  31:     0x7f40d043c34a - __libc_start_main
  32:     0x55b4bfb3b705 - _start
                               at /build/glibc/src/glibc/csu/../sysdeps/x86_64/start.S:115
  33:                0x0 - <unknown>
thread panicked while panicking. aborting.
Aborted (core dumped)

Cheers !

Add alloc_async function that takes a single element to copy into all the elements

Remove `mod prelude`

This is a small enough library with different enough pieces where it isn't really needed.

Move CudaRng pub methods to trait

This will enable a corresponding CpuRng to implement them

Use cuda runtime instead of driver api?

Related to #2, curand may not support driver api

Sub-views of CudaSlices

Would be similar to slices of Vecs.

Ideal interface would be:

let a: CudaSlice<f32> = ...;
let b: CudaSlice<f32> = a.offset(123);

where offset would panic if there aren't enough elements, otherwise make a copy of the device pointer.

Scenarios to figure out:

let a: CudaSlice<f32> = ...;
let b: CudaSlice<f32> = a.offset(123);
drop(a);
// b should still be valid here, so the underlying device pointer should not be dropped

copy from `CudaView`

Sometimes you only need a specific part of a CudaSlice and copying from a CudaView would be very useful (and why can you only slice with a RangeFrom?)

alloc (without memsetting to 0) should be public

can this function be public?

cudarc/src/driver/safe.rs

Line 388 in 2785abb

    
           unsafe fn alloc<T>(self: &Arc<Self>, len: usize) -> Result<CudaSlice<T>, DriverError> {

Sometimes i know that i'll overwrite/initialize the memory, but memsetting 2GB-3GB before is rather a waste of time.

Add copy_into_async

Should take Vec and overwrite self.host_buf

Add Into<Vec<E>> for CudaSlice<E>

Multi stream support

Currently CudaDevice only supports a single stream. Look into how multiple should be supported

Add tests for modifying values past the end of the allocated buffer for safe functions

Cuda does not raise any errors if you access memory past the bytes you allocated. The safe functions account for this, but should add unit tests to ensure this

Are cuda device pointers only valid for the device they were allocated onto?

This would be a good justification for ordinal as a const generic on both CudaSlice/CudaDevice.

Originally posted by @M1ngXU in #6 (comment)

Docs.rs build fails on 0.9.0

This seems to be related to changes to build.rs that require a cuda installation to exist, specifically here. This should be fixable with a #[cfg(not(doc))] and a version bump.

Here is the build log of the failing build.

impl Send/Sync for CudaDevice

unsafe impl Send for CudaDevice {}
unsafe impl Sync for CudaDevice {}

is all that is needed.

It doesn't work out of the box because CudaDevice has raw pointers from the sys api.

Clippy issues

Can we fix some clippy issues (found some in nvrtc)? Maybe add some github workflow like there is in dfdx?

cuRAND fails when generating an odd number of values

Apparently this is all expected according to their documentation. This crate should find away around this short coming of their host api.

Add `CudaSlice<T>` & remove `CudaRc`

Equivalent to Vec<T> on cpu. All APIs should accept this instead.

Notably this crate should not worry about compile time safety, and instead be as flexible as possible.

`sync_copy_from` - also accept vector

A function that also accepts a vector would be great as the cudaslice-sizes might be dynamic and it then requires more boilerplate code to actually create a correctly sized slice.

Host values should be `Pin`

Copying to device depends on these values staying in the same memory location.

Unsound lifetimes of kernel params

There is potential undefined behavior if any of the params for cuLaunchKernel are drop'd before the kernel actually executes.

A simple case is just a local parameter variable, which gets turned into a kernel parameter which a reference. That means the pointer is to local stack frame maybe?

What would need to happen to fix this is ensuring that these values are dropped after the kernel executes. One potential way to do this is use cuLaunchHostFunc, where the host function does nothing but has ownership of all the params.

Pin also seems very useful for this case.

Add CudaBytes

Would look like this:

#[derive(Debug)]
pub struct CudaBytes {
    pub(crate) cu_device_ptr: sys::CUdeviceptr,
    pub(crate) num_bytes: usize,
    pub(crate) device: ...
}

And this can be unsafely transformed into a CudaSlice

Add Mutex<CudaModule> to CudaDevice for true JIT

Currently all cuda modules must be specified in the device builder for CudaDevice. This means you can't add any cuda modules after you've built the CudaDevice object.

It would be nice to be able to add a new CudaModule that was compiled on the fly during forward/backward pass, but that would require mutating the CudaDevice even though it's passed by reference.

unclear how this interacts with #10

Add cuBLAS or cuDNN support

cuBLAS only has matmuls and vec mat muls, so no convs unless a im2col is also added. cuDNN definitely has convs and a couple other nice things, but unsure how matmuls work upon first look

Use a second stream for de-allocating memory

Currently all operations, including de-allocations, happen on the default stream. In dfdx, after a long forward pass with many operations (e.g. 100 operations, each producting 1+ gradient), all gradients are captured in a Gradients object. After the forward pass is done, the gradients object is dropped, which means ALL temporary gradients are de-allocated at once.

This blocks the default stream at the moment, so all de-allocations occur before any other work can complete.

Instead, we should put de-allocations on a second stream that is synchronized with the default stream with events:

call cuEventCreate
call cuEventRecord with the default stream
call cuStreamWaitEvent with the event and the deallocation stream
call free_async with the deallocation stream

This should free up the default stream to continue working

Add DeviceRng to example/devices.rs

Use Arc instead of Rc for sharing between threads

Add curand

should probably use rand and rand_distributions, other option is just create/gen_x/destroy

Remove dependency on find_cuda_helper

To reduce dependencies

Add cuDeviceCount, cuDeviceTotalMem, cuMemGetInfo to result

Reorganize structure

Each of the top level single file modules (blas.rs, device.rs, jit.rs, and rng.rs) all correspond to one of the other top level modules that contain the sys.rs/result.rs. To make this more consistent and make the structure more obvious, these should all be moved into the corresponding folders and renamed safe.rs.

This would look like:

driver/
- sys.rs
- result.rs
- safe.rs
nvrtc/
- sys.rs
- result.rs
- safe.rs
curand/
- sys.rs
- result.rs
- safe.rs
cublas/
- sys.rs
- result.rs
- safe.rs

NvrtcLoadingError when using precompiled ptx

Sometimes i get this error:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: NvrtcLoadingError { key: "cuda", cuda: DriverError(CUDA_ERROR_INVALID_PTX, "a PTX JIT compilation failed") }'

though just recompiling rust (not nvcc) works