Code Monkey home page Code Monkey logo

cudarc's People

Contributors

bitemyapp avatar bloodre avatar brandonros avatar chaserileyroberts avatar corey-lambda avatar coreylowman avatar dkales avatar emchristiansen avatar fiend-star-666 avatar garymcd avatar jafioti avatar jark5455 avatar l3utterfly avatar laurentmazare avatar m1ngxu avatar mert-kurttutan avatar mneilly avatar narsil avatar nkoppel avatar olivierdehaene avatar tthebst avatar vartec avatar viliamvadocz avatar wenhaozhao avatar zjsec avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cudarc's Issues

Unsound-ness of `&mut T` as a kernel param

At the time of calling launch_async, the callee function may have exclusive ownership of the reference (&mut T), but that is released as soon as launch_async is returned.

What if device mutates the mutable variable at the same time as a host function? Or similar to #29, what if t is dropped before launch_async returns?

`take_async` should only accept `Send` or `Sync` types

I accidently passed &mut u8 instead of u8 and got weird errors. The data of the borrow can't be ensured to be valid and probably can't be accessed too. Not sure weather we have to restrict T to Send or Sync (or maybe even something else).

Use of std::vec doesn't work with nostd

error[E0433]: failed to resolve: could not find `vec` in `std`
   --> /root/.cargo/registry/src/github.com-1ecc6299db9ec823/cudarc-0.4.0/src/device.rs:387:50
    |
387 |         let mut buf = buf.unwrap_or_else(|| std::vec![Default::default(); src.len]);
    |                                                  ^^^ could not find `vec` in `std`

cudnn support

See initial discussion in #1 and #16 for wip

On thread safety:

The cuDNN library is thread-safe. Its functions can be called from multiple host threads, so long as the threads do not share the same cuDNN handle simultaneously.
Source

More parameters for launching kernels

sometimes the current limit (4 or 5) of parameters is not enough; can we have tuples implement the IntoKernelParam trait? (maybe IntoKernelParam also specifies the size of a parameter, so a tuple of size 2 would have the double size)

Unsound API ?

Thanks for this lib, I was trying it out and managed to cause illegal address access in the cuda kernel.

Here is the following example (not necessarily the most elegant, I took the matmul example from nvidia).

This example works OK for when running in debug mode and fails in release mode.

My guess is that the slices get dropped too early after optimization (since all ops are asynchronous, I don't think the compiler knows that it should wait for a device synchronization for a and b.

Hereafter the program and stacktrace

use cudarc::device::{CudaDeviceBuilder, LaunchConfig, LaunchCudaFunction};
use cudarc::jit::{compile_ptx_with_opts, CompileError, CompileOptions};

fn main() -> Result<(), CompileError> {
    let start = std::time::Instant::now();
    let opts = CompileOptions {
        ftz: Some(true),
        prec_div: Some(false),
        prec_sqrt: Some(false),
        fmad: Some(true),
        ..Default::default()
    };

    let ptx = compile_ptx_with_opts(
        "

            __global__ void matmul(float* A, float* B, float* C, int N) {

                int ROW = blockIdx.y*blockDim.y+threadIdx.y;
                int COL = blockIdx.x*blockDim.x+threadIdx.x;

                float tmpSum = 0;

                if (ROW < N && COL < N) {
                    // each thread computes one element of the block sub-matrix
                    for (int i = 0; i < N; i++) {
                        tmpSum += A[ROW * N + i] * B[i * N + COL];
                    }
                }
                // printf(\"pos, (%d, %d) - N %d - value %d\\n\", ROW, COL, N, tmpSum);
                C[ROW * N + COL] = tmpSum;
            }

            ",
        opts,
    )?;
    use core::ffi::CStr;
    let ptx: &CStr = unsafe { CStr::from_ptr(ptx.image().as_ptr()) };
    let ptx: &str = ptx.to_str().unwrap();
    println!("Compilation succeeded!");
    use std::fs::File;
    use std::io::Write;
    let mut file = File::create("matmul.ptx").unwrap();
    file.write_all(ptx.as_bytes()).unwrap();
    println!("Init in {:?}", start.elapsed());
    let dev = CudaDeviceBuilder::new(0)
        .with_ptx_from_file("_Z6matmulPfS_S_i", "./matmul.ptx", &["_Z6matmulPfS_S_i"])
        .build()
        .unwrap();
    println!("Init in {:?}", start.elapsed());

    // "sin_module" is the key used with CudaDeviceBuilder
    let module = dev.get_module("_Z6matmulPfS_S_i").unwrap();

    // "sin_kernel" is the name of the actual function inside the .ptx file
    let f = module.get_fn("_Z6matmulPfS_S_i").unwrap();
    println!("Loaded in {:?}", start.elapsed());

    let a_host = [1.0f32, 2.0, 3.0, 4.0];
    let b_host = [1.0f32, 2.0, 3.0, 4.0];
    let mut c_host = [0.0f32; 4];

    let a_dev = dev.sync_copy(&a_host).unwrap();
    let b_dev = dev.sync_copy(&b_host).unwrap();
    let mut c_dev = dev.sync_copy(&c_host).unwrap();

    let cfg = LaunchConfig {
        block_dim: (2, 2, 1),
        grid_dim: (1, 1, 1),
        shared_mem_bytes: 0,
    };
    unsafe { dev.launch_async(f, cfg, (&a_dev, &b_dev, &mut c_dev, 2)) }.unwrap();

    drop(a_dev);
    drop(b_dev);

    dev.sync_copy_from(&c_dev, &mut c_host).unwrap();
    drop(c_dev);
    println!("Found {:?} in {:?}", c_host, start.elapsed());
    Ok(())
}

Stacktrace;

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")', /home/nicolas/src/cudarc/src/device.rs:204:82
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")', /home/nicolas/src/cudarc/src/device.rs:204:82
stack backtrace:
   0:     0x55b4bfb5c5fa - std::backtrace_rs::backtrace::libunwind::trace::h196c489280b5a090
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x55b4bfb5c5fa - std::backtrace_rs::backtrace::trace_unsynchronized::h13bf778637c29636
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x55b4bfb5c5fa - std::sys_common::backtrace::_print_fmt::h0f4861fc71bccacd
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/sys_common/backtrace.rs:65:5
   3:     0x55b4bfb5c5fa - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h5f011e3cd484ad42
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x55b4bfb7ad2e - core::fmt::write::h1524a82fa967022b
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/core/src/fmt/mod.rs:1208:17
   5:     0x55b4bfb5a445 - std::io::Write::write_fmt::ha3d33adc534ae39a
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/io/mod.rs:1682:15
   6:     0x55b4bfb5c3c5 - std::sys_common::backtrace::_print::h11f95f719c502811
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/sys_common/backtrace.rs:47:5
   7:     0x55b4bfb5c3c5 - std::sys_common::backtrace::print::h71adea94b47d00ae
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/sys_common/backtrace.rs:34:9
   8:     0x55b4bfb5da4f - std::panicking::default_hook::{{closure}}::he466417f162d07e8
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:267:22
   9:     0x55b4bfb5d78b - std::panicking::default_hook::hff6ac2c6adb87df1
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:286:9
  10:     0x55b4bfb5e15c - std::panicking::rust_panic_with_hook::hcb61d2700034f188
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:688:13
  11:     0x55b4bfb5def9 - std::panicking::begin_panic_handler::{{closure}}::h44457d7d7214b4ef
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:579:13
  12:     0x55b4bfb5caac - std::sys_common::backtrace::__rust_end_short_backtrace::hf4f789a62590af34
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/sys_common/backtrace.rs:137:18
  13:     0x55b4bfb5dc02 - rust_begin_unwind
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:575:5
  14:     0x55b4bfb3b193 - core::panicking::panic_fmt::h19d66e5282d7808f
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/core/src/panicking.rs:64:14
  15:     0x55b4bfb3b5c3 - core::result::unwrap_failed::h5275d1617160e163
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/core/src/result.rs:1790:5
  16:     0x55b4bfb3bce6 - core::ptr::drop_in_place<cudarc::device::CudaSlice<f32>>::h1e3a20025b0386e7
  17:     0x55b4bfb3c7bf - matmul::main::h96270d3cbc9d8793
  18:     0x55b4bfb3c823 - std::sys_common::backtrace::__rust_begin_short_backtrace::h77c3c7fd1a851eb0
  19:     0x55b4bfb3c843 - std::rt::lang_start::{{closure}}::he524618c01c92a82
  20:     0x55b4bfb576ec - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::hee44d3bca815d6cd
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/core/src/ops/function.rs:606:13
  21:     0x55b4bfb576ec - std::panicking::try::do_call::h8e2913788124065e
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:483:40
  22:     0x55b4bfb576ec - std::panicking::try::h460040cf489c3878
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:447:19
  23:     0x55b4bfb576ec - std::panic::catch_unwind::h5e12d45b7905f1f3
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panic.rs:137:14
  24:     0x55b4bfb576ec - std::rt::lang_start_internal::{{closure}}::h53e0b25b0eaaa72b
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/rt.rs:148:48
  25:     0x55b4bfb576ec - std::panicking::try::do_call::h84ed13c66e27b8c7
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:483:40
  26:     0x55b4bfb576ec - std::panicking::try::h9972483dcd03469d
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:447:19
  27:     0x55b4bfb576ec - std::panic::catch_unwind::h56b5ad191546ecc2
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panic.rs:137:14
  28:     0x55b4bfb576ec - std::rt::lang_start_internal::hc154d6be0270f9da
                               at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/rt.rs:148:20
  29:     0x55b4bfb3c815 - main
  30:     0x7f40d043c290 - <unknown>
  31:     0x7f40d043c34a - __libc_start_main
  32:     0x55b4bfb3b705 - _start
                               at /build/glibc/src/glibc/csu/../sysdeps/x86_64/start.S:115
  33:                0x0 - <unknown>
thread panicked while panicking. aborting.
Aborted (core dumped)

Cheers !

Remove `mod prelude`

This is a small enough library with different enough pieces where it isn't really needed.

Sub-views of CudaSlices

Would be similar to slices of Vecs.

Ideal interface would be:

let a: CudaSlice<f32> = ...;
let b: CudaSlice<f32> = a.offset(123);

where offset would panic if there aren't enough elements, otherwise make a copy of the device pointer.

Scenarios to figure out:

let a: CudaSlice<f32> = ...;
let b: CudaSlice<f32> = a.offset(123);
drop(a);
// b should still be valid here, so the underlying device pointer should not be dropped

copy from `CudaView`

Sometimes you only need a specific part of a CudaSlice and copying from a CudaView would be very useful (and why can you only slice with a RangeFrom?)

Multi stream support

Currently CudaDevice only supports a single stream. Look into how multiple should be supported

Docs.rs build fails on 0.9.0

This seems to be related to changes to build.rs that require a cuda installation to exist, specifically here. This should be fixable with a #[cfg(not(doc))] and a version bump.

Here is the build log of the failing build.

impl Send/Sync for CudaDevice

unsafe impl Send for CudaDevice {}
unsafe impl Sync for CudaDevice {}

is all that is needed.

It doesn't work out of the box because CudaDevice has raw pointers from the sys api.

Clippy issues

Can we fix some clippy issues (found some in nvrtc)? Maybe add some github workflow like there is in dfdx?

Add `CudaSlice<T>` & remove `CudaRc`

Equivalent to Vec<T> on cpu. All APIs should accept this instead.

Notably this crate should not worry about compile time safety, and instead be as flexible as possible.

`sync_copy_from` - also accept vector

A function that also accepts a vector would be great as the cudaslice-sizes might be dynamic and it then requires more boilerplate code to actually create a correctly sized slice.

Unsound lifetimes of kernel params

There is potential undefined behavior if any of the params for cuLaunchKernel are drop'd before the kernel actually executes.

A simple case is just a local parameter variable, which gets turned into a kernel parameter which a reference. That means the pointer is to local stack frame maybe?

What would need to happen to fix this is ensuring that these values are dropped after the kernel executes. One potential way to do this is use cuLaunchHostFunc, where the host function does nothing but has ownership of all the params.

Pin also seems very useful for this case.

Add CudaBytes

Would look like this:

#[derive(Debug)]
pub struct CudaBytes {
    pub(crate) cu_device_ptr: sys::CUdeviceptr,
    pub(crate) num_bytes: usize,
    pub(crate) device: ...
}

And this can be unsafely transformed into a CudaSlice

Add Mutex<CudaModule> to CudaDevice for true JIT

Currently all cuda modules must be specified in the device builder for CudaDevice. This means you can't add any cuda modules after you've built the CudaDevice object.

It would be nice to be able to add a new CudaModule that was compiled on the fly during forward/backward pass, but that would require mutating the CudaDevice even though it's passed by reference.

unclear how this interacts with #10

Add cuBLAS or cuDNN support

cuBLAS only has matmuls and vec mat muls, so no convs unless a im2col is also added. cuDNN definitely has convs and a couple other nice things, but unsure how matmuls work upon first look

Use a second stream for de-allocating memory

Currently all operations, including de-allocations, happen on the default stream. In dfdx, after a long forward pass with many operations (e.g. 100 operations, each producting 1+ gradient), all gradients are captured in a Gradients object. After the forward pass is done, the gradients object is dropped, which means ALL temporary gradients are de-allocated at once.

This blocks the default stream at the moment, so all de-allocations occur before any other work can complete.

Instead, we should put de-allocations on a second stream that is synchronized with the default stream with events:

  1. call cuEventCreate
  2. call cuEventRecord with the default stream
  3. call cuStreamWaitEvent with the event and the deallocation stream
  4. call free_async with the deallocation stream

This should free up the default stream to continue working

Add curand

should probably use rand and rand_distributions, other option is just create/gen_x/destroy

Reorganize structure

Each of the top level single file modules (blas.rs, device.rs, jit.rs, and rng.rs) all correspond to one of the other top level modules that contain the sys.rs/result.rs. To make this more consistent and make the structure more obvious, these should all be moved into the corresponding folders and renamed safe.rs.

This would look like:

  • driver/
    • sys.rs
    • result.rs
    • safe.rs
  • nvrtc/
    • sys.rs
    • result.rs
    • safe.rs
  • curand/
    • sys.rs
    • result.rs
    • safe.rs
  • cublas/
    • sys.rs
    • result.rs
    • safe.rs

NvrtcLoadingError when using precompiled ptx

Sometimes i get this error:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: NvrtcLoadingError { key: "cuda", cuda: DriverError(CUDA_ERROR_INVALID_PTX, "a PTX JIT compilation failed") }'

though just recompiling rust (not nvcc) works

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.