coreylowman / cudarc Goto Github PK
View Code? Open in Web Editor NEWSafe rust wrapper around CUDA toolkit
License: Apache License 2.0
Safe rust wrapper around CUDA toolkit
License: Apache License 2.0
At the time of calling launch_async
, the callee function may have exclusive ownership of the reference (&mut T
), but that is released as soon as launch_async
is returned.
What if device mutates the mutable variable at the same time as a host function? Or similar to #29, what if t
is dropped before launch_async returns?
I accidently passed &mut u8
instead of u8
and got weird errors. The data of the borrow can't be ensured to be valid and probably can't be accessed too. Not sure weather we have to restrict T
to Send
or Sync
(or maybe even something else).
Perhaps use spin instead?
error[E0433]: failed to resolve: could not find `vec` in `std`
--> /root/.cargo/registry/src/github.com-1ecc6299db9ec823/cudarc-0.4.0/src/device.rs:387:50
|
387 | let mut buf = buf.unwrap_or_else(|| std::vec![Default::default(); src.len]);
| ^^^ could not find `vec` in `std`
struct CudaDevice {
...
ordinal: usize
}
fn ordinal(&self) -> &usize {
&self.ordinal
}
Instantiating multiple devices creates multiple streams, and there is tricky synchronization problems between them.
To reduce additional dependencies
tuples where all elements are ValidAsZeroBits
are ValidAsZeroBits
too
sometimes the current limit (4 or 5) of parameters is not enough; can we have tuples implement the IntoKernelParam
trait? (maybe IntoKernelParam
also specifies the size of a parameter, so a tuple of size 2 would have the double size)
Thanks for this lib, I was trying it out and managed to cause illegal address access in the cuda kernel.
Here is the following example (not necessarily the most elegant, I took the matmul example from nvidia).
This example works OK for when running in debug mode and fails in release mode.
My guess is that the slices get dropped too early after optimization (since all ops are asynchronous, I don't think the compiler knows that it should wait for a device synchronization for a and b.
Hereafter the program and stacktrace
use cudarc::device::{CudaDeviceBuilder, LaunchConfig, LaunchCudaFunction};
use cudarc::jit::{compile_ptx_with_opts, CompileError, CompileOptions};
fn main() -> Result<(), CompileError> {
let start = std::time::Instant::now();
let opts = CompileOptions {
ftz: Some(true),
prec_div: Some(false),
prec_sqrt: Some(false),
fmad: Some(true),
..Default::default()
};
let ptx = compile_ptx_with_opts(
"
__global__ void matmul(float* A, float* B, float* C, int N) {
int ROW = blockIdx.y*blockDim.y+threadIdx.y;
int COL = blockIdx.x*blockDim.x+threadIdx.x;
float tmpSum = 0;
if (ROW < N && COL < N) {
// each thread computes one element of the block sub-matrix
for (int i = 0; i < N; i++) {
tmpSum += A[ROW * N + i] * B[i * N + COL];
}
}
// printf(\"pos, (%d, %d) - N %d - value %d\\n\", ROW, COL, N, tmpSum);
C[ROW * N + COL] = tmpSum;
}
",
opts,
)?;
use core::ffi::CStr;
let ptx: &CStr = unsafe { CStr::from_ptr(ptx.image().as_ptr()) };
let ptx: &str = ptx.to_str().unwrap();
println!("Compilation succeeded!");
use std::fs::File;
use std::io::Write;
let mut file = File::create("matmul.ptx").unwrap();
file.write_all(ptx.as_bytes()).unwrap();
println!("Init in {:?}", start.elapsed());
let dev = CudaDeviceBuilder::new(0)
.with_ptx_from_file("_Z6matmulPfS_S_i", "./matmul.ptx", &["_Z6matmulPfS_S_i"])
.build()
.unwrap();
println!("Init in {:?}", start.elapsed());
// "sin_module" is the key used with CudaDeviceBuilder
let module = dev.get_module("_Z6matmulPfS_S_i").unwrap();
// "sin_kernel" is the name of the actual function inside the .ptx file
let f = module.get_fn("_Z6matmulPfS_S_i").unwrap();
println!("Loaded in {:?}", start.elapsed());
let a_host = [1.0f32, 2.0, 3.0, 4.0];
let b_host = [1.0f32, 2.0, 3.0, 4.0];
let mut c_host = [0.0f32; 4];
let a_dev = dev.sync_copy(&a_host).unwrap();
let b_dev = dev.sync_copy(&b_host).unwrap();
let mut c_dev = dev.sync_copy(&c_host).unwrap();
let cfg = LaunchConfig {
block_dim: (2, 2, 1),
grid_dim: (1, 1, 1),
shared_mem_bytes: 0,
};
unsafe { dev.launch_async(f, cfg, (&a_dev, &b_dev, &mut c_dev, 2)) }.unwrap();
drop(a_dev);
drop(b_dev);
dev.sync_copy_from(&c_dev, &mut c_host).unwrap();
drop(c_dev);
println!("Found {:?} in {:?}", c_host, start.elapsed());
Ok(())
}
Stacktrace;
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")', /home/nicolas/src/cudarc/src/device.rs:204:82
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")', /home/nicolas/src/cudarc/src/device.rs:204:82
stack backtrace:
0: 0x55b4bfb5c5fa - std::backtrace_rs::backtrace::libunwind::trace::h196c489280b5a090
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
1: 0x55b4bfb5c5fa - std::backtrace_rs::backtrace::trace_unsynchronized::h13bf778637c29636
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: 0x55b4bfb5c5fa - std::sys_common::backtrace::_print_fmt::h0f4861fc71bccacd
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/sys_common/backtrace.rs:65:5
3: 0x55b4bfb5c5fa - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h5f011e3cd484ad42
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/sys_common/backtrace.rs:44:22
4: 0x55b4bfb7ad2e - core::fmt::write::h1524a82fa967022b
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/core/src/fmt/mod.rs:1208:17
5: 0x55b4bfb5a445 - std::io::Write::write_fmt::ha3d33adc534ae39a
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/io/mod.rs:1682:15
6: 0x55b4bfb5c3c5 - std::sys_common::backtrace::_print::h11f95f719c502811
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/sys_common/backtrace.rs:47:5
7: 0x55b4bfb5c3c5 - std::sys_common::backtrace::print::h71adea94b47d00ae
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/sys_common/backtrace.rs:34:9
8: 0x55b4bfb5da4f - std::panicking::default_hook::{{closure}}::he466417f162d07e8
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:267:22
9: 0x55b4bfb5d78b - std::panicking::default_hook::hff6ac2c6adb87df1
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:286:9
10: 0x55b4bfb5e15c - std::panicking::rust_panic_with_hook::hcb61d2700034f188
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:688:13
11: 0x55b4bfb5def9 - std::panicking::begin_panic_handler::{{closure}}::h44457d7d7214b4ef
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:579:13
12: 0x55b4bfb5caac - std::sys_common::backtrace::__rust_end_short_backtrace::hf4f789a62590af34
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/sys_common/backtrace.rs:137:18
13: 0x55b4bfb5dc02 - rust_begin_unwind
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:575:5
14: 0x55b4bfb3b193 - core::panicking::panic_fmt::h19d66e5282d7808f
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/core/src/panicking.rs:64:14
15: 0x55b4bfb3b5c3 - core::result::unwrap_failed::h5275d1617160e163
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/core/src/result.rs:1790:5
16: 0x55b4bfb3bce6 - core::ptr::drop_in_place<cudarc::device::CudaSlice<f32>>::h1e3a20025b0386e7
17: 0x55b4bfb3c7bf - matmul::main::h96270d3cbc9d8793
18: 0x55b4bfb3c823 - std::sys_common::backtrace::__rust_begin_short_backtrace::h77c3c7fd1a851eb0
19: 0x55b4bfb3c843 - std::rt::lang_start::{{closure}}::he524618c01c92a82
20: 0x55b4bfb576ec - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::hee44d3bca815d6cd
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/core/src/ops/function.rs:606:13
21: 0x55b4bfb576ec - std::panicking::try::do_call::h8e2913788124065e
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:483:40
22: 0x55b4bfb576ec - std::panicking::try::h460040cf489c3878
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:447:19
23: 0x55b4bfb576ec - std::panic::catch_unwind::h5e12d45b7905f1f3
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panic.rs:137:14
24: 0x55b4bfb576ec - std::rt::lang_start_internal::{{closure}}::h53e0b25b0eaaa72b
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/rt.rs:148:48
25: 0x55b4bfb576ec - std::panicking::try::do_call::h84ed13c66e27b8c7
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:483:40
26: 0x55b4bfb576ec - std::panicking::try::h9972483dcd03469d
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panicking.rs:447:19
27: 0x55b4bfb576ec - std::panic::catch_unwind::h56b5ad191546ecc2
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/panic.rs:137:14
28: 0x55b4bfb576ec - std::rt::lang_start_internal::hc154d6be0270f9da
at /rustc/d0dc9efff14ac0a1eeceffd1e605e37eeb8362a0/library/std/src/rt.rs:148:20
29: 0x55b4bfb3c815 - main
30: 0x7f40d043c290 - <unknown>
31: 0x7f40d043c34a - __libc_start_main
32: 0x55b4bfb3b705 - _start
at /build/glibc/src/glibc/csu/../sysdeps/x86_64/start.S:115
33: 0x0 - <unknown>
thread panicked while panicking. aborting.
Aborted (core dumped)
Cheers !
This is a small enough library with different enough pieces where it isn't really needed.
This will enable a corresponding CpuRng
to implement them
Related to #2, curand may not support driver api
Would be similar to slices of Vecs.
Ideal interface would be:
let a: CudaSlice<f32> = ...;
let b: CudaSlice<f32> = a.offset(123);
where offset would panic if there aren't enough elements, otherwise make a copy of the device pointer.
Scenarios to figure out:
let a: CudaSlice<f32> = ...;
let b: CudaSlice<f32> = a.offset(123);
drop(a);
// b should still be valid here, so the underlying device pointer should not be dropped
Sometimes you only need a specific part of a CudaSlice
and copying from a CudaView
would be very useful (and why can you only slice with a RangeFrom
?)
can this function be public?
Line 388 in 2785abb
Sometimes i know that i'll overwrite/initialize the memory, but memsetting 2GB-3GB before is rather a waste of time.
Should take Vec and overwrite self.host_buf
Currently CudaDevice only supports a single stream. Look into how multiple should be supported
Cuda does not raise any errors if you access memory past the bytes you allocated. The safe functions account for this, but should add unit tests to ensure this
This would be a good justification for ordinal as a const generic on both CudaSlice/CudaDevice.
Originally posted by @M1ngXU in #6 (comment)
unsafe impl Send for CudaDevice {}
unsafe impl Sync for CudaDevice {}
is all that is needed.
It doesn't work out of the box because CudaDevice has raw pointers from the sys api.
Can we fix some clippy issues (found some in nvrtc)? Maybe add some github workflow like there is in dfdx?
Apparently this is all expected according to their documentation. This crate should find away around this short coming of their host api.
Equivalent to Vec<T>
on cpu. All APIs should accept this instead.
Notably this crate should not worry about compile time safety, and instead be as flexible as possible.
A function that also accepts a vector would be great as the cudaslice-sizes might be dynamic and it then requires more boilerplate code to actually create a correctly sized slice.
Copying to device depends on these values staying in the same memory location.
There is potential undefined behavior if any of the params for cuLaunchKernel
are drop
'd before the kernel actually executes.
A simple case is just a local parameter variable, which gets turned into a kernel parameter which a reference. That means the pointer is to local stack frame maybe?
What would need to happen to fix this is ensuring that these values are dropped after the kernel executes. One potential way to do this is use cuLaunchHostFunc, where the host function does nothing but has ownership of all the params.
Pin
also seems very useful for this case.
Would look like this:
#[derive(Debug)]
pub struct CudaBytes {
pub(crate) cu_device_ptr: sys::CUdeviceptr,
pub(crate) num_bytes: usize,
pub(crate) device: ...
}
And this can be unsafely transformed into a CudaSlice
Currently all cuda modules must be specified in the device builder for CudaDevice. This means you can't add any cuda modules after you've built the CudaDevice object.
It would be nice to be able to add a new CudaModule that was compiled on the fly during forward/backward pass, but that would require mutating the CudaDevice even though it's passed by reference.
unclear how this interacts with #10
cuBLAS only has matmuls and vec mat muls, so no convs unless a im2col is also added. cuDNN definitely has convs and a couple other nice things, but unsure how matmuls work upon first look
Currently all operations, including de-allocations, happen on the default stream. In dfdx, after a long forward pass with many operations (e.g. 100 operations, each producting 1+ gradient), all gradients are captured in a Gradients object. After the forward pass is done, the gradients object is dropped, which means ALL temporary gradients are de-allocated at once.
This blocks the default stream at the moment, so all de-allocations occur before any other work can complete.
Instead, we should put de-allocations on a second stream that is synchronized with the default stream with events:
This should free up the default stream to continue working
should probably use rand and rand_distributions, other option is just create/gen_x/destroy
To reduce dependencies
Each of the top level single file modules (blas.rs, device.rs, jit.rs, and rng.rs) all correspond to one of the other top level modules that contain the sys.rs/result.rs. To make this more consistent and make the structure more obvious, these should all be moved into the corresponding folders and renamed safe.rs
.
This would look like:
Sometimes i get this error:
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: NvrtcLoadingError { key: "cuda", cuda: DriverError(CUDA_ERROR_INVALID_PTX, "a PTX JIT compilation failed") }'
though just recompiling rust (not nvcc) works
Currently not accurate to what it actually does
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.