Comments (8)
Ah, so even setindex
can trigger the GC... That would explain this deadlock, which is a separate issue #685, but not the CURAND failures. I'll have a look at fixing the former, for which this backtrace is very helpful.
from cuarrays.jl.
Could you add a call to CUDAdrv.synchronize()
before the failing CURAND API call, e.g. in curandGenerateSeeds
, to see if we can capture that preexisting failure?
from cuarrays.jl.
I tried both
@checked function curandGenerateSeeds(generator)
initialize_api()
CUDAdrv.synchronize()
@runtime_ccall((:curandGenerateSeeds, libcurand()), curandStatus_t,
(curandGenerator_t,),
generator)
end
and
@checked function curandGenerateSeeds(generator)
CUDAdrv.synchronize()
initialize_api()
@runtime_ccall((:curandGenerateSeeds, libcurand()), curandStatus_t,
(curandGenerator_t,),
generator)
end
and I still get the same error / stack trace, although anecdotally it seems like it takes a little longer to trigger (might just be in my head). Is that what you meant?
from cuarrays.jl.
Yes, but sadly it doesn't catch anything. I wonder why CURAND thinks there's a preexisting failure then.
Bisecting would be useful. Due to the coupling between CuArrays/CUDAnative/GPUArrays you'll probably have to use the Manifest that's part of CuArrays (only a few commits don't work, you can bisect skip
those).
from cuarrays.jl.
Ok, bisected it to this being the first bad commit: 65a35b1
I checked a couple of times and I'm pretty sure this is it.
I'm using the Manifest like you suggested, so the breakdown is:
-
bad - CuArrays 65a35b1, CUDAapi v4.0.0, CUDAdrv v6.2.0, CUDAnative v3.0.0
-
good - CuArrays 138ece7, CUDAapi v4.0.0, CUDAdrv v6.2.0, CUDAnative v2.10.2 #58c6755
I notice that whenever I switch between these two commit I get
Building the CUDAnative run-time library for your sm_70 device, this might take a while...
which may be relevant.
from cuarrays.jl.
Hmm, that doesn't help much. Are you using multiple threads or tasks?
from cuarrays.jl.
My code is single threaded, and can run in a one-MPI-process-per-GPU configuration. I mentioned above sometimes it hangs intsead of giving me the CURAND_STATUS_PREEXISTING_FAILURE error, but based on your comment / that bisect I ran my code with a single MPI process, and it looks like in this case its just always hanging. Maybe the CURAND_STATUS_PREEXISTING_FAILURE is a red-herring / side-effect of the real issue?
With a single process, I reproduced the hang about 5 times (with the "bad" versions from above), each time I get this identical stack track if I just kill it:
free at /global/u1/m/marius/work/s4/dev/CuArrays/src/memory/binned.jl:393
unknown function (ip: 0x2aac1ff19ad2)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
macro expansion at /global/homes/m/marius/.julia/packages/TimerOutputs/7Id5J/src/TimerOutput.jl:228 [inlined]
macro expansion at /global/u1/m/marius/work/s4/dev/CuArrays/src/memory.jl:218 [inlined]
macro expansion at ./util.jl:234 [inlined]
free at /global/u1/m/marius/work/s4/dev/CuArrays/src/memory.jl:217 [inlined]
_unsafe_free! at /global/u1/m/marius/work/s4/dev/CuArrays/src/array.jl:51
unsafe_free! at /global/u1/m/marius/work/s4/dev/CuArrays/src/array.jl:40
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
jl_apply at /global/u1/m/marius/src/julia-1.4/src/julia.h:1692 [inlined]
run_finalizer at /global/u1/m/marius/src/julia-1.4/src/gc.c:277
jl_gc_run_finalizers_in_list at /global/u1/m/marius/src/julia-1.4/src/gc.c:363
run_finalizers at /global/u1/m/marius/src/julia-1.4/src/gc.c:391 [inlined]
run_finalizers at /global/u1/m/marius/src/julia-1.4/src/gc.c:370
jl_gc_collect at /global/u1/m/marius/src/julia-1.4/src/gc.c:3124
maybe_collect at /global/u1/m/marius/src/julia-1.4/src/gc.c:827 [inlined]
jl_gc_pool_alloc at /global/u1/m/marius/src/julia-1.4/src/gc.c:1142
jl_gc_alloc_ at /global/u1/m/marius/src/julia-1.4/src/julia_internal.h:246 [inlined]
_new_array_ at /global/u1/m/marius/src/julia-1.4/src/array.c:106 [inlined]
_new_array at /global/u1/m/marius/src/julia-1.4/src/array.c:162 [inlined]
jl_alloc_array_1d at /global/u1/m/marius/src/julia-1.4/src/array.c:433
Array at ./boot.jl:405 [inlined]
rehash! at ./dict.jl:193
_setindex! at ./dict.jl:367 [inlined]
setindex! at ./dict.jl:388
macro expansion at /global/u1/m/marius/work/s4/dev/CuArrays/src/memory/binned.jl:384 [inlined]
macro expansion at ./lock.jl:183 [inlined]
alloc at /global/u1/m/marius/work/s4/dev/CuArrays/src/memory/binned.jl:383
unknown function (ip: 0x2aac1fe7fcb5)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
macro expansion at /global/homes/m/marius/.julia/packages/TimerOutputs/7Id5J/src/TimerOutput.jl:228 [inlined]
macro expansion at /global/u1/m/marius/work/s4/dev/CuArrays/src/memory.jl:180 [inlined]
macro expansion at ./util.jl:234 [inlined]
alloc at /global/u1/m/marius/work/s4/dev/CuArrays/src/memory.jl:179 [inlined]
CuArray at /global/u1/m/marius/work/s4/dev/CuArrays/src/array.jl:107
CuArray at /global/u1/m/marius/work/s4/dev/CuArrays/src/array.jl:115 [inlined]
similar at ./abstractarray.jl:671 [inlined]
similar at ./abstractarray.jl:670 [inlined]
similar at /global/u1/m/marius/work/s4/dev/CuArrays/src/broadcast.jl:11 [inlined]
copy at ./broadcast.jl:840
materialize at ./broadcast.jl:820
copy at /global/homes/m/marius/.julia/packages/GPUArrays/QDGmr/src/host/abstractarray.jl:173 [inlined]
unsafe_execute! at /global/u1/m/marius/work/s4/dev/CuArrays/src/fft/fft.jl:412 [inlined]
mul! at /global/u1/m/marius/work/s4/dev/CuArrays/src/fft/fft.jl:449 [inlined]
Fourier at /global/homes/m/marius/work/s4/dev/CMBLensing/src/flat_s0.jl:74 [inlined]
Basislike at /global/homes/m/marius/work/s4/dev/CMBLensing/src/generic.jl:56
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
Ð! at /global/homes/m/marius/work/s4/dev/CMBLensing/src/generic.jl:62
v! at /global/homes/m/marius/work/s4/dev/CMBLensing/src/lenseflow.jl:145
unknown function (ip: 0x2aac5fcebc0e)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
RK4Solver at /global/homes/m/marius/work/s4/dev/CMBLensing/src/numerical_algorithms.jl:25
unknown function (ip: 0x2aac5fce911e)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
odesolve at /global/homes/m/marius/work/s4/dev/CMBLensing/src/numerical_algorithms.jl:53
unknown function (ip: 0x2aac5fce85aa)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
back at /global/homes/m/marius/work/s4/dev/CMBLensing/src/flowops.jl:40
#187#back at /global/homes/m/marius/.julia/packages/ZygoteRules/6nssF/src/adjoint.jl:49
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
lnP at /global/homes/m/marius/work/s4/dev/CMBLensing/src/posterior.jl:59 [inlined]
Pullback at /global/homes/m/marius/.julia/packages/Zygote/4tJp5/src/compiler/interface2.jl:0
#175 at /global/homes/m/marius/.julia/packages/Zygote/4tJp5/src/lib/lib.jl:170 [inlined]
#344#back at /global/homes/m/marius/.julia/packages/ZygoteRules/6nssF/src/adjoint.jl:49 [inlined]
lnP at /global/homes/m/marius/work/s4/dev/CMBLensing/src/posterior.jl:70 [inlined]
Pullback at /global/homes/m/marius/.julia/packages/Zygote/4tJp5/src/compiler/interface2.jl:0
lnP at /global/homes/m/marius/work/s4/dev/CMBLensing/src/posterior.jl:53 [inlined]
Pullback at /global/homes/m/marius/.julia/packages/Zygote/4tJp5/src/compiler/interface2.jl:0
unknown function (ip: 0x2aac5fcdc6e3)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
#460 at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:286 [inlined]
Pullback at /global/homes/m/marius/.julia/packages/Zygote/4tJp5/src/compiler/interface2.jl:0
#36 at /global/homes/m/marius/.julia/packages/Zygote/4tJp5/src/compiler/interface.jl:36
unknown function (ip: 0x2aac5fcdb043)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
gradient at /global/homes/m/marius/.julia/packages/Zygote/4tJp5/src/compiler/interface.jl:45
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
jl_apply at /global/u1/m/marius/src/julia-1.4/src/julia.h:1692 [inlined]
do_apply at /global/u1/m/marius/src/julia-1.4/src/builtins.c:643
jl_f__apply_latest at /global/u1/m/marius/src/julia-1.4/src/builtins.c:693
#invokelatest#1 at ./essentials.jl:712 [inlined]
invokelatest at ./essentials.jl:711 [inlined]
#419#420 at /global/homes/m/marius/work/s4/dev/CMBLensing/src/util.jl:272 [inlined]
#419 at /global/homes/m/marius/work/s4/dev/CMBLensing/src/util.jl:272 [inlined]
#418 at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:14
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
macro expansion at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:25 [inlined]
macro expansion at /global/homes/m/marius/.julia/packages/ProgressMeter/g1lse/src/ProgressMeter.jl:717 [inlined]
#symplectic_integrate#414 at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:23
unknown function (ip: 0x2aac5fc69baa)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
symplectic_integrate##kw at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:17
unknown function (ip: 0x2aac5fc69399)
symplectic_integrate##kw at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:17
unknown function (ip: 0x2aac5fc69195)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
macro expansion at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:284 [inlined]
macro expansion at ./util.jl:234 [inlined]
#458 at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:277
iterate at ./generator.jl:47 [inlined]
_collect at ./array.jl:678
collect_similar at ./array.jl:607
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
map at ./abstractarray.jl:2072
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2144 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
#sample_joint#449 at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:249
unknown function (ip: 0x2aac5c096c3f)
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2158 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
sample_joint##kw at /global/homes/m/marius/work/s4/dev/CMBLensing/src/sampling.jl:176
_jl_invoke at /global/u1/m/marius/src/julia-1.4/src/gf.c:2158 [inlined]
jl_apply_generic at /global/u1/m/marius/src/julia-1.4/src/gf.c:2322
jl_apply at /global/u1/m/marius/src/julia-1.4/src/julia.h:1692 [inlined]
do_call at /global/u1/m/marius/src/julia-1.4/src/interpreter.c:369
eval_value at /global/u1/m/marius/src/julia-1.4/src/interpreter.c:458
eval_stmt_value at /global/u1/m/marius/src/julia-1.4/src/interpreter.c:409 [inlined]
eval_body at /global/u1/m/marius/src/julia-1.4/src/interpreter.c:803
jl_interpret_toplevel_thunk at /global/u1/m/marius/src/julia-1.4/src/interpreter.c:911
jl_toplevel_eval_flex at /global/u1/m/marius/src/julia-1.4/src/toplevel.c:814
jl_toplevel_eval_flex at /global/u1/m/marius/src/julia-1.4/src/toplevel.c:764
slurmstepd: error: *** STEP 550402.8 ON cgpu03 CANCELLED AT 2020-04-16T14:18:58 ***
srun: Terminating job step 550402.8
from cuarrays.jl.
This appears to be fixed for me on 2.2.0, presumably by the referenced issue above. Guessing the CURAND thing was just a random side-effect.
from cuarrays.jl.
Related Issues (20)
- similar(PermutedDimsArray(::CuArray)) isa Array HOT 1
- In CuArrays v2.0, GPU operation takes hours to run for the first time HOT 5
- sum!(y::CuVector, x::CuMatrix) throws InvalidIRError error
- Where can I find
- Where can I find All the using instructions of CuArrays? HOT 3
- add implicit float conversion to math functions HOT 4
- Multiplication between mixed types doesn't drop leading dimensions HOT 2
- Very slow 4D broadcast in 2.0.1 HOT 1
- Failed to detect installed CUDA version. HOT 1
- Sum function is slow HOT 8
- Deadlock during memory free HOT 5
- Indexing CuArrays with Empty Ranges Errors HOT 5
- Sum, any, etc. with function is no longer implemented HOT 7
- Training Halts when Using CuArrarys HOT 6
- CUBLAS initialization HOT 1
- Performance issue with v2.1.0 compared with v1.7.3 HOT 4
- .+ CartesianIndices: InvalidIRError: compiling kernel broadcast HOT 1
- Package fails to load HOT 4
- Project.toml becoming stale (many notable package downgrades) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cuarrays.jl.