Comments (6)
How does training halt? You need to be getting some kind of error or reason it halts, right?
Can you profile the code and see where the time goes? A typical problem is memory, where GPUs have much fewer RAM which doesn't compose well with Julia's GC and running close to the memory limit. Yours has 11GB though, so unless your model is huge that should generally work fine.
Also, which version of Julia?
from cuarrays.jl.
How does training halt? You need to be getting some kind of error or reason it halts, right?
The println
statement (second last line) will stop being printed out, I have tried giving it about an hour, no updates.
Can you profile the code and see where the time goes? A typical problem is memory, where GPUs have much fewer RAM which doesn't compose well with Julia's GC and running close to the memory limit. Yours has 11GB though, so unless your model is huge that should generally work fine.
When I look at CuArrays.memory_status()
, the usage is very high (99.97%), which I found very weird. I have a similar model written in PyTorch that takes only about 500 MB.
Also, which version of Julia?
I have tested this on Julia 1.3.1 and Julia 1.4.1 and both have this problem. I also wonder if this is related to #350 . But I have tried taking out sqrt.
but have the same issue.
Update: if I call GC.gc()
at the end of each epoch, the problem goes away.
from cuarrays.jl.
Update: if I call
GC.gc()
at the end of each epoch, the problem goes away.
Ah, so that problem again. I thought training exited, but it hangs, which is consistent with the GC taking up all time. This is a tough problem, but it's good to have another (small-ish) reproducer.
You can also try using the new, WIP, memory pool: JULIA_CUDA_MEMORY_POOL=split
. Improves performance in some workloads, but ultimately still falls back on the Julia GC so might have the same problem with your use case.
from cuarrays.jl.
Hey @maleadt, I work with @lpjiang97 and spent a bit looking into this. For what it's worth, I was able to replicate this on my machine (information below) 3/5 attempts, each in a fresh Julia session. When it does lock up, stacktrace shows that it's waiting on the lock in either alloc or free:
Stacktrace:
[1] top-level scope at /home/colinxs/workspace/dev/Experiments/flux/foo/debug0.jl:29
[2] lock(::Base.Threads.SpinLock) at ./locks-mt.jl:71
[3] macro expansion at ./lock.jl:181 [inlined]
[4] free(::CUDAdrv.CuPtr{Nothing}) at /home/colinxs/.julia/packages/CuArrays/4Q1BY/src/
memory/binned.jl:393
[5] macro expansion at /home/colinxs/.julia/packages/TimerOutputs/NvIUx/src/TimerOutput
.jl:245 [inlined]
[6] macro expansion at /home/colinxs/.julia/packages/CuArrays/4Q1BY/src/memory.jl:231 [
inlined]
[7] macro expansion at ./util.jl:234 [inlined]
[8] free at /home/colinxs/.julia/packages/CuArrays/4Q1BY/src/memory.jl:230 [inlined]
[9] _unsafe_free!(::CuArray{Float32,2,Nothing}) at /home/colinxs/.julia/packages/CuArra
ys/4Q1BY/src/array.jl:51
[10] unsafe_free!(::CuArray{Float32,2,Nothing}) at /home/colinxs/.julia/packages/CuArra
ys/4Q1BY/src/array.jl:40
Single GPU (1050)
julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
JULIA_DOWNLOAD = /home/colinxs/pkg/installed/julia
JULIA_NUM_THREADS = 6
JULIA_PKG_DEVDIR = /home/colinxs/workspace/juliadev
julia> Pkg.status()
Status `~/workspace/dev/Experiments/flux/foo/Project.toml`
[3895d2a7] CUDAapi v4.0.0
[be33ccc6] CUDAnative v3.0.4
[3a865a2d] CuArrays v2.1.0
[587475ba] Flux v0.10.4
from cuarrays.jl.
I should've the open issues first, it appears you're already well aware of this: #685
from cuarrays.jl.
Correct, I suspected a performance issue but the backtrace is useful in identifying the actual issue. I'll have a look at the deadlock, since a couple of users have been running into this.
from cuarrays.jl.
Related Issues (20)
- similar(PermutedDimsArray(::CuArray)) isa Array HOT 1
- In CuArrays v2.0, GPU operation takes hours to run for the first time HOT 5
- sum!(y::CuVector, x::CuMatrix) throws InvalidIRError error
- Where can I find
- Where can I find All the using instructions of CuArrays? HOT 3
- add implicit float conversion to math functions HOT 4
- Multiplication between mixed types doesn't drop leading dimensions HOT 2
- Very slow 4D broadcast in 2.0.1 HOT 1
- Failed to detect installed CUDA version. HOT 1
- Sum function is slow HOT 8
- CURAND_STATUS_PREEXISTING_FAILURE with v2.0.1 but not v1.7.3 HOT 8
- Deadlock during memory free HOT 5
- Indexing CuArrays with Empty Ranges Errors HOT 5
- Sum, any, etc. with function is no longer implemented HOT 7
- CUBLAS initialization HOT 1
- Performance issue with v2.1.0 compared with v1.7.3 HOT 4
- .+ CartesianIndices: InvalidIRError: compiling kernel broadcast HOT 1
- Package fails to load HOT 4
- Project.toml becoming stale (many notable package downgrades) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cuarrays.jl.