juliagpu / cuarrays.jl Goto Github PK

View Code? Open in Web Editor NEW

281.0 22.0 83.0 2.21 MB

A Curious Cumulation of CUDA Cuisine

Home Page: https://juliagpu.org/cuda/

License: Other

Julia 100.00%

cuda gpu-programming julia

cuarrays.jl's Issues

Extend `accumulate!`

CuArrays has accumulate!, but it's limited: does not support anything but vectors, does not support the init keyword, and is slow (should use the shmem/shfl optims from https://github.com/JuliaGPU/CUDAnative.jl/blob/master/examples/scan.jl)

Old post:

@dpsanders and I just run into the situation where we wanted to do a cumsum on a CuArray.

CUDAnative has it as a example, but we should probably add the functionality to CuArrays https://github.com/JuliaGPU/CUDAnative.jl/blob/master/examples/scan.jl

build from source?

this is a huge barrier to adoption. will building julia from source always be a requirement for this package?

CuArrays downgrades CUDAapi, CUDAdrv, and CUDAnative

CuArrays downgrades CUDAapi to 0.2.1, CUDAdrv to 0.6.1, and CUDAnative to 0.5.3, just like CUDNN does, which causes problem of tests in CUDAnative. JuliaGPU/CUDAnative.jl#144, JuliaAttic/CUDNN.jl#13.

Type stability w/ and w/o `using CUDAnative`

If I define a module like this:

module M

using CuArrays
# using CUDAnative

function cross_entropy_loss(ŷ::AbstractMatrix, y::AbstractMatrix)
    sublosses = -sum(y .* ŷ, 1) .+ CUDAnative.log.(sum(exp.(ŷ), 1))
    return mean(sublosses)    
end

end

and then invoke it form REPL or another file like this:

using CuArrays

function main()
    y = CuArray(rand(Float32, 3, 4))
    ŷ = CuArray(rand(Float32, 3, 4))
    M.cross_entropy_loss(ŷ, y)
end

I'm getting:

ERROR: Broadcast output type Any is not concrete
Stacktrace:
 [1] broadcast_t at /home/dfdx/.julia/v0.6/CuArrays/src/broadcast.jl:34 [inlined]
 [2] broadcast_c at /home/dfdx/.julia/v0.6/CuArrays/src/broadcast.jl:63 [inlined]
 [3] broadcast at ./broadcast.jl:434 [inlined]
 [4] cross_entropy_loss(::CuArray{Float32,2}, ::CuArray{Float32,2}) at /home/dfdx/Downloads/cross_entropy_test.jl:7
 [5] main() at /home/dfdx/Downloads/cross_entropy_test.jl:18

@code_warntype shows that the function indeed returns Any, but the generated code is too complex for me to infer further details.

If we uncomment using CUDAnative inside the module, the error disappears. It also disappears if I define cross_entropy_loss in the same module as calling code or if I try to simplify the function. So I won't be surprised if it's not reproducible even in slightly different conditions.

I'm using Julia 0.6.0 and latest master of both - CuArrays (67444add60fee8cfcd346ab7c97dd64bfae4b1ba) and CUDAnative (997142a0281034e96d53132da631d01e6646ca6b).

A_mul_Bc! gives error

I'm having some problems with A_mul_Bc! For example:

julia> a0 = cu(rand(1,1) + im*rand(1,1));
julia> ar = cu(rand(1,1) + im*rand(1,1));
julia> en = cu(rand(1,1) + im*rand(1,1));
julia> A_mul_Bc!(en, ar, a0)
ERROR: ReadOnlyMemoryError()

For larger matrix sizes julia seg faults in some cases. At the same time, other types of A_mul_B seem to work fine.

Out of memory during precompile

I get the following error during precompile on an 8 GPU machine where devices 0 and 3 are fully occupied:

ERROR: LoadError: InitError: CUDA error: out of memory (code #2, ERROR_OUT_OF_MEMORY)
Stacktrace:
 [1] macro expansion at /dev/shm/dyuret/.julia/v0.6/CUDAdrv/src/base.jl:148 [inlined]
 [2] CUDAdrv.CuContext(::CUDAdrv.CuDevice, ::CUDAdrv.CUctx_flags) at /dev/shm/dyuret/.julia/v0.6/CUDAdrv/src/context.jl:11\
8
 [3] __init__() at /dev/shm/dyuret/.julia/v0.6/CUDAnative/src/CUDAnative.jl:67

Knet always finds the device with the greatest amount of available memory and by default initializes there. Is there a way to do this with CuArrays manually or automatically?

Automatic creation of contexts on GPU when "using CuArrays"

HI All,

I just installed CuArrays on top of CUDAdrv. I found "@Everywhere using CuArrays" automatically creates GPU contexts, one worker each on device 0. see an example when using 5 workers below.

GPU_ID %GPU GPU_MEM PID
0 0 196.8MiB 169014
0 0 197.8MiB 169021
0 0 196.8MiB 169023
0 0 196.8MiB 169027
0 0 196.8MiB 169025
1 0 0

Do I have control over which device a worker initiates the context on? I think it would be nice that the context can be created in a more controllable way, for example, only when do CUDAdrv.CuContext(CUDAdrv.CuDevice(devInt)). Ideally, it will be great that the following code can create the context only when explicitly called CUDAdrv.CuContext.

manager = MPIManager(np=4)
cpus = addprocs(manager)
gpus = [0,0,1,1]
@everywhere CuArrays
for worker in cpus
     @spawnat worker CUDAdrv.CuContext(CUDAdrv.CuDevice(gpus[worker-1]))
end
@parallel (+) for i in 1:4 (xl=cu(rand(10^4,10^4));xr=cu(rand(10^4,10^4)); x=xl*xr;collect(x)) end

Cheers
Yue

`deconvolution` operation request

CuArrays does not cover deconv (transposed convolution) yet, which is essential in doing matrix upsampling tasks.

Test tolerances

Yet another BLAS tolerance test failure:

elty = Float32: Test Failed
  Expression: ≈(C[:L], dL, rtol=0.01)
Stacktrace:
 [1] macro expansion at /var/lib/buildbot/workers/julia/CuArrays-julia06-x86-64bit/packages/v0.6/CuArrays/test/blas.jl:1173 [inlined]
 [2] macro expansion at ./test.jl:921 [inlined]
 [3] macro expansion at /var/lib/buildbot/workers/julia/CuArrays-julia06-x86-64bit/packages/v0.6/CuArrays/test/blas.jl:1150 [inlined]
 [4] macro expansion at ./test.jl:860 [inlined]
 [5] macro expansion at /var/lib/buildbot/workers/julia/CuArrays-julia06-x86-64bit/packages/v0.6/CuArrays/test/blas.jl:1149 [inlined]
 [6] macro expansion at ./test.jl:860 [inlined]
 [7] macro expansion at /var/lib/buildbot/workers/julia/CuArrays-julia06-x86-64bit/packages/v0.6/CuArrays/test/blas.jl:37 [inlined]
 [8] macro expansion at ./test.jl:860 [inlined]
 [9] anonymous at ./<missing>:?
elty = Float32: Test Failed
  Expression: ≈(C[:U], dU, rtol=0.01)
Stacktrace:
 [1] macro expansion at /var/lib/buildbot/workers/julia/CuArrays-julia06-x86-64bit/packages/v0.6/CuArrays/test/blas.jl:1174 [inlined]
 [2] macro expansion at ./test.jl:921 [inlined]
 [3] macro expansion at /var/lib/buildbot/workers/julia/CuArrays-julia06-x86-64bit/packages/v0.6/CuArrays/test/blas.jl:1150 [inlined]
 [4] macro expansion at ./test.jl:860 [inlined]
 [5] macro expansion at /var/lib/buildbot/workers/julia/CuArrays-julia06-x86-64bit/packages/v0.6/CuArrays/test/blas.jl:1149 [inlined]
 [6] macro expansion at ./test.jl:860 [inlined]
 [7] macro expansion at /var/lib/buildbot/workers/julia/CuArrays-julia06-x86-64bit/packages/v0.6/CuArrays/test/blas.jl:37 [inlined]
 [8] macro expansion at ./test.jl:860 [inlined]
 [9] anonymous at ./<missing>:?

However, bumping the rtol beyond 1% seems awfully high. Something else going on?

Compatibility with ForwardDiff

With your comments on the blogpost, I tried to use CuArrays with ForwardDiff but it seems like I've hit a wall with the broadcast mechanisms:

using CuArrays
import CUDAnative
import CUDAdrv: synchronize

# seems missing in CuArrays?
Base.Broadcast.promote_containertype(::Type{CuArray}, ::Type{CuArray}) = CuArray
Base.Broadcast.promote_containertype(::Type{CuArray}, ct) = CuArray
Base.Broadcast.promote_containertype(ct, ::Type{CuArray}) = CuArray

# I couldn't get CuArrays to work with Base intrinsics, shouldn't the cufunc hack solve that?

# HACK: @define_diffrule cannot handle CUDAnative.x
@inline cuda_log10(x) = CUDAnative.log10(x)
@inline cuda_erf(x) = CUDAnative.erf(x)
@inline cuda_sqrt(x) = CUDAnative.sqrt(x)
@inline cuda_exp(x) = CUDAnative.exp(x)

# HACK: diff rules for CUDAnative intrinsics
import DiffBase: @define_diffrule, DiffRule     # HACK: @define_diffrule wrongly escapes
@define_diffrule cuda_log10(x)    = :( inv($x) / CUDAnative.log(10) )
@define_diffrule cuda_erf(x)      = :( (2 / CUDAnative.sqrt(π)) * CUDAnative.exp(-$x * $x) )
@define_diffrule cuda_sqrt(x)     = :( inv(2 * CUDAnative.sqrt($x)) )
@define_diffrule cuda_exp(x)      = :( CUDAnative.exp($x) )

@inline cndf2(in::AbstractArray{T}) where {T<:Real} = T(0.5) .+ T(0.5) .* cuda_erf.(T(0.707106781) .* in)

function blackscholes(sptprice::AbstractArray{<:Real}, strike::AbstractArray{<:Real},
                      rate::AbstractArray{<:Real}, volatility::AbstractArray{<:Real},
                      time::AbstractArray{<:Real})
    logterm = cuda_log10.(sptprice ./ strike)
    powterm = eltype(volatility)(.5) .* volatility .* volatility
    den = volatility .* cuda_sqrt.(time)
    d1 = (((rate .+ powterm) .* time) .+ logterm) ./ den
    d2 = d1 .- den
    NofXd1 = cndf2(d1)
    NofXd2 = cndf2(d2)
    futureValue = strike .* cuda_exp.(- rate .* time)
    c1 = futureValue .* NofXd2
    call = sptprice .* NofXd1 .- c1
    return call .- futureValue .+ sptprice
end

iterations = 10#^7

sptprice   = Float32[ 42.0 for i = 1:iterations ]
strike     = Float32[ 40.0 + (i / iterations) for i = 1:iterations ]
rate       = Float32[ 0.5 for i = 1:iterations ]
volatility = Float32[ 0.2 for i = 1:iterations ]
time       = Float32[ 0.5 for i = 1:iterations ]

sptprice_dev = CuArray(sptprice)
strike_dev = CuArray(strike)
rate_dev = CuArray(rate)
volatility_dev = CuArray(volatility)
time_dev = CuArray(time)
out = zeros(sptprice)

using ForwardDiff
blackscholes_time(time) = blackscholes(sptprice_dev, strike_dev, rate_dev, volatility_dev, time)
g = time -> ForwardDiff.gradient(blackscholes_time, time)
@show g(time_dev)

This doesn't work because of CuArrays' broadcast refusing Any:

f = #5
A = CuArray(Float32[0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2])
Bs = (CuArray(ForwardDiff.Dual{ForwardDiff.Tag{#blackscholes_time,0xd4f33f37fa3a8c65},Float32,10}[Dual{ForwardDiff.Tag{#blackscholes_time,0xd4f33f37fa3a8c65}}(0.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0), Dual{ForwardDiff.Tag{#blackscholes_time,0xd4f33f37fa3a8c65}}(0.5,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0), Dual{ForwardDiff.Tag{#blackscholes_time,0xd4f33f37fa3a8c65}}(0.5,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0), Dual{ForwardDiff.Tag{#blackscholes_time,0xd4f33f37fa3a8c65}}(0.5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0), Dual{ForwardDiff.Tag{#blackscholes_time,0xd4f33f37fa3a8c65}}(0.5,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0), Dual{ForwardDiff.Tag{#blackscholes_time,0xd4f33f37fa3a8c65}}(0.5,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0), Dual{ForwardDiff.Tag{#blackscholes_time,0xd4f33f37fa3a8c65}}(0.5,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0), Dual{ForwardDiff.Tag{#blackscholes_time,0xd4f33f37fa3a8c65}}(0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0), Dual{ForwardDiff.Tag{#blackscholes_time,0xd4f33f37fa3a8c65}}(0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0), Dual{ForwardDiff.Tag{#blackscholes_time,0xd4f33f37fa3a8c65}}(0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0)]),)
T = _broadcast_eltype(f, A, Bs...) = Any
ERROR: LoadError: Broadcast output type Any is not concrete

"Double free or corruption" error when importing CuArrays

When I was running a Flux model, I encountered an error regarding something about there not being a canonical binary representation of some data I had called gpu on, which caused the program to abort. That seemed fine and dandy because I could fix the error and re-run, but now any time I try to import CuArrays with using CuArrays (e.g., with julia -e "using CuArrays" from Bash), I get the following error message:

*** Error in `/home/maetshju/julia-0.6.2/julia': double free or corruption (!prev): 0x0000560346920010 ***

signal (6): Aborted
while loading no file, in expression starting on line 0
__libc_signal_restore_set at /build/glibc-itYbWN/glibc-2.26/signal/../sysdeps/unix/sysv/linux/nptl-signals.h:80 [inlined]
raise at /build/glibc-itYbWN/glibc-2.26/signal/../sysdeps/unix/sysv/linux/raise.c:48
abort at /build/glibc-itYbWN/glibc-2.26/stdlib/abort.c:90
__libc_message at /build/glibc-itYbWN/glibc-2.26/libio/../sysdeps/posix/libc_fatal.c:181
malloc_printerr at /build/glibc-itYbWN/glibc-2.26/malloc/malloc.c:5426
_int_free at /build/glibc-itYbWN/glibc-2.26/malloc/malloc.c:4175
__libc_free at /build/glibc-itYbWN/glibc-2.26/malloc/malloc.c:3145
unknown function (ip: 0x7fd3b8477d7b)
unknown function (ip: 0x7fd3b8477dc2)
unknown function (ip: 0x7fd3b8478063)
unknown function (ip: 0x7fd3b836a92f)
unknown function (ip: 0x7fd3b8344abb)
cuInit at /usr/lib/x86_64-linux-gnu/libcuda.so.390.30 (unknown line)
macro expansion at /home/maetshju/.julia/v0.6/CUDAdrv/src/base.jl:143 [inlined]
init at /home/maetshju/.julia/v0.6/CUDAdrv/src/init.jl:10
__init__ at /home/maetshju/.julia/v0.6/CUDAdrv/src/init.jl:29
unknown function (ip: 0x7fd3bd2360ff)

signal (11): Segmentation fault
while loading no file, in expression starting on line 0

I'm not quite sure how to give code to reproduce the error, unless you want to run my current script for the neural network model I'm developing and hope that it fails in such a way as to get CuArrays stuck like this, but I'm happy to provide any extra information I can.

Error with permutedims

It seems permutedims is not correctly permuting CuArrays in some cases. For example:

julia> e = cu(rand(2, 2, 2))
2×2×2 CuArray{Float64,3}:
[:, :, 1] =
 0.223654  0.428071
 0.498901  0.423364

[:, :, 2] =
 0.257477  0.138125
 0.612776  0.565442

julia> permutedims(e, (3, 1, 2))
2×2×2 CuArray{Float64,3}:
[:, :, 1] =
 0.223654  0.257477
 0.428071  0.138125

[:, :, 2] =
 0.498901  0.612776
 0.423364  0.565442

julia> permutedims(collect(e), (3, 1, 2))
2×2×2 Array{Float64,3}:
[:, :, 1] =
 0.223654  0.498901
 0.257477  0.612776

[:, :, 2] =
 0.428071  0.423364
 0.138125  0.565442

where it appears CuArrays has implemented the permutation (2, 3, 1) instead. As an aside it would be useful if CuArrays allowed the square brackets notation of the permutation [3, 1, 2], as this is what the docs of Base.permutedims suggest.

reduction along a dimension?

What's the best way to reduce along only one dimension? For example for sum(x,1)?

LLVM Problem

i am using Julia 0.6.2 in Atom (GTX 1060, Cuda installed, Ryzen 5 1600X)
I know i cant use the newest Version in 0.6.2, but if i install CuArrays it will install an older Version of LLVM automatically, but i get a LoadError:

===[ ERROR: LLVM ]===
LoadError: Unknown OS
while loading C:\Users\Max.julia\v0.6\LLVM\deps\build.jl, in expression starting on line 104

Part of the Problem:
INFO: Building CUDAnative
ERROR: LoadError: ArgumentError: Module Unicode not found in current path.
Run Pkg.add("Unicode") to install the Unicode package.

if i install the knewest version of LLVM:
Pkg.test("CuArrays") gives:
WARNING: julia is fixed at 0.6.2 conflicting with requirement for LLVM: [0.7.0-DEV.2915,∞)

How can i solve this Problem without Julia 0.7?

Fractional powers give LLVM error

   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.0 (2017-06-19 13:05 UTC)
 _/ |\__'_|_|_|\__'_|  |  
|__/                   |  x86_64-linux-gnu

julia> using CuArrays

julia> CuArrays.allowslow(false)
false

julia> x = cu(abs.(randn(10, 10)))
10×10 CuArray{Float64,2}:
 1.01167    0.926641   0.122783   0.433447  0.471071   …  0.641311  0.322102  1.11901    1.08687 
 0.174375   0.902149   0.708563   0.100777  1.17209       0.984024  1.54349   2.31564    1.42121 
 0.880497   0.260067   0.182906   0.321197  0.419981      1.0425    1.03248   2.39686    0.466474
 0.455405   0.189683   0.970617   1.02481   1.20051       0.548943  1.0183    0.922669   0.674622
 1.65143    0.423125   0.199559   0.756164  0.769089      1.65847   0.336149  0.0769364  0.728361
 1.13424    0.463791   0.487644   0.422425  0.642065   …  0.582538  0.699725  0.0974422  0.725078
 1.42058    0.551611   1.55393    1.81143   0.848366      2.38784   0.995038  1.15988    2.07894 
 0.0542946  1.25654    1.99043    1.3079    0.0102366     0.260701  1.28124   0.301421   0.744825
 2.18924    0.345994   1.44563    1.96132   1.38548       1.22024   0.151539  0.120201   0.810535
 1.06216    0.0061378  0.0229676  0.110931  0.036234      0.386171  0.931272  0.23151    0.688115

julia> x .^ 1.5
ERROR: LLVM error: Cannot select: 0x17890df0: f64 = fpow 0x17890990, ConstantFP:f64<1.500000e+00>
  0x17890990: f64,ch = load<LD8[null(addrspace=101)]> 0x18940cc0, TargetExternalSymbol:i64'julia__7_61683_param_0', undef:i64
    0x178908b0: i64 = TargetExternalSymbol'julia__7_61683_param_0'
    0x17890920: i64 = undef
  0x17890d80: f64 = ConstantFP<1.500000e+00>
In function: julia__7_61683
Stacktrace:
 [1] handle_error(::Cstring) at /home/alha02/.julia/v0.6/LLVM/src/core/context.jl:96
 [2] macro expansion at /home/alha02/.julia/v0.6/LLVM/src/util/logging.jl:102 [inlined]
 [3] macro expansion at /home/alha02/.julia/v0.6/LLVM/src/base.jl:18 [inlined]
 [4] LLVMTargetMachineEmitToMemoryBuffer(::Ptr{LLVM.API.LLVMOpaqueTargetMachine}, ::Ptr{LLVM.API.LLVMOpaqueModule}, ::UInt32, ::Base.RefValue{Cstring}, ::Base.RefValue{Ptr{LLVM.API.LLVMOpaqueMemoryBuffer}}) at /home/alha02/.julia/v0.6/LLVM/src/../lib/3.9/libLLVM_h.jl:301
 [5] emit(::LLVM.TargetMachine, ::LLVM.Module, ::UInt32) at /home/alha02/.julia/v0.6/LLVM/src/targetmachine.jl:39
 [6] #mcgen#45(::Bool, ::Function, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at /home/alha02/.julia/v0.6/CUDAnative/src/jit.jl:303
 [7] (::CUDAnative.#kw##mcgen)(::Array{Any,1}, ::CUDAnative.#mcgen, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at ./<missing>:0
 [8] #compile_function#46(::Bool, ::Function, ::Any, ::Any, ::VersionNumber) at /home/alha02/.julia/v0.6/CUDAnative/src/jit.jl:328
 [9] cufunction(::CUDAdrv.CuDevice, ::Any, ::Any) at /home/alha02/.julia/v0.6/CUDAnative/src/jit.jl:369
 [10] macro expansion at /home/alha02/.julia/v0.6/CUDAnative/src/execution.jl:107 [inlined]
 [11] _cuda(::Tuple{Int64,Int64}, ::Int64, ::CUDAdrv.CuStream, ::CuArrays.#broadcast_kernel, ::##7#8, ::CUDAnative.CuDeviceArray{Float64,2,CUDAnative.AS.Global}, ::Tuple{Tuple{Bool,Bool}}, ::Tuple{Tuple{Int64,Int64}}, ::CUDAnative.CuDeviceArray{Float64,2,CUDAnative.AS.Global}, ::Tuple{}) at /home/alha02/.julia/v0.6/CUDAnative/src/execution.jl:80
 [12] _broadcast! at /home/alha02/.julia/v0.6/CuArrays/src/broadcast.jl:22 [inlined]
 [13] broadcast_t at /home/alha02/.julia/v0.6/CuArrays/src/broadcast.jl:37 [inlined]
 [14] broadcast_c at /home/alha02/.julia/v0.6/CuArrays/src/broadcast.jl:58 [inlined]
 [15] broadcast(::Function, ::CuArray{Float64,2}) at ./broadcast.jl:434

Bringing the array to host first works:

julia> collect(x) .^ 1.5
10×10 Array{Float64,2}:
 1.01756    0.892005    0.0430235   0.285367   …  0.513574  0.182806   1.18373    1.13309 
 0.0728158  0.856875    0.596441    0.0319922     0.976132  1.9176     3.52376    1.69429 
 0.826213   0.132626    0.078224    0.182036      1.06443   1.04911    3.71077    0.318597
 0.307324   0.0826121   0.95625     1.03744       0.406716  1.02758    0.886275   0.554103
 2.12221    0.275235    0.089147    0.657543      2.1358    0.194894   0.0213402  0.621613
 1.20798    0.315852    0.340529    0.274552   …  0.444618  0.585317   0.0304173  0.617415
 1.69316    0.409684    1.93708     2.43799       3.68983   0.992566   1.24916    2.99752 
 0.0126513  1.40853     2.80814     1.49575       0.133111  1.45026    0.165486   0.642808
 3.23921    0.203518    1.73815     2.74676       1.34793   0.0589909  0.0416736  0.729722
 1.09467    0.00048086  0.00348075  0.0369472     0.239977  0.898701   0.111392   0.570809

and integer powers works:

julia> x .^ 2.0
10×10 CuArray{Float64,2}:
 1.02349    0.858663    0.0150756    0.187876   …  0.41128    0.10375   1.25218     1.18129 
 0.0304066  0.813873    0.502062     0.010156      0.968303   2.38237   5.36219     2.01984 
 0.775276   0.067635    0.0334544    0.103167      1.08681    1.06602   5.74494     0.217598
 0.207394   0.0359798   0.942097     1.05023       0.301338   1.03694   0.851318    0.455115
 2.72721    0.179035    0.0398237    0.571784      2.75051    0.112996  0.00591921  0.53051 
 1.28651    0.215102    0.237797     0.178443   …  0.339351   0.489616  0.00949498  0.525738
 2.01804    0.304275    2.4147       3.28127       5.70176    0.9901    1.34532     4.32198 
 0.0029479  1.5789      3.9618       1.71059       0.0679649  1.64158   0.0908546   0.554764
 4.79275    0.119712    2.08985      3.84676       1.48899    0.022964  0.0144482   0.656967
 1.12818    3.76726e-5  0.000527509  0.0123058     0.149128   0.867268  0.0535968   0.473502

cat can't expand the number of dimensions but doesn't produce an error

Maybe related to #72 but the behaviour is different. There is no error, but the result of cat(3,cu(x),cu(y)) is different from cat(3,x,y):

julia> using CuArrays;

julia> x = rand(2, 1);

julia> y = rand(2, 1);

julia> cat(3,x,y)
2×1×2 Array{Float64,3}:
[:, :, 1] =
 0.151418
 0.388829

[:, :, 2] =
 0.732151
 0.991128

julia> cat(3,cu(x),cu(y))
2×1×2 CuArray{Float32,3}:
[:, :, 1] =
 0.151418
 0.388829

[:, :, 2] =
 -1.25053f-6
  3.89817f-6

test erros on Master

After checkout CuArrays, I ran into some test errors.

julia> Pkg.checkout("CuArrays")
INFO: Checking out CuArrays master...
INFO: Pulling CuArrays latest master...
INFO: Cloning cache of Adapt from https://github.com/MikeInnes/Adapt.jl.git
INFO: Installing Adapt v0.1.0
INFO: Upgrading CUDAapi: v0.2.1 => v0.3.0
INFO: Upgrading CUDAdrv: v0.6.1 => v0.7.3
INFO: Upgrading CUDAnative: v0.5.3 => v0.5.4
INFO: Building CUDAdrv
WARNING: Found multiple CUDA driver installations: /usr/lib/x86_64-linux-gnu and /usr
INFO: Building LLVM
INFO: LLVM.jl has already been built for this toolchain, no need to rebuild
INFO: Building CUDAnative
WARNING: Found multiple CUDA toolkit installations: /usr/local/cuda and /usr/local/cuda-8.0

julia> Pkg.test("CuArrays")
INFO: Testing CuArrays
ERROR: LoadError: UndefVarError: configured not defined
Stacktrace:
 [1] include_from_node1(::String) at ./loading.jl:576
 [2] include(::String) at ./sysimg.jl:14
 [3] anonymous at ./<missing>:2
while loading /home/zhuj6/.julia/v0.6/CuArrays/src/CuArrays.jl, in expression starting on line 13
ERROR: LoadError: Failed to precompile CuArrays to /home/zhuj6/.julia/lib/v0.6/CuArrays.ji.
Stacktrace:
 [1] compilecache(::String) at ./loading.jl:710
 [2] _require(::Symbol) at ./loading.jl:463
 [3] require(::Symbol) at ./loading.jl:405
 [4] include_from_node1(::String) at ./loading.jl:576
 [5] include(::String) at ./sysimg.jl:14
 [6] process_options(::Base.JLOptions) at ./client.jl:305
 [7] _start() at ./client.jl:371
while loading /home/zhuj6/.julia/v0.6/CuArrays/test/runtests.jl, in expression starting on line 15
ERROR: LoadError: failed process: Process(`/home/zhuj6/julia/usr/bin/julia -Cnative -J/home/zhuj6/julia/usr/lib/julia/sys.so --compile=yes --depwarn=yes --color=yes --compilecache=yes --startup-file=yes --code-coverage=none /home/zhuj6/.julia/v0.6/CuArrays/test/runtests.jl`, ProcessExited(1)) [1]
Stacktrace:
 [1] pipeline_error(::Base.Process) at ./process.jl:682
 [2] run(::Cmd) at ./process.jl:651
 [3] include_from_node1(::String) at ./loading.jl:576
 [4] include(::String) at ./sysimg.jl:14
 [5] process_options(::Base.JLOptions) at ./client.jl:305
 [6] _start() at ./client.jl:371
while loading /home/zhuj6/.julia/v0.6/CuArrays/test/runtests.jl, in expression starting on line 4
==============================[ ERROR: CuArrays ]===============================

failed process: Process(`/home/zhuj6/julia/usr/bin/julia -Cnative -J/home/zhuj6/julia/usr/lib/julia/sys.so --compile=yes --depwarn=yes --check-bounds=yes --code-coverage=none --color=yes --compilecache=yes /home/zhuj6/.julia/v0.6/CuArrays/test/runtests.jl`, ProcessExited(1)) [1]

================================================================================
ERROR: CuArrays had test errors

Non-contiguous indexing

Hi Mike,

Thanks for such an effort! I think non-contiguous array indexing is the only missing feature in CuArray where we use in our models heavily. Example code snippet:

julia> a1 = rand(Float32, 4,5)
4×5 Array{Float32,2}:
 0.698076  0.795957   0.501911  0.148559  0.416837
 0.817677  0.430873   0.383991  0.443963  0.201368
 0.359252  0.210537   0.277426  0.985338  0.454552
 0.821997  0.0168654  0.453663  0.40859   0.908441

julia> c1 = CuArray(a1)
4×5 CuArray{Float32,2}:
 0.698076  0.795957   0.501911  0.148559  0.416837
 0.817677  0.430873   0.383991  0.443963  0.201368
 0.359252  0.210537   0.277426  0.985338  0.454552
 0.821997  0.0168654  0.453663  0.40859   0.908441

julia> a1[[1,20]]
2-element Array{Float32,1}:
 0.698076
 0.908441

julia> g1[[1,20]]
ERROR: MethodError: Cannot `convert` an object of type Tuple{Int64} to an object of type Array{Float64,3}
This may have arisen from a call to the constructor Array{Float64,3}(...),
since type constructors fall back to convert methods.
Stacktrace:
 [1] getindex(::GPUArrays.GPUArray{Float64,3,CUDAdrv.CuArray{Float64,3},GPUArrays.CUBackend.CUContext}, ::Array{Int64,1}) at /KUFS/scratch/ikesen16/.julia/newnode/v0.6/GPUArrays/src/abstractarray.jl:398

julia> a1[[1,2],:]
2×5 Array{Float32,2}:
 0.698076  0.795957  0.501911  0.148559  0.416837
 0.817677  0.430873  0.383991  0.443963  0.201368

julia> c1[[1,2],:]
ERROR: don't know how to handle argument of type Array{Int64,1}
Stacktrace:
 [1] cudaconvert(::Array{Int64,1}) at /KUFS/scratch/ikesen16/.julia/newnode/v0.6/CUDAnative/src/execution.jl:20
 [2] broadcast(::Function, ::Tuple{Array{Int64,1},Base.Slice{Base.OneTo{Int64}}}) at ./broadcast.jl:17
 [3] _unsafe_getindex!(::CuArray{Float32,2}, ::CuArray{Float32,2}, ::Array{Int64,1}, ::Base.Slice{Base.OneTo{Int64}}, ::Vararg{Base.Slice{Base.OneTo{Int64}},N} where N) at /KUFS/scratch/ikesen16/.julia/newnode/v0.6/CuArrays/src/indexing.jl:50
 [4] macro expansion at ./multidimensional.jl:460 [inlined]
 [5] _unsafe_getindex(::IndexLinear, ::CuArray{Float32,2}, ::Array{Int64,1}, ::Base.Slice{Base.OneTo{Int64}}) at ./multidimensional.jl:453
 [6] macro expansion at ./multidimensional.jl:442 [inlined]
 [7] _getindex at ./multidimensional.jl:438 [inlined]
 [8] getindex(::CuArray{Float32,2}, ::Array{Int64,1}, ::Colon) at ./abstractarray.jl:882

julia> a1[1:2:end,1:2:end]
2×3 Array{Float32,2}:
 0.698076  0.501911  0.416837
 0.359252  0.277426  0.454552

julia> c1[1:2:end,1:2:end]
2×3 CuArray{Float32,2}:
 0.698076  0.501911  0.416837
 0.359252  0.277426  0.454552

I think if the last case works, the other is not hard to implement. I will try to test CuArrays with my dynamic neural net benchmark examples (for now, without indexing).

CUBLAS

We need to wrap more blas kernels, as currently we only have matmul.

As part of JuliaGPU/CUDAdrv.jl#63, we'll need to fold CUBLAS.jl into this package.

Native fill ?

Is there a way I can create a CuArray of a constant?

CUDNN

Same as #21 but for CUDNN. See also JuliaGPU/CUDAdrv.jl#63.

unspecified launch failure in `log.(::CuArray)` on CUDA 8.0

I was trying to nail down why I was getting an error in Flux.crossentropy and came up with this minimal example. +, -, ./, .*, * and exp. works. But log. doesn't work for me. Any ideas why I'm getting this error?

> using CuArrays
> x = CuArray([2.f0]);
> sum(exp.(x))
7.389056f0
> sum(log.(x))
ERROR: CUDA error: unspecified launch failure (code #719, ERROR_LAUNCH_FAILED)
Stacktrace:
 [1] macro expansion at /home/ec2-user/.julia/v0.6/CUDAdrv/src/base.jl:148 [inlined]
 [2] #download!#5(::Bool, ::Function, ::Base.RefArray{Float32,Array{Float32,1},Void}, ::CUDAdrv.Mem.Buffer, ::Int64, ::CUDAdrv.CuStream) at /home/ec2-user/.julia/v0.6/CUDAdrv/src/memory.jl:224
 [3] (::CUDAdrv.Mem.#kw##download!)(::Array{Any,1}, ::CUDAdrv.Mem.#download!, ::Base.RefArray{Float32,Array{Float32,1},Void}, ::CUDAdrv.Mem.Buffer, ::Int64, ::CUDAdrv.CuStream) at ./<missing>:0
 [4] #download!#8 at /home/ec2-user/.julia/v0.6/CUDAdrv/src/memory.jl:292 [inlined]
 [5] download! at /home/ec2-user/.julia/v0.6/CUDAdrv/src/memory.jl:291 [inlined] (repeats 2 times)
 [6] copy!(::Array{Float32,1}, ::CuArray{Float32,1}) at /home/ec2-user/.julia/v0.6/CuArrays/src/array.jl:66
 [7] convert at /home/ec2-user/.julia/v0.6/GPUArrays/src/construction.jl:95 [inlined]
 [8] convert at ./abstractarray.jl:839 [inlined]
 [9] Type at ./sysimg.jl:77 [inlined]
 [10] acc_mapreduce(::Function, ::Function, ::Float32, ::CuArray{Float32,1}, ::Tuple{}) at /home/ec2-user/.julia/v0.6/GPUArrays/src/mapreduce.jl:138
 [11] sum(::CuArray{Float32,1}) at ./reduce.jl:359

versions:

> using CUDAdrv
> CuDevice(0)
CuDevice(0): Tesla K80

$ ../../julia/julia --version
julia version 0.6.2

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Sun_Sep__4_22:14:01_CDT_2016
Cuda compilation tools, release 8.0, V8.0.44

$ uname
Linux

confusing `sin` guidance in README

The readme uses the example zs .= sin.(xs) .+ ys .* 2 to demo broadcasting, but then also says

When broadcasting, watch out for errors like:

julia> sin.(cos.(xs))
ERROR: CUDA error: invalid program counter (code #718, ERROR_INVALID_PC)
A current limitation of CUDAnative means that you'll need to restart Julia and use CuArrays.sin, > CuArrays.cos etc in this case.

It's not clear to me why the usage of sin is OK in the first but not in the second. It might help to expand on what exactly to watch out for in the 2nd case.

Tag a new release

It would be nice to have a new release so that CURAND is usable in a released version of CuArrays.

Possibility of QR decomposition

Just wondering if there is an obvious way to enable QR decompositions here. My code is now much faster thanks to this package, but I still have to do QR on the CPU and hence have a bit of copying back and forth.

Reduce with unary operators

In trying to get ForwardDiff's main example to work, I'm trying to reduce with binary operators:

xs = rand(5)
sum(sin, xs)

Initially, it looked like CuArrays supports this:

xs = CuArray(rand(5))
sum(sin, xs)

But as it turns out, this does a CPU reduction because of (what I presume to be) a missing definition:

Base.sum(f::Base.Callable, xs::CuArray) = reduce(f, 0, xs)

But now it shows how CuArrays.reduce_grid calls op in a binary fashion, which obviously fails:

CUDAnative.code_warntype(CuArrays.reduce_grid,
                         (typeof(CUDAnative.sin), Float64,
                          CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},
                          CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},
                          Int32))

val::Any = (op::CUDAnative.#sin)(val::Any, (Base.pointerref)((Core.getfield)((Core.getfield)(input::CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global}, :ptr)::CUDAnative.DevicePtr{Float64,CUDAnative.AS.Global}, :ptr)::Ptr{Float64}, (Base.zext_int)(Int64, i::UInt32)::Int64, 8)::Float64)::Any

Base.sum(f::Base.Callable, xs::CuArray) = reduce(f, 0, xs)

xs = CuArray(rand(5))
sum(sin, xs)

ERROR: LoadError: error compiling reduce_grid: emit_allocobj for CuArrays/src/reduction.jl:58 requires the dynamic_alloc language feature, which is disabled

"CUDA error: an illegal memory access" in simple composition operation

When running the simple code

x = cu(Float32[1, 2, 3.0])
log.(exp.(x - maximum(x)))

I got the error as follows

warning: ignoring debug info with an invalid version (0) in 
3-element CuArray{Float32,1}:
Error showing value of type CuArray{Float32,1}:
ERROR: CUDA error: an illegal memory access was encountered (code #700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
 [1] macro expansion at /home/xiucheng/.julia/v0.6/CUDAdrv/src/base.jl:148 [inlined]
 [2] download(::Ptr{Float32}, ::CUDAdrv.OwnedPtr{Void}, ::Int64) at /home/xiucheng/.julia/v0.6/CUDAdrv/src/memory.jl:141
 [3] copy!(::Array{Float32,1}, ::CuArray{Float32,1}) at /home/xiucheng/.julia/v0.6/CuArrays/src/array.jl:59
 [4] #showarray#1(::Bool, ::Function, ::IOContext{Base.Terminals.TTYTerminal}, ::CuArray{Float32,1}, ::Bool) at /home/xiucheng/.julia/v0.6/CuArrays/src/array.jl:121
 [5] display(::Base.REPL.REPLDisplay{Base.REPL.LineEditREPL}, ::MIME{Symbol("text/plain")}, ::CuArray{Float32,1}) at ./REPL.jl:122
 [6] display(::Base.REPL.REPLDisplay{Base.REPL.LineEditREPL}, ::CuArray{Float32,1}) at ./REPL.jl:125
 [7] display(::CuArray{Float32,1}) at ./multimedia.jl:218
 [8] eval(::Module, ::Any) at ./boot.jl:235
 [9] print_response(::Base.Terminals.TTYTerminal, ::Any, ::Void, ::Bool, ::Bool, ::Void) at ./REPL.jl:144
 [10] print_response(::Base.REPL.LineEditREPL, ::Any, ::Void, ::Bool, ::Bool) at ./REPL.jl:129
 [11] (::Base.REPL.#do_respond#16{Bool,Base.REPL.##26#36{Base.REPL.LineEditREPL,Base.REPL.REPLHistoryProvider},Base.REPL.LineEditREPL,Base.LineEdit.Prompt})(::Base.LineEdit.MIState, ::Base.AbstractIOBuffer{Array{UInt8,1}}, ::Bool) at ./REPL.jl:646

But is works normally if I separate the composition computation into two steps like,

x = cu(Float32[1, 2, 3.0])
x = exp.(x - maximum(x))
log.(x)

The system and version info,

OS: Ubuntu 16.04
cuda: 8.0,
Julia: 0.6.2 compiled from source
CuArrays: v0.3.0
CUDAdrv: v0.6.1

Error tagging new release

The URL of this package does not match that stored in METADATA.jl.
cc: @MikeInnes

seg fault for LSTM on CUDA 9.0 and CUDA 8.0 with CUDNN 7.1.1.

I get the following error, after many prediction queries to flux-based LSTM. This is an error on CUDA 9.0. The error on CUDA 8.0 is similar, but it relates to garbage collection (when calling free). Does it makes sense? Is the source of the error obvious and is it easily fixable? I could try to work on replicable example, but its pretty difficult as I am not sure exactly what triggers the error (very large code base involved). My guess is that its garbage collection.

Just to clarify a bit, this error never occurs when training LSTM. Only when doing inference with LSTM, where the SGD is being performed for a vector of parameter (unrelated to LSTM). The predictions from LSTM are used to define the loss. Prior to running this SGD on GPU, the LSTM is "untracked" with mTᵏ = Flux.mapleaves(Flux.Tracker.data, mTᵏ). The whole procedure runs fine on CPUs. Is this CuArrays issue or Flux issue? Anything I can do to further diagnose? Much appreciated!

signal (11): Segmentation fault
while loading no file, in expression starting on line 0
cfree at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
cudnnDestroyTensorDescriptor at /usr/lib/x86_64-linux-gnu/libcudnn.so.7.1.1 (unknown line)
unknown function (ip: 0x7f2dd8194a55)
cudnnRNNBackwardData at /usr/lib/x86_64-linux-gnu/libcudnn.so.7.1.1 (unknown line)
macro expansion at /home/ubuntu/.julia/v0.6/CuArrays/src/dnn/error.jl:17 [inlined]
cudnnRNNBackwardData at /home/ubuntu/.julia/v0.6/Flux/src/cuda/cudnn.jl:189
unknown function (ip: 0x7f2dfc1c217b)
jl_call_fptr_internal at /usr/local/src/julia/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /usr/local/src/julia/src/julia_internal.h:358 [inlined]
jl_invoke at /usr/local/src/julia/src/gf.c:41
backwardData at /home/ubuntu/.julia/v0.6/Flux/src/cuda/cudnn.jl:206
unknown function (ip: 0x7f2dfc1c12e0)
jl_call_fptr_internal at /usr/local/src/julia/src/julia_internal.h:339 [inlined]
jl_call_method_internal at /usr/local/src/julia/src/julia_internal.h:358 [inlined]
jl_apply_generic at /usr/local/src/julia/src/gf.c:1926
back_ at /home/ubuntu/.julia/v0.6/Flux/src/cuda/cudnn.jl:363
back_ at /home/ubuntu/.julia/v0.6/Flux/src/tracker/back.jl:25
unknown function (ip: 0x7f2dfc1c077d)

setindex! or append!

Hi Mike, would it be possible to add support for setindex! or append!? So e.g.

x = cu(randn(10))
x[1:5] .= cu(randn(5))

x = cu(randn(5))
append!(x, cu(randn(5)))

Error tagging new release

The URL of this package does not match that stored in METADATA.jl.
cc: @MikeInnes

Reduction to a tuple (or any other complex type)

Use case - collect both sum(x) and sum(abs(x)) in one pass

a = randn(3, 3)
reduce((x0,x)->(x0[1]+x, x0[2]+abs(x)), (0.,0.), a)

works, but:

a = cu(a)
reduce((x0,x)->(x0[1]+x, x0[2]+abs(x)), (0.,0.), a)
ERROR: MethodError: Cannot `convert` an object of type Tuple{Float64,Float64} to an object of 
type Float64                                                                                 
This may have arisen from a call to the constructor Float64(...),
since type constructors fall back to convert methods.
Stacktrace:
 [1] reduce(::Function, ::Tuple{Float64,Float64}, ::CuArray{Float64,2}) at /.julia/v0
.6/CuArrays/src/reduction.jl:87

Probably harder, but would be nice to make it work along some dimension, for example to collect sum(x, 1) and sum(abs(x), 1) in one pass.

Horrible performance for "x[i,:] = vec" bottlenecks "back(::typeof(getindex),..)" in Flux.

My Flux code is taking 10-50x times longer to run on top GPU (Tesla V100) compared to an old CPU. There is also a Flux issue open for this, but the problem is actually with CuArrays as demonstrated below.

The problem is in 2nd line of Flux.back:

function back(::typeof(getindex), Δ, xs::Flux.TrackedArray, i...)
  Δ′ = zeros(xs.data)
  Δ′[i...] = Δ
  Flux.Tracker.@back(xs, Δ′)
end

This kills performance on GPUs. Is there any temporary workaround for this? Our code relies heavy on doing something like this out = [out[i,:] for i=1:(3*nparams)] in advance, as part of inference. So we can't avoid this type of indexing (I really tried to avoid it!!!). Any bandaid solution, anything? We only need back to work for indices of the type out[i,:].
We are trying to wrap up a project that has become solely dependent on Flux at this point. We are completely stuck, as effectively our Flux code is completely unsuitable for GPUs. Its also impossible for us to finish this project with CPUs. Would it be possible to help us? Please? Any advice would be highly appreciated!

Here is an example that highlights the performance hit and the type of indexing that we need.

Please note that its much faster to copy the array to cpu, do the setindex!, then copy back to GPU!!!. This is shown below in back_hack. There's got to be a more performant GPU-only solution, no???

julia> using Flux
julia> using CuArrays
julia> testx = rand(2,100);
julia> x = param(testx);
julia> idx = (1,:); # (1, Colon())
julia> l = Flux.getindex(x,idx...);
julia> l2 = sum(l);
julia> @time Flux.back!(l2);
  0.000010 seconds (9 allocations: 2.844 KiB)
julia> xg = param(testx) |> gpu;
julia> l = Flux.getindex(xg,idx...);
julia> l2 = sum(l);
julia> @time Flux.back!(l2);
  0.044698 seconds (1.77 k allocations: 86.266 KiB)

Here is comparison to CPU version:

function back(::typeof(getindex), Δ, xs::Flux.TrackedArray, i...)
  Δ′ = zeros(xs.data)
  Δ′[i...] = Δ
  Flux.Tracker.@back(xs, Δ′)
end

function back_hack(::typeof(getindex), Δ, xs::Flux.TrackedArray, i...)
  Δ′ = zeros(xs.data|>cpu)
  Δ′[i...] = (Δ|>cpu)
  Flux.Tracker.@back(xs, Δ′|>gpu)
end

## directly call back
## ON CPU:
@time back(getindex, Flux.Tracker.grad(x)[1,1:100], x, idx...)
# 0.000019 seconds (11 allocations: 2.922 KiB)

## ON GPU
@time back(getindex, Flux.Tracker.grad(xg)[1,1:100], xg, idx...)
# 0.030649 seconds (1.78 k allocations: 86.656 KiB)

## even moving to CPU then back to GPU is doing better:
@time back_hack(getindex, Flux.Tracker.grad(xg)[1,1:100], xg, idx...)
# 0.000290 seconds (97 allocations: 6.297 KiB)

# Note that the entire run time is dominated by setindex!
xg = rand(2,100) |> gpu;
repl = zeros(100) |> gpu;
@time xg[1,:] = repl;
# 0.030225 seconds (1.71 k allocations: 84.547 KiB)

compilation in Julia 0.6.2 with LLVM requiring Julia 0.7-dev

Just curious: I am trying to install CuArrays in Julia 0.6.2, but it requires LLVM. LLVM compiles out of box in Julia 0.7-dev and does not want to do that in Julia 0.6.2. The other way around: CuArrays does not want to build itself in Julia 0.7-dev.

Is there any possible way out of this circle? Or has someone succeeded recently with the build in 0.6.2? Thank you in advance

bad interaction with SpecialFunctions

On julia 0.6

julia> using CuArrays

julia> x=cu(rand(3))
3-element CuArray{Float32,1}:
 0.130609
 0.036018
 0.938617

julia> using SpecialFunctions

julia> erf.(x)
3-element CuArray{Float32,1}:
Error showing value of type CuArray{Float32,1}:
ERROR: CUDA error: unspecified launch failure (code #719, ERROR_LAUNCH_FAILED)
Stacktrace:
 [1] macro expansion at /home/lucibello/.julia/v0.6/CUDAdrv/src/base.jl:148 [inlined]
 [2] #download!#5(::Bool, ::Function, ::Base.RefArray{Float32,Array{Float32,1},Void}, ::CUDAdrv.Mem.Buffer, ::Int64, ::CUDAdrv.CuStream) at /home/lucibello/.julia/v0.6/CUDAdrv/src/memory.jl:224
 [3] (::CUDAdrv.Mem.#kw##download!)(::Array{Any,1}, ::CUDAdrv.Mem.#download!, ::Base.RefArray{Float32,Array{Float32,1},Void}, ::CUDAdrv.Mem.Buffer, ::Int64, ::CUDAdrv.CuStream) at ./<missing>:0
 [4] #download!#8 at /home/lucibello/.julia/v0.6/CUDAdrv/src/memory.jl:292 [inlined]
 [5] download! at /home/lucibello/.julia/v0.6/CUDAdrv/src/memory.jl:291 [inlined] (repeats 2 times)
 [6] copy!(::Array{Float32,1}, ::CuArray{Float32,1}) at /home/lucibello/.julia/v0.6/CuArrays/src/array.jl:66
 [7] #showarray#3(::Bool, ::Function, ::IOContext{Base.Terminals.TTYTerminal}, ::CuArray{Float32,1}, ::Bool) at /home/lucibello/.julia/v0.6/CuArrays/src/array.jl:143
 [8] display(::Base.REPL.REPLDisplay{Base.REPL.LineEditREPL}, ::MIME{Symbol("text/plain")}, ::CuArray{Float32,1}) at ./REPL.jl:122
 [9] display(::Base.REPL.REPLDisplay{Base.REPL.LineEditREPL}, ::CuArray{Float32,1}) at ./REPL.jl:125
 [10] display(::CuArray{Float32,1}) at ./multimedia.jl:194
 [11] eval(::Module, ::Any) at ./boot.jl:235
 [12] print_response(::Base.Terminals.TTYTerminal, ::Any, ::Void, ::Bool, ::Bool, ::Void) at ./REPL.jl:144
 [13] print_response(::Base.REPL.LineEditREPL, ::Any, ::Void, ::Bool, ::Bool) at ./REPL.jl:129
 [14] (::Base.REPL.#do_respond#16{Bool,Base.REPL.##26#36{Base.REPL.LineEditREPL,Base.REPL.REPLHistoryProvider},Base.REPL.LineEditREPL,Base.LineEdit.Prompt})(::Base.LineEdit.MIState, ::Base.AbstractIOBuffer{Array{UInt8,1}}, ::Bool) at ./REPL.jl:646

julia> x
3-element CuArray{Float32,1}:
Error showing value of type CuArray{Float32,1}:
ERROR: CUDA error: unspecified launch failure (code #719, ERROR_LAUNCH_FAILED)
Stacktrace:
 [1] macro expansion at /home/lucibello/.julia/v0.6/CUDAdrv/src/base.jl:148 [inlined]
 [2] #download!#5(::Bool, ::Function, ::Base.RefArray{Float32,Array{Float32,1},Void}, ::CUDAdrv.Mem.Buffer, ::Int64, ::CUDAdrv.CuStream) at /home/lucibello/.julia/v0.6/CUDAdrv/src/memory.jl:224
 [3] (::CUDAdrv.Mem.#kw##download!)(::Array{Any,1}, ::CUDAdrv.Mem.#download!, ::Base.RefArray{Float32,Array{Float32,1},Void}, ::CUDAdrv.Mem.Buffer, ::Int64, ::CUDAdrv.CuStream) at ./<missing>:0
 [4] #download!#8 at /home/lucibello/.julia/v0.6/CUDAdrv/src/memory.jl:292 [inlined]
 [5] download! at /home/lucibello/.julia/v0.6/CUDAdrv/src/memory.jl:291 [inlined] (repeats 2 times)
 [6] copy!(::Array{Float32,1}, ::CuArray{Float32,1}) at /home/lucibello/.julia/v0.6/CuArrays/src/array.jl:66
 [7] #showarray#3(::Bool, ::Function, ::IOContext{Base.Terminals.TTYTerminal}, ::CuArray{Float32,1}, ::Bool) at /home/lucibello/.julia/v0.6/CuArrays/src/array.jl:143
 [8] display(::Base.REPL.REPLDisplay{Base.REPL.LineEditREPL}, ::MIME{Symbol("text/plain")}, ::CuArray{Float32,1}) at ./REPL.jl:122
 [9] display(::Base.REPL.REPLDisplay{Base.REPL.LineEditREPL}, ::CuArray{Float32,1}) at ./REPL.jl:125
 [10] display(::CuArray{Float32,1}) at ./multimedia.jl:194
 [11] eval(::Module, ::Any) at ./boot.jl:235
 [12] print_response(::Base.Terminals.TTYTerminal, ::Any, ::Void, ::Bool, ::Bool, ::Void) at ./REPL.jl:144
 [13] print_response(::Base.REPL.LineEditREPL, ::Any, ::Void, ::Bool, ::Bool) at ./REPL.jl:129
 [14] (::Base.REPL.#do_respond#16{Bool,Base.REPL.##26#36{Base.REPL.LineEditREPL,Base.REPL.REPLHistoryProvider},Base.REPL.LineEditREPL,Base.LineEdit.Prompt})(::Base.LineEdit.MIState, ::Base.AbstractIOBuffer{Array{UInt8,1}}, ::Bool) at ./REPL.jl:646

vcat Error

using CuArrays

x = cu(rand(2))
y = cu(rand(2))
vcat(x,y)

Gives me warnings and error:
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_box_uint32", %59 = call i8** @jl_box_uint32(i32 zeroext %58), !dbg !17))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_gc_pool_alloc", %60 = call i8** @jl_gc_pool_alloc(i8* %ptls_i8, i32 1456, i32 32), !dbg !17))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_apply_generic", %63 = call i8** @jl_apply_generic(i8*** %9, i32 3), !dbg !17))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_apply_generic", %64 = call i8** @jl_apply_generic(i8*** %12, i32 2), !dbg !17))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_apply_generic", %66 = call i8** @jl_apply_generic(i8*** %4, i32 4), !dbg !17))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_f_getfield", %67 = call i8** @jl_f_getfield(i8** null, i8*** %7, i32 2), !dbg !17))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_f_getfield", %68 = call i8** @jl_f_getfield(i8** null, i8*** %11, i32 2), !dbg !17))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_apply_generic", %70 = call i8** @jl_apply_generic(i8*** %5, i32 4), !dbg !17))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_f_getfield", %71 = call i8** @jl_f_getfield(i8** null, i8*** %15, i32 2), !dbg !17))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_f_getfield", %72 = call i8** @jl_f_getfield(i8** null, i8*** %16, i32 2), !dbg !17))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_f_apply_type", %78 = call i8** @jl_f_apply_type(i8** null, i8*** %8, i32 4), !dbg !23))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_box_int64", %79 = call i8** @jl_box_int64(i64 signext %0), !dbg !23))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_gc_pool_alloc", %80 = call i8** @jl_gc_pool_alloc(i8* %ptls_i8, i32 1432, i32 16), !dbg !23))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_new_structv", %86 = call i8** @jl_new_structv(i8** %78, i8*** %13, i32 3), !dbg !23))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_apply_generic", %87 = call i8** @jl_apply_generic(i8*** %6, i32 2), !dbg !23))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_f_tuple", %88 = call i8** @jl_f_tuple(i8** null, i8*** %10, i32 1), !dbg !23))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_f_tuple", %89 = call i8** @jl_f_tuple(i8** null, i8*** %14, i32 2), !dbg !23))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_f_getfield", %53 = call i8** @jl_f_getfield(i8** null, i8*** %11, i32 2), !dbg !31))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_f_getfield", %54 = call i8** @jl_f_getfield(i8** null, i8*** %12, i32 2), !dbg !31))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_gc_pool_alloc", %55 = call i8** @jl_gc_pool_alloc(i8* %ptls_i8, i32 1480, i32 48), !dbg !32))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_apply_generic", %58 = call i8** @jl_apply_generic(i8*** %14, i32 3), !dbg !32))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_f_getfield", %59 = call i8** @jl_f_getfield(i8** null, i8*** %13, i32 2), !dbg !32))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_apply_generic", %60 = call i8** @jl_apply_generic(i8*** %15, i32 3), !dbg !32))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_gc_pool_alloc", %61 = call i8** @jl_gc_pool_alloc(i8* %ptls_i8, i32 1432, i32 16), !dbg !32))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_box_uint32", %65 = call i8** @jl_box_uint32(i32 zeroext %30), !dbg !32))
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_apply_generic", %66 = call i8** @jl_apply_generic(i8*** %10, i32 5), !dbg !32))

ERROR: LLVM IR generated for kernel(Int64, CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 6.1.0 is not compatible
Stacktrace:
[1] #compile_function#58(::Bool, ::Function, ::Any, ::Any, ::VersionNumber) at /home/david/.julia/v0.6/CUDAnative/src/jit.jl:422
[2] cufunction(::CUDAdrv.CuDevice, ::Any, ::Any) at /home/david/.julia/v0.6/CUDAnative/src/jit.jl:476
[3] macro expansion at /home/david/.julia/v0.6/CUDAnative/src/execution.jl:108 [inlined]
[4] _cuda(::Tuple{Int64,Int64}, ::Int64, ::CUDAdrv.CuStream, ::CuArrays.#kernel#10, ::Int64, ::CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at /home/david/.julia/v0.6/CUDAnative/src/execution.jl:80
[5] _cat(::Int64, ::CuArray{Float32,1}, ::CuArray{Float32,1}, ::Vararg{CuArray{Float32,1},N} where N) at /home/david/.julia/v0.6/CuArrays/src/utils.jl:95
[6] cat_t(::Int64, ::Type{T} where T, ::CuArray{Float32,1}, ::CuArray{Float32,1}, ::Vararg{CuArray{Float32,1},N} where N) at /home/david/.julia/v0.6/CuArrays/src/utils.jl:103
[7] vcat(::CuArray{Float32,1}, ::CuArray{Float32,1}) at /home/david/.julia/v0.6/CuArrays/src/utils.jl:106

I built Julia 0.6.2 from source and am running on Ubuntu 16.04. GPU is 1060 GEFORCE (compute capability 6.1.0) and I am running CUDA 8.0 with CUDNN 7.0.5
Has anyone seen this error before? Any help would be greatly appreciated! :)

hcat of vectors yields "incompatible LLVM IR"

Similar to #55 but this issue refers to hcat.

julia> using CuArrays;

julia> x = cu(rand(2));

julia> y = cu(rand(2));

julia> hcat(x,y)
WARNING: Encountered incompatible LLVM IR for kernel(Int64, CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 3.7.0: CUDAnative.InvalidIRError("calls the Julia runtime", ("jl_bounds_error_unboxed_int",   call void @jl_bounds_error_unboxed_int(i8* %50, i8** inttoptr (i64 139624390903792 to i8**), i64 %.), !dbg !49))
ERROR: LLVM IR generated for kernel(Int64, CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at capability 3.7.0 is not compatible
Stacktrace:
 [1] #compile_function#58(::Bool, ::Function, ::Any, ::Any, ::VersionNumber) at ~/.julia/v0.6/CUDAnative/src/jit.jl:434
 [2] cufunction(::CUDAdrv.CuDevice, ::Any, ::Any) at ~/.julia/v0.6/CUDAnative/src/jit.jl:488
 [3] macro expansion at ~/.julia/v0.6/CUDAnative/src/execution.jl:107 [inlined]
 [4] _cuda(::Tuple{Int64,Int64}, ::Int64, ::CUDAdrv.CuStream, ::CuArrays.#kernel#12, ::Int64, ::CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, ::Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}}) at ~/.julia/v0.6/CUDAnative/src/execution.jl:80
 [5] _cat(::Int64, ::CuArray{Float32,2}, ::CuArray{Float32,1}, ::Vararg{CuArray{Float32,1},N} where N) at ~/.julia/v0.6/CuArrays/src/utils.jl:96
 [6] cat_t(::Int64, ::Type{T} where T, ::CuArray{Float32,1}, ::CuArray{Float32,1}, ::Vararg{CuArray{Float32,1},N} where N) at ~/.julia/v0.6/CuArrays/src/utils.jl:104
 [7] hcat(::CuArray{Float32,1}, ::CuArray{Float32,1}) at ~/.julia/v0.6/CuArrays/src/utils.jl:108

Error tagging new release

The tag name "0.2.0" is not of the appropriate SemVer form (vX.Y.Z).
cc: @MikeInnes

Memory: high use of gc() in alloc()?

CuArrays.jl/src/memory.jl

Line 29 in 12269b8

gc(false)

Hi,
I've noticed that after a while, especially when creating a lot of small CuArrays, performance goes down by at least 5 orders of magnitude while creating a new CuArray. If I comment out the line I highlighted, none of this happens. I've also not found any instabilities or crashes happening after uncommenting this line, not even when the memory of the GPU is 99% used. Is there any reason for this gc pass to be there?

Array Serialisation

This should be pretty easy, we just need to hook into Base's serialize/deserialize, however that's done.

x[Range, Vector] indexing crashes

julia> x = cu(rand(1000,1000));

julia> x[1:10, [8;6]]
error in running finalizer: CUDAdrv.CuError(code=700, meta=nothing)
10×2 CuArray{Float64,2}:
Error showing value of type CuArray{Float64,2}:
ERROR: CUDA error: an illegal memory access was encountered (code #700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
 [1] macro expansion at /home/viralbshah/.julia/v0.6/CUDAdrv/src/base.jl:130 [inlined]
 [2] download(::Ptr{Float64}, ::CUDAdrv.OwnedPtr{Float64}, ::Int64) at /home/viralbshah/.julia/v0.6/CUDAdrv/src/memory.jl:141
 [3] copy!(::Array{Float64,2}, ::CuArray{Float64,2}) at /home/viralbshah/.julia/v0.6/CuArrays/src/array.jl:55
 [4] #showarray#1(::Bool, ::Function, ::IOContext{Base.Terminals.TTYTerminal}, ::CuArray{Float64,2}, ::Bool) at /home/viralbshah/.julia/v0.6/CuArrays/src/array.jl:111
 [5] display(::Base.REPL.REPLDisplay{Base.REPL.LineEditREPL}, ::MIME{Symbol("text/plain")}, ::CuArray{Float64,2}) at ./REPL.jl:122
 [6] display(::Base.REPL.REPLDisplay{Base.REPL.LineEditREPL}, ::CuArray{Float64,2}) at ./REPL.jl:125
 [7] display(::CuArray{Float64,2}) at ./multimedia.jl:194
 [8] eval(::Module, ::Any) at ./boot.jl:235
 [9] print_response(::Base.Terminals.TTYTerminal, ::Any, ::Void, ::Bool, ::Bool, ::Void) at ./REPL.jl:144
 [10] print_response(::Base.REPL.LineEditREPL, ::Any, ::Void, ::Bool, ::Bool) at ./REPL.jl:129
 [11] (::Base.REPL.#do_respond#16{Bool,Base.REPL.##26#36{Base.REPL.LineEditREPL,Base.REPL.REPLHistoryProvider},Base.REPL.LineEditREPL,Base.LineEdit.Prompt})(::Base.LineEdit.MIState, ::Base.AbstractIOBuffer{Array{UInt8,1}}, ::Bool) at ./REPL.jl:646

GPU memory hog

The current pool allocator never frees memory except when allocations fail. This quickly leads to CuArrays/Flux using all of the available memory, as frequently observed on cyclops. We should probably have an idle thread scan for unused pool entries and free those up.

Yet another broadcasting error

Earlier I posted a similar issue to FluxML/NNlib.jl, but this one is unrelated to that repo.

The following code:

using CuArrays

A = cu(rand(Float32, 10, 100))
B = cu(ones(Float32, 10, 100))
C = cu(ones(Float32, 1, 100)) 
D = cu(ones(Float32, 1, 100)) 
E = cu(rand(Float32, 10, 100))

CUDAnative.exp.(1 ./ A) .* (B .* (C ./ D)) .+ E

generates this error:

WARNING: Method definition convert(Type{LLVM.LLVMType}, Type{T} where T) in module Interop at /home/dfdx/.julia/v0.6/LLVM/src/interop/base.jl:54 overwritten in module CUDAnative at /home/dfdx/.julia/v0.6/CUDAnative/src/cgutils.jl:159.


warning: ignoring debug info with an invalid version (0) in #1
warning: ignoring debug info with an invalid version (0) in 
ERROR: LoadError: LLVM error: All DICompileUnits must be listed in llvm.dbg.cu

Stacktrace:
 [1] verify(::LLVM.Module) at /home/dfdx/.julia/v0.6/LLVM/src/analysis.jl:11
 [2] #add_entry!#26(::Bool, ::Function, ::LLVM.Module, ::Any, ::Any) at /home/dfdx/.julia/v0.6/CUDAnative/src/jit.jl:251
 [3] (::CUDAnative.#kw##add_entry!)(::Array{Any,1}, ::CUDAnative.#add_entry!, ::LLVM.Module, ::Any, ::Any) at ./<missing>:0
 [4] #compile_function#51(::Bool, ::Function, ::Any, ::Any, ::VersionNumber) at /home/dfdx/.julia/v0.6/CUDAnative/src/jit.jl:402
 [5] cufunction(::CUDAdrv.CuDevice, ::Any, ::Any) at /home/dfdx/.julia/v0.6/CUDAnative/src/jit.jl:465
 [6] macro expansion at /home/dfdx/.julia/v0.6/CUDAnative/src/execution.jl:108 [inlined]
 [7] _cuda(::Tuple{Int64,Int64}, ::Int64, ::CUDAdrv.CuStream, ::CuArrays.#broadcast_kernel, ::##1#2, ::CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, ::NTuple{5,Tuple{Bool,Bool}}, ::NTuple{5,Tuple{Int64,Int64}}, ::CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, ::NTuple{4,CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}}) at /home/dfdx/.julia/v0.6/CUDAnative/src/execution.jl:80
 [8] _broadcast! at /home/dfdx/.julia/v0.6/CuArrays/src/broadcast.jl:22 [inlined]
 [9] broadcast_t(::Function, ::Type{T} where T, ::Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}, ::CuArray{Float32,2}, ::CuArray{Float32,2}, ::CuArray{Float32,2}, ::CuArray{Float32,2}, ::CuArray{Float32,2}) at /home/dfdx/.julia/v0.6/CuArrays/src/broadcast.jl:37
 [10] broadcast_c(::Function, ::Type{CuArrays.CuArray}, ::CuArray{Float32,2}, ::CuArray{Float32,2}, ::Vararg{CuArray{Float32,2},N} where N) at /home/dfdx/.julia/v0.6/CuArrays/src/broadcast.jl:58
 [11] broadcast(::Function, ::CuArray{Float32,2}, ::CuArray{Float32,2}, ::CuArray{Float32,2}, ::Vararg{CuArray{Float32,2},N} where N) at ./broadcast.jl:455
 [12] include_from_node1(::String) at ./loading.jl:576
 [13] include(::String) at ./sysimg.jl:14
while loading /home/dfdx/Downloads/broadcast_fail.jl, in expression starting on line 10

I use Julia 0.6.2, CUDAnative 0.5.3 and latest master of CuArrays. I'm open to test it on Julia 0.7 and latest CUDAnative master if it's assumed to be fixed there, but right now this setup seems to be broken (or at least I couldn't build latest CUDAnative on my machine), so it will take time to resolve,

Native rand

Currently I have to do cu(rand(...)). It would be nice to have a native version.

Port CUSPARSE.jl

This exists here but we need to figure out how to integrate it with CuArrays.

Problems running on GPU

I was trying the "mnist/mlp.jl" file, which on CPU runs just fine. However, when I uncomment the using CuArrays line I get a broadcast error on the call to crossentropy:

julia> include("mlp.jl")
INFO: Recompiling stale cache file /home/carlo/.julia/lib/v0.6/Flux.ji for module Flux.
INFO: Recompiling stale cache file /home/carlo/.julia/lib/v0.6/CuArrays.ji for module CuArrays.
ERROR: LoadError: Broadcast output type Any is not concrete
Stacktrace:
 [1] broadcast_t at /home/carlo/.julia/v0.6/CuArrays/src/broadcast.jl:34 [inlined]
 [2] broadcast_c at /home/carlo/.julia/v0.6/CuArrays/src/broadcast.jl:63 [inlined]
 [3] broadcast at ./broadcast.jl:455 [inlined]
 [4] tracked_broadcast(::Function, ::Flux.OneHotMatrix{CuArray{Flux.OneHotVector,1}}, ::TrackedArray{…,CuArray{Float32,2}}, ::Int64) at /home/carlo/.julia/v0.6/Flux/src/tracker/array.jl:278
 [5] #crossentropy#71(::Int64, ::Function, ::TrackedArray{…,CuArray{Float32,2}}, ::Flux.OneHotMatrix{CuArray{Flux.OneHotVector,1}}) at /home/carlo/.julia/v0.6/Flux/src/layers/stateless.jl:8
 [6] crossentropy(::TrackedArray{…,CuArray{Float32,2}}, ::Flux.OneHotMatrix{CuArray{Flux.OneHotVector,1}}) at /home/carlo/.julia/v0.6/Flux/src/layers/stateless.jl:8
 [7] loss(::CuArray{Float32,2}, ::Flux.OneHotMatrix{CuArray{Flux.OneHotVector,1}}) at /home/carlo/Programs/Devel/deeplearning/model-zoo/mnist/mlp.jl:21
 [8] #train!#130(::Flux.#throttled#14, ::Function, ::Function, ::Base.Iterators.Take{Base.Iterators.Repeated{Tuple{CuArray{Float32,2},Flux.OneHotMatrix{CuArray{Flux.OneHotVector,1}}}}}, ::Flux.Optimise.##71#75) at /home/carlo/.julia/v0.6/Flux/src/optimise/train.jl:39
 [9] (::Flux.Optimise.#kw##train!)(::Array{Any,1}, ::Flux.Optimise.#train!, ::Function, ::Base.Iterators.Take{Base.Iterators.Repeated{Tuple{CuArray{Float32,2},Flux.OneHotMatrix{CuArray{Flux.OneHotVector,1}}}}}, ::Function) at ./<missing>:0
 [10] include_from_node1(::String) at ./loading.jl:576
 [11] include(::String) at ./sysimg.jl:14
while loading /home/carlo/Programs/Devel/deeplearning/model-zoo/mnist/mlp.jl, in expression starting on line 29

I can't figure out what's the issue. I'm on Julia 0.6.3-pre and I tried to get the latest versions of all packages. Any ideas?

a =  cu(zeros(100))
b = cu(fill(1,100))

then
copy!(a,b)
throws
ERROR: MethodError: no method matching transfer!(::CUDAdrv.Mem.Buffer, ::CUDAdrv.Mem.Buffer)

From CuArrays the method for copy calls Mem.transfer from CUDAdrv's memory.jl module:

function Base.copy!(dst::CuArray{T}, src::CuArray{T}) where T
    @assert length(dst) == length(src)
    Mem.transfer!(unsafe_buffer(dst), unsafe_buffer(src))
    return dst
end

But Mem.transfer! requires at least the number of bytes:

function transfer!(dst::Buffer, src::Buffer, nbytes::Integer,
                   stream::CuStream=CuDefaultStream(); async::Bool=false)
    if async
        @apicall(:cuMemcpyDtoDAsync,
                 (Ptr{Cvoid}, Ptr{Cvoid}, Csize_t, CuStream_t),
                 dst, src, nbytes, stream)
    else
        @assert stream==CuDefaultStream()
        @apicall(:cuMemcpyDtoD,
                 (Ptr{Cvoid}, Ptr{Cvoid}, Csize_t),
                 dst, src, nbytes)
    end
end

It also seems like this should be caught in Pkg.test("CuArrays").

SegFault for some getindex calls

Please see denizyuret/Knet.jl#198.

juliagpu / cuarrays.jl Goto Github PK

cuarrays.jl's Issues

Recommend Projects

Recommend Topics

Recommend Org