juliagpu / gpuarrays.jl Goto Github PK

View Code? Open in Web Editor NEW

316.0 316.0 75.0 2.5 MB

Reusable array functionality for Julia's various GPU backends.

License: MIT License

Julia 100.00%

gpu hacktoberfest julia

gpuarrays.jl's People

Stargazers

Watchers

Forkers

simidci aakarsh tkelman vchuravy ralvarezm jekbradbury jekyllstein vtjnash jianghaizhu chrisrackauckas schmrlng tejank10 americast dlfivefifty ranocha ssz66666 juliadocsforks giggleliu erikajob91 swiesend batmanabcdefg lcw nlw0 astropianist jkozdon vandyt stjordanis skyphy ademissew sleort janebert haampie chrstphrbrns aterenin drchainsaw astupidbear ellipse0934 charleskawczynski achuchmala danielwe strogo standardgalactic n5n3 torfjelde mcabbott bjarthur troels roger-luo playfloor mfkiwl mloubout simsurace kryptan awadell1 kajwiik red-portal vpuri3 isgasho gvigne chengchingwen amontoison leios maxwindiff evelyne-ringoot gabrielpreviato christiangnrd dkarrasch vilin97 brabreda milescranmer tgymnich mtfishman wsmoses jipolanco pxl-th

gpuarrays.jl's Issues

Multiplying small Matrices

We rely on Base to route A * B calls to the correct BLAS routines.
But base actually reroutes small matrices to a julia implementation here:

https://github.com/JuliaLang/julia/blob/master/base/linalg/matmul.jl#L336

We need to overload this and change the behaviour for GPUArrays.
I'm almost inclined to throw an error, since no one should actually multiply a 4x4 matrix on the GPU like that.
Ref: #44

Option to make indexing warn

Since indexing is a slow operation, it would be nice to make it throw a warning when it's used so we can know when we're "safe".

free memory, check memory status

During the juliacon demo @SimonDanisch seemed you use a free function. Yet that function is not imported when importing GPUArrays. I have seen issue #30 but I am making a separate issue only for the memory problem.

I am experiencing some problems when using GPUArrays in a notebook.

Out of curiosity I just copied the same code that was printing a nice The kernel appears to have died. It will restart automatically, into a file, and it runs just fine.

This is the code

using GPUArrays
using BenchmarkTools

sizes = [x for x in 100:100:1000];
cpu_times = Dict()
gpu_times = Dict()

println("\nCPU times")
for s in sizes
    X = rand(Float32,s,s);
    X_result = zeros(X);
    res_cpu = @elapsed A_mul_B!(X_result, X,X)
    println("size: ", s, " x ", s, " seconds: ", res_cpu, " seconds")
    #cpu_times[s] = mean(res_cpu.times)/10^6
end

println("\nGPU times")
for s in sizes
    X = rand(Float32,s,s);
    X_result = zeros(X);
    X_gpu = GPUArray(X);
    X_result_gpu = GPUArray(zeros(Float32,s,s));

    res_gpu = @elapsed A_mul_B!(X_result_gpu, X_gpu, X_gpu)
    println("size: ", s, " x ", s, " seconds: ", res_gpu, " seconds")
    #gpu_times[s] = mean(res_gpu.times)/10^6
end

I really don't know what is internally doing GPUArrays in the for loop.
For example in the following code

s = 100
 X = rand(Float32,s,s);
 X_result = zeros(X);
 X_gpu = GPUArray(X);

s = 200
 X = rand(Float32,s,s);
 X_result = zeros(X);
 X_gpu = GPUArray(X);

Is the first array freed?

Can we free an array at any moment using a free function?
Can we check the current status of a device easily? (how much ram is available in my gpu) for example.

UndefVarError: CLTranspiler not defined

I'm trying to get GPUArrays.jl working on my new MacBook Pro (with a Radeon card) but I got an error trying to use Transpiler:

julia> Pkg.add("Transpiler")
INFO: Cloning cache of Matcha from https://github.com/SimonDanisch/Matcha.jl.git
INFO: Cloning cache of OpenCL from https://github.com/JuliaGPU/OpenCL.jl.git
INFO: Cloning cache of Sugar from https://github.com/SimonDanisch/Sugar.jl.git
INFO: Cloning cache of Transpiler from https://github.com/SimonDanisch/Transpiler.jl.git
INFO: Installing Matcha v0.0.2
INFO: Installing OpenCL v0.5.2
INFO: Installing Sugar v0.3.0
INFO: Installing Transpiler v0.1.0
INFO: Package database updated

julia> Pkg.add("GPUArrays")
INFO: Cloning cache of CLBLAS from https://github.com/JuliaGPU/CLBLAS.jl.git
INFO: Cloning cache of CLFFT from https://github.com/JuliaGPU/CLFFT.jl.git
INFO: Cloning cache of CUBLAS from https://github.com/JuliaGPU/CUBLAS.jl.git
INFO: Cloning cache of CUDAdrv from https://github.com/JuliaGPU/CUDAdrv.jl.git
INFO: Cloning cache of CUDArt from https://github.com/JuliaGPU/CUDArt.jl.git
INFO: Cloning cache of CUFFT from https://github.com/JuliaGPU/CUFFT.jl.git
INFO: Cloning cache of GPUArrays from https://github.com/JuliaGPU/GPUArrays.jl.git
INFO: Cloning cache of Primes from https://github.com/JuliaMath/Primes.jl.git
INFO: Installing CLBLAS v1.1.1
INFO: Installing CLFFT v0.4.0
INFO: Installing CUBLAS v0.1.0
INFO: Installing CUDAdrv v0.4.2
INFO: Installing CUDArt v0.3.1
INFO: Installing CUFFT v0.0.4
INFO: Installing GPUArrays v0.0.2
INFO: Installing Primes v0.1.3
INFO: Building CLBLAS
===============================[ ERROR: CLBLAS ]================================

LoadError:     OSX not oficially supported.
    Find manual build instructions on: https://github.com/clMathLibraries/clBLAS/wiki/Build

while loading /Users/solver/.julia/v0.6/CLBLAS/deps/build.jl, in expression starting on line 38

================================================================================
INFO: Building CLFFT
================================[ ERROR: CLFFT ]================================

LoadError:     OSX not oficially supported.
    Find manual build instructions on: https://github.com/clMathLibraries/clBLAS/wiki/Build

while loading /Users/solver/.julia/v0.6/CLFFT/deps/build.jl, in expression starting on line 34

================================================================================
INFO: Building CUDAdrv
===============================[ ERROR: CUDAdrv ]===============================

LoadError: Could not find the CUDA driver library (specify the path to libcuda using the CUDA_DRIVER environment variable).
while loading /Users/solver/.julia/v0.6/CUDAdrv/deps/build.jl, in expression starting on line 119

================================================================================
INFO: Building CUDArt
===============================[ ERROR: CUDArt ]================================

LoadError: CUDA runtime library cannot be found.
while loading /Users/solver/.julia/v0.6/CUDArt/deps/build.jl, in expression starting on line 380

================================================================================
INFO: Building GPUArrays
INFO: This process will figure out which acceleration Packages you have installed
and therefore which backends GPUArrays can offer.
Theoretically available:
:cudanative, :julia, :opencl

:julia is the default backend, which should always work.
Just start Julia with:
`JULIA_NUM_THREADS=8 julia -O3` to get it some threads.
8 is just an example and should be chosen depending on the processor you have.
`-O3` is completely optional, but when you're already fishing for multhithreaded
acceleration, you might as well want optimization level 3!
In the future, OpenCL, CUDA and OpenGL will be added as another backend.
INFO: Not installing CUDAnative backend. If you've installed CUDAnative.jl not in the
default location, consider building GPUArrays like this:

ENV[CUDANATIVE_PATH] = "path/to/CUDAnative/"
Pkg.build("GPUArrays")

If not installed, you can get CUDAnative like this:

Install CUDA runtime
Build Julia from the branch: tb/cuda.
Then:
Pkg.clone("https://github.com/JuliaGPU/CUDAnative.jl.git") #
Pkg.test("CUDAnative")
Pkg.checkout("CUDAdrv")
Pkg.checkout("LLVM")

INFO: julia added as a backend.
INFO: CUDAnative doesn't seem to be usable and it won't be installed as a backend. Error: ArgumentError("Module CUDAnative not found in current path.\nRun `Pkg.add(\"CUDAnative\")` to install the CUDAnative package.")
INFO: If error fixed, try Pkg.build("GPUArrays") again!
INFO: OpenCL added as backend!
INFO: import of CLBLAS did not work, not added

WARNING: deprecated syntax "typealias clfftNumber Union{Float64,Float32,Complex128,Complex64}" at /Users/solver/.julia/v0.6/CLFFT/src/CLFFT.jl:55.
Use "const clfftNumber = Union{Float64,Float32,Complex128,Complex64}" instead.

WARNING: deprecated syntax "typealias clfftReal Union{Float64,Float32}" at /Users/solver/.julia/v0.6/CLFFT/src/CLFFT.jl:56.
Use "const clfftReal = Union{Float64,Float32}" instead.

WARNING: deprecated syntax "typealias clfftComplex Union{Complex128,Complex64}" at /Users/solver/.julia/v0.6/CLFFT/src/CLFFT.jl:57.
Use "const clfftComplex = Union{Complex128,Complex64}" instead.

WARNING: deprecated syntax "typealias clfftDouble Union{Float64,Complex128}" at /Users/solver/.julia/v0.6/CLFFT/src/CLFFT.jl:58.
Use "const clfftDouble = Union{Float64,Complex128}" instead.

WARNING: deprecated syntax "typealias clfftSingle Union{Float32,Complex64}" at /Users/solver/.julia/v0.6/CLFFT/src/CLFFT.jl:59.
Use "const clfftSingle = Union{Float32,Complex64}" instead.

WARNING: deprecated syntax "typealias clfftTypeDouble Union{Type{Float64},Type{Complex128}}" at /Users/solver/.julia/v0.6/CLFFT/src/CLFFT.jl:60.
Use "const clfftTypeDouble = Union{Type{Float64},Type{Complex128}}" instead.

WARNING: deprecated syntax "typealias clfftTypeSingle Union{Type{Float32},Type{Complex64}}" at /Users/solver/.julia/v0.6/CLFFT/src/CLFFT.jl:61.
Use "const clfftTypeSingle = Union{Type{Float32},Type{Complex64}}" instead.

WARNING: deprecated syntax "typealias PlanHandle Csize_t" at /Users/solver/.julia/v0.6/CLFFT/src/CLFFT.jl:63.
Use "const PlanHandle = Csize_t" instead.

WARNING: deprecated syntax "typealias PlanHandle Csize_t" at /Users/solver/.julia/v0.6/CLFFT/src/api.jl:33.
Use "const PlanHandle = Csize_t" instead.

WARNING: deprecated syntax "typealias Callback Ptr{Void}" at /Users/solver/.julia/v0.6/CLFFT/src/api.jl:34.
Use "const Callback = Ptr{Void}" instead.

WARNING: deprecated syntax "typealias UserData Ptr{Void}" at /Users/solver/.julia/v0.6/CLFFT/src/api.jl:35.
Use "const UserData = Ptr{Void}" instead.

WARNING: deprecated syntax "typealias Precision Cint" at /Users/solver/.julia/v0.6/CLFFT/src/api.jl:37.
Use "const Precision = Cint" instead.

WARNING: deprecated syntax "typealias Dim Cint" at /Users/solver/.julia/v0.6/CLFFT/src/api.jl:38.
Use "const Dim = Cint" instead.

WARNING: deprecated syntax "typealias Direction Cint" at /Users/solver/.julia/v0.6/CLFFT/src/api.jl:39.
Use "const Direction = Cint" instead.

WARNING: deprecated syntax "typealias Layout Cint" at /Users/solver/.julia/v0.6/CLFFT/src/api.jl:40.
Use "const Layout = Cint" instead.

WARNING: deprecated syntax "typealias ResultLocation Cint" at /Users/solver/.julia/v0.6/CLFFT/src/api.jl:41.
Use "const ResultLocation = Cint" instead.

WARNING: deprecated syntax "typealias ResultTransposed Cint" at /Users/solver/.julia/v0.6/CLFFT/src/api.jl:42.
Use "const ResultTransposed = Cint" instead.
INFO: import of CLFFT did not work, not added

================================[ BUILD ERRORS ]================================

WARNING: CUDArt, CUDAdrv, CLBLAS and CLFFT had build errors.

 - packages with build errors remain installed in /Users/solver/.julia/v0.6
 - build the package(s) and all dependencies with `Pkg.build("CUDArt", "CUDAdrv", "CLBLAS", "CLFFT")`
 - build a single package by running its `deps/build.jl` script

================================================================================
INFO: Package database updated

julia> using GPUArrays
INFO: Precompiling module GPUArrays.
ERROR: LoadError: LoadError: LoadError: LoadError: UndefVarError: CLTranspiler not defined
Stacktrace:
 [1] include_from_node1(::String) at ./loading.jl:569
 [2] include(::String) at ./sysimg.jl:14
 [3] include_from_node1(::String) at ./loading.jl:569
 [4] include(::String) at ./sysimg.jl:14
 [5] include_from_node1(::String) at ./loading.jl:569
 [6] include(::String) at ./sysimg.jl:14
 [7] include_from_node1(::String) at ./loading.jl:569
 [8] include(::String) at ./sysimg.jl:14
 [9] anonymous at ./<missing>:2
while loading /Users/solver/.julia/v0.6/GPUArrays/src/backends/opencl/opencl.jl, in expression starting on line 17
while loading /Users/solver/.julia/v0.6/GPUArrays/src/backends/supported_backends.jl, in expression starting on line 9
while loading /Users/solver/.julia/v0.6/GPUArrays/src/backends/backends.jl, in expression starting on line 52
while loading /Users/solver/.julia/v0.6/GPUArrays/src/GPUArrays.jl, in expression starting on line 10
ERROR: Failed to precompile GPUArrays to /Users/solver/.julia/lib/v0.6/GPUArrays.ji.
Stacktrace:
 [1] compilecache(::String) at ./loading.jl:703
 [2] _require(::Symbol) at ./loading.jl:490
 [3] require(::Symbol) at ./loading.jl:398

Indexing with arrays

This is with CuArrays own indexing disabled.

Strangely plain arrays seem to work, but not cuarrays.

julia> cu([1:5;])[[5,2,3]]
3-element CuArray{Float32,1}:
 5.0
 2.0
 3.0

julia> cu([1:5;])[cu[5,2,3]]
ERROR: scalar getindex is disabled

CUDA compilation lib path error

Running the latest tagged versions of CUDAdrv, CUDArt and GPUArrays, Pkg.test("GPUArrays") yields (among other things):

WARNING: Couldn't find cuda compilation lib in default location.
        This will disable the ability to compile a CUDA kernel from a string
        To fix, install CUDAdrv in default location.

The warning (and subsequent test failure) is caused by this line in the backend definition, which refers to a file that was moved from CUDAdrv to CUDArt by this commit.

Should be a simple three character fix from
compile_lib = Pkg.dir("CUDAdrv", "examples", "compilation", "library.jl")
to
compile_lib = Pkg.dir("CUDArt", "examples", "compilation", "library.jl")
so didn't seem worthy of a PR.

ERROR_ILLEGAL_ADDRESS

This was surprising. I'm guessing it's an issue with limited memory:

julia> using GPUArrays

julia> CUBackend.init()
CUContext

julia> x = GPUArray(rand(Float32, 100000));

julia> @time x .= sqrt.(x);
  9.759133 seconds (3.52 M allocations: 192.281 MiB, 1.01% gc time)

julia> @time x .= sqrt.(x);
  0.000380 seconds (36 allocations: 944 bytes)

julia> y = GPUArray(rand(Float32, 100000));

julia> @time atan2.(y,x);
  0.480490 seconds (279.18 k allocations: 15.781 MiB, 1.60% gc time)

julia> @time atan2.(y,x);
ERROR: Illegal memory access (CUDA error #700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
 [1] macro expansion at /Users/solver/.julia/v0.6/CUDAdrv/src/base.jl:157 [inlined]
 [2] upload(error in running finalizer: CUDAdrv.CuError(code=700, info=Base.Nullable{String}(hasvalue=false, value=#<null>))
::CUDAdrv.DevicePtr{CUDAnative.CuDeviceArray{Float32,1}}, ::Base.RefValue{CUDAnative.CuDeviceArray{Float32,1}}, ::Int64) at /Users/solver/.julia/v0.6/CUDAdrv/src/memory.jl:48
 [3] upload(::CUDAdrv.DevicePtr{CUDAnative.CuDeviceArray{Float32,1}}, ::CUDAnative.CuDeviceArray{Float32,1}) at /Users/solver/.julia/v0.6/CUDAdrv/src/memory.jl:98
 [4] macro expansion at /Users/solver/.julia/v0.6/CUDAnative/src/execution.jl:122 [inlined]
 [5] generated_cuda at /Users/solver/.julia/v0.6/CUDAnative/src/execution.jl:185 [inlined]
 [6] call_cuda at /Users/solver/.julia/v0.6/GPUArrays/src/backends/cudanative/cudanative.jl:102 [inlined]
 [7] acc_broadcast!(::Base.Math.#atan2, ::GPUArrays.GPUArray{Float32,1,CUDAdrv.CuArray{Float32,1},GPUArrays.CUBackend.CUContext}, ::Tuple{GPUArrays.GPUArray{Float32,1,CUDAdrv.CuArray{Float32,1},GPUArrays.CUBackend.CUContext},GPUArrays.GPUArray{Float32,1,CUDAdrv.CuArray{Float32,1},GPUArrays.CUBackend.CUContext}}) at /Users/solver/.julia/v0.6/GPUArrays/src/backends/cudanative/cudanative.jl:227
 [8] broadcast(::Function, ::GPUArrays.GPUArray{Float32,1,CUDAdrv.CuArray{Float32,1},GPUArrays.CUBackend.CUContext}, ::GPUArrays.GPUArray{Float32,1,CUDAdrv.CuArray{Float32,1},GPUArrays.CUBackend.CUContext}) at /Users/solver/.julia/v0.6/GPUArrays/src/abstractarray.jl:196

How to initialise GPU Array ?

Hey, dumb question: how does a user initialise a GPU Array?

julia> using GPUArrays

julia> a = rand(10)
10-element Array{Float64,1}:
 0.480675  
 0.185595  
 0.668687  
 0.315772  
 0.722896  
 0.704025  
 0.443428  
 0.626663  
 0.00459752
 0.414371  

julia> GPUArrays.GPUArray(a)
ERROR: BoundsError: attempt to access 0-element Array{GPUArrays.Context,1} at index [0]
 in GPUArrays.GPUArray{T,N,B,C}(::Array{Float64,1}) at /home/ranjan/.julia/v0.6/GPUArrays/src/arrays.jl:60

Macro for loops

Brought up here: https://discourse.julialang.org/t/list-of-most-desired-features-for-julia-v1-x/4481/80?u=chrisrackauckas

The idea is to transform the loops to broadcast:

@gpu for i in 2:length(A)-1
  A[i] = 2A[i] - A[i-1] - A[i+1]
end

That would be a really nice usability feature for writing stencils. Broadcast + views can be the implementation, right?

global scope of compute_context

I am curious that how would the global compute_context be access and set from backend? in the case that backend is also a module.

In the documentation of julia:
http://docs.julialang.org/en/release-0.4/manual/variables-and-scoping/#global-scope

Each module introduces a new global scope, separate from the global scope of all other modules; there is no all-encompassing global scope

logreg example does not want to work. + few qestions.

so i m runing on 0.6 julia, and exemple with ReverseDiff pkg. complains on convertion to pointer in GPUArray, is there a whay to fix it?

LoadError: conversion to pointer not defined for GPUArrays.GPUArray{Float32,1,Array{Float32,1},GPUArrays.JLBackend.JLContext}
while loading C:\Users\PunkR.julia\v0.6\GPUArrays\examples\logreg.jl, in expression starting on line 23
in gradient at ReverseDiff\src\api\gradients.jl:22
in gradient at ReverseDiff\src\api\gradients.jl:24
in seeded_reverse_pass! at ReverseDiff\src\api\tape.jl:47
in seeded_reverse_pass! at ReverseDiff\src\api\utils.jl:31
in reverse_pass! at ReverseDiff\src\tape.jl:87
in reverse_exec! at ReverseDiff\src\tape.jl:93
in special_reverse_exec! at ReverseDiff\src\derivatives/linalg\arithmetic.jl:224
in A_mul_Bc! at base\linalg\matmul.jl:223
in A_mul_Bt! at base\linalg\matmul.jl:193
in gemm_wrapper! at base\linalg\matmul.jl:369
in gemm! at base\linalg\blas.jl:1027

And 2nd question, is there some documentation about pakage? at least some kind of list of available functions?
And 3rd, in some benches there kind of NT = Base.Threads.nthreads() and JLBackend.init(), and i cant figure out how to set more threads, it's olweys find 1 availible/

JLContext Intel(R) Xeon(R) CPU E5-1660 0 @ 3.30GHz with 1

threads

cuBLAS support

I'm interested to implement support for cuBLAS, but I haven't followed discussions lately so am a bit out of context and have a couple of questions.

From the README:

CUDAnative could support these easily as well, but we currently run into problems with the interactions of CUDAdrv and CUDArt

I see the relevant commented coded, but what exactly is wrong with it?

Do we want cuBLAS to be integrated into CUDAnative.jl or just GPUArrays.jl?

opencl: broadcast complex64 fail (osX)

Hello,

I installed the package with Pkg.add but some of the tests don´t pass (julia0.6)

My output is:

julia> Pkg.test("GPUArrays")
INFO: Computing test dependencies for GPUArrays...
INFO: No packages to install, update or remove
INFO: Testing GPUArrays
Test Summary: | Pass  Total
julia         |   48     48
broadcast Complex64: Test Failed
  Expression: ERROR (unhandled task failure): MethodError: no method matching unsafe_string(::Ptr{Void})
Closest candidates are:
  unsafe_string(::Cstring) at c.jl:79
  unsafe_string(::Union{Ptr{Int8}, Ptr{UInt8}}) at strings/string.jl:39
  unsafe_string(::Union{Ptr{Int8}, Ptr{UInt8}}, ::Integer) at strings/string.jl:35
Stacktrace:
 [1] macro expansion at /Users/davidbuchacaprats/.julia/v0.6/OpenCL/src/context.jl:148 [inlined]
 [2] (::OpenCL.cl.##43#44)() at ./task.jl:335
all((x->begin 
            x == angle(10.0f0im)
        end), Array(B))
Stacktrace:
 [1] macro expansion at /Users/davidbuchacaprats/.julia/v0.6/GPUArrays/test/opencl.jl:36 [inlined]
 [2] macro expansion at ./test.jl:856 [inlined]
 [3] anonymous at ./<missing>:?
Test Summary:                        | Pass  Fail  Total
opencl                               |   44     1     45
  broadcast Float32                  |    5            5
  broadcast Complex64                |    4     1      5
  Custom kernel from Julia function  |    1            1
  Custom kernel from string function |    1            1
  transpose                          |    1            1
  mapreduce Float32 (4048,)          |    4            4
  mapreduce Int32 (4048,)            |    4            4
  mapreduce Float32 (1024, 1024)     |    4            4
  mapreduce Int32 (1024, 1024)       |    4            4
  mapreduce Float32 (77,)            |    4            4
  mapreduce Int32 (77,)              |    4            4
  mapreduce Float32 (1923, 209)      |    4            4
  mapreduce Int32 (1923, 209)        |    4            4
ERROR: LoadError: Some tests did not pass: 44 passed, 1 failed, 0 errored, 0 broken.
while loading /Users/davidbuchacaprats/.julia/v0.6/GPUArrays/test/runtests.jl, in expression starting on line 24
==================================[ ERROR: GPUArrays ]==================================

failed process: Process(`/Applications/Julia-0.6.app/Contents/Resources/julia/bin/julia -Ccore2 -J/Applications/Julia-0.6.app/Contents/Resources/julia/lib/julia/sys.dylib --compile=yes --depwarn=yes --check-bounds=yes --code-coverage=none --color=yes --compilecache=yes /Users/davidbuchacaprats/.julia/v0.6/GPUArrays/test/runtests.jl`, ProcessExited(1)) [1]

========================================================================================
INFO: No packages to install, update or remove
ERROR: GPUArrays had test errors

Any ideas how to solve it ?

Broadcast fusion fails when scalars are involved

u0 = GPUArray(rand(Float32, 32, 32))
tmp = ones(u0)
uprev = ones(u0)
k1 = ones(u0)
a = Float32(2.0)
tmp .=  uprev.+a.*k1

indexing not defined for GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext}
getindex at abstractarray.jl:874 [inlined]
_getindex at abstractarray.jl:921 [inlined]
getindex at abstractarray.jl:875 [inlined]
_broadcast_getindex at broadcast.jl:133 [inlined]
_broadcast_getindex at broadcast.jl:130 [inlined]
macro expansion at broadcast.jl:151 [inlined]
macro expansion at simdloop.jl:73 [inlined]
macro expansion at broadcast.jl:147 [inlined]
_broadcast!(::##5#6, ::GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext}, ::Tuple{Tuple{Bool,Bool},Tuple{},Tuple{Bool,Bool}}, ::Tuple{Tuple{Int64,Int64},Tuple{},Tuple{Int64,Int64}}, ::GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext}, ::Tuple{Float32,GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext}}, ::Type{Val{2}}, ::CartesianRange{CartesianIndex{2}}) at broadcast.jl:139
broadcast_c! at broadcast.jl:211 [inlined]
broadcast!(::Function, ::GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext}, ::GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext}, ::Float32, ::GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext}) at broadcast.jl:204
include_string(::String, ::String) at loading.jl:515
include_string(::String, ::String, ::Int64) at eval.jl:30
include_string(::Module, ::String, ::String, ::Int64, ::Vararg{Int64,N} where N) at eval.jl:34
(::Atom.##49#52{String,Int64,String})() at eval.jl:50
withpath(::Atom.##49#52{String,Int64,String}, ::String) at utils.jl:30
withpath(::Function, ::String) at eval.jl:38
macro expansion at eval.jl:49 [inlined]
(::Atom.##48#51{Dict{String,Any}})() at task.jl:80

Choosing Device

Is there a way to choose the device where we will allocate the arrays? (and perform comutations)

I can see what seems to be the current device

julia> CLBackend.current_context()
CLContext: AMD Radeon HD - FirePro D300 Compute Engine

I see though that there are several possible devices (in this case too many there are only 2 GPUs in my machine not 3).

julia> CLBackend.all_contexts()
3-element Array{GPUArrays.CLBackend.CLContext,1}:
 CLContext: AMD Radeon HD - FirePro D300 Compute Engine
 CLContext: AMD Radeon HD - FirePro D300 Compute Engine
 CLContext: AMD Radeon HD - FirePro D300 Compute Engine

In ArrayFire for example, ArrayFire.AFInfo() gives you device infor and it can be used to select a device.

julia> ArrayFire.AFInfo()
ArrayFire v3.3.2 (OpenCL, 64-bit Mac OSX, build f65dd97)
[0] APPLE   : AMD Radeon HD - FirePro D300 Compute Engine, 2048 MB
-1- APPLE   : AMD Radeon HD - FirePro D300 Compute Engine, 2048 MB

What are the most differences where GPUArrays might differ from ArrayFire?

CuRAND.jl wrapper

It would be nice to be able to have fast GPU generation of random numbers. Those algorithms are super good, so wrapping https://github.com/JuliaGPU/CURAND.jl would be very useful. Some way to use that to generate GPUArrays would be very helpful.

What is the compile_lib?

In CUDAnative backend, this package searches for a specific Julia source code file (library.jl) which is expected to be under CUDAdrv/examples/compilation. What's that file? I was getting an error due to this issue, but I think after I did run CUDAdrv tests, this error disappeared.

Transpiler having troubles with broadcast

uprev = GPUArray(rand(Float32, 32, 32))
k1 = GPUArray(rand(Float32, 32, 32))
k2 = GPUArray(rand(Float32, 32, 32))
k3 = GPUArray(rand(Float32, 32, 32))
k4 = GPUArray(rand(Float32, 32, 32))
dt = 1.2f0
b1 = 1.3f0
b2 = 1.4f0
b3 = 1.5f0
b4 = 1.6f0
utilde = similar(uprev)
@. utilde = uprev + dt*(b1*k1 + b2*k2 + b3*k3 + b4*k4)

OpenCL backend

Execution difference between `opencl` backend and native Julia

function power_method(M, v)
    T = eltype(v)
    for i in 1:100
        v = M*v        # repeatedly creates a new vector and destroys the old v
        v ./= T(norm(v))
    end
    
    return v, T(norm(M*v)) / T(norm(v))  # or  (M*v) ./ v
end

M = [2 1; 1 1.]
v = [1., 1]
MM = GPUArray(M)
vv = GPUArray(v)

# OpenCL backend on a K80
vec, val = power_method(MM, vv)
# => ([0.000285175, 2.02974], 1.4145115942187871)

vec, val = power_method(M, v)
# => ([0.850651, 0.525731], 2.618033988749895)

Transpose error

I got the following error when I tried simple transposing a gpuarray. I was using the most recent GPUArray and CUDAnative, Julia 0.6 and mac os 10.12.6.

julia> b = z';
ERROR: MethodError: no method matching copy!(::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::CartesianRange{CartesianIndex{2}}, ::Array{Float64,1}, ::CartesianRange{CartesianIndex{1}})
Closest candidates are:
copy!(::GPUArrays.AbstractAccArray{T,N} where N, ::CartesianRange{CartesianIndex{1}}, ::AbstractArray{T,N} where N, ::CartesianRange{CartesianIndex{1}}) where T at /Users/shhong/.julia/v0.6/GPUArrays/src/abstractarray.jl:271
copy!(::GPUArrays.AbstractAccArray, ::Integer, ::AbstractArray, ::Integer, ::Integer) at /Users/shhong/.julia/v0.6/GPUArrays/src/abstractarray.jl:254
copy!(::AbstractArray, ::Integer, ::AbstractArray) at abstractarray.jl:672
...
Stacktrace:
[1] setindex!(::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::Float64, ::Int64, ::Int64) at /Users/shhong/.julia/v0.6/GPUArrays/src/abstractarray.jl:382
[2] transposeblock!(::Base.#transpose, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::Int64, ::Int64, ::Int64, ::Int64) at ./linalg/transpose.jl:68
[3] transposeblock!(::Base.#transpose, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::Int64, ::Int64, ::Int64, ::Int64) at ./linalg/transpose.jl:74
[4] transposeblock!(::Base.#transpose, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::Int64, ::Int64, ::Int64, ::Int64) at ./linalg/transpose.jl:78
[5] transposeblock!(::Base.#transpose, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::Int64, ::Int64, ::Int64, ::Int64) at ./linalg/transpose.jl:74
[6] transposeblock!(::Base.#transpose, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::Int64, ::Int64, ::Int64, ::Int64) at ./linalg/transpose.jl:78
[7] transposeblock!(::Base.#transpose, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::Int64, ::Int64, ::Int64, ::Int64) at ./linalg/transpose.jl:74
[8] transposeblock!(::Base.#transpose, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::Int64, ::Int64, ::Int64, ::Int64) at ./linalg/transpose.jl:78
[9] transposeblock!(::Base.#transpose, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::Int64, ::Int64, ::Int64, ::Int64) at ./linalg/transpose.jl:74
[10] transposeblock!(::Base.#transpose, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::Int64, ::Int64, ::Int64, ::Int64) at ./linalg/transpose.jl:78
[11] transpose_f!(::Base.#transpose, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}) at ./linalg/transpose.jl:59
[12] transpose!(::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}) at ./linalg/transpose.jl:16
[13] transpose(::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}) at ./linalg/transpose.jl:121
[14] ctranspose(::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}) at ./linalg/transpose.jl:130

Reduction/Concatenation operations are not supported

Vector reduction (sum(A,1), maximum(A,(2,3)) etc.) and concatenation operations are not supported currently and I think its error message is not informative as much as possible. It just points out indexing is not supported; but this has nothing to with indexing.

Thanks.

Question about B-S benchmark

I'm interested to know why CLContext is 10 times slower than CUContext on the same GeForce GTX 950.

Sum over a dimension

Not sure if it's expected, but the following:

X = GPUArray(randn(Float32, 10, 10))
sum(X, 2)

gives an error:

 ERROR: MethodError: no method matching indexlength(::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}, ::Int64, ::CartesianIndex{1})
Closest candidates are:
  indexlength(::Any, ::Any, ::AbstractArray) at /home/dfdx/.julia/v0.6/GPUArrays/src/abstractarray.jl:368
  indexlength(::Any, ::Any, ::Number) at /home/dfdx/.julia/v0.6/GPUArrays/src/abstractarray.jl:369
  indexlength(::Any, ::Any, ::Colon) at /home/dfdx/.julia/v0.6/GPUArrays/src/abstractarray.jl:370
Stacktrace:
 [1] (::GPUArrays.##22#23{GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext},Tuple{Int64,CartesianIndex{1}}})(::Int64) at /home/dfdx/.julia/v0.6/GPUArrays/src/abstractarray.jl:375
 [2] ntuple at ./tuple.jl:128 [inlined]
 [3] setindex!(::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}, ::Float32, ::Int64, ::CartesianIndex{1}) at /home/dfdx/.julia/v0.6/GPUArrays/src/abstractarray.jl:374
 [4] macro expansion at ./reducedim.jl:203 [inlined]
 [5] macro expansion at ./simdloop.jl:73 [inlined]
 [6] _mapreducedim!(::Base.#identity, ::Base.#+, ::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}, ::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}) at ./reducedim.jl:202
 [7] mapreducedim!(::Function, ::Function, ::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}, ::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}) at ./reducedim.jl:210
 [8] sum(::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}, ::Int64) at ./reducedim.jl:572

Concatenation

It seems like this is implemented, but commented out.

failing to compile kernels in opencl

On intelHD gpu with beignet drivers.

julia> v=GPUArray(rand(2,2))
GPUArray with ctx: CLContext: Intel(R) HD Graphics 5500 BroadWell U-Processor GT2: 
2×2 Array{Float64,2}:
 0.841527  0.436476 
 0.23317   0.0211867

julia> v[1,1]
Couldn't compile kernel: 
    1   : // dependencies
    2   : // (GPUArrays.linear_index, Tuple{Transpiler.CLIntrinsics.CLArray{Float64,2},Float32})
    3   : uint linear_index_1(__global float * restrict  x2unused2, float state)
    4   : {
    5   :     return (get_global_id)(0) + (uint){1};
    6   : }
    7   : // Type{UInt32}
    8   : typedef int Type3UInt324; // placeholder type instance
    9   : __constant Type3UInt324 TYP_INST_Type3UInt324;
    10  : 
    11  : // (GPUArrays._ind2sub, Tuple{Tuple{UInt32},UInt32})
    12  : uint _ind2sub_5(uint indslast, uint ind)
    13  : {
    14  :     return (uint){ind + (uint){1}};
    15  : }
    16  : // (Base.argtail, Tuple{UInt32,UInt32})
    17  : uint argtail_6(uint x, uint rest)
    18  : {
    19  :     return rest;
    20  : }
    21  : // (Base.tail, Tuple{Tuple{UInt32,UInt32}})
    22  : uint tail_7(uint2 x)
    23  : {
    24  :     uint2 x22_apply_tmp2659;
    25  :     x22_apply_tmp2659 = x;
    26  :     return (argtail_6)(x22_apply_tmp2659.s0, (uint){x22_apply_tmp2659.s1});
    27  : }
    28  : // (GPUArrays._ind2sub, Tuple{Tuple{UInt32,UInt32},UInt32})
    29  : uint2 _ind2sub_2(uint2 inds, uint ind)
    30  : {
    31  :     uint l;
    32  :     uint f;
    33  :     uint indnext;
    34  :     uint r1;
    35  :     r1 = inds.s0;
    36  :     indnext = ind / r1;
    37  :     f = (uint){1};
    38  :     l = r1;
    39  :     uint x22_apply_tmp2658;
    40  :     x22_apply_tmp2658 = (_ind2sub_5)((tail_7)(inds), indnext);
    41  :     return (uint2){(ind - l * indnext) + f, x22_apply_tmp2658};
    42  : }
    43  : // (GPUArrays.gpu_ind2sub, Tuple{Tuple{UInt32,UInt32},UInt32})
    44  : uint2 gpu_ind2sub_2(uint2 dims, uint ind)
    45  : {
    46  :     return (_ind2sub_2)((uint2){dims}, ind - (uint){1});
    47  : }
    48  : // Tuple{}
    49  : typedef int Tuple_; // empty type emitted as an int
    50  : // (Base.argtail, Tuple{UInt32})
    51  : Tuple_ argtail_8(uint x, Tuple_ rest)
    52  : {
    53  :     return rest;
    54  : }
    55  : // (Base.tail, Tuple{Tuple{UInt32}})
    56  : Tuple_ tail_9(uint x)
    57  : {
    58  :     uint x22_apply_tmp2663;
    59  :     x22_apply_tmp2663 = x;
    60  :     return (argtail_8)(x22_apply_tmp2663, (Tuple_){0.0f});
    61  : }
    62  : // (GPUArrays._sub2ind, Tuple{Tuple{},UInt32,UInt32})
    63  : uint _sub2ind_10(Tuple_ x, uint L, uint ind)
    64  : {
    65  :     return ind;
    66  : }
    67  : // (GPUArrays._sub2ind, Tuple{Tuple{UInt32},UInt32,UInt32,UInt32})
    68  : uint _sub2ind_11(uint inds, uint L, uint ind, uint i, Tuple_ I)
    69  : {
    70  :     uint r1;
    71  :     r1 = inds;
    72  :     Tuple_ x22_apply_tmp2662;
    73  :     x22_apply_tmp2662 = I;
    74  :     return (_sub2ind_10)((tail_9)(inds), L * r1, ind + (i - (uint){1}) * L);
    75  : }
    76  : // (GPUArrays._sub2ind, Tuple{Tuple{UInt32,UInt32},UInt32,UInt32,UInt32,UInt32})
    77  : uint _sub2ind_12(uint2 inds, uint L, uint ind, uint i, uint I)
    78  : {
    79  :     uint r1;
    80  :     r1 = inds.s0;
    81  :     uint x22_apply_tmp2661;
    82  :     x22_apply_tmp2661 = I;
    83  :     return (_sub2ind_11)((tail_7)(inds), L * r1, ind + (i - (uint){1}) * L, x22_apply_tmp2661, (Tuple_){0.0f});
    84  : }
    85  : // (GPUArrays.gpu_sub2ind, Tuple{Tuple{UInt32,UInt32},Tuple{UInt32,UInt32}})
    86  : uint gpu_sub2ind_3(uint2 dims, uint2 I)
    87  : {
    88  :     uint2 x22_apply_tmp2660;
    89  :     x22_apply_tmp2660 = I;
    90  :     return (_sub2ind_12)((uint2){dims}, (uint){1}, (uint){1}, x22_apply_tmp2660.s0, (uint){x22_apply_tmp2660.s1});
    91  : }
    92  : // ########################
    93  : // Main inner function
    94  : // (GPUArrays.copy_kernel!, (Float32, Transpiler.CLIntrinsics.CLArray{Float64,2}, Tuple{UInt32,UInt32}, Transpiler.CLIntrinsics.CLArray{Float64,2}, Tuple{UInt32,UInt32}, Tuple{UInt32,UInt32}, Tuple{UInt32,UInt32}, Tuple{UInt32,UInt32}, UInt32))
    95  : __kernel void copy_kernel1_4(float state, __global float * restrict  dest, uint2 dest_offsets, __global float * restrict  src, uint2 src_offsets, uint2 shape, uint2 shape_dest, uint2 shape_source, uint length)
    96  : {
    97  :     uint src_idx;
    98  :     uint dest_idx;
    99  :     uint2 idx;
    100 :     uint i;
    101 :     i = (linear_index_1)(dest, state);
    102 :     if(i <= length){
    103 :         idx = (gpu_ind2sub_2)(shape, i);
    104 :         dest_idx = (gpu_sub2ind_3)(shape_dest, idx + dest_offsets);
    105 :         src_idx = (gpu_sub2ind_3)(shape_source, idx + src_offsets);
    106 :         ;
    107 :         float _ssavalue_0;
    108 :         _ssavalue_0 = (src)[src_idx - 0x00000001];
    109 :         (dest)[dest_idx - 0x00000001] = _ssavalue_0;
    110 :         ;
    111 :     };
    112 :     ;
    113 : }
    114 : 
With following build error:
stringInput.cl:5:13: error: taking address of function is not allowed
stringInput.cl:9:25: error: variable in constant address space must be initialized
stringInput.cl:26:13: error: taking address of function is not allowed
stringInput.cl:40:26: error: taking address of function is not allowed
stringInput.cl:40:39: error: taking address of function is not allowed
stringInput.cl:46:13: error: taking address of function is not allowed
stringInput.cl:60:13: error: taking address of function is not allowed
stringInput.cl:74:13: error: taking address of function is not allowed
stringInput.cl:74:27: error: taking address of function is not allowed
stringInput.cl:83:13: error: taking address of function is not allowed
stringInput.cl:83:27: error: taking address of function is not allowed
stringInput.cl:90:13: error: taking address of function is not allowed
stringInput.cl:101:10: error: taking address of function is not allowed
stringInput.cl:103:16: error: taking address of function is not allowed
stringInput.cl:104:21: error: taking address of function is not allowed
stringInput.cl:105:20: error: taking address of function is not allowed

ERROR: CLError(code=-11, CL_BUILD_PROGRAM_FAILURE)
Stacktrace:
 [1] macro expansion at /home/carlo/.julia/v0.6/OpenCL/src/macros.jl:6 [inlined]
 [2] #build!#113(::String, ::Bool, ::Function, ::OpenCL.cl.Program) at /home/carlo/.julia/v0.6/OpenCL/src/program.jl:101
 [3] (::OpenCL.cl.#kw##build!)(::Array{Any,1}, ::OpenCL.cl.#build!, ::OpenCL.cl.Program) at ./<missing>:0
 [4] (::Transpiler.##41#42{Tuple{Float32,GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},Tuple{UInt32,UInt32},GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},UInt32},GPUArrays.#copy_kernel!,OpenCL.cl.CmdQueue,OpenCL.cl.Context,NTuple{9,DataType}})() at /home/carlo/.julia/v0.6/Transpiler/src/clike/opencl/compilation.jl:96
 [5] get!(::Transpiler.##41#42{Tuple{Float32,GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},Tuple{UInt32,UInt32},GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},UInt32},GPUArrays.#copy_kernel!,OpenCL.cl.CmdQueue,OpenCL.cl.Context,NTuple{9,DataType}}, ::Dict{Any,Transpiler.CLFunction}, ::Tuple{GPUArrays.#copy_kernel!,NTuple{9,DataType}}) at ./dict.jl:449
 [6] Transpiler.CLFunction(::Function, ::Tuple{Float32,GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},Tuple{UInt32,UInt32},GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},UInt32}, ::OpenCL.cl.CmdQueue) at /home/carlo/.julia/v0.6/Transpiler/src/clike/opencl/compilation.jl:77
 [7] gpu_call(::Function, ::GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}, ::Tuple{GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},Tuple{UInt32,UInt32},GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},UInt32}, ::Int64, ::Void) at /home/carlo/.julia/v0.6/GPUArrays/src/backends/opencl/opencl.jl:220
 [8] copy!(::GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}, ::CartesianRange{CartesianIndex{2}}, ::GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}, ::CartesianRange{CartesianIndex{2}}) at /home/carlo/.julia/v0.6/GPUArrays/src/abstractarray.jl:329
 [9] copy!(::Array{Float64,2}, ::CartesianRange{CartesianIndex{2}}, ::GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}, ::CartesianRange{CartesianIndex{2}}) at /home/carlo/.julia/v0.6/GPUArrays/src/abstractarray.jl:360
 [10] getindex(::GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}, ::Int64, ::Int64) at /home/carlo/.julia/v0.6/GPUArrays/src/abstractarray.jl:401
 [11] macro expansion at ./REPL.jl:97 [inlined]
 [12] (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:73

and analogously

julia> v * v
Couldn't compile kernel: 
    1   : // dependencies
    2   : // (GPUArrays.linear_index, Tuple{Transpiler.CLIntrinsics.CLArray{Float64,2},Float32})
    3   : uint linear_index_1(__global float * restrict  x2unused2, float state)
    4   : {
    5   :     return (get_global_id)(0) + (uint){1};
    6   : }
    7   : // Type{UInt32}
    8   : typedef int Type3UInt324; // placeholder type instance
    9   : __constant Type3UInt324 TYP_INST_Type3UInt324;
    10  : 
    11  : // (GPUArrays._ind2sub, Tuple{Tuple{UInt32},UInt32})
    12  : uint _ind2sub_5(uint indslast, uint ind)
    13  : {
    14  :     return (uint){ind + (uint){1}};
    15  : }
    16  : // (Base.argtail, Tuple{UInt32,UInt32})
    17  : uint argtail_6(uint x, uint rest)
    18  : {
    19  :     return rest;
    20  : }
    21  : // (Base.tail, Tuple{Tuple{UInt32,UInt32}})
    22  : uint tail_7(uint2 x)
    23  : {
    24  :     uint2 x22_apply_tmp2671;
    25  :     x22_apply_tmp2671 = x;
    26  :     return (argtail_6)(x22_apply_tmp2671.s0, (uint){x22_apply_tmp2671.s1});
    27  : }
    28  : // (GPUArrays._ind2sub, Tuple{Tuple{UInt32,UInt32},UInt32})
    29  : uint2 _ind2sub_2(uint2 inds, uint ind)
    30  : {
    31  :     uint l;
    32  :     uint f;
    33  :     uint indnext;
    34  :     uint r1;
    35  :     r1 = inds.s0;
    36  :     indnext = ind / r1;
    37  :     f = (uint){1};
    38  :     l = r1;
    39  :     uint x22_apply_tmp2670;
    40  :     x22_apply_tmp2670 = (_ind2sub_5)((tail_7)(inds), indnext);
    41  :     return (uint2){(ind - l * indnext) + f, x22_apply_tmp2670};
    42  : }
    43  : // (GPUArrays.gpu_ind2sub, Tuple{Tuple{UInt32,UInt32},UInt32})
    44  : uint2 gpu_ind2sub_2(uint2 dims, uint ind)
    45  : {
    46  :     return (_ind2sub_2)((uint2){dims}, ind - (uint){1});
    47  : }
    48  : // Tuple{}
    49  : typedef int Tuple_; // empty type emitted as an int
    50  : // (Base.argtail, Tuple{UInt32})
    51  : Tuple_ argtail_8(uint x, Tuple_ rest)
    52  : {
    53  :     return rest;
    54  : }
    55  : // (Base.tail, Tuple{Tuple{UInt32}})
    56  : Tuple_ tail_9(uint x)
    57  : {
    58  :     uint x22_apply_tmp2675;
    59  :     x22_apply_tmp2675 = x;
    60  :     return (argtail_8)(x22_apply_tmp2675, (Tuple_){0.0f});
    61  : }
    62  : // (GPUArrays._sub2ind, Tuple{Tuple{},UInt32,UInt32})
    63  : uint _sub2ind_10(Tuple_ x, uint L, uint ind)
    64  : {
    65  :     return ind;
    66  : }
    67  : // (GPUArrays._sub2ind, Tuple{Tuple{UInt32},UInt32,UInt32,UInt32})
    68  : uint _sub2ind_11(uint inds, uint L, uint ind, uint i, Tuple_ I)
    69  : {
    70  :     uint r1;
    71  :     r1 = inds;
    72  :     Tuple_ x22_apply_tmp2674;
    73  :     x22_apply_tmp2674 = I;
    74  :     return (_sub2ind_10)((tail_9)(inds), L * r1, ind + (i - (uint){1}) * L);
    75  : }
    76  : // (GPUArrays._sub2ind, Tuple{Tuple{UInt32,UInt32},UInt32,UInt32,UInt32,UInt32})
    77  : uint _sub2ind_12(uint2 inds, uint L, uint ind, uint i, uint I)
    78  : {
    79  :     uint r1;
    80  :     r1 = inds.s0;
    81  :     uint x22_apply_tmp2673;
    82  :     x22_apply_tmp2673 = I;
    83  :     return (_sub2ind_11)((tail_7)(inds), L * r1, ind + (i - (uint){1}) * L, x22_apply_tmp2673, (Tuple_){0.0f});
    84  : }
    85  : // (GPUArrays.gpu_sub2ind, Tuple{Tuple{UInt32,UInt32},Tuple{UInt32,UInt32}})
    86  : uint gpu_sub2ind_3(uint2 dims, uint2 I)
    87  : {
    88  :     uint2 x22_apply_tmp2672;
    89  :     x22_apply_tmp2672 = I;
    90  :     return (_sub2ind_12)((uint2){dims}, (uint){1}, (uint){1}, x22_apply_tmp2672.s0, (uint){x22_apply_tmp2672.s1});
    91  : }
    92  : // ########################
    93  : // Main inner function
    94  : // (GPUArrays.copy_kernel!, (Float32, Transpiler.CLIntrinsics.CLArray{Float64,2}, Tuple{UInt32,UInt32}, Transpiler.CLIntrinsics.CLArray{Float64,2}, Tuple{UInt32,UInt32}, Tuple{UInt32,UInt32}, Tuple{UInt32,UInt32}, Tuple{UInt32,UInt32}, UInt32))
    95  : __kernel void copy_kernel1_4(float state, __global float * restrict  dest, uint2 dest_offsets, __global float * restrict  src, uint2 src_offsets, uint2 shape, uint2 shape_dest, uint2 shape_source, uint length)
    96  : {
    97  :     uint src_idx;
    98  :     uint dest_idx;
    99  :     uint2 idx;
    100 :     uint i;
    101 :     i = (linear_index_1)(dest, state);
    102 :     if(i <= length){
    103 :         idx = (gpu_ind2sub_2)(shape, i);
    104 :         dest_idx = (gpu_sub2ind_3)(shape_dest, idx + dest_offsets);
    105 :         src_idx = (gpu_sub2ind_3)(shape_source, idx + src_offsets);
    106 :         ;
    107 :         float _ssavalue_0;
    108 :         _ssavalue_0 = (src)[src_idx - 0x00000001];
    109 :         (dest)[dest_idx - 0x00000001] = _ssavalue_0;
    110 :         ;
    111 :     };
    112 :     ;
    113 : }
    114 : 
With following build error:
stringInput.cl:5:13: error: taking address of function is not allowed
stringInput.cl:9:25: error: variable in constant address space must be initialized
stringInput.cl:26:13: error: taking address of function is not allowed
stringInput.cl:40:26: error: taking address of function is not allowed
stringInput.cl:40:39: error: taking address of function is not allowed
stringInput.cl:46:13: error: taking address of function is not allowed
stringInput.cl:60:13: error: taking address of function is not allowed
stringInput.cl:74:13: error: taking address of function is not allowed
stringInput.cl:74:27: error: taking address of function is not allowed
stringInput.cl:83:13: error: taking address of function is not allowed
stringInput.cl:83:27: error: taking address of function is not allowed
stringInput.cl:90:13: error: taking address of function is not allowed
stringInput.cl:101:10: error: taking address of function is not allowed
stringInput.cl:103:16: error: taking address of function is not allowed
stringInput.cl:104:21: error: taking address of function is not allowed
stringInput.cl:105:20: error: taking address of function is not allowed

ERROR: CLError(code=-11, CL_BUILD_PROGRAM_FAILURE)
Stacktrace:
 [1] macro expansion at /home/carlo/.julia/v0.6/OpenCL/src/macros.jl:6 [inlined]
 [2] #build!#113(::String, ::Bool, ::Function, ::OpenCL.cl.Program) at /home/carlo/.julia/v0.6/OpenCL/src/program.jl:101
 [3] (::OpenCL.cl.#kw##build!)(::Array{Any,1}, ::OpenCL.cl.#build!, ::OpenCL.cl.Program) at ./<missing>:0
 [4] (::Transpiler.##41#42{Tuple{Float32,GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},Tuple{UInt32,UInt32},GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},UInt32},GPUArrays.#copy_kernel!,OpenCL.cl.CmdQueue,OpenCL.cl.Context,NTuple{9,DataType}})() at /home/carlo/.julia/v0.6/Transpiler/src/clike/opencl/compilation.jl:96
 [5] get!(::Transpiler.##41#42{Tuple{Float32,GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},Tuple{UInt32,UInt32},GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},UInt32},GPUArrays.#copy_kernel!,OpenCL.cl.CmdQueue,OpenCL.cl.Context,NTuple{9,DataType}}, ::Dict{Any,Transpiler.CLFunction}, ::Tuple{GPUArrays.#copy_kernel!,NTuple{9,DataType}}) at ./dict.jl:449
 [6] Transpiler.CLFunction(::Function, ::Tuple{Float32,GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},Tuple{UInt32,UInt32},GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},UInt32}, ::OpenCL.cl.CmdQueue) at /home/carlo/.julia/v0.6/Transpiler/src/clike/opencl/compilation.jl:77
 [7] gpu_call(::Function, ::GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}, ::Tuple{GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},Tuple{UInt32,UInt32},GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},Tuple{UInt32,UInt32},UInt32}, ::Int64, ::Void) at /home/carlo/.julia/v0.6/GPUArrays/src/backends/opencl/opencl.jl:220
 [8] copy!(::GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}, ::CartesianRange{CartesianIndex{2}}, ::GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}, ::CartesianRange{CartesianIndex{2}}) at /home/carlo/.julia/v0.6/GPUArrays/src/abstractarray.jl:329
 [9] copy!(::Array{Float64,2}, ::CartesianRange{CartesianIndex{2}}, ::GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}, ::CartesianRange{CartesianIndex{2}}) at /home/carlo/.julia/v0.6/GPUArrays/src/abstractarray.jl:360
 [10] getindex(::GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}, ::Int64, ::Int64) at /home/carlo/.julia/v0.6/GPUArrays/src/abstractarray.jl:401
 [11] matmul2x2!(::GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}, ::Char, ::Char, ::GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}, ::GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}) at ./linalg/matmul.jl:670
 [12] gemm_wrapper!(::GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}, ::Char, ::Char, ::GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}, ::GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}) at ./linalg/matmul.jl:360
 [13] A_mul_B!(::GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}, ::GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}, ::GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}) at ./linalg/matmul.jl:148
 [14] *(::GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}, ::GPUArrays.GPUArray{Float64,2,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}) at ./linalg/matmul.jl:146
 [15] macro expansion at ./REPL.jl:97 [inlined]
 [16] (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:73

JLArrays are created on the GPU?

Hi,

I want to take advantage of the threaded broadcast of your package but I ran into this:

using GPUArrays
a=JLArray(rand(10))

which gives:

GPU: 10-element Array{Float64,1}:
 0.709778
 0.691078
 0.974742
 0.546105
 0.962552
 0.737111
 0.297245
 0.985732
 0.495128
 0.496053

Fails to load on OS X due to CLFFT, CLBLAS, CUDArt build errors

Hey there! I wanted to test out GPUArrays given that it's had its first 0.0.1 release (congrats!), but ran into some trouble. I tried pulling down ddfc16d on OS X on Julia 0.6, and received build errors for CLFFT, CLBLAS (OS X not officially supported), and CUDArt (@windows not defined). GPUArrays still installed, however attempting to load it with using GPUArrays fails with could not open file .../.julia/v0.6/GPUArrays/src/backends/supported_backends.jl. Here's my versioninfo:

julia> versioninfo()
Julia Version 0.6.0-pre.alpha.0
Commit 29bc2ac* (2017-02-28 13:15 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

I'm wondering if there's any easy way around these errors, or if I'll need to manually build CLFFT and CLBLAS (as their errors point me to do on OS X), and somehow manually patch or remove the CUDArt dependency?

Regardless of these issues, you've done some amazing work so far!

"cublas internal error" with GPUArray, but not with CuArray

This works:

using CUDAdrv
using CUBLAS
A = CuArray(rand(Float32, 100, 100))
B = CuArray(rand(Float32, 100, 100))
C = CuArray(rand(Float32, 100, 100))
A_mul_B!(C, A, B)

but this:

using GPUArrays
A = GPUArray(rand(Float32, 100, 100))
B = GPUArray(rand(Float32, 100, 100))
C = GPUArray(rand(Float32, 100, 100))
A_mul_B!(C, A, B)

gives me:

WARNING: CUBLAS error triggered from:

Stacktrace:
 [1] statuscheck(::UInt32) at /home/dfdx/.julia/v0.6/CUBLAS/src/CUBLAS.jl:71
 [2] gemm!(::Char, ::Char, ::Float32, ::CUDAdrv.CuArray{Float32,2}, ::CUDAdrv.CuArray{Float32,2}, ::Float32, ::CUDAdrv.CuArray{Float32,2}) at /home/dfdx/.julia/v0.6/CUBLAS/src/blas.jl:915
 [3] gemm! at /home/dfdx/.julia/v0.6/GPUArrays/src/backends/blas.jl:21 [inlined]
 [4] gemm_wrapper!(::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}, ::Char, ::Char, ::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}, ::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}) at ./linalg/matmul.jl:367
 [5] A_mul_B!(::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}, ::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}, ::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}) at ./linalg/matmul.jl:148
 [6] eval(::Module, ::Any) at ./boot.jl:235
 [7] eval_user_input(::Any, ::Base.REPL.REPLBackend) at ./REPL.jl:66
 [8] macro expansion at ./REPL.jl:97 [inlined]
 [9] (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:73
ERROR: "cublas internal error"
Stacktrace:
 [1] statuscheck(::UInt32) at /home/dfdx/.julia/v0.6/CUBLAS/src/CUBLAS.jl:73
 [2] gemm!(::Char, ::Char, ::Float32, ::CUDAdrv.CuArray{Float32,2}, ::CUDAdrv.CuArray{Float32,2}, ::Float32, ::CUDAdrv.CuArray{Float32,2}) at /home/dfdx/.julia/v0.6/CUBLAS/src/blas.jl:915
 [3] gemm! at /home/dfdx/.julia/v0.6/GPUArrays/src/backends/blas.jl:21 [inlined]
 [4] gemm_wrapper!(::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}, ::Char, ::Char, ::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}, ::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}) at ./linalg/matmul.jl:367
 [5] A_mul_B!(::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}, ::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}, ::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}) at ./linalg/matmul.jl:148

GeForce GTX 960M
Ubuntu 16.04

ERROR: UndefVarError: CUDAnative not defined

While testing the example on the README, I got the following error. I have CUDAnative already working and tested, so what's going on?

julia> a = GPUArray(rand(Float32, 32, 32))
ERROR: UndefVarError: CUDAnative not defined
Stacktrace:
 [1] current_device at /home/mohd/.julia/v0.6/GPUArrays/src/backends/opencl/opencl.jl:59 [inlined]
 [2] current_context() at /home/mohd/.julia/v0.6/GPUArrays/src/backends/opencl/opencl.jl:64
 [3] #call#5(::Array{Any,1}, ::Type{GPUArrays.GPUArray}, ::Array{Float32,2}) at /home/mohd/.julia/v0.6/GPUArrays/src/abstractarray.jl:85
 [4] GPUArrays.GPUArray(::Array{Float32,2}) at /home/mohd/.julia/v0.6/GPUArrays/src/abstractarray.jl:85

Can't create a GPUArray

julia> using GPUArrays

julia> GPUArray{Float32}(10,10)
ERROR: MethodError: no method matching similar(::Type{GPUArrays.GPUArray{Float32,N} where N}, ::Type{Float32}, ::Tuple{Int64,Int64})
Closest candidates are:
  similar(::Array, ::Type, ::Tuple{Vararg{Int64,N}}) where N at array.jl:194
  similar(::SubArray, ::Type, ::Tuple{Vararg{Int64,N}} where N) at subarray.jl:58
  similar(::Base.ReshapedArray, ::Type, ::Tuple{Vararg{Int64,N}} where N) at reshapedarray.jl:167
  ...
Stacktrace:
 [1] GPUArrays.GPUArray{Float32,N} where N(::Int64, ::Int64) at /Users/solver/.julia/v0.6/GPUArrays/src/construction.jl:37

julia> versioninfo()
Julia Version 0.6.0
Commit 903644385b (2017-06-19 13:05 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin16.7.0)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libgfortblas
  LAPACK: liblapack
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, broadwell)

ArrayFire backend support

Do you think it would be worthwhile to add ArrayFire.jl as a backend? How much work could it be? Would it be possible to interop with Arrays from different backends but backed by the same device?

Features we need

notify julia's GC of GPU memory pressure
- could do this as a first approach: look up free mem in GPUArray constructor and call gc if almost full
- memory manager (Grab one big buffer?! Grow a buffer? Reuse stuff? Work on only few arrays and make that fast? reuse KnetArrays approach?)
views: treat everything as a view into gpu memory. setindex! / getindex view or not?
mapreduce dim
concatenating / slicing
stencil operations

@denizyuret, @ilkerkesen

GPUArrays does the inverse permutation to julia base

GPUArrays seems to have the same problem that occurred here: JuliaGPU/CuArrays.jl/issues/30

log.(x) throws ERROR_ILLEGAL_ADDRESS on CUDAnative backend

using GPUArrays
GPUArrays.init(:cudanative)
x = GPUArray(rand(100, 200))
log.(x)

gives

GPUArray with ctx: CUDAnative context with:
Device: CU GeForce GTX 960M
            threads: 1024
             blocks: (1024, 1024, 64)
      global_memory: 2100.232192 mb
 free_global_memory: 1274.478592 mb
       local_memory: 0.065536 mb
: 
Error showing value of type GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}:
ERROR: CUDA error: an illegal memory access was encountered (code #700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
 [1] macro expansion at /home/dfdx/.julia/v0.6/CUDAdrv/src/base.jl:130 [inlined]
 [2] download(::Base.RefArray{Float64,Array{Float64,2},Void}, ::CUDAdrv.OwnedPtr{Float64}, ::Int64) at /home/dfdx/.julia/v0.6/CUDAdrv/src/memory.jl:141
 [3] copy!(::Array{Float64,2}, ::Int64, ::CUDAdrv.CuArray{Float64,2}, ::Int64, ::Int64) at /home/dfdx/.julia/v0.6/GPUArrays/src/backends/cudanative/cudanative.jl:154
 [4] copy!(::Array{Float64,2}, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}) at /home/dfdx/.julia/v0.6/GPUArrays/src/abstractarray.jl:294
 [5] Type at /home/dfdx/.julia/v0.6/GPUArrays/src/abstractarray.jl:103 [inlined]
 [6] Type at /home/dfdx/.julia/v0.6/GPUArrays/src/abstractarray.jl:96 [inlined]
 [7] show(::IOContext{Base.Terminals.TTYTerminal}, ::MIME{Symbol("text/plain")}, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}) at /home/dfdx/.julia/v0.6/GPUArrays/src/abstractarray.jl:48
 [8] display(::Base.REPL.REPLDisplay{Base.REPL.LineEditREPL}, ::MIME{Symbol("text/plain")}, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}) at ./REPL.jl:122
 [9] display(::Base.REPL.REPLDisplay{Base.REPL.LineEditREPL}, ::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}) at ./REPL.jl:125
 [10] display(::GPUArrays.GPUArray{Float64,2,CUDAdrv.CuArray{Float64,2},GPUArrays.CUBackend.CUContext}) at ./multimedia.jl:194
 [11] eval(::Module, ::Any) at ./boot.jl:235
 [12] print_response(::Base.Terminals.TTYTerminal, ::Any, ::Void, ::Bool, ::Bool, ::Void) at ./REPL.jl:144
 [13] print_response(::Base.REPL.LineEditREPL, ::Any, ::Void, ::Bool, ::Bool) at ./REPL.jl:129
 [14] (::Base.REPL.#do_respond#16{Bool,Base.REPL.##26#36{Base.REPL.LineEditREPL,Base.REPL.REPLHistoryProvider},Base.REPL.LineEditREPL,Base.LineEdit.Prompt})(::Base.LineEdit.MIState, ::Base.AbstractIOBuffer{Array{UInt8,1}}, ::Bool) at ./REPL.jl:646

Works fine on OpenCL backend.

Loads of possibly unnecessary deps

Hey, I was trying to install GPUArrays. I have a working version of CUDAnative. But after a Pkg.clone I find I need to have to installed the GLAbstractions and GLFW etc., Is it possible to decouple the package from all the backends so the user can simply use the backend he cares about?

Julia Base GPU support

TODO make a PR to julia base with these:

CartesianIndex preserving integer type
subind/ind2sub should preserve integer type
Broadcast.newindex with better CUint support
move inner broadcast kernel to its own function

CC @MikeInnes

Initializing Second Device

Initializing a backend such as opencl it selects the first device by default as seen in lines 29-38 of opencl.jl
https://github.com/JuliaGPU/GPUArrays.jl/blob/master/src/backends/opencl/opencl.jl

dev = if isempty(devlist)
                devlist = cl.devices(:cpu)
                if isempty(devlist)
                    error("no device found to be supporting opencl")
                else
                    first(devlist)
                end
            else
                first(devlist)
            end

Can we add functionality to select from multiple supported devices when initializing. For example, most laptops with discrete graphics also have integrated graphics as an option. For these computers the first device is the integrated chip which is normally not what you want to use doing GPU acceleration.

Example doesn't work in 0.6/OS X

I tried running the example but get:

julia> c = a * b # calls to BLAS
ERROR: MethodError: no method matching blas_module(::GPUArrays.CLBackend.CLContext)
Closest candidates are:
  blas_module(::GPUArrays.JLBackend.JLContext) at /Users/solver/.julia/v0.6/GPUArrays/src/backends/julia/julia.jl:81
  blas_module(::Union{GPUArrays.AbstractAccArray{T,1}, GPUArrays.AbstractAccArray{T,2}} where T) at /Users/solver/.julia/v0.6/GPUArrays/src/backends/blas.jl:4
Stacktrace:
 [1] gemm! at /Users/solver/.julia/v0.6/GPUArrays/src/backends/blas.jl:20 [inlined]
 [2] gemm_wrapper!(::GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext}, ::Char, ::Char, ::GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext}, ::GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext}) at ./linalg/matmul.jl:369
 [3] *(::GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext}, ::GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext}) at ./linalg/matmul.jl:148

julia> versioninfo()
Julia Version 0.6.0-pre.beta.9
Commit ceb06740e3 (2017-04-02 01:04 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin16.4.0)
  CPU: Intel(R) Core(TM) i7-3820QM CPU @ 2.70GHz
  WORD_SIZE: 64
  BLAS: libgfortblas
  LAPACK: liblapack
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, ivybridge)

Cannot Compile Kernel

Tests pass for the CL context, but I am not sure where to go with this:

using GPUArrays
u0 = GPUArray(rand(Float32, 32, 32))
abs.(u0)

Couldn't compile kernel: 
    1   : // dependant type declarations
    2   : typedef struct {
    3   : float empty; // structs can't be empty
    4   : }Base21abs;
    5   : 
    6   : // dependant function declarations
    7   : float broadcast_index_2(__global float * restrict  arg, int2 shape, int i)
    8   : {
    9   :     ;
    10  :     return arg[i - 1];
    11  :     ;
    12  : }
    13  : // Main inner function
    14  : __kernel void broadcast_kernel_7(__global float * restrict  A, Base21abs f, int2 sz, __global float * restrict  arg_1)
    15  : {
    16  :     int i;
    17  :     i = get_global_id(0) + 1;
    18  :     float _ssavalue_0;
    19  :     _ssavalue_0 = abs(broadcast_index_2(arg_1, sz, i));
    20  :     A[i - 1] = _ssavalue_0;
    21  :     ;
    22  : }
    23  : 
With following build error:
<kernel>:19:19: error: call to 'abs' is ambiguous
    _ssavalue_0 = abs(broadcast_index_2(arg_1, sz, i));
                  ^~~
cl_kernel.h:2310:24: note: candidate function
uchar __OVERLOADABLE__ abs(char);
                       ^
cl_kernel.h:2311:24: note: candidate function
uchar __OVERLOADABLE__ abs(uchar);
                       ^
cl_kernel.h:2312:25: note: candidate function
ushort __OVERLOADABLE__ abs(short);
                        ^
cl_kernel.h:2313:25: note: candidate function
ushort __OVERLOADABLE__ abs(ushort);
                        ^
cl_kernel.h:2314:23: note: candidate function
uint __OVERLOADABLE__ abs(int);
                      ^
cl_kernel.h:2315:23: note: candidate function
uint __OVERLOADABLE__ abs(uint);
                      ^
cl_kernel.h:2316:24: note: candidate function
ulong __OVERLOADABLE__ abs(long);
                       ^
cl_kernel.h:2317:24: note: candidate function
ulong __OVERLOADABLE__ abs(ulong);
                       ^
cl_kernel.h:2319:25: note: candidate function
uchar2 __OVERLOADABLE__ abs(char2);
                        ^
cl_kernel.h:2321:25: note: candidate function
uchar3 __OVERLOADABLE__ abs(char3);
                        ^
cl_kernel.h:2323:25: note: candidate function
uchar4 __OVERLOADABLE__ abs(char4);                                                              
                        ^
cl_kernel.h:2324:25: note: candidate function
uchar8 __OVERLOADABLE__ abs(char8);                                                              
                        ^
cl_kernel.h:2325:26: note: candidate function
uchar16 __OVERLOADABLE__ abs(char16);                                                            
                         ^
cl_kernel.h:2326:25: note: candidate function
uchar2 __OVERLOADABLE__ abs(uchar2);
                        ^
cl_kernel.h:2328:25: note: candidate function
uchar3 __OVERLOADABLE__ abs(uchar3);                                                             
                        ^
cl_kernel.h:2330:25: note: candidate function
uchar4 __OVERLOADABLE__ abs(uchar4);                                                             
                        ^
cl_kernel.h:2331:25: note: candidate function
uchar8 __OVERLOADABLE__ abs(uchar8);                                                             
                        ^
cl_kernel.h:2332:26: note: candidate function
uchar16 __OVERLOADABLE__ abs(uchar16);                                                           
                         ^
cl_kernel.h:2333:26: note: candidate function
ushort2 __OVERLOADABLE__ abs(short2);                                                            
                         ^
cl_kernel.h:2335:26: note: candidate function
ushort3 __OVERLOADABLE__ abs(short3);                                                            
                         ^
cl_kernel.h:2337:26: note: candidate function
ushort4 __OVERLOADABLE__ abs(short4);                                                            
                         ^
cl_kernel.h:2338:26: note: candidate function
ushort8 __OVERLOADABLE__ abs(short8);                                                            
                         ^
cl_kernel.h:2339:27: note: candidate function
ushort16 __OVERLOADABLE__ abs(short16);                                                          
                          ^
cl_kernel.h:2340:26: note: candidate function
ushort2 __OVERLOADABLE__ abs(ushort2);                                                           
                         ^
cl_kernel.h:2342:26: note: candidate function
ushort3 __OVERLOADABLE__ abs(ushort3);                                                           
                         ^
cl_kernel.h:2344:26: note: candidate function
ushort4 __OVERLOADABLE__ abs(ushort4);                                                           
                         ^
cl_kernel.h:2345:26: note: candidate function
ushort8 __OVERLOADABLE__ abs(ushort8);                                                           
                         ^
cl_kernel.h:2346:27: note: candidate function
ushort16 __OVERLOADABLE__ abs(ushort16);                                                         
                          ^
cl_kernel.h:2347:24: note: candidate function
uint2 __OVERLOADABLE__ abs(int2);
                       ^
cl_kernel.h:2349:24: note: candidate function
uint3 __OVERLOADABLE__ abs(int3);
                       ^
cl_kernel.h:2351:24: note: candidate function
uint4 __OVERLOADABLE__ abs(int4);
                       ^
cl_kernel.h:2352:24: note: candidate function
uint8 __OVERLOADABLE__ abs(int8);
                       ^
cl_kernel.h:2353:25: note: candidate function
uint16 __OVERLOADABLE__ abs(int16);
                        ^
cl_kernel.h:2354:24: note: candidate function
uint2 __OVERLOADABLE__ abs(uint2);
                       ^
cl_kernel.h:2356:24: note: candidate function
uint3 __OVERLOADABLE__ abs(uint3);
                       ^
cl_kernel.h:2358:24: note: candidate function
uint4 __OVERLOADABLE__ abs(uint4);
                       ^
cl_kernel.h:2359:24: note: candidate function
uint8 __OVERLOADABLE__ abs(uint8);
                       ^
cl_kernel.h:2360:25: note: candidate function
uint16 __OVERLOADABLE__ abs(uint16);
                        ^
cl_kernel.h:2361:25: note: candidate function
ulong2 __OVERLOADABLE__ abs(long2);
                        ^
cl_kernel.h:2363:25: note: candidate function
ulong3 __OVERLOADABLE__ abs(long3);
                        ^
cl_kernel.h:2365:25: note: candidate function
ulong4 __OVERLOADABLE__ abs(long4);
                        ^
cl_kernel.h:2366:25: note: candidate function
ulong8 __OVERLOADABLE__ abs(long8);
                        ^
cl_kernel.h:2367:26: note: candidate function
ulong16 __OVERLOADABLE__ abs(long16);
                         ^
cl_kernel.h:2368:25: note: candidate function
ulong2 __OVERLOADABLE__ abs(ulong2);
                        ^
cl_kernel.h:2370:25: note: candidate function
ulong3 __OVERLOADABLE__ abs(ulong3);
                        ^
cl_kernel.h:2372:25: note: candidate function
ulong4 __OVERLOADABLE__ abs(ulong4);
                        ^
cl_kernel.h:2373:25: note: candidate function
ulong8 __OVERLOADABLE__ abs(ulong8);
                        ^
cl_kernel.h:2374:26: note: candidate function
ulong16 __OVERLOADABLE__ abs(ulong16);
                         ^

CLError(code=-11, CL_BUILD_PROGRAM_FAILURE)
in broadcast at GPUArrays\src\abstractarray.jl:186
in acc_broadcast! at GPUArrays\src\backends\opencl\opencl.jl:168
in Type at Transpiler\src\clike/opencl\compilation.jl:79
in get! at base\dict.jl:449
in #27 at Transpiler\src\clike/opencl\compilation.jl:104
in #build! at base\<missing>
in #build!#113 at OpenCL\src\program.jl:101
in macro expansion at OpenCL\src\macros.jl:6

Linear solvers

A_ldiv_B! and \ support for GPUArrays is pretty essential and I keep assuming it already exists.

LLVM error on A .^ n

The following code:

A = GPUArray(randn(Float32, 3, 2))
n = 2
A .^  n

gives an error:

ERROR: LLVM error: Cannot select: 0xf179100: f32 = fpow 0xf178f40, 0xf179090
  0xf178f40: f32,ch = load<LD4[%30](tbaa=<0xd4c1048>)> 0xf176c80, 0xf177930, undef:i64
    0xf177930: i64 = add 0xf1777e0, Constant:i64<-4>
      0xf1777e0: i64 = add 0xf174cd0, 0xf1792c0
        0xf174cd0: i64,ch = CopyFromReg 0xeb554a0:1, Register:i64 %vreg15
          0xeb55740: i64 = Register %vreg15
        0xf1792c0: i64 = NVPTXISD::MUL_WIDE_UNSIGNED 0xf1773f0, Constant:i32<4>
          0xf1773f0: i32 = NVPTXISD::IMAD 0xf1772a0, 0xeb557b0, 0xeb54c50
            0xf1772a0: i32 = add 0xf1771c0, Constant:i32<-1>
              0xf1771c0: i32 = select 0xf176f20, 0xeb555f0, 0xf178d10
                0xf176f20: i1 = xor 0xf177310, Constant:i1<-1>
                  0xf177310: i1 = truncate 0xf1751a0

                  0xf176eb0: i1 = Constant<-1>
                0xeb555f0: i32,ch = CopyFromReg 0xeb54fd0:1, Register:i32 %vreg10
                  0xeb55190: i32 = Register %vreg10
                0xf178d10: i32,ch,glue = NVPTXISD::LoadParam<LDST4[<unknown>]> 0xf1775b0:1, Constant:i32<1>, Constant:i32<4>, 0xf1775b0:2
                  0xf176e40: i32 = Constant<1>
                  0xf175a60: i32 = Constant<4>
                  0xf1775b0: i32,ch,glue = NVPTXISD::LoadParam<LDST4[<unknown>]> 0xf177690, Constant:i32<1>, Constant:i32<0>, 0xf177690:1



              0xf177230: i32 = Constant<-1>
            0xeb557b0: i32,ch = CopyFromReg 0xeb59f50, Register:i32 %vreg1
              0xeb55820: i32 = Register %vreg1
            0xeb54c50: i32 = select 0xeb54b70, 0xeb54fd0, 0xf1775b0
              0xeb54b70: i1 = xor 0xf175600, Constant:i1<-1>
                0xf175600: i1 = truncate 0xf174f00
                  0xf174f00: i32,ch,glue = NVPTXISD::LoadParam<LDST4[<unknown>](align=1)> 0xf174fe0, Constant:i32<1>, Constant:i32<0>, 0xf174fe0:1



                0xf176eb0: i1 = Constant<-1>
              0xeb54fd0: i32,ch = CopyFromReg 0xeb59f50, Register:i32 %vreg9
                0xf177540: i32 = Register %vreg9
              0xf1775b0: i32,ch,glue = NVPTXISD::LoadParam<LDST4[<unknown>]> 0xf177690, Constant:i32<1>, Constant:i32<0>, 0xf177690:1
                0xf176e40: i32 = Constant<1>
                0xf175130: i32 = Constant<0>
                0xf177690: ch,glue = NVPTXISD::CallArgEnd 0xf177770, Constant:i32<1>, 0xf177770:1
                  0xf176e40: i32 = Constant<1>
                  0xf177770: ch,glue = NVPTXISD::LastCallArg 0xf1756e0, Constant:i32<1>, Constant:i32<1>, 0xf1756e0:1



          0xf175a60: i32 = Constant<4>
      0xf178c30: i64 = Constant<-4>
    0xf176dd0: i64 = undef
  0xf179090: f32 = sint_to_fp 0xf179020
    0xf179020: i64,ch = CopyFromReg 0xeb59f50, Register:i64 %vreg16
      0xf178fb0: i64 = Register %vreg16
In function: ptxcall_broadcast_kernel__61772
Stacktrace:
 [1] handle_error(::Cstring) at /home/dfdx/.julia/v0.6/LLVM/src/core/context.jl:96
 [2] macro expansion at /home/dfdx/.julia/v0.6/LLVM/src/util/logging.jl:102 [inlined]
 [3] macro expansion at /home/dfdx/.julia/v0.6/LLVM/src/base.jl:20 [inlined]
 [4] LLVMTargetMachineEmitToMemoryBuffer(::Ptr{LLVM.API.LLVMOpaqueTargetMachine}, ::Ptr{LLVM.API.LLVMOpaqueModule}, ::UInt32, ::Base.RefValue{Cstring}, ::Base.RefValue{Ptr{LLVM.API.LLVMOpaqueMemoryBuffer}}) at /home/dfdx/.julia/v0.6/LLVM/src/../lib/3.9/libLLVM_h.jl:301
 [5] emit(::LLVM.TargetMachine, ::LLVM.Module, ::UInt32) at /home/dfdx/.julia/v0.6/LLVM/src/targetmachine.jl:39
 [6] #mcgen#46(::Bool, ::Function, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at /home/dfdx/.julia/v0.6/CUDAnative/src/jit.jl:296
 [7] (::CUDAnative.#kw##mcgen)(::Array{Any,1}, ::CUDAnative.#mcgen, ::LLVM.Module, ::LLVM.Function, ::VersionNumber) at ./<missing>:0
 [8] #compile_function#47(::Bool, ::Function, ::Any, ::Any, ::VersionNumber) at /home/dfdx/.julia/v0.6/CUDAnative/src/jit.jl:319
 [9] cufunction(::CUDAdrv.CuDevice, ::Any, ::Any) at /home/dfdx/.julia/v0.6/CUDAnative/src/jit.jl:356
 [10] macro expansion at /home/dfdx/.julia/v0.6/CUDAnative/src/execution.jl:106 [inlined]
 [11] _cuda(::Tuple{Int64,Int64}, ::Int64, ::CUDAdrv.CuStream, ::GPUArrays.#broadcast_kernel!, ::Float32, ::Base.#^, ::CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, ::Tuple{UInt32,UInt32}, ::UInt32, ::Tuple{GPUArrays.BroadcastDescriptorN{Array,2},GPUArrays.BroadcastDescriptorN{Any,0}}, ::CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, ::Int64) at /home/dfdx/.julia/v0.6/CUDAnative/src/execution.jl:79
 [12] gpu_call(::Function, ::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}, ::Tuple{Base.#^,GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext},Tuple{UInt32,UInt32},UInt32,Tuple{GPUArrays.BroadcastDescriptorN{Array,2},GPUArrays.BroadcastDescriptorN{Any,0}},GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext},Int64}, ::Int64, ::Void) at /home/dfdx/.julia/v0.6/GPUArrays/src/backends/cudanative/cudanative.jl:194
 [13] _broadcast!(::Function, ::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}, ::Tuple{Tuple{Bool,Bool},Tuple{}}, ::Tuple{Tuple{Int64,Int64},Tuple{}}, ::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}, ::Tuple{Int64}, ::Type{Val{1}}, ::CartesianRange{CartesianIndex{2}}) at /home/dfdx/.julia/v0.6/GPUArrays/src/broadcast.jl:66
 [14] broadcast_t(::Function, ::Type{Float32}, ::Tuple{Base.OneTo{Int64},Base.OneTo{Int64}}, ::CartesianRange{CartesianIndex{2}}, ::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}, ::Int64) at /home/dfdx/.julia/v0.6/GPUArrays/src/broadcast.jl:33
 [15] broadcast_c at ./broadcast.jl:314 [inlined]
 [16] broadcast(::Function, ::GPUArrays.GPUArray{Float32,2,CUDAdrv.CuArray{Float32,2},GPUArrays.CUBackend.CUContext}, ::Int64) at ./broadcast.jl:434

Note that using a constant power, e.g. A .^ 2, works fine.

Version info:

Julia Version 0.6.0
Commit 9036443 (2017-06-19 13:05 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)

GPU: GeForce GTX 960M
CUDA 8.0

N-D broadcast is buggy

N-dimensional broadcast operations are buggy with CUDAnative backend. Example code snippet,

julia> using GPUArrays

julia> CUBackend.init()
CUContext

julia> g1 = GPUArray(rand(4,5,3));

julia> g2 = GPUArray(rand(1,5,3));

julia> a1 = Array(g1); a2 = Array(g2);

julia> isapprox(Array(g1 .+ g2), a1 .+ a2)
false

julia> g3 = GPUArray(rand(1,5,1)); a3 = Array(g3);

julia> isapprox(Array(g1 .+ g3), a1 .+ a3)
false

Muladd fallback

u0 = GPUArray(rand(Float32, 32, 32))
k1 = similar(u0)
uprev = similar(u0)
muladd.(2,k1,uprev)

MethodError: no method matching clintrinsic(::Tuple{Expr,DataType})
Closest candidates are:
  clintrinsic(!Matched::Base.#getindex, !Matched::Type{Tuple{Tuple{Vararg{T,N}},I<:Integer}}) where {N, T, I<:Integer} at C:\Users\Chris\.julia\v0.6\Transpiler\src\clike/opencl\intrinsics.jl:172
  clintrinsic(!Matched::Base.#getindex, !Matched::Type{Tuple{T,I<:Union{Int32, Int64, UInt64}}}) where {T, I<:Union{Int32, Int64, UInt64}} at C:\Users\Chris\.julia\v0.6\Transpiler\src\clike/opencl\intrinsics.jl:178
  clintrinsic(!Matched::Base.#setindex!, !Matched::Type{Tuple{T<:Union{Transpiler.CLTranspiler.CLIntrinsics.CLArray, Transpiler.CLTranspiler.CLIntrinsics.LocalMemory},Val,I<:Integer}}) where {T<:Union{Transpiler.CLTranspiler.CLIntrinsics.CLArray, Transpiler.CLTranspiler.CLIntrinsics.LocalMemory}, Val, I<:Integer} at C:\Users\Chris\.julia\v0.6\Transpiler\src\clike/opencl\intrinsics.jl:183
  ...
isintrinsic(::Sugar.LazyMethod{:CL}) at intrinsics.jl:157
_dependencies!(::DataStructures.OrderedSet{Sugar.LazyMethod}, ::Sugar.LazyMethod{:CL}) at methods.jl:330
_dependencies!(::Sugar.LazyMethod{:CL}, ::Sugar.LazyMethod{:CL}) at methods.jl:323
_dependencies!(::DataStructures.OrderedSet{Sugar.LazyMethod}, ::Sugar.LazyMethod{:CL}) at methods.jl:331
_dependencies!(::Sugar.LazyMethod{:CL}, ::Sugar.LazyMethod{:CL}) at methods.jl:323
_dependencies!(::DataStructures.OrderedSet{Sugar.LazyMethod}, ::Sugar.LazyMethod{:CL}) at methods.jl:331
_dependencies!(::Sugar.LazyMethod{:CL}, ::Sugar.LazyMethod{:CL}) at methods.jl:323
_dependencies!(::DataStructures.OrderedSet{Sugar.LazyMethod}, ::Sugar.LazyMethod{:CL}) at methods.jl:331
dependencies!(::Sugar.LazyMethod{:CL}, ::Bool) at methods.jl:310
(::Transpiler.CLTranspiler.##27#28{Tuple{GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext},##13#14,Tuple{Int32,Int32},GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext},GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext}},GPUArrays.CLBackend.#broadcast_kernel,OpenCL.cl.CmdQueue,OpenCL.cl.Context,NTuple{5,DataType}})() at compilation.jl:84
get!(::Transpiler.CLTranspiler.##27#28{Tuple{GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext},##13#14,Tuple{Int32,Int32},GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext},GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext}},GPUArrays.CLBackend.#broadcast_kernel,OpenCL.cl.CmdQueue,OpenCL.cl.Context,NTuple{5,DataType}}, ::Dict{Any,Transpiler.CLTranspiler.CLFunction}, ::Tuple{GPUArrays.CLBackend.#broadcast_kernel,NTuple{5,DataType}}) at dict.jl:449
Transpiler.CLTranspiler.CLFunction(::Function, ::Tuple{GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext},##13#14,Tuple{Int32,Int32},GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext},GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext}}, ::OpenCL.cl.CmdQueue) at compilation.jl:79
acc_broadcast!(::##13#14, ::GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext}, ::Tuple{GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext},GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext}}) at opencl.jl:168
broadcast(::Function, ::GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext}, ::GPUArrays.GPUArray{Float32,2,OpenCL.cl.Buffer{Float32},GPUArrays.CLBackend.CLContext}) at abstractarray.jl:196
include_string(::String, ::String) at loading.jl:515
eval(::Module, ::Any) at boot.jl:235
(::Atom.##61#64)() at eval.jl:102
withpath(::Atom.##61#64, ::Void) at utils.jl:30
withpath(::Function, ::Void) at eval.jl:38
macro expansion at eval.jl:101 [inlined]
(::Atom.##60#63{Dict{String,Any}})() at task.jl:80

Would a simple fallback would be fine?

uprev .+ 2.*k1

That works. If it can FMA, that's better.

Couldn't compile kernel (Windows 10)

All test pass:

julia> Pkg.test("GPUArrays")
INFO: Computing test dependencies for GPUArrays...
INFO: No packages to install, update or remove
INFO: Testing GPUArrays
Test Summary: | Pass  Total
julia         |   48     48
Test Summary: | Pass  Total
opencl        |   51     51
Test Summary: | Pass  Total
BLAS          |    8      8
Test Summary: | Pass  Total
Shared        |    6      6
INFO: GPUArrays tests passed
INFO: No packages to install, update or remove

Code:

julia> x = GPUArray(ones(100,100,100)); y = GPUArray(randn(100,100,100)); z = x .* y;
Couldn't compile kernel:
    1   : // dependant type declarations
    2   : typedef struct {
    3   :     int field1;
    4   :     int field2;
    5   :     int field3;
    6   : }int[3];
    7   :
    8   : typedef struct {
    9   : float empty; // structs can't be empty
    10  : }Base123;
    11  :
    12  : // dependant function declarations
    13  : float broadcast_index_1(__global float * restrict  arg, int[3] shape, int i)
    14  : {
    15  :     ;
    16  :     return arg[i - 1];
    17  :     ;
    18  : }
    19  : // Main inner function
    20  : __kernel void broadcast_kernel_2(__global float * restrict  A, Base123 f, int[3] sz, __global float * restrict  arg_1, __global float * restrict  arg_2)
    21  : {
    22  :     int i;
    23  :     i = get_global_id(0) + 1;
    24  :     float _ssavalue_0;
    25  :     _ssavalue_0 = broadcast_index_1(arg_1, sz, i) * broadcast_index_1(arg_2, sz, i);
    26  :     A[i - 1] = _ssavalue_0;
    27  :     ;
    28  : }
    29  :
With following build error:
<kernel>:6:2: error: expected ';' after struct
}int[3];
 ^
 ;
<kernel>:2:1: warning: typedef requires a name
typedef struct {
^~~~~~~
<kernel>:6:5: error: expected identifier or '('
}int[3];
    ^
<kernel>:13:64: error: expected ')'
float broadcast_index_1(__global float * restrict  arg, int[3] shape, int i)
                                                               ^
<kernel>:13:24: note: to match this '('
float broadcast_index_1(__global float * restrict  arg, int[3] shape, int i)
                       ^
<kernel>:13:60: error: parameter name omitted
float broadcast_index_1(__global float * restrict  arg, int[3] shape, int i)
                                                           ^
<kernel>:16:16: error: use of undeclared identifier 'i'
    return arg[i - 1];
               ^
<kernel>:20:82: error: expected ')'
__kernel void broadcast_kernel_2(__global float * restrict  A, Base123 f, int[3] sz, __global float * restrict  arg_1, __global float * restrict  arg_2)
                                                                                 ^
<kernel>:20:33: note: to match this '('
__kernel void broadcast_kernel_2(__global float * restrict  A, Base123 f, int[3] sz, __global float * restrict  arg_1, __global float * restrict  arg_2)
                                ^
<kernel>:20:78: error: parameter name omitted
__kernel void broadcast_kernel_2(__global float * restrict  A, Base123 f, int[3] sz, __global float * restrict  arg_1, __global float * restrict  arg_2)
                                                                             ^
<kernel>:25:37: error: use of undeclared identifier 'arg_1'
    _ssavalue_0 = broadcast_index_1(arg_1, sz, i) * broadcast_index_1(arg_2, sz, i);
                                    ^
<kernel>:25:71: error: use of undeclared identifier 'arg_2'
    _ssavalue_0 = broadcast_index_1(arg_1, sz, i) * broadcast_index_1(arg_2, sz, i);
                                                                      ^
�
ERROR: CLError(code=-11, CL_BUILD_PROGRAM_FAILURE)
Stacktrace:
 [1] macro expansion at C:\Users\jecs\.julia\v0.6\OpenCL\src\macros.jl:6 [inlined]
 [2] #build!#113(::String, ::Bool, ::Function, ::OpenCL.cl.Program) at C:\Users\jecs\.julia\v0.6\OpenCL\src\program.jl:101
 [3] (::OpenCL.cl.#kw##build!)(::Array{Any,1}, ::OpenCL.cl.#build!, ::OpenCL.cl.Program) at .\<missing>:0
 [4] (::Transpiler.CLTranspiler.##27#28{Tuple{GPUArrays.GPUArray{Float64,3,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},Base.#*,Tuple{Int32,Int32,Int32},GPUArrays.GPUArray{Float64,3,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},GPUArrays.GPUArray{Float64,3,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}},GPUArrays.CLBackend.#broadcast_kernel,OpenCL.cl.CmdQueue,OpenCL.cl.Context,NTuple{5,DataType}})() at C:\Users\jecs\.julia\v0.6\Transpiler\src\clike/opencl\compilation.jl:104
 [5] get!(::Transpiler.CLTranspiler.##27#28{Tuple{GPUArrays.GPUArray{Float64,3,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},Base.#*,Tuple{Int32,Int32,Int32},GPUArrays.GPUArray{Float64,3,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},GPUArrays.GPUArray{Float64,3,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}},GPUArrays.CLBackend.#broadcast_kernel,OpenCL.cl.CmdQueue,OpenCL.cl.Context,NTuple{5,DataType}}, ::Dict{Any,Transpiler.CLTranspiler.CLFunction}, ::Tuple{GPUArrays.CLBackend.#broadcast_kernel,NTuple{5,DataType}}) at .\dict.jl:449
 [6] Transpiler.CLTranspiler.CLFunction(::Function, ::Tuple{GPUArrays.GPUArray{Float64,3,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},Base.#*,Tuple{Int32,Int32,Int32},GPUArrays.GPUArray{Float64,3,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},GPUArrays.GPUArray{Float64,3,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}}, ::OpenCL.cl.CmdQueue) at C:\Users\jecs\.julia\v0.6\Transpiler\src\clike/opencl\compilation.jl:79
 [7] acc_broadcast!(::Base.#*, ::GPUArrays.GPUArray{Float64,3,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}, ::Tuple{GPUArrays.GPUArray{Float64,3,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext},GPUArrays.GPUArray{Float64,3,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}}) at C:\Users\jecs\.julia\v0.6\GPUArrays\src\backends\opencl\opencl.jl:168
 [8] broadcast(::Function, ::GPUArrays.GPUArray{Float64,3,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}, ::GPUArrays.GPUArray{Float64,3,OpenCL.cl.Buffer{Float64},GPUArrays.CLBackend.CLContext}) at C:\Users\jecs\.julia\v0.6\GPUArrays\src\abstractarray.jl:196

Version info:

Julia Version 0.6.0
Commit 903644385b* (2017-06-19 13:05 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)

GPUArrays.jl status:

julia> Pkg.status("GPUArrays")
 - GPUArrays                     0.0.2+             master

Error blackscholes for N=10^2

Hello

I 'ved tried your black scholes script on a mac and I get the following results. It seems to work but an ERROR message is printed. Do you have the same behaviour?

INFO: Running benchmarks number of threads: 1
| Backend | Time (μs) for N = 10^1 |
| ---- | ---- |
| JLContext Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz with 1 threads |   0.00 μs|
| CLContext: AMD Radeon HD - FirePro D300 Compute Engine |   0.00 μs|

| Backend | Time (μs) for N = 10^2 |
| ---- | ---- |
| JLContext Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz with 1 threads |   0.00 μs|
| CLContext: AMD Radeon HD - FirePro D300 Compute Engine |   0.00 μs|
ERROR (unhandled task failure): MethodError: no method matching unsafe_string(::Ptr{Void})
Closest candidates are:
  unsafe_string(!Matched::Cstring) at c.jl:79
  unsafe_string(!Matched::Union{Ptr{Int8}, Ptr{UInt8}}) at strings/string.jl:39
  unsafe_string(!Matched::Union{Ptr{Int8}, Ptr{UInt8}}, !Matched::Integer) at strings/string.jl:35
Stacktrace:
 [1] macro expansion at /Users/macpro/.julia/v0.6/OpenCL/src/context.jl:95 [inlined]
 [2] (::OpenCL.cl.##43#44)() at ./task.jl:335

| Backend | Time (μs) for N = 10^3 |
| ---- | ---- |
| JLContext Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz with 1 threads |   0.00 μs|
| CLContext: AMD Radeon HD - FirePro D300 Compute Engine |   0.00 μs|

| Backend | Time (μs) for N = 10^4 |
| ---- | ---- |
| JLContext Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz with 1 threads |   0.00 μs|
| CLContext: AMD Radeon HD - FirePro D300 Compute Engine |   0.00 μs|

| Backend | Time (μs) for N = 10^5 |
| ---- | ---- |
| JLContext Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz with 1 threads |   0.01 μs|
| CLContext: AMD Radeon HD - FirePro D300 Compute Engine |   0.00 μs|

| Backend | Time (μs) for N = 10^6 |
| ---- | ---- |
| JLContext Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz with 1 threads |   0.09 μs|
| CLContext: AMD Radeon HD - FirePro D300 Compute Engine |   0.00 μs|

| Backend | Time (μs) for N = 10^7 |
| ---- | ---- |
| JLContext Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz with 1 threads |   0.92 μs|
| CLContext: AMD Radeon HD - FirePro D300 Compute Engine |   0.04 μs|

ERROR: LoadError: LoadError: LoadError: LoadError: UndefVarError: CLTranspiler not defined

I get the following error when running using GPUArrays for the first time. Any idea what is wrong?

julia> using GPUArrays
INFO: Precompiling module GPUArrays.
ERROR: LoadError: LoadError: LoadError: LoadError: UndefVarError: CLTranspiler not defined
Stacktrace:
 [1] include_from_node1(::String) at ./loading.jl:569
 [2] include(::String) at ./sysimg.jl:14
 [3] include_from_node1(::String) at ./loading.jl:569
 [4] include(::String) at ./sysimg.jl:14
 [5] include_from_node1(::String) at ./loading.jl:569
 [6] include(::String) at ./sysimg.jl:14
 [7] include_from_node1(::String) at ./loading.jl:569
 [8] include(::String) at ./sysimg.jl:14
 [9] anonymous at ./<missing>:2
while loading /home/mohd/.julia/v0.6/GPUArrays/src/backends/opencl/opencl.jl, in expression starting on line 17
while loading /home/mohd/.julia/v0.6/GPUArrays/src/backends/supported_backends.jl, in expression starting on line 9
while loading /home/mohd/.julia/v0.6/GPUArrays/src/backends/backends.jl, in expression starting on line 52
while loading /home/mohd/.julia/v0.6/GPUArrays/src/GPUArrays.jl, in expression starting on line 10
ERROR: Failed to precompile GPUArrays to /home/mohd/.julia/lib/v0.6/GPUArrays.ji.
Stacktrace:
 [1] compilecache(::String) at ./loading.jl:703
 [2] _require(::Symbol) at ./loading.jl:490
 [3] require(::Symbol) at ./loading.jl:398

Why is `x .= x.*x` allocating memory and slow? (in Julia 0.6)

This is surprising:

julia> x = GPUArray(rand(Float32, 1000000));

julia> @time x.=sqrt.(x);
  0.000559 seconds (36 allocations: 944 bytes)

julia> @time x .= x.*x;
  0.135318 seconds (52.30 k allocations: 3.387 MiB, 5.70% gc time)

julia> versioninfo()
Julia Version 0.6.0-rc1.0
Commit 6bdb3950bd (2017-05-07 00:00 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin16.5.0)
  CPU: Intel(R) Core(TM) i7-3820QM CPU @ 2.70GHz
  WORD_SIZE: 64
  BLAS: libgfortblas
  LAPACK: liblapack
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, ivybridge)

Indexing bugs

u0 = GPUArray(rand(Float32, 32, 32))
u0[1]

available_devices and all_devices don't work currently

It seems that active_backends and current_backend is no longer supported. Is this just a leftover from the GPUArrays evolution or is that part of the infrastructure currently missing.

`norm` on GPUArrays throws error if for more than 32 elements

Base defines a vecnorm function that calls BLAS if the array is bigger than 32 elements.