Code Monkey home page Code Monkey logo

gpuifyloops.jl's Introduction

GPUifyLoops.jl

Support for writing loop-based code that executes both on CPU and GPU

End of Life

GPUifyLoops.jl is no longer under development and has been replaced by KernelAbstractions.jl.

Installation

GPUifyLoops is a registered package, and can be installed using the Julia package manager.

julia>]
(v1.1) pkg> add GPUifyLoops

Note: The current version of this package requires Julia 1.1.

Debugging

Debugging failures to transforma a function for the GPU requires the use of Cthulhu.jl.

using Cthulhu
using GPUifyLoops

# @launch CUDA() f(args...)
descend(GPUifyLoops.signature(f, args...)...)

Development

In order to test this package locally you need to do:

julia --project=test/gpuenv
julia> ]
(gpuenv) pkg> resolve
(gpuenv) pkg> instantiate

This will resolve the GPU environment, please do not checking changes to test/gpuenv/. Then you can run the tests with julia --project=test/gpuenv test/runtests.jl

License

GPUifyLoops.jl is licensed under MIT license.

gpuifyloops.jl's People

Contributors

ali-ramadhan avatar bors[bot] avatar dilumaluthge avatar dpsanders avatar juliatagbot avatar lcw avatar vchuravy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gpuifyloops.jl's Issues

Inlining Failure?

using GPUifyLoops, CuArrays, StaticArrays
ps = CuArray([@SVector [10,28,8/3] for i in 1:10])
function loop(u, p, t)
    σ = p[1]; ρ = p[2]; β = p[3]
    du1 = σ*(u[2]-u[1])
    du2 = u[1]*-u[3]) - u[2]
    du3 = u[1]*u[2] - β*u[3]
    @SVector [du1, du2, du3]
end
function ff18(p)
    u0 = @SVector [10.0, 10.0, 10.0]
    tspan = (0.0, 100.0)
    dt = 0.1
    ts = tspan[1]:dt:tspan[2]
    k7 = loop(u, p, ts[1])
end
_ff18 = GPUifyLoops.contextualize(ff18)
map(_ff18,ps)

function ff19(p)
    u = @SVector [10.0, 10.0, 10.0]
    tspan = (0.0, 100.0)
    dt = 0.1
    ts = tspan[1]:dt:tspan[2]
    σ = p[1]; ρ = p[2]; β = p[3]
    du1 = σ*(u[2]-u[1])
    du2 = u[1]*-u[3]) - u[2]
    du3 = u[1]*u[2] - β*u[3]
    @SVector [du1, du2, du3]
end
_ff19 = GPUifyLoops.contextualize(ff19)
map(_ff19,ps)

Notice that manual inlining fixes the issue, while without it you get

julia> map(_ff18,ps)
ERROR: LLVM error: Program used external function '__nv_fabs' which could not be resolved!
Stacktrace:
 [1] handle_error(::Cstring) at C:\Users\accou\.julia\packages\LLVM\ViliQ\src\core\context.jl:103
 [2] map(::Function, ::CuArray{SArray{Tuple{3},Float64,1,3},1}) at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\base.jl:9
 [3] top-level scope at REPL[29]:1

GPUifyLoops boundschecking on SubArray

function lorenz(du,u,p,t)
 @inbounds begin
     du[1] = 10.0f0(u[2]-u[1])
     du[2] = u[1]*(28.0f0-u[3]) - u[2]
     du[3] = u[1]*u[2] - (8/3f0)*u[3]
 end
 nothing
end
u0 = Float32[1.0;0.0;0.0]

using GPUifyLoops, CuArrays, CUDAnative
function f(du,u,p,t)
    @loop for i in (1:size(u,2); CUDAnative.threadIdx().x)
        @views @inbounds lorenz(du[:,i],u[:,i],p,t)
        nothing
    end
    nothing
end
function _f(du,u,p,t)
    @launch CUDA() threads = size(u,2) f(du,u,p,t)
end

CuArrays.allowscalar(false)
_u0 = CuArray(hcat([u0 for i in 1:128]...))

u = copy(_u0)
du= copy(_u0)
p = nothing
t = 0.0f0
_f(du,u,p,t)

works, but removing the @inbounds in lorenz, like

function lorenz(du,u,p,t)
 du[1] = 10.0f0(u[2]-u[1])
 du[2] = u[1]*(28.0f0-u[3]) - u[2]
 du[3] = u[1]*u[2] - (8/3f0)*u[3]
 nothing
end
u0 = Float32[1.0;0.0;0.0]

using GPUifyLoops, CuArrays, CUDAnative
function f(du,u,p,t)
    @loop for i in (1:size(u,2); CUDAnative.threadIdx().x)
        @views @inbounds lorenz(du[:,i],u[:,i],p,t)
        nothing
    end
    nothing
end
function _f(du,u,p,t)
    @launch CUDA() threads = size(u,2) f(du,u,p,t)
end

CuArrays.allowscalar(false)
_u0 = CuArray(hcat([u0 for i in 1:128]...))

u = copy(_u0)
du= copy(_u0)
p = nothing
t = 0.0f0
_f(du,u,p,t)

causes

julia> _f(du,u,p,t)
ERROR: InvalidIRError: compiling f(Cassette.Context{nametype(Ctx),Nothing,Nothing,getfield(GPUifyLoops, Symbol("##PassType#397")),Nothing,Cassette.DisableHooks}, typeof(f), CuDeviceArray{Float32,2,CUDAnative.AS.Global}, CuDeviceArray{Float32,2,CUDAnative.AS.Global}, Nothing, Float32) resulted in invalid LLVM IR
Reason: unsupported call to the Julia runtime (call to jl_f_tuple)
Stacktrace:
 [1] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:508
 [2] multiple call sites at unknown:0
Reason: unsupported call to the Julia runtime (call to jl_f_getfield)
Stacktrace:
 [1] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:508
 [2] multiple call sites at unknown:0
Stacktrace:
 [1] check_ir(::CUDAnative.CompilerJob, ::LLVM.Module) at C:\Users\accou\.julia\dev\CUDAnative\src\compiler\validation.jl:114
 [2] macro expansion at C:\Users\accou\.julia\packages\TimerOutputs\7zSea\src\TimerOutput.jl:216 [inlined]
 [3] #codegen#121(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.codegen), ::Symbol, ::CUDAnative.CompilerJob) at C:\Users\accou\.julia\dev\CUDAnative\src\compiler\driver.jl:186
 [4] #codegen at .\none:0 [inlined]
 [5] #compile#120(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.compile), ::Symbol, ::CUDAnative.CompilerJob) at C:\Users\accou\.julia\dev\CUDAnative\src\compiler\driver.jl:47
 [6] #compile at C:\Users\accou\.julia\dev\CUDAnative\src\compiler\common.jl:0 [inlined]
 [7] #compile#119 at C:\Users\accou\.julia\dev\CUDAnative\src\compiler\driver.jl:28 [inlined]
 [8] #compile at .\none:0 [inlined] (repeats 2 times)
 [9] macro expansion at C:\Users\accou\.julia\dev\CUDAnative\src\execution.jl:388 [inlined]
 [10] #cufunction#161(::String, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(cufunction), ::typeof(Cassette.overdub), ::Type{Tuple{Cassette.Context{nametype(Ctx),Nothing,Nothing,getfield(GPUifyLoops, Symbol("##PassType#397")),Nothing,Cassette.DisableHooks},typeof(f),CuDeviceArray{Float32,2,CUDAnative.AS.Global},CuDeviceArray{Float32,2,CUDAnative.AS.Global},Nothing,Float32}}) at C:\Users\accou\.julia\dev\CUDAnative\src\execution.jl:356
 [11] (::getfield(CUDAnative, Symbol("#kw##cufunction")))(::NamedTuple{(:name,),Tuple{String}}, ::typeof(cufunction), ::Function, ::Type) at .\none:0
 [12] macro expansion at C:\Users\accou\.julia\packages\GPUifyLoops\HaVjN\src\GPUifyLoops.jl:125 [inlined]
 [13] macro expansion at .\gcutils.jl:87 [inlined]
 [14] #launch#46(::Base.Iterators.Pairs{Symbol,Int64,Tuple{Symbol},NamedTuple{(:threads,),Tuple{Int64}}}, ::typeof(GPUifyLoops.launch), ::CUDA, ::typeof(f), ::CuArray{Float32,2}, ::Vararg{Any,N} where N) at C:\Users\accou\.julia\packages\GPUifyLoops\HaVjN\src\GPUifyLoops.jl:121
 [15] #launch at .\none:0 [inlined]
 [16] macro expansion at C:\Users\accou\.julia\packages\GPUifyLoops\HaVjN\src\GPUifyLoops.jl:54 [inlined]
 [17] _f(::CuArray{Float32,2}, ::CuArray{Float32,2}, ::Nothing, ::Float32) at .\REPL[5]:2
 [18] top-level scope at REPL[12]:1

Stencil abstraction does not work on GPU

I'm trying out the stencil abstraction from PR #81 which works on the CPU now (but allocates a ton on the CPU) but does not work on the GPU.

@vchuravy says it's because cat and reshape do not work on the GPU on static arrays.

I'm using commit CliMA/Oceananigans.jl@0cf5988 of Oceananigans to test this out and get this error:

julia> using Oceananigans, BenchmarkTools
julia> model = Model(N=(256, 256, 256), L=(100, 100, 100), arch=GPU());
julia> time_step!(model, 1, 1)
ERROR: InvalidIRError: compiling overdub(Cassette.Context{nametype(Ctx),Nothing,Nothing,getfield(GPUifyLoops, Symbol("##PassType#371")),Nothing,Cassette.DisableHooks}, typeof(Oceananigans.calculate_interior_source_terms!), RegularCartesianGrid{Float64,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}, PlanetaryConstants{Float64}, LinearEquationOfState{Float64}, Oceananigans.TurbulenceClosures.ConstantAnisotropicDiffusivity{Float64}, OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}}, OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}}, OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}}, OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}}, OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}}, OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}}, OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}}, OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}}, OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}}, OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}}, OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}}, Forcing{typeof(Oceananigans.zero_func),typeof(Oceananigans.zero_func),typeof(Oceananigans.zero_func),typeof(Oceananigans.zero_func),typeof(Oceananigans.zero_func)}) resulted in invalid LLVM IR       
Reason: unsupported dynamic function invocation (call to recurse(ctx::Cassette.Context, ::typeof(Core._apply), f, args...) in Cassette at /ccsopen/home/alir/.julia/packages/Cassette/xggAf/src/overdub.jl:503)                                                                                                                                                                           
Stacktrace:
 [1] #cat_t at none:0
 [2] _cat at abstractarray.jl:1566
 [3] #cat#106 at abstractarray.jl:1565
 [4] recurse at /ccsopen/home/alir/.julia/packages/Cassette/xggAf/src/overdub.jl:503
 [5] #cat at none:0
 [6] #3 at /ccsopen/home/alir/.julia/dev/GPUifyLoops/src/custencil.jl:106
 [7] macro expansion at ntuple.jl:50
 [8] ntuple at ntuple.jl:46
 [9] iterate at /ccsopen/home/alir/.julia/dev/GPUifyLoops/src/custencil.jl:94
 [10] calculate_interior_source_terms! at /autofs/nccsopen-svm1_home/alir/Oceananigans.jl/src/time_steppers.jl:160
 [11] overdub at /ccsopen/home/alir/.julia/packages/Cassette/xggAf/src/overdub.jl:0
Reason: unsupported call to the Julia runtime (call to jl_f_tuple)
Stacktrace:
 [1] macro expansion at ntuple.jl:51
 [2] ntuple at ntuple.jl:46
 [3] iterate at /ccsopen/home/alir/.julia/dev/GPUifyLoops/src/custencil.jl:94
 [4] calculate_interior_source_terms! at /autofs/nccsopen-svm1_home/alir/Oceananigans.jl/src/time_steppers.jl:160
 [5] overdub at /ccsopen/home/alir/.julia/packages/Cassette/xggAf/src/overdub.jl:0
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] iterate at /ccsopen/home/alir/.julia/dev/GPUifyLoops/src/custencil.jl:109
 [2] calculate_interior_source_terms! at /autofs/nccsopen-svm1_home/alir/Oceananigans.jl/src/time_steppers.jl:160
 [3] overdub at /ccsopen/home/alir/.julia/packages/Cassette/xggAf/src/overdub.jl:0
Reason: unsupported call to the Julia runtime (call to jl_f_tuple)
Stacktrace:
 [1] iterate at /ccsopen/home/alir/.julia/dev/GPUifyLoops/src/custencil.jl:109
 [2] calculate_interior_source_terms! at /autofs/nccsopen-svm1_home/alir/Oceananigans.jl/src/time_steppers.jl:160
 [3] overdub at /ccsopen/home/alir/.julia/packages/Cassette/xggAf/src/overdub.jl:0
Reason: unsupported call to the Julia runtime (call to jl_f_getfield)
Stacktrace:
 [1] calculate_interior_source_terms! at /autofs/nccsopen-svm1_home/alir/Oceananigans.jl/src/time_steppers.jl:160
 [2] overdub at /ccsopen/home/alir/.julia/packages/Cassette/xggAf/src/overdub.jl:0
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] calculate_interior_source_terms! at /autofs/nccsopen-svm1_home/alir/Oceananigans.jl/src/time_steppers.jl:160
 [2] overdub at /ccsopen/home/alir/.julia/packages/Cassette/xggAf/src/overdub.jl:0
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] calculate_interior_source_terms! at /autofs/nccsopen-svm1_home/alir/Oceananigans.jl/src/time_steppers.jl:162
 [2] overdub at /ccsopen/home/alir/.julia/packages/Cassette/xggAf/src/overdub.jl:0
Stacktrace:
 [1] check_ir(::CUDAnative.CompilerJob, ::LLVM.Module) at /ccsopen/home/alir/.julia/packages/CUDAnative/9rZcJ/src/compiler/validation.jl:114
 [2] #codegen#119(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.codegen), ::Symbol, ::CUDAnative.CompilerJob) at /ccsopen/home/alir/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:216
 [3] #codegen at ./none:0 [inlined]
 [4] #compile#118(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.compile), ::Symbol, ::CUDAnative.CompilerJob) at /ccsopen/home/alir/.julia/packages/CUDAnative/9rZcJ/src/compiler/driver.jl:47
 [5] #compile at ./none:0 [inlined]
 [6] #compile#117 at /ccsopen/home/alir/.julia/packages/CUDAnative/9rZcJ/src/compiler/driver.jl:28 [inlined]
 [7] compile at /ccsopen/home/alir/.julia/packages/CUDAnative/9rZcJ/src/compiler/driver.jl:28 [inlined] (repeats 2 times)
 [8] macro expansion at /ccsopen/home/alir/.julia/packages/CUDAnative/9rZcJ/src/execution.jl:378 [inlined]
 [9] #cufunction#159(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(CUDAnative.cufunction), ::typeof(Cassette.overdub), ::Type{Tuple{Cassette.Context{nametype(Ctx),Nothing,Nothing,getfield(GPUifyLoops, Symbol("##PassType#371")),Nothing,Cassette.DisableHooks},typeof(Oceananigans.calculate_interior_source_terms!),RegularCartesianGrid{Float64,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}},PlanetaryConstants{Float64},LinearEquationOfState{Float64},Oceananigans.TurbulenceClosures.ConstantAnisotropicDiffusivity{Float64},OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}},OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}},OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}},OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}},OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}},OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}},OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}},OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}},OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}},OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}},OffsetArrays.OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}},Forcing{typeof(Oceananigans.zero_func),typeof(Oceananigans.zero_func),typeof(Oceananigans.zero_func),typeof(Oceananigans.zero_func),typeof(Oceananigans.zero_func)}}}) at /ccsopen/home/alir/.julia/packages/CUDAnative/9rZcJ/src/execution.jl:347
 [10] cufunction(::Function, ::Type) at /ccsopen/home/alir/.julia/packages/CUDAnative/9rZcJ/src/execution.jl:347
 [11] macro expansion at /ccsopen/home/alir/.julia/dev/GPUifyLoops/src/GPUifyLoops.jl:127 [inlined]
 [12] macro expansion at ./gcutils.jl:87 [inlined]
 [13] #launch#58(::Base.Iterators.Pairs{Symbol,Tuple{Int64,Int64,Vararg{Int64,N} where N},Tuple{Symbol,Symbol},NamedTuple{(:threads, :blocks),Tuple{Tuple{Int64,Int64},Tuple{Int64,Int64,Int64}}}}, ::typeof(GPUifyLoops.launch), ::GPUifyLoops.CUDA, ::typeof(Oceananigans.calculate_interior_source_terms!), ::RegularCartesianGrid{Float64,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}, ::Vararg{Any,N} where N) at /ccsopen/home/alir/.julia/dev/GPUifyLoops/src/GPUifyLoops.jl:121
 [14] time_step!(::Model{GPU,RegularCartesianGrid{Float64,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}},Oceananigans.TurbulenceClosures.ConstantAnisotropicDiffusivity{Float64},Float64}, ::Int64, ::Int64) at ./none:0
 [15] top-level scope at REPL[3]:1

Use occupancy API to suggest launch configuration

Lazy-mode PR, but I'm not sure what the API guarantees of this call are.
Instead of only passing the maximum amount of threads, one can use the CUDA API to figure out a thread count that maximizes occupancy:

diff --git a/src/GPUifyLoops.jl b/src/GPUifyLoops.jl
index a73aa4f..abf3b05 100644
--- a/src/GPUifyLoops.jl
+++ b/src/GPUifyLoops.jl
@@ -77,7 +77,7 @@ launch configuration passed to the call.
 Return a NamedTuple that has `blocks`, `threads`, `shmem`, and `stream`.
 All arguments are optional, but blocks and threads is recommended.
 """
-function launch_config(@nospecialize(f), maxthreads, args...; kwargs...)
+function launch_config(@nospecialize(f), maxthreads, suggested, args...; kwargs...)
     return kwargs
 end
 
@@ -105,6 +105,7 @@ end
 
 @init @require CUDAnative="be33ccc6-a3ff-5ff2-a52e-74243cff1e17" begin
     using .CUDAnative
+    using .CUDAnative: CUDAdrv
 
     function version_check()
         project = joinpath(dirname(pathof(CUDAnative)), "../Project.toml")
@@ -129,7 +130,8 @@ end
             end
 
             maxthreads = CUDAnative.maxthreads(kernel)
-            config = launch_config(f, maxthreads, args...; call_kwargs...)
+            suggested = CUDAdrv.launch_configuration(kernel.fun)
+            config = launch_config(f, maxthreads, suggested, args...; call_kwargs...)
 
             kernel(kernel_args...; config...)
         end

Downstream use with DiffEqGPU:

diff --git a/src/DiffEqGPU.jl b/src/DiffEqGPU.jl
index 149af2c..27853df 100644
--- a/src/DiffEqGPU.jl
+++ b/src/DiffEqGPU.jl
@@ -57,9 +57,9 @@ function GPUifyLoops.launch_config(::Union{typeof(gpu_kernel),
                                            typeof(discrete_affect!_kernel),
                                            typeof(continuous_condition_kernel),
                                            typeof(continuous_affect!_kernel)},
-                                           maxthreads,context,g,f,du,u,args...;
+                                           maxthreads,suggested,context,g,f,du,u,args...;
                                            kwargs...)
-    t = min(maxthreads,size(u,2))
+    t = min(suggested.threads,size(u,2))
     blocks = ceil(Int,size(u,2)/t)
     (threads=t,blocks=blocks)
 end
@@ -278,18 +278,18 @@ function ldiv!_kernel(W,x,len,nfacts)
 end
 
 function GPUifyLoops.launch_config(::typeof(qr_kernel),
-                                           maxthreads,context,g,W,len,nfacts;
+                                           maxthreads,suggested,context,g,W,len,nfacts;
                                            kwargs...)
-    t = min(maxthreads,nfacts)
+    t = min(suggested.threads,nfacts)
     blocks = ceil(Int,nfacts/t)
     (threads=t,blocks=blocks)
 end
 
 function GPUifyLoops.launch_config(::typeof(ldiv!_kernel),
-                                           maxthreads,context,g,W,x,len,nfacts,
+                                           maxthreads,suggested,context,g,W,x,len,nfacts,
                                            args...;
                                            kwargs...)
-    t = min(maxthreads,nfacts)
+    t = min(suggested.threads,nfacts)
     blocks = ceil(Int,nfacts/t)
     (threads=t,blocks=blocks)
 end

This assumes no shared memory is dynamically allocated.

cc @ChrisRackauckas

Dispatch to device index function

@glwagner and I discussed last friday an idea to dispatch to a device index function instead of having to write out the index calculation.

So instead of writing:

@loop for i in (1:length(A); threadID().x)
  ...
end

You could write:

@loop inner for i in 1:length(A)
  ...
end

Which would call: index(kernelf, Val{:inner}(), cpuindices)
This allows the user to specify a particular index depending on a loop label and the kernel-function. I am worried that it is a bit to magical, but we would certainly keep the original form around.

Return of ODESolution type fails

With SciML/DiffEqBase.jl#240 and SimpleDiffEq#master, the entire returned object is immutable. However, it segfaults when it returns. But, returning it's solution array, sol.u, is fine.

using GPUifyLoops, CuArrays, SimpleDiffEq
using StaticArrays, Cthulhu

function loop(u, p, t)
    @inbounds begin
        σ = p[1]; ρ = p[2]; β = p[3]
        du1 = σ*(u[2]-u[1])
        du2 = u[1]*-u[3]) - u[2]
        du3 = u[1]*u[2] - β*u[3]
        return SVector{3}(du1, du2, du3)
    end
end
function liip(du, u, p, t)
    σ = p[1]; ρ = p[2]; β = p[3]
    du[1] = σ*(u[2]-u[1])
    du[2] = u[1]*-u[3]) - u[2]
    du[3] = u[1]*u[2] - β*u[3]
    return nothing
end

u0 = 10ones(Float32,3)
const su0= SVector{3}(u0)
const dt = 1f-1

odeoop = ODEProblem{false}(loop, SVector{3}(u0), (0.0f0, 10.0f0),  Float32[10, 28, 8/3])
sol2 = solve(odeoop,GPUSimpleTsit5(),dt=dt)
ps = CuArray([@SVector [10f0,28f0,8/3f0] for i in 1:10])
CuArrays.allowscalar(false)

function f(p)
    prob = ODEProblem{false}(loop, su0, (0.0f0, 10.0f0),  p)
    solve(prob,GPUSimpleTsit5(),dt=dt).u
end

_f = GPUifyLoops.contextualize(f)
map(_f,ps)

function f2(p)
    prob = ODEProblem{false}(loop, su0, (0.0f0, 10.0f0),  p)
    solve(prob,GPUSimpleTsit5(),dt=dt)
end

_f2 = GPUifyLoops.contextualize(f2)
map(_f2,ps)

Possible fabs float range issue?

using GPUifyLoops, CuArrays, StaticArrays
ps = CuArray([@SVector [10,28,8/3] for i in 1:10])
function loop(u, p, t)
    σ = p[1]; ρ = p[2]; β = p[3]
    du1 = σ*(u[2]-u[1])
    du2 = u[1]*-u[3]) - u[2]
    du3 = u[1]*u[2] - β*u[3]
    @SVector [du1, du2, du3]
end
function ff8(p)
    u0 = @SVector [10.0, 10.0, 10.0]
    tspan = (0.0, 100.0)
    dt = 0.1
    tf = tspan[2]
    ts = tspan[1]:dt:tspan[2]
    u = u0
    k7 = loop(u, p, ts[1])
    uprev = u; k1 = k7
    tmp = uprev.+dt*0.161.*k7
end

_ff8 = GPUifyLoops.contextualize(ff8)
map(_ff8,ps)
julia> map(_ff8,ps)
ERROR: LLVM error: Program used external function '__nv_fabs' which could not be resolved!
Stacktrace:
 [1] handle_error(::Cstring) at C:\Users\accou\.julia\packages\LLVM\ViliQ\src\core\context.jl:103
 [2] map(::Function, ::CuArray{SArray{Tuple{3},Float64,1,3},1}) at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\base.jl:9
 [3] top-level scope at REPL[6]:1

setindex! to an MArray with non-literals fails

using GPUifyLoops, SimpleDiffEq, CuArrays, StaticArrays
ps = CuArray([@SVector [10f0,28f0,8/3f0] for i in 1:10])
function ff5(p)
    u0 = @SVector [10.0f0, 10.0f0, 10.0f0]
    us = MVector{10001,SArray{Tuple{3},Float64,1,3}}(undef)
    us[1] = u0
    for i in 2:10001
        us[i] = u0
    end
    us
end

_ff5 = GPUifyLoops.contextualize(ff5)
map(_ff5,ps)

The literal seems fine, it's the loop.

Out of bounds addressing with StaticArrays

Consider the following kernel

function kernel()
    s = MArray{Tuple{1}, Float32}(undef)
    @inbounds s[1] = zero(Float32)

    nothing
end

When I launch it with CUDAnative directly

@cuda threads=(5,) blocks=1 kernel()

It runs cuda-memcheck clean, i.e.,

========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors

However when I launch the same kernel with GPUifyLoops with

@launch CUDA() threads=(5,) blocks=1 kernel()

under cuda-memcheck I get the follow errors

a --project=. vanilladeriv.jl
========= CUDA-MEMCHECK
========= Invalid __global__ read of size 1
=========     at 0x00000040 in ./Base.jl:15:ptxcall_anonymous12_1
=========     by thread (4,0,0) in block (0,0,0)
=========     Address 0x7f392b367799 is out of bounds
=========     Device Frame:./Base.jl:15:ptxcall_anonymous12_1 (ptxcall_anonymous12_1 : 0x40)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x2cd) [0x24f90d]
=========     Host Frame:[0x7f390edbf408]
=========     Host Frame:[0x7f390ee2d8f9]
=========     Host Frame:[0x7f390ee2d20b]
=========     Host Frame:[0x7f390ee2d287]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x279) [0x451c9]
=========     Host Frame:[0x7f390edc03e5]
=========     Host Frame:[0x7f390edbd47c]
=========     Host Frame:[0x7f390edbd4e3]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x279) [0x451c9]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c5ba7]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c58a2]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c64df]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c6b44]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x57c2c]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c6f4d]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x7516e]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x4eeee]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_load + 0x53) [0x76113]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/lib/julia/sys.so [0xf364a]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/lib/julia/sys.so [0xffe0c]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x458) [0x453a8]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/lib/julia/sys.so [0xff499]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/lib/julia/sys.so [0xffb60]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x458) [0x453a8]
=========     Host Frame:julia [0x26fe]
=========     Host Frame:julia [0x2324]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xeb) [0x2409b]
=========     Host Frame:julia [0x23ca]
=========
========= Invalid __global__ read of size 1
=========     at 0x00000040 in ./Base.jl:15:ptxcall_anonymous12_1
=========     by thread (3,0,0) in block (0,0,0)
=========     Address 0x7f392b367799 is out of bounds
=========     Device Frame:./Base.jl:15:ptxcall_anonymous12_1 (ptxcall_anonymous12_1 : 0x40)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x2cd) [0x24f90d]
=========     Host Frame:[0x7f390edbf408]
=========     Host Frame:[0x7f390ee2d8f9]
=========     Host Frame:[0x7f390ee2d20b]
=========     Host Frame:[0x7f390ee2d287]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x279) [0x451c9]
=========     Host Frame:[0x7f390edc03e5]
=========     Host Frame:[0x7f390edbd47c]
=========     Host Frame:[0x7f390edbd4e3]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x279) [0x451c9]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c5ba7]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c58a2]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c64df]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c6b44]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x57c2c]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c6f4d]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x7516e]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x4eeee]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_load + 0x53) [0x76113]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/lib/julia/sys.so [0xf364a]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/lib/julia/sys.so [0xffe0c]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x458) [0x453a8]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/lib/julia/sys.so [0xff499]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/lib/julia/sys.so [0xffb60]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x458) [0x453a8]
=========     Host Frame:julia [0x26fe]
=========     Host Frame:julia [0x2324]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xeb) [0x2409b]
=========     Host Frame:julia [0x23ca]
=========
========= Invalid __global__ read of size 1
=========     at 0x00000040 in ./Base.jl:15:ptxcall_anonymous12_1
=========     by thread (2,0,0) in block (0,0,0)
=========     Address 0x7f392b367799 is out of bounds
=========     Device Frame:./Base.jl:15:ptxcall_anonymous12_1 (ptxcall_anonymous12_1 : 0x40)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x2cd) [0x24f90d]
=========     Host Frame:[0x7f390edbf408]
=========     Host Frame:[0x7f390ee2d8f9]
=========     Host Frame:[0x7f390ee2d20b]
=========     Host Frame:[0x7f390ee2d287]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x279) [0x451c9]
=========     Host Frame:[0x7f390edc03e5]
=========     Host Frame:[0x7f390edbd47c]
=========     Host Frame:[0x7f390edbd4e3]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x279) [0x451c9]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c5ba7]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c58a2]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c64df]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c6b44]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x57c2c]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c6f4d]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x7516e]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x4eeee]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_load + 0x53) [0x76113]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/lib/julia/sys.so [0xf364a]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/lib/julia/sys.so [0xffe0c]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x458) [0x453a8]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/lib/julia/sys.so [0xff499]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/lib/julia/sys.so [0xffb60]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x458) [0x453a8]
=========     Host Frame:julia [0x26fe]
=========     Host Frame:julia [0x2324]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xeb) [0x2409b]
=========     Host Frame:julia [0x23ca]
=========
========= Invalid __global__ read of size 1
=========     at 0x00000040 in ./Base.jl:15:ptxcall_anonymous12_1
=========     by thread (1,0,0) in block (0,0,0)
=========     Address 0x7f392b367799 is out of bounds
=========     Device Frame:./Base.jl:15:ptxcall_anonymous12_1 (ptxcall_anonymous12_1 : 0x40)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x2cd) [0x24f90d]
=========     Host Frame:[0x7f390edbf408]
=========     Host Frame:[0x7f390ee2d8f9]
=========     Host Frame:[0x7f390ee2d20b]
=========     Host Frame:[0x7f390ee2d287]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x279) [0x451c9]
=========     Host Frame:[0x7f390edc03e5]
=========     Host Frame:[0x7f390edbd47c]
=========     Host Frame:[0x7f390edbd4e3]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x279) [0x451c9]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c5ba7]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c58a2]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c64df]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c6b44]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x57c2c]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c6f4d]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x7516e]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x4eeee]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_load + 0x53) [0x76113]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/lib/julia/sys.so [0xf364a]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/lib/julia/sys.so [0xffe0c]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x458) [0x453a8]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/lib/julia/sys.so [0xff499]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/lib/julia/sys.so [0xffb60]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x458) [0x453a8]
=========     Host Frame:julia [0x26fe]
=========     Host Frame:julia [0x2324]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xeb) [0x2409b]
=========     Host Frame:julia [0x23ca]
=========
========= Invalid __global__ read of size 1
=========     at 0x00000040 in ./Base.jl:15:ptxcall_anonymous12_1
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7f392b367799 is out of bounds
=========     Device Frame:./Base.jl:15:ptxcall_anonymous12_1 (ptxcall_anonymous12_1 : 0x40)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuLaunchKernel + 0x2cd) [0x24f90d]
=========     Host Frame:[0x7f390edbf408]
=========     Host Frame:[0x7f390ee2d8f9]
=========     Host Frame:[0x7f390ee2d20b]
=========     Host Frame:[0x7f390ee2d287]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x279) [0x451c9]
=========     Host Frame:[0x7f390edc03e5]
=========     Host Frame:[0x7f390edbd47c]
=========     Host Frame:[0x7f390edbd4e3]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x279) [0x451c9]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c5ba7]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c58a2]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c64df]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c6b44]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x57c2c]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x1c6f4d]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x7516e]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 [0x4eeee]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_load + 0x53) [0x76113]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/lib/julia/sys.so [0xf364a]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/lib/julia/sys.so [0xffe0c]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x458) [0x453a8]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/lib/julia/sys.so [0xff499]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/lib/julia/sys.so [0xffb60]
=========     Host Frame:/home/lucasw/opt/julia/git/usr/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x458) [0x453a8]
=========     Host Frame:julia [0x26fe]
=========     Host Frame:julia [0x2324]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xeb) [0x2409b]
=========     Host Frame:julia [0x23ca]
=========
========= ERROR SUMMARY: 5 errors

I am using CUDAnative#master, CUDAdrv#master, and GPUifyLoops#master along with Julia 1.2 built from source with the GPU code generation fix

git checkout release-1.2
git cherry-pick 8469856

cc: @simonbyrne

v0.2.4 broke things but not sure what

Just tried upgrading from GPUifyLoops v0.2.3 to v0.2.4 but now quite a few tests for Oceananigans fail. The failures all mention Cassette.overdub and seem very similar. In particular, the δz_f2c_ab̄ᶻ is always mentioned on line [1], perhaps it's the issue or just the first bad line that is encountered? I included an example stacktrace below.

Sorry I know this is probably not a very useful description, but if anyone has any idea how to approach this issue, I'd be grateful!

For now v0.2.3 still works great so this isn't a huge issue.

Time stepping: Error During Test at /home/alir_mit_edu/Oceananigans.jl/test/runtests.jl:240
  Test threw exception
  Expression: time_stepping_works(arch, ft)
  InvalidIRError: compiling #12(RegularCartesianGrid{Float32,StepRangeLen{Float32,Float64,Float64}}, PlanetaryConstants{Float64}, LinearEquationOfState{Float64}, ModelConfiguration{Float64}, CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global}, Forcing{typeof(Oceananigans.zero_func),typeof(Oceananigans.zero_func),typeof(Oceananigans.zero_func),typeof(Oceananigans.zero_func),typeof(Oceananigans.zero_func)}) resulted in invalid LLVM IR
  Reason: unsupported dynamic function invocation (call to Cassette.overdub)
  Stacktrace:
   [1] δz_f2c_ab̄ᶻ at /home/alir_mit_edu/Oceananigans.jl/src/operators/ops_regular_cartesian_grid.jl:130
   [2] div_flux at /home/alir_mit_edu/Oceananigans.jl/src/operators/ops_regular_cartesian_grid.jl:139
   [3] calculate_interior_source_terms! at /home/alir_mit_edu/Oceananigans.jl/src/time_steppers.jl:171
   [4] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/r44Xz/src/context.jl:169
  Reason: unsupported dynamic function invocation (call to Cassette.overdub)
  Stacktrace:
   [1] δz_f2c_ab̄ᶻ at /home/alir_mit_edu/Oceananigans.jl/src/operators/ops_regular_cartesian_grid.jl:130
   [2] div_flux at /home/alir_mit_edu/Oceananigans.jl/src/operators/ops_regular_cartesian_grid.jl:139
   [3] calculate_interior_source_terms! at /home/alir_mit_edu/Oceananigans.jl/src/time_steppers.jl:176
   [4] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/r44Xz/src/context.jl:169
  Stacktrace:
   [1] check_ir(::CUDAnative.CompilerJob, ::LLVM.Module) at /home/alir_mit_edu/.julia/packages/CUDAnative/9rZcJ/src/compiler/validation.jl:114
   [2] macro expansion at /home/alir_mit_edu/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:216 [inlined]
   [3] #codegen#119(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::Function, ::Symbol, ::CUDAnative.CompilerJob) at /home/alir_mit_edu/.julia/packages/CUDAnative/9rZcJ/src/compiler/driver.jl:186
   [4] #codegen at /home/alir_mit_edu/.julia/packages/CUDAnative/9rZcJ/src/compiler/driver.jl:0 [inlined]
   [5] #compile#118(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::Function, ::Symbol, ::CUDAnative.CompilerJob) at /home/alir_mit_edu/.julia/packages/CUDAnative/9rZcJ/src/compiler/driver.jl:47
   [6] #compile at ./none:0 [inlined]
   [7] #compile#117 at /home/alir_mit_edu/.julia/packages/CUDAnative/9rZcJ/src/compiler/driver.jl:28 [inlined]
   [8] compile at /home/alir_mit_edu/.julia/packages/CUDAnative/9rZcJ/src/compiler/driver.jl:28 [inlined] (repeats 2 times)
   [9] macro expansion at /home/alir_mit_edu/.julia/packages/CUDAnative/9rZcJ/src/execution.jl:378 [inlined]
   [10] #cufunction#159(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(CUDAnative.cufunction), ::getfield(GPUifyLoops, Symbol("##12#13")){typeof(Oceananigans.calculate_interior_source_terms!)}, ::Type{Tuple{RegularCartesianGrid{Float32,StepRangeLen{Float32,Float64,Float64}},PlanetaryConstants{Float64},LinearEquationOfState{Float64},ModelConfiguration{Float64},CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global},Forcing{typeof(Oceananigans.zero_func),typeof(Oceananigans.zero_func),typeof(Oceananigans.zero_func),typeof(Oceananigans.zero_func),typeof(Oceananigans.zero_func)}}}) at /home/alir_mit_edu/.julia/packages/CUDAnative/9rZcJ/src/execution.jl:347
   [11] cufunction(::Function, ::Type) at /home/alir_mit_edu/.julia/packages/CUDAnative/9rZcJ/src/execution.jl:347
   [12] macro expansion at /home/alir_mit_edu/.julia/packages/GPUifyLoops/r44Xz/src/GPUifyLoops.jl:113 [inlined]
   [13] macro expansion at ./gcutils.jl:87 [inlined]
   [14] #launch#46(::Base.Iterators.Pairs{Symbol,Tuple{Int64,Int64,Vararg{Int64,N} where N},Tuple{Symbol,Symbol},NamedTuple{(:threads, :blocks),Tuple{Tuple{Int64,Int64},Tuple{Int64,Int64,Int64}}}}, ::Function, ::GPUifyLoops.CUDA, ::typeof(Oceananigans.calculate_interior_source_terms!), ::RegularCartesianGrid{Float32,StepRangeLen{Float32,Float64,Float64}}, ::Vararg{Any,N} where N) at /home/alir_mit_edu/.julia/packages/GPUifyLoops/r44Xz/src/GPUifyLoops.jl:110
   [15] (::getfield(GPUifyLoops, Symbol("#kw##launch")))(::NamedTuple{(:threads, :blocks),Tuple{Tuple{Int64,Int64},Tuple{Int64,Int64,Int64}}}, ::typeof(GPUifyLoops.launch), ::GPUifyLoops.CUDA, ::typeof(Oceananigans.calculate_interior_source_terms!), ::RegularCartesianGrid{Float32,StepRangeLen{Float32,Float64,Float64}}, ::Vararg{Any,N} where N) at ./none:0
   [16] macro expansion at /home/alir_mit_edu/.julia/packages/GPUifyLoops/r44Xz/src/GPUifyLoops.jl:54 [inlined]
   [17] time_step!(::Model{GPU}, ::Int64, ::Int64) at /home/alir_mit_edu/Oceananigans.jl/src/time_steppers.jl:74
   [18] time_stepping_works(::GPU, ::Type) at /home/alir_mit_edu/Oceananigans.jl/test/test_time_stepping.jl:10
   [19] top-level scope at /home/alir_mit_edu/Oceananigans.jl/test/runtests.jl:240
   [20] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Test/src/Test.jl:1083
   [21] top-level scope at /home/alir_mit_edu/Oceananigans.jl/test/runtests.jl:236
   [22] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/Test/src/Test.jl:1083
   [23] top-level scope at /home/alir_mit_edu/Oceananigans.jl/test/runtests.jl:17

Overdub print/println

Can we overdub print and println as we do for sin, cos etc.? We may compromise with printed warnings instead of errors in #393.

CPU exceptions/assertions?

Can we have a way to get informative error messages on the CPU (they could just throw a generic error on the GPU). e.g.

@cpuerror "complicated error message: $x"

which on a CPU would expand to

error("complicated error message: $x")

and on the GPU would throw a generic error without worrying about string allocation.

Abstract host interface

Now that an abstract launch function, #20, is being considered to address #3 we should think about if other functionality should be supported. Many distributed codes use

  • device array allocation
  • multiple streams
  • synchronizing streams
  • pinned memory (async) transfers

Do we want GPUifyLoops to support some of these things with its abstract interface?

Fuse multiply-add

LLVM needs to know that fadd fast so that the MulAdd pass can do it's thing. How do we use fma without making the code ugly.

Fix host hack for when CUDA not present

Here is the current hack:

const HAVE_CUDA = try
  using CUDAdrv
  using CUDAnative
  using CuArrays
  true
catch
  false
end
if !HAVE_CUDA
  macro cuStaticSharedMem(x...)
    :()
  end
  macro cuda(x...)
    :()
  end
end

Change `Device` to `Backend` or something similar ?

The motivation is that there could be different backends for the same device, e.g. a single threaded and multithreaded backend for CPUs. The current usage of Device looks to me like it is really an abstraction for backend. Ideally, those concepts should be separated. Moreover, currently it is true that

julia> CPU <: Device
true
julia> GPUifyLoops.isdevice(CPU())
false

which does not seem great.

Method definition overwritten warning

WARNING: Method definition overdub(Cassette.Context{N, M, T, P, B, H} where H<:Union{Cassette.DisableHooks, Nothing} where B<:Union{Nothing, Base.IdDict{Module, Base.Dict{Symbol, Cassette.BindingMeta}}} where P<:Cassette.AbstractPass where T<:Union{Nothing, Cassette.Tag{N, X, E} where E where X where N<:Cassette.AbstractContextName} where M where N<:Cassette.AbstractContextName, Any...) in module Cassette at /home/vsts/.julia/packages/Cassette/YCOeN/src/overdub.jl:524 overwritten in module GPUifyLoops at /home/vsts/.julia/packages/Cassette/YCOeN/src/overdub.jl:524.
WARNING: Method definition recurse(Cassette.Context{N, M, T, P, B, H} where H<:Union{Cassette.DisableHooks, Nothing} where B<:Union{Nothing, Base.IdDict{Module, Base.Dict{Symbol, Cassette.BindingMeta}}} where P<:Cassette.AbstractPass where T<:Union{Nothing, Cassette.Tag{N, X, E} where E where X where N<:Cassette.AbstractContextName} where M where N<:Cassette.AbstractContextName, Any...) in module Cassette at /home/vsts/.julia/packages/Cassette/YCOeN/src/overdub.jl:537 overwritten in module GPUifyLoops at /home/vsts/.julia/packages/Cassette/YCOeN/src/overdub.jl:537.
WARNING: Method definition overdub(Cassette.Context{N, M, T, P, B, H} where H<:Union{Cassette.DisableHooks, Nothing} where B<:Union{Nothing, Base.IdDict{Module, Base.Dict{Symbol, Cassette.BindingMeta}}} where P<:Cassette.AbstractPass where T<:Union{Nothing, Cassette.Tag{N, X, E} where E where X where N<:Cassette.AbstractContextName} where M where N<:Cassette.AbstractContextName, Any...) in module Cassette at /home/vsts/.julia/packages/Cassette/YCOeN/src/overdub.jl:524 overwritten in module GPUifyLoops at /home/vsts/.julia/packages/Cassette/YCOeN/src/overdub.jl:524.
WARNING: Method definition recurse(Cassette.Context{N, M, T, P, B, H} where H<:Union{Cassette.DisableHooks, Nothing} where B<:Union{Nothing, Base.IdDict{Module, Base.Dict{Symbol, Cassette.BindingMeta}}} where P<:Cassette.AbstractPass where T<:Union{Nothing, Cassette.Tag{N, X, E} where E where X where N<:Cassette.AbstractContextName} where M where N<:Cassette.AbstractContextName, Any...) in module Cassette at /home/vsts/.julia/packages/Cassette/YCOeN/src/overdub.jl:537 overwritten in module GPUifyLoops at /home/vsts/.julia/packages/Cassette/YCOeN/src/overdub.jl:537.

Using Cthulhu with GPUifyLoops on the GPU

First I tried something similar to this MWE

using GPUifyLoops, Cthulhu, CuArrays, CUDAnative

function kernel!(a, b)
  @inbounds @loop for i in (1:length(a);
                            (blockIdx().x - 1) * blockDim().x + threadIdx().x)
    a[i] = b[i]
  end
  nothing
end

a = CuArray(rand(Float32, 10 ^ 3))
b = similar(a)

threads = 256
blocks = ceil(Int, size(a, 1) / threads)
@descend @launch(CUDA(), threads = threads, blocks = blocks, kernel!(a, b))

and then I tried adding @descend just before the kernel call inside GPUifyLoops. Both approaches failed to allow me to step into kernel!. The second one got me much closer so it seems like there may be two different issues ? Should the above MWE work ?

GPUifty CuArray of Static Arrays

using GPUifyLoops,CuArrays, StaticArrays
CuArrays.allowscalar(false)
function f2(p)
    sum(p)
end
_f2 = GPUifyLoops.contextualize(f2)
ps = CuArray([@SVector [10,28,8/3] for i in 1:10])
map(_f2,ps)

Support for ranges

using GPUifyLoops, CuArrays

function ff(p)
    ts = 0.0:0.1:100.0
end

_ff = GPUifyLoops.contextualize(ff)
map(_ff,CuArray(ones(1)))

gives

ERROR: InvalidIRError: compiling #23(CuArrays.CuKernelState, CUDAnative.CuDeviceArray{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}},1,CUDAnative.AS.Global}, Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},getfield(GPUifyLoops, Symbol("##12#13")){typeof(ff)},Tuple{Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}}}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:123
 [2] string at strings/io.jl:168
 [3] __throw_gcd_overflow at intfuncs.jl:50
 [4] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:0
 [5] gcd at intfuncs.jl:47
 [6] lcm at intfuncs.jl:71
 [7] Colon at twiceprecision.jl:396
 [8] ff at REPL[2]:2
 [9] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [10] _broadcast_getindex_evalf at broadcast.jl:625
 [11] _broadcast_getindex at broadcast.jl:598
 [12] getindex at broadcast.jl:558
 [13] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:124
 [2] string at strings/io.jl:168
 [3] __throw_gcd_overflow at intfuncs.jl:50
 [4] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:0
 [5] gcd at intfuncs.jl:47
 [6] lcm at intfuncs.jl:71
 [7] Colon at twiceprecision.jl:396
 [8] ff at REPL[2]:2
 [9] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [10] _broadcast_getindex_evalf at broadcast.jl:625
 [11] _broadcast_getindex at broadcast.jl:598
 [12] getindex at broadcast.jl:558
 [13] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:128
 [2] string at strings/io.jl:168
 [3] __throw_gcd_overflow at intfuncs.jl:50
 [4] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:0
 [5] gcd at intfuncs.jl:47
 [6] lcm at intfuncs.jl:71
 [7] Colon at twiceprecision.jl:396
 [8] ff at REPL[2]:2
 [9] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [10] _broadcast_getindex_evalf at broadcast.jl:625
 [11] _broadcast_getindex at broadcast.jl:598
 [12] getindex at broadcast.jl:558
 [13] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:129
 [2] string at strings/io.jl:168
 [3] __throw_gcd_overflow at intfuncs.jl:50
 [4] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:0
 [5] gcd at intfuncs.jl:47
 [6] lcm at intfuncs.jl:71
 [7] Colon at twiceprecision.jl:396
 [8] ff at REPL[2]:2
 [9] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [10] _broadcast_getindex_evalf at broadcast.jl:625
 [11] _broadcast_getindex at broadcast.jl:598
 [12] getindex at broadcast.jl:558
 [13] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] call at C:\Users\accou\.julia\packages\Cassette\xggAf\src\context.jl:447
 [2] fallback at C:\Users\accou\.julia\packages\Cassette\xggAf\src\context.jl:445
 [3] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\context.jl:271
 [4] print_to_string at strings/io.jl:131
 [5] string at strings/io.jl:168
 [6] __throw_gcd_overflow at intfuncs.jl:50
 [7] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:0
 [8] gcd at intfuncs.jl:47
 [9] lcm at intfuncs.jl:71
 [10] Colon at twiceprecision.jl:396
 [11] ff at REPL[2]:2
 [12] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [13] _broadcast_getindex_evalf at broadcast.jl:625
 [14] _broadcast_getindex at broadcast.jl:598
 [15] getindex at broadcast.jl:558
 [16] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:131
 [2] string at strings/io.jl:168
 [3] __throw_gcd_overflow at intfuncs.jl:50
 [4] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:0
 [5] gcd at intfuncs.jl:47
 [6] lcm at intfuncs.jl:71
 [7] Colon at twiceprecision.jl:396
 [8] ff at REPL[2]:2
 [9] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [10] _broadcast_getindex_evalf at broadcast.jl:625
 [11] _broadcast_getindex at broadcast.jl:598
 [12] getindex at broadcast.jl:558
 [13] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] __throw_gcd_overflow at intfuncs.jl:50
 [2] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:0
 [3] gcd at intfuncs.jl:47
 [4] lcm at intfuncs.jl:71
 [5] Colon at twiceprecision.jl:396
 [6] ff at REPL[2]:2
 [7] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [8] _broadcast_getindex_evalf at broadcast.jl:625
 [9] _broadcast_getindex at broadcast.jl:598
 [10] getindex at broadcast.jl:558
 [11] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported call to the Julia runtime (call to jl_type_error)
Stacktrace:
 [1] print_to_string at strings/io.jl:124
 [2] string at strings/io.jl:168
 [3] __throw_gcd_overflow at intfuncs.jl:50
 [4] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:0
 [5] gcd at intfuncs.jl:47
 [6] lcm at intfuncs.jl:71
 [7] Colon at twiceprecision.jl:396
 [8] ff at REPL[2]:2
 [9] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [10] _broadcast_getindex_evalf at broadcast.jl:625
 [11] _broadcast_getindex at broadcast.jl:598
 [12] getindex at broadcast.jl:558
 [13] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:127
 [2] string at strings/io.jl:168
 [3] __throw_gcd_overflow at intfuncs.jl:50
 [4] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:0
 [5] gcd at intfuncs.jl:47
 [6] lcm at intfuncs.jl:71
 [7] Colon at twiceprecision.jl:396
 [8] ff at REPL[2]:2
 [9] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [10] _broadcast_getindex_evalf at broadcast.jl:625
 [11] _broadcast_getindex at broadcast.jl:598
 [12] getindex at broadcast.jl:558
 [13] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported call to the Julia runtime (call to jl_type_error)
Stacktrace:
 [1] print_to_string at strings/io.jl:129
 [2] string at strings/io.jl:168
 [3] __throw_gcd_overflow at intfuncs.jl:50
 [4] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:0
 [5] gcd at intfuncs.jl:47
 [6] lcm at intfuncs.jl:71
 [7] Colon at twiceprecision.jl:396
 [8] ff at REPL[2]:2
 [9] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [10] _broadcast_getindex_evalf at broadcast.jl:625
 [11] _broadcast_getindex at broadcast.jl:598
 [12] getindex at broadcast.jl:558
 [13] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:123
 [2] string at strings/io.jl:168
 [3] checked_abs at checked.jl:118
 [4] lcm at intfuncs.jl:71
 [5] Colon at twiceprecision.jl:396
 [6] ff at REPL[2]:2
 [7] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [8] _broadcast_getindex_evalf at broadcast.jl:625
 [9] _broadcast_getindex at broadcast.jl:598
 [10] getindex at broadcast.jl:558
 [11] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:124
 [2] string at strings/io.jl:168
 [3] checked_abs at checked.jl:118
 [4] lcm at intfuncs.jl:71
 [5] Colon at twiceprecision.jl:396
 [6] ff at REPL[2]:2
 [7] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [8] _broadcast_getindex_evalf at broadcast.jl:625
 [9] _broadcast_getindex at broadcast.jl:598
 [10] getindex at broadcast.jl:558
 [11] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:128
 [2] string at strings/io.jl:168
 [3] checked_abs at checked.jl:118
 [4] lcm at intfuncs.jl:71
 [5] Colon at twiceprecision.jl:396
 [6] ff at REPL[2]:2
 [7] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [8] _broadcast_getindex_evalf at broadcast.jl:625
 [9] _broadcast_getindex at broadcast.jl:598
 [10] getindex at broadcast.jl:558
 [11] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:129
 [2] string at strings/io.jl:168
 [3] checked_abs at checked.jl:118
 [4] lcm at intfuncs.jl:71
 [5] Colon at twiceprecision.jl:396
 [6] ff at REPL[2]:2
 [7] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [8] _broadcast_getindex_evalf at broadcast.jl:625
 [9] _broadcast_getindex at broadcast.jl:598
 [10] getindex at broadcast.jl:558
 [11] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] call at C:\Users\accou\.julia\packages\Cassette\xggAf\src\context.jl:447
 [2] fallback at C:\Users\accou\.julia\packages\Cassette\xggAf\src\context.jl:445
 [3] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\context.jl:271
 [4] print_to_string at strings/io.jl:131
 [5] string at strings/io.jl:168
 [6] checked_abs at checked.jl:118
 [7] lcm at intfuncs.jl:71
 [8] Colon at twiceprecision.jl:396
 [9] ff at REPL[2]:2
 [10] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [11] _broadcast_getindex_evalf at broadcast.jl:625
 [12] _broadcast_getindex at broadcast.jl:598
 [13] getindex at broadcast.jl:558
 [14] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:131
 [2] string at strings/io.jl:168
 [3] checked_abs at checked.jl:118
 [4] lcm at intfuncs.jl:71
 [5] Colon at twiceprecision.jl:396
 [6] ff at REPL[2]:2
 [7] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [8] _broadcast_getindex_evalf at broadcast.jl:625
 [9] _broadcast_getindex at broadcast.jl:598
 [10] getindex at broadcast.jl:558
 [11] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] checked_abs at checked.jl:118
 [2] lcm at intfuncs.jl:71
 [3] Colon at twiceprecision.jl:396
 [4] ff at REPL[2]:2
 [5] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [6] _broadcast_getindex_evalf at broadcast.jl:625
 [7] _broadcast_getindex at broadcast.jl:598
 [8] getindex at broadcast.jl:558
 [9] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:123
 [2] string at strings/io.jl:168
 [3] Type at range.jl:343
 [4] Type at twiceprecision.jl:355
 [5] steprangelen_hp at twiceprecision.jl:320
 [6] floatrange at twiceprecision.jl:368
 [7] Colon at twiceprecision.jl:407
 [8] ff at REPL[2]:2
 [9] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [10] _broadcast_getindex_evalf at broadcast.jl:625
 [11] _broadcast_getindex at broadcast.jl:598
 [12] getindex at broadcast.jl:558
 [13] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:124
 [2] string at strings/io.jl:168
 [3] Type at range.jl:343
 [4] Type at twiceprecision.jl:355
 [5] steprangelen_hp at twiceprecision.jl:320
 [6] floatrange at twiceprecision.jl:368
 [7] Colon at twiceprecision.jl:407
 [8] ff at REPL[2]:2
 [9] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [10] _broadcast_getindex_evalf at broadcast.jl:625
 [11] _broadcast_getindex at broadcast.jl:598
 [12] getindex at broadcast.jl:558
 [13] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:128
 [2] string at strings/io.jl:168
 [3] Type at range.jl:343
 [4] Type at twiceprecision.jl:355
 [5] steprangelen_hp at twiceprecision.jl:320
 [6] floatrange at twiceprecision.jl:368
 [7] Colon at twiceprecision.jl:407
 [8] ff at REPL[2]:2
 [9] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [10] _broadcast_getindex_evalf at broadcast.jl:625
 [11] _broadcast_getindex at broadcast.jl:598
 [12] getindex at broadcast.jl:558
 [13] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:129
 [2] string at strings/io.jl:168
 [3] Type at range.jl:343
 [4] Type at twiceprecision.jl:355
 [5] steprangelen_hp at twiceprecision.jl:320
 [6] floatrange at twiceprecision.jl:368
 [7] Colon at twiceprecision.jl:407
 [8] ff at REPL[2]:2
 [9] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [10] _broadcast_getindex_evalf at broadcast.jl:625
 [11] _broadcast_getindex at broadcast.jl:598
 [12] getindex at broadcast.jl:558
 [13] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] call at C:\Users\accou\.julia\packages\Cassette\xggAf\src\context.jl:447
 [2] fallback at C:\Users\accou\.julia\packages\Cassette\xggAf\src\context.jl:445
 [3] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\context.jl:271
 [4] print_to_string at strings/io.jl:131
 [5] string at strings/io.jl:168
 [6] Type at range.jl:343
 [7] Type at twiceprecision.jl:355
 [8] steprangelen_hp at twiceprecision.jl:320
 [9] floatrange at twiceprecision.jl:368
 [10] Colon at twiceprecision.jl:407
 [11] ff at REPL[2]:2
 [12] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [13] _broadcast_getindex_evalf at broadcast.jl:625
 [14] _broadcast_getindex at broadcast.jl:598
 [15] getindex at broadcast.jl:558
 [16] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:131
 [2] string at strings/io.jl:168
 [3] Type at range.jl:343
 [4] Type at twiceprecision.jl:355
 [5] steprangelen_hp at twiceprecision.jl:320
 [6] floatrange at twiceprecision.jl:368
 [7] Colon at twiceprecision.jl:407
 [8] ff at REPL[2]:2
 [9] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [10] _broadcast_getindex_evalf at broadcast.jl:625
 [11] _broadcast_getindex at broadcast.jl:598
 [12] getindex at broadcast.jl:558
 [13] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] Type at range.jl:343
 [2] Type at twiceprecision.jl:355
 [3] steprangelen_hp at twiceprecision.jl:320
 [4] floatrange at twiceprecision.jl:368
 [5] Colon at twiceprecision.jl:407
 [6] ff at REPL[2]:2
 [7] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [8] _broadcast_getindex_evalf at broadcast.jl:625
 [9] _broadcast_getindex at broadcast.jl:598
 [10] getindex at broadcast.jl:558
 [11] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported call to the Julia runtime (call to jl_type_error)
Stacktrace:
 [1] print_to_string at strings/io.jl:124
 [2] string at strings/io.jl:168
 [3] Type at range.jl:343
 [4] Type at twiceprecision.jl:355
 [5] steprangelen_hp at twiceprecision.jl:320
 [6] floatrange at twiceprecision.jl:368
 [7] Colon at twiceprecision.jl:407
 [8] ff at REPL[2]:2
 [9] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [10] _broadcast_getindex_evalf at broadcast.jl:625
 [11] _broadcast_getindex at broadcast.jl:598
 [12] getindex at broadcast.jl:558
 [13] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:127
 [2] string at strings/io.jl:168
 [3] Type at range.jl:343
 [4] Type at twiceprecision.jl:355
 [5] steprangelen_hp at twiceprecision.jl:320
 [6] floatrange at twiceprecision.jl:368
 [7] Colon at twiceprecision.jl:407
 [8] ff at REPL[2]:2
 [9] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [10] _broadcast_getindex_evalf at broadcast.jl:625
 [11] _broadcast_getindex at broadcast.jl:598
 [12] getindex at broadcast.jl:558
 [13] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported call to the Julia runtime (call to jl_type_error)
Stacktrace:
 [1] print_to_string at strings/io.jl:129
 [2] string at strings/io.jl:168
 [3] Type at range.jl:343
 [4] Type at twiceprecision.jl:355
 [5] steprangelen_hp at twiceprecision.jl:320
 [6] floatrange at twiceprecision.jl:368
 [7] Colon at twiceprecision.jl:407
 [8] ff at REPL[2]:2
 [9] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [10] _broadcast_getindex_evalf at broadcast.jl:625
 [11] _broadcast_getindex at broadcast.jl:598
 [12] getindex at broadcast.jl:558
 [13] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported call to the Julia runtime (call to jl_type_error)
Stacktrace:
 [1] print_to_string at strings/io.jl:124
 [2] string at strings/io.jl:168
 [3] checked_abs at checked.jl:118
 [4] lcm at intfuncs.jl:71
 [5] Colon at twiceprecision.jl:396
 [6] ff at REPL[2]:2
 [7] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [8] _broadcast_getindex_evalf at broadcast.jl:625
 [9] _broadcast_getindex at broadcast.jl:598
 [10] getindex at broadcast.jl:558
 [11] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:127
 [2] string at strings/io.jl:168
 [3] checked_abs at checked.jl:118
 [4] lcm at intfuncs.jl:71
 [5] Colon at twiceprecision.jl:396
 [6] ff at REPL[2]:2
 [7] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [8] _broadcast_getindex_evalf at broadcast.jl:625
 [9] _broadcast_getindex at broadcast.jl:598
 [10] getindex at broadcast.jl:558
 [11] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Reason: unsupported call to the Julia runtime (call to jl_type_error)
Stacktrace:
 [1] print_to_string at strings/io.jl:129
 [2] string at strings/io.jl:168
 [3] checked_abs at checked.jl:118
 [4] lcm at intfuncs.jl:71
 [5] Colon at twiceprecision.jl:396
 [6] ff at REPL[2]:2
 [7] #12 at C:\Users\accou\.julia\dev\GPUifyLoops\src\context.jl:169
 [8] _broadcast_getindex_evalf at broadcast.jl:625
 [9] _broadcast_getindex at broadcast.jl:598
 [10] getindex at broadcast.jl:558
 [11] #23 at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:50
Stacktrace:
 [1] check_ir(::CUDAnative.CompilerJob, ::LLVM.Module) at C:\Users\accou\.julia\packages\CUDAnative\wU0tS\src\compiler\validation.jl:114
 [2] macro expansion at C:\Users\accou\.julia\packages\TimerOutputs\7zSea\src\TimerOutput.jl:216 [inlined]
 [3] #codegen#116(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.codegen), ::Symbol, ::CUDAnative.CompilerJob) at C:\Users\accou\.julia\packages\CUDAnative\wU0tS\src\compiler\driver.jl:186
 [4] #codegen at .\none:0 [inlined]
 [5] #compile#115(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.compile), ::Symbol, ::CUDAnative.CompilerJob) at C:\Users\accou\.julia\packages\CUDAnative\wU0tS\src\compiler\driver.jl:47
 [6] #compile#114 at .\none:0 [inlined]
 [7] compile at C:\Users\accou\.julia\packages\CUDAnative\wU0tS\src\compiler\driver.jl:28 [inlined] (repeats 2 times)
 [8] macro expansion at C:\Users\accou\.julia\packages\CUDAnative\wU0tS\src\execution.jl:378 [inlined]
 [9] #cufunction#156(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(CUDAnative.cufunction), ::getfield(GPUArrays, Symbol("##23#24")), ::Type{Tuple{CuArrays.CuKernelState,CUDAnative.CuDeviceArray{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}},1,CUDAnative.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},getfield(GPUifyLoops, Symbol("##12#13")){typeof(ff)},Tuple{Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}}}}}) at C:\Users\accou\.julia\packages\CUDAnative\wU0tS\src\execution.jl:347
 [10] cufunction(::Function, ::Type) at C:\Users\accou\.julia\packages\CUDAnative\wU0tS\src\execution.jl:347
 [11] macro expansion at C:\Users\accou\.julia\packages\CUDAnative\wU0tS\src\execution.jl:174 [inlined]
 [12] macro expansion at .\gcutils.jl:87 [inlined]
 [13] macro expansion at C:\Users\accou\.julia\packages\CUDAnative\wU0tS\src\execution.jl:171 [inlined]
 [14] _gpu_call(::CuArrays.CuArrayBackend, ::Function, ::CuArray{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}},1}, ::Tuple{CuArray{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}},1},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},getfield(GPUifyLoops, Symbol("##12#13")){typeof(ff)},Tuple{Base.Broadcast.Extruded{CuArray{Float64,1},Tuple{Bool},Tuple{Int64}}}}}, ::Tuple{Tuple{Int64},Tuple{Int64}}) at C:\Users\accou\.julia\packages\CuArrays\PwSdF\src\gpuarray_interface.jl:59
 [15] gpu_call at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\abstract_gpu_interface.jl:151 [inlined]
 [16] gpu_call at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\abstract_gpu_interface.jl:128 [inlined]
 [17] copyto! at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\broadcast.jl:48 [inlined]
 [18] copyto! at .\broadcast.jl:842 [inlined]
 [19] copy(::Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Tuple{Base.OneTo{Int64}},getfield(GPUifyLoops, Symbol("##12#13")){typeof(ff)},Tuple{CuArray{Float64,1}}}) at .\broadcast.jl:818
 [20] materialize(::Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,getfield(GPUifyLoops, Symbol("##12#13")){typeof(ff)},Tuple{CuArray{Float64,1}}}) at .\broadcast.jl:798
 [21] map(::Function, ::CuArray{Float64,1}) at C:\Users\accou\.julia\packages\GPUArrays\CjRPU\src\base.jl:9
 [22] top-level scope at REPL[4]:1

Compile-time regression

Loving the new @launch macro! But I noticed my kernels for Oceananigans.jl are taking much longer to compile once I upgraded to GPUifyLoops v0.2.0 and one kernel in particular would never finish compiling as it somehow got stuck in an infinite recursive call ending with a StackOverflowError.

I was able to track this down to the following line

kernel = cufunction(contextualize(f), kernel_tt; compiler_kwargs...)

which I changed to

kernel = cufunction(f, kernel_tt; compiler_kwargs...) 

which got me up and running again as none of my kernels require contextualize and brought compilation time back down to 3.5 minutes from 8.5 minutes with contextualize. No idea why the compiler got stuck when contextualize was turned on as I didn't have any function calls that could be converted.

I was wondering if it would be a good idea to make contextualize an option when launching a kernel? From my understanding, it lets you write exp(x) in your kernel which gets converted to CUDAnative.exp(x) when called via @launch CUDA(). I can see contextualize being useful but might be nice to turn it on for certain kernels that need it to speed up compilation.

I'm happy work on this. In case contextualize should always be on I can just keep this on my fork.

I was able to reproduce this with Julia 1.1 and the latest nightly [Version 1.2.0-DEV.562 (2019-03-28) Commit 11156024da].

Unfortunately I'm not sure what the issue was so I can't provide a minimal working example.

Verbose details included below.


Problematic kernel (I'm working on making it less ugly!)

function calculate_interior_source_terms!(grid::Grid, constants, eos, cfg, u, v, w, T, S, pHY′, Gu, Gv, Gw, GT, GS, F)
    Nx, Ny, Nz = grid.Nx, grid.Ny, grid.Nz
    Δx, Δy, Δz = grid.Δx, grid.Δy, grid.Δz
    fCor = constants.f
    ρ₀ = eos.ρ₀
    𝜈h, 𝜈v, κh, κv = cfg.𝜈h, cfg.𝜈v, cfg.κh, cfg.κv
    @loop for k in (1:grid.Nz; blockIdx().z)
        @loop for j in (1:grid.Ny; (blockIdx().y - 1) * blockDim().y + threadIdx().y)
            @loop for i in (1:grid.Nx; (blockIdx().x - 1) * blockDim().x + threadIdx().x)
                # u-momentum equation
                @inbounds Gu[i, j, k] = (-u∇u(u, v, w, Nx, Ny, Nz, Δx, Δy, Δz, i, j, k)
                                            + fCor*avg_xy(v, Nx, Ny, i, j, k)
                                            - δx_c2f(pHY′, Nx, i, j, k) / (Δx * ρ₀)
                                            + 𝜈∇²u(u, 𝜈h, 𝜈v, Nx, Ny, Nz, Δx, Δy, Δz, i, j, k)
                                            + F.u(u, v, w, T, S, Nx, Ny, Nz, Δx, Δy, Δz, i, j, k))
                # v-momentum equation
                @inbounds Gv[i, j, k] = (-u∇v(u, v, w, Nx, Ny, Nz, Δx, Δy, Δz, i, j, k)
                                            - fCor*avg_xy(u, Nx, Ny, i, j, k)
                                            - δy_c2f(pHY′, Ny, i, j, k) / (Δy * ρ₀)
                                            + 𝜈∇²v(v, 𝜈h, 𝜈v, Nx, Ny, Nz, Δx, Δy, Δz, i, j, k)
                                            + F.v(u, v, w, T, S, Nx, Ny, Nz, Δx, Δy, Δz, i, j, k))
                # w-momentum equation: comment about how pressure and buoyancy are handled
                @inbounds Gw[i, j, k] = (-u∇w(u, v, w, Nx, Ny, Nz, Δx, Δy, Δz, i, j, k)
                                            + 𝜈∇²w(w, 𝜈h, 𝜈v, Nx, Ny, Nz, Δx, Δy, Δz, i, j, k)
                                            + F.w(u, v, w, T, S, Nx, Ny, Nz, Δx, Δy, Δz, i, j, k))
                # temperature equation
                @inbounds GT[i, j, k] = (-div_flux(u, v, w, T, Nx, Ny, Nz, Δx, Δy, Δz, i, j, k)
                                            + κ∇²(T, κh, κv, Nx, Ny, Nz, Δx, Δy, Δz, i, j, k)
                                            + F.T(u, v, w, T, S, Nx, Ny, Nz, Δx, Δy, Δz, i, j, k))
                # salinity equation
                @inbounds GS[i, j, k] = (-div_flux(u, v, w, S, Nx, Ny, Nz, Δx, Δy, Δz, i, j, k)
                                            + κ∇²(S, κh, κv, Nx, Ny, Nz, Δx, Δy, Δz, i, j, k)
                                            + F.S(u, v, w, T, S, Nx, Ny, Nz, Δx, Δy, Δz, i, j, k))
            end
        end
    end
    @synchronize
end

Launching the kernel

@launch CUDA() calculate_interior_source_terms!(grid, constants, eos, cfg, uvw..., TS..., pr.pHY′.data, Gⁿ..., forcing, threads=(Tx, Ty), blocks=(Bx, By, Bz))

Error

ERROR: StackOverflowError:
Stacktrace:                                                                                                                                                                                       
 [1] snca_compress!(::Array{Core.Compiler.Node,1}, ::Array{UInt64,1}, ::UInt64, ::UInt64) at ./compiler/ssair/domtree.jl:204
 [2] snca_compress!(::Array{Core.Compiler.Node,1}, ::Array{UInt64,1}, ::UInt64, ::UInt64) at ./compiler/ssair/domtree.jl:207 (repeats 25909 times)
 [3] SNCA(::Core.Compiler.CFG) at ./compiler/ssair/domtree.jl:258
 [4] construct_domtree(::Core.Compiler.CFG) at ./compiler/ssair/domtree.jl:104
 [5] run_passes(::Core.CodeInfo, ::Int64, ::Core.Compiler.OptimizationState) at ./compiler/ssair/driver.jl:120
 [6] optimize(::Core.Compiler.OptimizationState, ::Any) at ./compiler/optimize.jl:167
 [7] typeinf(::Core.Compiler.InferenceState) at ./compiler/typeinfer.jl:33
 [8] typeinf_edge(::Method, ::Any, ::Core.SimpleVector, ::Core.Compiler.InferenceState) at ./compiler/typeinfer.jl:503
 [9] abstract_call_method(::Method, ::Any, ::Core.SimpleVector, ::Core.Compiler.InferenceState) at ./compiler/abstractinterpretation.jl:363
 [10] abstract_call_gf_by_type(::Any, ::Array{Any,1}, ::Any, ::Core.Compiler.InferenceState, ::Int64) at ./compiler/abstractinterpretation.jl:92
 [11] abstract_call(::Any, ::Nothing, ::Array{Any,1}, ::Array{Any,1}, ::Core.Compiler.InferenceState, ::Int64) at ./compiler/abstractinterpretation.jl:802
 [12] abstract_apply(::Any, ::Array{Any,1}, ::Array{Any,1}, ::Core.Compiler.InferenceState, ::Int64) at ./compiler/abstractinterpretation.jl:545
 [13] abstract_call(::Any, ::Array{Any,1}, ::Array{Any,1}, ::Array{Any,1}, ::Core.Compiler.InferenceState, ::Int64) at ./compiler/abstractinterpretation.jl:593
 [14] abstract_call(::Any, ::Array{Any,1}, ::Array{Any,1}, ::Array{Any,1}, ::Core.Compiler.InferenceState) at ./compiler/abstractinterpretation.jl:592
 1 GPUifyLoops.jl +                                                                                                                                                                                  X 
 [15] abstract_eval_call(::Array{Any,1}, ::Array{Any,1}, ::Array{Any,1}, ::Core.Compiler.InferenceState) at ./compiler/abstractinterpretation.jl:831
 [16] abstract_eval(::Any, ::Array{Any,1}, ::Core.Compiler.InferenceState) at ./compiler/abstractinterpretation.jl:901
 [17] typeinf_local(::Core.Compiler.InferenceState) at ./compiler/abstractinterpretation.jl:1159
 [18] typeinf_nocycle(::Core.Compiler.InferenceState) at ./compiler/abstractinterpretation.jl:1215
 [19] typeinf(::Core.Compiler.InferenceState) at ./compiler/typeinfer.jl:12
 [20] typeinf at ./compiler/typeinfer.jl:8 [inlined]
 [21] typeinf_type(::Method, ::Any, ::Core.SimpleVector, ::Core.Compiler.Params) at ./compiler/typeinfer.jl:614
 [22] return_types(::Any, ::Any) at ./reflection.jl:1068
 [23] check_method(::CUDAnative.CompilerContext) at /home/alir_mit_edu/.julia/packages/CUDAnative/PFgO3/src/compiler/validation.jl:12
 [24] compile(::CUDAnative.CompilerContext) at /home/alir_mit_edu/.julia/packages/CUDAnative/PFgO3/src/compiler/driver.jl:67
 [25] #compile#109(::Bool, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(CUDAnative.compile), ::VersionNumber, ::Any, ::Any) at /home/alir_mit_edu/.julia/packages/CUDAnative/PFgO3/src/compiler/driver.jl:45
 [26] compile at /home/alir_mit_edu/.julia/packages/CUDAnative/PFgO3/src/compiler/driver.jl:43 [inlined]
 [27] #compile#108(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(CUDAnative.compile), ::CUDAdrv.CuDevice, ::Function, ::Any) at /home/alir_mit_edu/.julia/packages/CUDAnative/PFgO3/src/compiler/driver.jl:18
 [28] compile at /home/alir_mit_edu/.julia/packages/CUDAnative/PFgO3/src/compiler/driver.jl:16 [inlined]
 [29] macro expansion at /home/alir_mit_edu/.julia/packages/CUDAnative/PFgO3/src/execution.jl:269 [inlined]
 [30] #cufunction#123(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(CUDAnative.cufunction), ::getfield(GPUifyLoops, Symbol("##16#17")){typeof(Oceananigans.calculate_interior_source_terms!)}, ::Type{Tuple{RegularCartesianGrid{Float64,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}},PlanetaryConstants{Float64},LinearEquationOfState{Float64},ModelConfiguration{Float64},CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global},Forcing{typeof(Oceananigans.zero_func),typeof(Oceananigans.zero_func),typeof(Oceananigans.zero_func),typeof(Oceananigans.zero_func),typeof(Oceananigans.zero_func)}}}) at /home/alir_mit_edu/.julia/packages/CUDAnative/PFgO3/src/execution.jl:240
 [31] cufunction(::Function, ::Type) at /home/alir_mit_edu/.julia/packages/CUDAnative/PFgO3/src/execution.jl:240
 [32] macro expansion at /home/alir_mit_edu/.julia/packages/GPUifyLoops/xOg1y/src/GPUifyLoops.jl:109 [inlined]
 [33] macro expansion at ./gcutils.jl:87 [inlined]
 [34] #launch#34(::Base.Iterators.Pairs{Symbol,Tuple{Int64,Int64,Vararg{Int64,N} where N},Tuple{Symbol,Symbol},NamedTuple{(:threads, :blocks),Tuple{Tuple{Int64,Int64},Tuple{Int64,Int64,Int64}}}}, ::typeof(GPUifyLoops.launch), ::GPUifyLoops.CUDA, ::typeof(Oceananigans.calculate_interior_source_terms!), ::RegularCartesianGrid{Float64,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}, ::Vararg{Any,N} where N) at /home/alir_mit_edu/.julia/packages/GPUifyLoops/xOg1y/src/GPUifyLoops.jl:106
 [35] (::getfield(GPUifyLoops, Symbol("#kw##launch")))(::NamedTuple{(:threads, :blocks),Tuple{Tuple{Int64,Int64},Tuple{Int64,Int64,Int64}}}, ::typeof(GPUifyLoops.launch), ::GPUifyLoops.CUDA, ::typeof(Oceananigans.calculate_interior_source_terms!), ::RegularCartesianGrid{Float64,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}, ::Vararg{Any,N} where N) at ./none:0
 [36] macro expansion at /home/alir_mit_edu/.julia/packages/GPUifyLoops/xOg1y/src/GPUifyLoops.jl:50 [inlined]
 [37] time_step!(::Model{GPU}, ::Int64, ::Int64) at /home/alir_mit_edu/gpu_test/Oceananigans.jl/src/time_steppers.jl:79
 [38] top-level scope at util.jl:156

Best way to GPUify nested loops over multi-dimensional arrays?

I'm hoping to use GPUifyLoops.jl to rewrite the time stepping for Oceananigans.jl.

I'm just starting very small (basically building off the simple example with the A[i] = 2*A[i] kernel) but since my model works with 3D arrays I was wondering about the best way to iterate over arrays with sizes like 128×128×64. Here is what I have so far

using BenchmarkTools
using CUDAnative, CuArrays
using GPUifyLoops

@inline incmod1(a, n) = a == n ? 1 : a+1  # Wrap around indexing for i+1.
δx(f, Nx, i, j, k) = f[incmod1(i, Nx), j, k] - f[i, j, k]  # x-difference operator

function time_stepping_kernel(::Val{Dev}, f, δxf) where Dev
    @setup Dev
    
    Nx, Ny, Nz = size(f)
    @loop for i in (1:Nx; threadIdx().x)
        for k in 1:Nz, j in 1:Ny
            δxf[i, j, k] = δx(f, Nx, i, j, k)
        end
    end
    
    @synchronize
end

time_step!(A::Array, B::Array) = time_stepping_kernel(Val(:CPU), A, B)

function time_step!(A::CuArray, B::CuArray)
    @cuda threads=512 time_stepping_kernel(Val(:GPU), A, B)
end

but then this kernel is quite slow

julia> Nx, Ny, Nz = 128, 128, 64
julia> xc, yc = rand(Nx, Ny, Nz), rand(Nx, Ny, Nz);
julia> xg, yg = cu(rand(Nx, Ny, Nz)), cu(rand(Nx, Ny, Nz));

julia> @benchmark time_step!($xc, $yc)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.830 ms (0.00% GC)
  median time:      5.843 ms (0.00% GC)
  mean time:        5.874 ms (0.00% GC)
  maximum time:     7.908 ms (0.00% GC)
  --------------
  samples:          851
  evals/sample:     1

julia> @benchmark time_step!($xg, $yg)
BenchmarkTools.Trial: 
  memory estimate:  288 bytes
  allocs estimate:  7
  --------------
  minimum time:     3.728 μs (0.00% GC)
  median time:      4.657 ms (0.00% GC)
  mean time:        4.344 ms (0.00% GC)
  maximum time:     4.663 ms (0.00% GC)
  --------------
  samples:          144
  evals/sample:     8

while I was hoping for a ~10x speed up compared to single-core CPU performance (see this thread on Julia Discourse). I'm guessing it's slow because I'm only using a single block maybe? Is there a way to speed up time_stepping_kernel with GPUifyLoops.jl?

I also have a few questions:

  1. Is it possible to use up as many threads as possible? I think broadcasting over CuArrays does this?
  2. I should probably read up again on CUDA threads and blocks but is it possible to use blockId, blockDim, and gridDim with GPUifyLoops.jl?
  3. I get a CUDA error: too many resources requested for launch (code #701, ERROR_LAUNCH_OUT_OF_RESOURCES) if I ask for too many threads (e.g. 1024). Googling around suggests that not enough registers are available. I assume I should probably be using blocks as well as threads to avoid jamming too many threads per block?

Thanks so much!

EDIT: Forgot to mention that this is with Julia 1.1, CUDA 9.2, and a Tesla V100.

Contextualize example doesn't seem to work

I just tried the contextualize example, and I'm getting this:

GPUifyLoops.jl>julia --project
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.0.0 (2018-08-08)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> include("examples/contextualize.jl")
[ Info: Recompiling stale cache file C:\Users\kawcz\.julia\compiled\v1.0\GPUifyLoops\BxvPf.ji for GPUifyLoops [ba82f77b-6841-5d2e-bd9f-4daf811aec27]
WARNING: Method definition overdub(Cassette.Context{N, M, T, P, B, H} where H<:Union{Cassette.DisableHooks, Nothing} where B<:Union{Nothing, Base.IdDict{Module, Base.Dict{Symbol, Cassette.BindingMeta}}} where P<:Cassette.AbstractPass where T<:Union{Nothing, Cassette.Tag{N, X, E} where E where X where N<:Cassette.AbstractContextName} where M where N<:Cassette.AbstractContextName, Any...) in module Cassette at C:\Users\kawcz\.julia\packages\Cassette\1rVkq\src\overdub.jl:500 overwritten in module GPUifyLoops at C:\Users\kawcz\.julia\packages\Cassette\1rVkq\src\overdub.jl:500.
WARNING: Method definition recurse(Cassette.Context{N, M, T, P, B, H} where H<:Union{Cassette.DisableHooks, Nothing} where B<:Union{Nothing, Base.IdDict{Module, Base.Dict{Symbol, Cassette.BindingMeta}}} where P<:Cassette.AbstractPass where T<:Union{Nothing, Cassette.Tag{N, X, E} where E where X where N<:Cassette.AbstractContextName} where M where N<:Cassette.AbstractContextName, Any...) in module Cassette at C:\Users\kawcz\.julia\packages\Cassette\1rVkq\src\overdub.jl:512 overwritten in module GPUifyLoops at C:\Users\kawcz\.julia\packages\Cassette\1rVkq\src\overdub.jl:512.
[ Info: Recompiling stale cache file C:\Users\kawcz\.julia\compiled\v1.0\CuArrays\7YFE0.ji for CuArrays [3a865a2d-5b23-5a0f-bc46-62713ec82fae]
┌ Warning: calls to Base intrinsics might be GPU incompatible
│   exception =
│    You called sin(x::T) where T<:Union{Float32, Float64} in Base.Math at special/trig.jl:30, maybe you intended to call sin(x::Float32) in CUDAnative at C:\Users\kawcz\.julia\packages\CUDAnative\Mdd3w\src\device\libdevice.jl:13 instead?
│    Stacktrace:
│     [1] sin at special/trig.jl:30
│     [2] #24 at C:\Users\kawcz\Dropbox\Caltech\work\dev\GPUifyLoops.jl\src\context.jl:127
│     [3] kernel! at C:\Users\kawcz\Dropbox\Caltech\work\dev\GPUifyLoops.jl\examples\contextualize.jl:8
└ @ CUDAnative C:\Users\kawcz\.julia\packages\CUDAnative\Mdd3w\src\compiler\irgen.jl:68
┌ Warning: calls to Base intrinsics might be GPU incompatible
│   exception =
│    You called sin(x::T) where T<:Union{Float32, Float64} in Base.Math at special/trig.jl:30, maybe you intended to call sin(x::Float32) in CUDAnative at C:\Users\kawcz\.julia\packages\CUDAnative\Mdd3w\src\device\libdevice.jl:13 instead?
│    Stacktrace:
│     [1] sin at special/trig.jl:30
│     [2] #24 at C:\Users\kawcz\Dropbox\Caltech\work\dev\GPUifyLoops.jl\src\context.jl:127
│     [3] kernel! at C:\Users\kawcz\Dropbox\Caltech\work\dev\GPUifyLoops.jl\examples\contextualize.jl:8
└ @ CUDAnative C:\Users\kawcz\.julia\packages\CUDAnative\Mdd3w\src\compiler\irgen.jl:68

Is there some other way that I need to launch Julia to get this working?

Allow users to disable fp contract optimizations ?

Recently I ran into a surprising (for me) behavior demonstrated by the MWE below

using CuArrays, CUDAnative, GPUifyLoops

function kernel(rho, T)
  P = rho[1] * T[1]

  if (abs(P - P) > 1e-16)
    @cuprintf("diff = %.16e\n", P - P)
  end
  nothing
end

rho = CuArray([1e-1])
T = CuArray([300.0])
@launch CUDA() kernel(rho, T, threads=1, blocks=1)

with the output

diff = 1.6653345369377348e-15

Basically, if my understanding of the generated PTX is correct, what happens
is that P - P is calculated as fma(rho[1], T[1], -P) which is probably not the smartest move by the compiler. However, clang with LLVM-6.0.1 also does this for CUDA C so I guess that's expected. This issue goes away if I disable contracts. In clang there's an option for that called -ffp-contract Maybe adding a similar option in GPUifyLoops would be helpful for debugging ?

For convenience, the generated PTX can be found here:

https://gist.github.com/mwarusz/5ab4ac99b02e77b54178cd95c9820d7b

`@scratch` bug

I have found a bug in @scratch that I can't figure out. I don't think it is an issue with the macro because when I use an MArray directly I still see the bug. However, I was unable to reproduce the bug without GPUifyLoops.

Below is as far as I can reduce the offending code. Any help would be greatly appreciated.

using GPUifyLoops
using StaticArrays

const HAVE_CUDA = try
    using CUDAdrv
    using CUDAnative
    using CuArrays
    true
catch
    false
end
if !HAVE_CUDA
    macro cuStaticSharedMem(x...)
        :()
    end
    macro cuda(x...)
        :()
    end
end

function knl!(::Val{DEV}, a) where {DEV}
  @setup DEV

  # r_a = @scratch Float32 (4, 5, 5) 2
  r_a = MArray{Tuple{4}, Float32}(undef)

  for k in 1:4
    r_a[k] = 0
  end

  nothing
end

function main()
  if HAVE_CUDA
    a = zeros(Float32, 4000)
    d_a = CuArray(a)

    @cuda(threads=(5, 5), blocks=4000, knl!(Val(:GPU), d_a))

    a .= d_a
  end

  nothing
end

main()

This is the error message that I am receiving

$ julia --project=. bug.jl
ERROR: LoadError: CUDA error: an illegal memory access was encountered (code #700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
 [1] macro expansion at /home/lwilcox/.julia/packages/CUDAdrv/JWljj/src/base.jl:147 [inlined]
 [2] #download!#11(::Bool, ::Function, ::Ptr{Float32}, ::CUDAdrv.Mem.Buffer, ::Int64, ::CuStream) at /home/lwilcox/.julia/packages/CUDAdrv/JWljj/src/memory.jl:254
 [3] download! at /home/lwilcox/.julia/packages/CUDAdrv/JWljj/src/memory.jl:248 [inlined] (repeats 2 times)
 [4] unsafe_copyto! at /home/lwilcox/.julia/packages/CuArrays/PD3UJ/src/array.jl:166 [inlined]
 [5] copyto! at /home/lwilcox/.julia/packages/GPUArrays/t8tJB/src/abstractarray.jl:110 [inlined]
 [6] copyto! at ./broadcast.jl:838 [inlined]
 [7] copyto! at ./broadcast.jl:797 [inlined]
 [8] materialize! at ./broadcast.jl:756 [inlined]
 [9] main() at /home/lwilcox/research/code/nps_julialab_2019_Jan/bug.jl:41
 [10] top-level scope at none:0
 [11] include at ./boot.jl:326 [inlined]
 [12] include_relative(::Module, ::String) at ./loading.jl:1038
 [13] include(::Module, ::String) at ./sysimg.jl:29
 [14] exec_options(::Base.JLOptions) at ./client.jl:267
 [15] _start() at ./client.jl:436
in expression starting at /home/lwilcox/research/code/nps_julialab_2019_Jan/bug.jl:47
error in running finalizer: CUDAdrv.CuError(code=700, meta=nothing)

cuda-memcheck spews a bunch of errors like the following

========= CUDA-MEMCHECK
========= Invalid __global__ write of size 1
=========     at 0x00000018 in ./pointer.jl:118:ptxcall_knl__1
=========     by thread (4,4,0) in block (3733,0,0)
=========     Address 0x00000003 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/nvidia/libcuda.so (cuLaunchKernel + 0x2cd) [0x24c3ad]
=========     Host Frame:[0x7fb1d8dfb875]
=========     Host Frame:[0x7fb1d8dfbab4]
=========     Host Frame:/home/lwilcox/opt/julia/1.1.0/bin/../lib/libjulia.so.1 (jl_fptr_trampoline + 0x3c) [0x4908c]
=========     Host Frame:/home/lwilcox/opt/julia/1.1.0/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x136) [0x485d6]
=========     Host Frame:[0x7fb1d8dfb61c]
=========     Host Frame:[0x7fb1d8dfb6aa]
=========     Host Frame:/home/lwilcox/opt/julia/1.1.0/bin/../lib/libjulia.so.1 (jl_fptr_trampoline + 0x3c) [0x4908c]
=========     Host Frame:/home/lwilcox/opt/julia/1.1.0/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x136) [0x485d6]
=========     Host Frame:[0x7fb1d8dbcd9f]
=========     Host Frame:/home/lwilcox/opt/julia/1.1.0/bin/../lib/libjulia.so.1 (jl_fptr_trampoline + 0x3c) [0x4908c]
=========     Host Frame:/home/lwilcox/opt/julia/1.1.0/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x136) [0x485d6]
=========     Host Frame:/home/lwilcox/opt/julia/1.1.0/bin/../lib/libjulia.so.1 [0x1adeb0]
=========     Host Frame:/home/lwilcox/opt/julia/1.1.0/bin/../lib/libjulia.so.1 [0x1adb80]
=========     Host Frame:/home/lwilcox/opt/julia/1.1.0/bin/../lib/libjulia.so.1 [0x1ae634]
=========     Host Frame:/home/lwilcox/opt/julia/1.1.0/bin/../lib/libjulia.so.1 [0x1aee9f]
=========     Host Frame:/home/lwilcox/opt/julia/1.1.0/bin/../lib/libjulia.so.1 [0x5fbfc]
=========     Host Frame:/home/lwilcox/opt/julia/1.1.0/bin/../lib/libjulia.so.1 [0x1af93d]
=========     Host Frame:/home/lwilcox/opt/julia/1.1.0/bin/../lib/libjulia.so.1 [0x7ebcc]
=========     Host Frame:/home/lwilcox/opt/julia/1.1.0/bin/../lib/libjulia.so.1 [0x540a6]
=========     Host Frame:/home/lwilcox/opt/julia/1.1.0/bin/../lib/libjulia.so.1 (jl_load + 0x53) [0x7fbf3]
=========     Host Frame:/home/lwilcox/opt/julia/1.1.0/lib/julia/sys.so [0x694f8f]
=========     Host Frame:/home/lwilcox/opt/julia/1.1.0/lib/julia/sys.so [0x6a084d]
=========     Host Frame:/home/lwilcox/opt/julia/1.1.0/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x136) [0x485d6]
=========     Host Frame:/home/lwilcox/opt/julia/1.1.0/lib/julia/sys.so [0x69fe35]
=========     Host Frame:/home/lwilcox/opt/julia/1.1.0/lib/julia/sys.so [0x6a05d9]
=========     Host Frame:/home/lwilcox/opt/julia/1.1.0/bin/../lib/libjulia.so.1 (jl_apply_generic + 0x136) [0x485d6]
=========     Host Frame:julia [0x191e]
=========     Host Frame:julia [0x1524]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
=========     Host Frame:julia [0x15c5]

The project and manifest that I used are below

[deps]
CUDAdrv = "c5f51814-7f29-56b8-a69c-e4d8f6be1fde"
CUDAnative = "be33ccc6-a3ff-5ff2-a52e-74243cff1e17"
CuArrays = "3a865a2d-5b23-5a0f-bc46-62713ec82fae"
GPUifyLoops = "8c6e557c-24b2-11e9-113b-4b4a3dc760fa"
StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
# This file is machine-generated - editing it directly is not advised

[[AbstractFFTs]]
deps = ["Compat", "LinearAlgebra"]
git-tree-sha1 = "8d59c3b1463b5e0ad05a3698167f85fac90e184d"
uuid = "621f4979-c628-5d54-868e-fcf4e3e8185c"
version = "0.3.2"

[[Adapt]]
deps = ["LinearAlgebra", "Test"]
git-tree-sha1 = "53d8fec4f662088c1202530e338a11a919407f3b"
uuid = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
version = "0.4.2"

[[Base64]]
uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"

[[BinDeps]]
deps = ["Compat", "Libdl", "SHA", "URIParser"]
git-tree-sha1 = "12093ca6cdd0ee547c39b1870e0c9c3f154d9ca9"
uuid = "9e28174c-4ba2-5203-b857-d8d62c4213ee"
version = "0.8.10"

[[BinaryProvider]]
deps = ["Libdl", "Pkg", "SHA", "Test"]
git-tree-sha1 = "055eb2690182ebc31087859c3dd8598371d3ef9e"
uuid = "b99e7846-7c00-51b0-8f62-c81ae34c0232"
version = "0.5.3"

[[CUDAapi]]
deps = ["Libdl", "Logging", "Test"]
git-tree-sha1 = "350cde12f25d297609369a9acb4c6214211675db"
uuid = "3895d2a7-ec45-59b8-82bb-cfc6a382f9b3"
version = "0.5.4"

[[CUDAdrv]]
deps = ["CUDAapi", "Libdl", "Printf", "Test"]
git-tree-sha1 = "dfe527ba231b6b699f879d1d384c1d39b49fc005"
uuid = "c5f51814-7f29-56b8-a69c-e4d8f6be1fde"
version = "1.0.1"

[[CUDAnative]]
deps = ["Adapt", "CUDAapi", "CUDAdrv", "InteractiveUtils", "LLVM", "Libdl", "Pkg", "Printf", "Statistics", "Test"]
git-tree-sha1 = "92e3ec4f4458e43cc17be4388b68690dbef24f31"
uuid = "be33ccc6-a3ff-5ff2-a52e-74243cff1e17"
version = "1.0.1"

[[CommonSubexpressions]]
deps = ["Test"]
git-tree-sha1 = "efdaf19ab11c7889334ca247ff4c9f7c322817b0"
uuid = "bbf7d656-a473-5ed7-a52c-81e309532950"
version = "0.2.0"

[[Compat]]
deps = ["Base64", "Dates", "DelimitedFiles", "Distributed", "InteractiveUtils", "LibGit2", "Libdl", "LinearAlgebra", "Markdown", "Mmap", "Pkg", "Printf", "REPL", "Random", "Serialization", "SharedArrays", "Sockets", "SparseArrays", "Statistics", "Test", "UUIDs", "Unicode"]
git-tree-sha1 = "49269e311ffe11ac5b334681d212329002a9832a"
uuid = "34da2185-b29b-5c13-b0c7-acf172513d20"
version = "1.5.1"

[[Conda]]
deps = ["Compat", "JSON", "VersionParsing"]
git-tree-sha1 = "b625d802587c2150c279a40a646fba63f9bd8187"
uuid = "8f4d0f93-b110-5947-807f-2305c1781a2d"
version = "1.2.0"

[[CuArrays]]
deps = ["AbstractFFTs", "Adapt", "CUDAapi", "CUDAdrv", "CUDAnative", "DiffRules", "ForwardDiff", "GPUArrays", "LinearAlgebra", "MacroTools", "NNlib", "Pkg", "Printf", "Random", "SparseArrays", "Test"]
git-tree-sha1 = "c1cd8792ca783987fcba2ed0d6b3b58176e6b13e"
uuid = "3a865a2d-5b23-5a0f-bc46-62713ec82fae"
version = "0.9.1"

[[Dates]]
deps = ["Printf"]
uuid = "ade2ca70-3891-5945-98fb-dc099432e06a"

[[DelimitedFiles]]
deps = ["Mmap"]
uuid = "8bb1440f-4735-579b-a4ab-409b98df4dab"

[[DiffResults]]
deps = ["Compat", "StaticArrays"]
git-tree-sha1 = "db8acf46717b13d6c48deb7a12007c7f85a70cf7"
uuid = "163ba53b-c6d8-5494-b064-1a9d43ac40c5"
version = "0.0.3"

[[DiffRules]]
deps = ["Random", "Test"]
git-tree-sha1 = "09d69da75967ec48a8b1ad0897ec9144ee052bf9"
uuid = "b552c78f-8df3-52c6-915a-8e097449b14b"
version = "0.0.8"

[[Distributed]]
deps = ["Random", "Serialization", "Sockets"]
uuid = "8ba89e20-285c-5b6f-9357-94700520ee1b"

[[FFTW]]
deps = ["AbstractFFTs", "BinaryProvider", "Compat", "Conda", "Libdl", "LinearAlgebra", "Reexport", "Test"]
git-tree-sha1 = "29cda58afbf62f35b1a094882ad6c745a47b2eaa"
uuid = "7a1cc6ca-52ef-59f5-83cd-3a7055c09341"
version = "0.2.4"

[[FillArrays]]
deps = ["LinearAlgebra", "Random", "SparseArrays", "Test"]
git-tree-sha1 = "471b7e33dc9c9c5b9170045dd57c8ba0927b2918"
uuid = "1a297f60-69ca-5386-bcde-b61e274b549b"
version = "0.4.0"

[[ForwardDiff]]
deps = ["CommonSubexpressions", "DiffResults", "DiffRules", "InteractiveUtils", "LinearAlgebra", "NaNMath", "Random", "SparseArrays", "SpecialFunctions", "StaticArrays", "Test"]
git-tree-sha1 = "e393bd3b9102659fb24fe88caedec41f2bc2e7de"
uuid = "f6369f11-7733-5829-9624-2563aa707210"
version = "0.10.2"

[[GPUArrays]]
deps = ["Adapt", "FFTW", "FillArrays", "LinearAlgebra", "Printf", "Random", "Serialization", "StaticArrays", "Test"]
git-tree-sha1 = "2b96d7f25fbea82c08a736d78cbf14df8d2100a5"
uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7"
version = "0.6.1"

[[GPUifyLoops]]
deps = ["Requires", "StaticArrays"]
git-tree-sha1 = "b1e8ec8003400ef8dd8cc1efae8c612e36a608fa"
repo-rev = "master"
repo-url = "https://github.com/vchuravy/GPUifyLoops.jl"
uuid = "8c6e557c-24b2-11e9-113b-4b4a3dc760fa"
version = "0.1.0"

[[InteractiveUtils]]
deps = ["Markdown"]
uuid = "b77e0a4c-d291-57a0-90e8-8db25a27a240"

[[JSON]]
deps = ["Dates", "Distributed", "Mmap", "Sockets", "Test", "Unicode"]
git-tree-sha1 = "1f7a25b53ec67f5e9422f1f551ee216503f4a0fa"
uuid = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"
version = "0.20.0"

[[LLVM]]
deps = ["InteractiveUtils", "Libdl", "Printf", "Test", "Unicode"]
git-tree-sha1 = "d98bd8e6e56591caceb7db300a6877fb6daca6ba"
uuid = "929cbde3-209d-540e-8aea-75f648917ca0"
version = "1.0.0"

[[LibGit2]]
uuid = "76f85450-5226-5b5a-8eaa-529ad045b433"

[[Libdl]]
uuid = "8f399da3-3557-5675-b5ff-fb832c97cbdb"

[[LinearAlgebra]]
deps = ["Libdl"]
uuid = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"

[[Logging]]
uuid = "56ddb016-857b-54e1-b83d-db4d58db5568"

[[MacroTools]]
deps = ["Compat"]
git-tree-sha1 = "c443e1c8d58a4e9f61b708ad0a88286c7042145b"
uuid = "1914dd2f-81c6-5fcd-8719-6d5c9610ff09"
version = "0.4.4"

[[Markdown]]
deps = ["Base64"]
uuid = "d6f4376e-aef5-505a-96c1-9c027394607a"

[[Mmap]]
uuid = "a63ad114-7e13-5084-954f-fe012c677804"

[[NNlib]]
deps = ["Libdl", "LinearAlgebra", "MacroTools", "Requires", "Test"]
git-tree-sha1 = "51330bb45927379007e089997bf548fbe232589d"
uuid = "872c559c-99b0-510c-b3b7-b6c96a88d5cd"
version = "0.4.3"

[[NaNMath]]
deps = ["Compat"]
git-tree-sha1 = "ce3b85e484a5d4c71dd5316215069311135fa9f2"
uuid = "77ba4419-2d1f-58cd-9bb1-8ffee604a2e3"
version = "0.3.2"

[[Pkg]]
deps = ["Dates", "LibGit2", "Markdown", "Printf", "REPL", "Random", "SHA", "UUIDs"]
uuid = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"

[[Printf]]
deps = ["Unicode"]
uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7"

[[REPL]]
deps = ["InteractiveUtils", "Markdown", "Sockets"]
uuid = "3fa0cd96-eef1-5676-8a61-b3b8758bbffb"

[[Random]]
deps = ["Serialization"]
uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"

[[Reexport]]
deps = ["Pkg"]
git-tree-sha1 = "7b1d07f411bc8ddb7977ec7f377b97b158514fe0"
uuid = "189a3867-3050-52da-a836-e630ba90ab69"
version = "0.2.0"

[[Requires]]
deps = ["Test"]
git-tree-sha1 = "f6fbf4ba64d295e146e49e021207993b6b48c7d1"
uuid = "ae029012-a4dd-5104-9daa-d747884805df"
version = "0.5.2"

[[SHA]]
uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce"

[[Serialization]]
uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b"

[[SharedArrays]]
deps = ["Distributed", "Mmap", "Random", "Serialization"]
uuid = "1a1011a3-84de-559e-8e89-a11a2f7dc383"

[[Sockets]]
uuid = "6462fe0b-24de-5631-8697-dd941f90decc"

[[SparseArrays]]
deps = ["LinearAlgebra", "Random"]
uuid = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"

[[SpecialFunctions]]
deps = ["BinDeps", "BinaryProvider", "Libdl", "Test"]
git-tree-sha1 = "0b45dc2e45ed77f445617b99ff2adf0f5b0f23ea"
uuid = "276daf66-3868-5448-9aa4-cd146d93841b"
version = "0.7.2"

[[StaticArrays]]
deps = ["InteractiveUtils", "LinearAlgebra", "Random", "Statistics", "Test"]
git-tree-sha1 = "1eb114d6e23a817cd3e99abc3226190876d7c898"
uuid = "90137ffa-7385-5640-81b9-e52037218182"
version = "0.10.2"

[[Statistics]]
deps = ["LinearAlgebra", "SparseArrays"]
uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"

[[Test]]
deps = ["Distributed", "InteractiveUtils", "Logging", "Random"]
uuid = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[[URIParser]]
deps = ["Test", "Unicode"]
git-tree-sha1 = "6ddf8244220dfda2f17539fa8c9de20d6c575b69"
uuid = "30578b45-9adc-5946-b283-645ec420af67"
version = "0.4.0"

[[UUIDs]]
deps = ["Random", "SHA"]
uuid = "cf7118a7-6976-5b1a-9a39-7adc72f591a4"

[[Unicode]]
uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"

[[VersionParsing]]
deps = ["Compat"]
git-tree-sha1 = "c9d5aa108588b978bd859554660c8a5c4f2f7669"
uuid = "81def892-9a0e-5fdd-b105-ffc91e053289"
version = "1.1.3"

Slower than CUDA kernels

Based on examples by @skandalaCLIMA
https://gist.github.com/simonbyrne/0ade43d68c79992a3b0cfddc1a6165f9

  • static memory seems to be slightly slower than dynamic
  • GPUifyLoops are much slower than CUDAnative.jl
├ julia --project testgpu.jl
qm1 = 6
size(phir) = (1, 6)
size(phis) = (1, 6)
size(phit) = (1, 6)
CPU launch + time consumed
===========================
  49.237 μs (1292 allocations: 24.89 KiB)
vout = [21.585306429074123]


GPU launch version 1 (static shared memory + val) + time consumed
===========================
  30.564 μs (87 allocations: 3.98 KiB)
d_vout = [21.585306429074112]


GPU launch version 2 + (static shared memory + val) + time consumed
===========================
timing using btime
  26.145 μs (85 allocations: 3.92 KiB)
d_vout = [21.585306429074123]


GPU launch version 1 with dynamic shared memory with size specified at launch + time consumed
===========================
  23.322 μs (46 allocations: 1.48 KiB)
d_vout = [21.585306429074112]


GPU launch version 2 with dynamic shared memory with size specified at launch + time consumed
===========================
  24.152 μs (44 allocations: 1.45 KiB)
d_vout = [21.585306429074123]


GPUifyLoops launch version 1 + time consumed
===========================
device  = CUDA()
  46.484 μs (130 allocations: 6.19 KiB)
d_vout = [21.585306429074112]


GPUifyLoops launch version 2 + time consumed
===========================
device  = CUDA()
  45.152 μs (130 allocations: 6.05 KiB)
d_vout = [21.585306429074123]

Returning something else makes it go boom!

using GPUifyLoops, SimpleDiffEq, CuArrays, StaticArrays
ps = CuArray([@SVector [10f0,28f0,8/3f0] for i in 1:10])
function loop(u, p, t)
    @inbounds begin
        σ = p[1]; ρ = p[2]; β = p[3]
        du1 = σ*(u[2]-u[1])
        du2 = u[1]*-u[3]) - u[2]
        du3 = u[1]*u[2] - β*u[3]
        return SVector{3}(du1, du2, du3)
    end
end
function ff5(p)
    u0 = @SVector [10.0f0, 10.0f0, 10.0f0]
    tspan = (0.0f0, 100.0f0)
    dt = 0.1f0
    tf = tspan[2]
    ts = tspan[1]:dt:tspan[2]

    u = u0
    k7 = loop(u, p, ts[1])

    cs, as, btildes, rs = SimpleDiffEq._build_atsit5_caches(eltype(u0))
    c1, c2, c3, c4, c5, c6 = cs
    a21, a31, a32, a41, a42, a43, a51, a52, a53, a54,
    a61, a62, a63, a64, a65, a71, a72, a73, a74, a75, a76 = as

    uprev = u; k1 = k7
    tmp = uprev+dt*a21*k1
    u
end

_ff5 = GPUifyLoops.contextualize(ff5)
map(_ff5,ps)

fails with a long error message (at the bottom). But, let's not return u?

using GPUifyLoops, SimpleDiffEq, CuArrays, StaticArrays
ps = CuArray([@SVector [10f0,28f0,8/3f0] for i in 1:10])
function loop(u, p, t)
    @inbounds begin
        σ = p[1]; ρ = p[2]; β = p[3]
        du1 = σ*(u[2]-u[1])
        du2 = u[1]*-u[3]) - u[2]
        du3 = u[1]*u[2] - β*u[3]
        return SVector{3}(du1, du2, du3)
    end
end
function ff5(p)
    u0 = @SVector [10.0f0, 10.0f0, 10.0f0]
    tspan = (0.0f0, 100.0f0)
    dt = 0.1f0
    tf = tspan[2]
    ts = tspan[1]:dt:tspan[2]

    u = u0
    k7 = loop(u, p, ts[1])

    cs, as, btildes, rs = SimpleDiffEq._build_atsit5_caches(eltype(u0))
    c1, c2, c3, c4, c5, c6 = cs
    a21, a31, a32, a41, a42, a43, a51, a52, a53, a54,
    a61, a62, a63, a64, a65, a71, a72, a73, a74, a75, a76 = as

    uprev = u; k1 = k7
    tmp = uprev+dt*a21*k1
end

_ff5 = GPUifyLoops.contextualize(ff5)
map(_ff5,ps)

This works 👍 .

The error message:

julia> map(_ff5,ps)
ERROR: InvalidIRError: compiling #23(CuArrays.CuKernelState, CUDAnative.CuDeviceArray{SArray{Tuple{3},Float32,1,3},1,CUDAnative.AS.Global}, Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},getfield(GPUifyLoops, Symbol("##12#13")){typeof(ff5)},Tuple{Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{SArray{Tuple{3},Float32,1,3},1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}}}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:123
 [2] string at strings/io.jl:168
 [3] result_join at broadcast.jl:446
 [4] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:0
 [5] result_style at broadcast.jl:430
 [6] combine_styles at broadcast.jl:406
 [7] broadcasted at broadcast.jl:1216
 [8] broadcast at broadcast.jl:752
 [9] * at C:\Users\accou\.julia\packages\StaticArrays\VyRz3\src\linalg.jl:25
 [10] * at operators.jl:529
 [11] ff5 at REPL[4]:17
 [12] #12 at C:\Users\accou\.julia\packages\GPUifyLoops\6pLga\src\context.jl:185
 [13] _broadcast_getindex_evalf at broadcast.jl:625
 [14] _broadcast_getindex at broadcast.jl:598
 [15] getindex at broadcast.jl:558
 [16] #23 at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:124
 [2] string at strings/io.jl:168
 [3] result_join at broadcast.jl:446
 [4] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:0
 [5] result_style at broadcast.jl:430
 [6] combine_styles at broadcast.jl:406
 [7] broadcasted at broadcast.jl:1216
 [8] broadcast at broadcast.jl:752
 [9] * at C:\Users\accou\.julia\packages\StaticArrays\VyRz3\src\linalg.jl:25
 [10] * at operators.jl:529
 [11] ff5 at REPL[4]:17
 [12] #12 at C:\Users\accou\.julia\packages\GPUifyLoops\6pLga\src\context.jl:185
 [13] _broadcast_getindex_evalf at broadcast.jl:625
 [14] _broadcast_getindex at broadcast.jl:598
 [15] getindex at broadcast.jl:558
 [16] #23 at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:128
 [2] string at strings/io.jl:168
 [3] result_join at broadcast.jl:446
 [4] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:0
 [5] result_style at broadcast.jl:430
 [6] combine_styles at broadcast.jl:406
 [7] broadcasted at broadcast.jl:1216
 [8] broadcast at broadcast.jl:752
 [9] * at C:\Users\accou\.julia\packages\StaticArrays\VyRz3\src\linalg.jl:25
 [10] * at operators.jl:529
 [11] ff5 at REPL[4]:17
 [12] #12 at C:\Users\accou\.julia\packages\GPUifyLoops\6pLga\src\context.jl:185
 [13] _broadcast_getindex_evalf at broadcast.jl:625
 [14] _broadcast_getindex at broadcast.jl:598
 [15] getindex at broadcast.jl:558
 [16] #23 at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:129
 [2] string at strings/io.jl:168
 [3] result_join at broadcast.jl:446
 [4] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:0
 [5] result_style at broadcast.jl:430
 [6] combine_styles at broadcast.jl:406
 [7] broadcasted at broadcast.jl:1216
 [8] broadcast at broadcast.jl:752
 [9] * at C:\Users\accou\.julia\packages\StaticArrays\VyRz3\src\linalg.jl:25
 [10] * at operators.jl:529
 [11] ff5 at REPL[4]:17
 [12] #12 at C:\Users\accou\.julia\packages\GPUifyLoops\6pLga\src\context.jl:185
 [13] _broadcast_getindex_evalf at broadcast.jl:625
 [14] _broadcast_getindex at broadcast.jl:598
 [15] getindex at broadcast.jl:558
 [16] #23 at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to getproperty)
Stacktrace:
 [1] call at C:\Users\accou\.julia\packages\Cassette\xggAf\src\context.jl:447
 [2] fallback at C:\Users\accou\.julia\packages\Cassette\xggAf\src\context.jl:445
 [3] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\context.jl:271
 [4] print_to_string at strings/io.jl:131
 [5] string at strings/io.jl:168
 [6] result_join at broadcast.jl:446
 [7] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:0
 [8] result_style at broadcast.jl:430
 [9] combine_styles at broadcast.jl:406
 [10] broadcasted at broadcast.jl:1216
 [11] broadcast at broadcast.jl:752
 [12] * at C:\Users\accou\.julia\packages\StaticArrays\VyRz3\src\linalg.jl:25
 [13] * at operators.jl:529
 [14] ff5 at REPL[4]:17
 [15] #12 at C:\Users\accou\.julia\packages\GPUifyLoops\6pLga\src\context.jl:185
 [16] _broadcast_getindex_evalf at broadcast.jl:625
 [17] _broadcast_getindex at broadcast.jl:598
 [18] getindex at broadcast.jl:558
 [19] #23 at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:131
 [2] string at strings/io.jl:168
 [3] result_join at broadcast.jl:446
 [4] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:0
 [5] result_style at broadcast.jl:430
 [6] combine_styles at broadcast.jl:406
 [7] broadcasted at broadcast.jl:1216
 [8] broadcast at broadcast.jl:752
 [9] * at C:\Users\accou\.julia\packages\StaticArrays\VyRz3\src\linalg.jl:25
 [10] * at operators.jl:529
 [11] ff5 at REPL[4]:17
 [12] #12 at C:\Users\accou\.julia\packages\GPUifyLoops\6pLga\src\context.jl:185
 [13] _broadcast_getindex_evalf at broadcast.jl:625
 [14] _broadcast_getindex at broadcast.jl:598
 [15] getindex at broadcast.jl:558
 [16] #23 at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] result_join at broadcast.jl:446
 [2] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:0
 [3] result_style at broadcast.jl:430
 [4] combine_styles at broadcast.jl:406
 [5] broadcasted at broadcast.jl:1216
 [6] broadcast at broadcast.jl:752
 [7] * at C:\Users\accou\.julia\packages\StaticArrays\VyRz3\src\linalg.jl:25
 [8] * at operators.jl:529
 [9] ff5 at REPL[4]:17
 [10] #12 at C:\Users\accou\.julia\packages\GPUifyLoops\6pLga\src\context.jl:185
 [11] _broadcast_getindex_evalf at broadcast.jl:625
 [12] _broadcast_getindex at broadcast.jl:598
 [13] getindex at broadcast.jl:558
 [14] #23 at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\broadcast.jl:50
Reason: unsupported call to the Julia runtime (call to jl_type_error)
Stacktrace:
 [1] print_to_string at strings/io.jl:124
 [2] string at strings/io.jl:168
 [3] result_join at broadcast.jl:446
 [4] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:0
 [5] result_style at broadcast.jl:430
 [6] combine_styles at broadcast.jl:406
 [7] broadcasted at broadcast.jl:1216
 [8] broadcast at broadcast.jl:752
 [9] * at C:\Users\accou\.julia\packages\StaticArrays\VyRz3\src\linalg.jl:25
 [10] * at operators.jl:529
 [11] ff5 at REPL[4]:17
 [12] #12 at C:\Users\accou\.julia\packages\GPUifyLoops\6pLga\src\context.jl:185
 [13] _broadcast_getindex_evalf at broadcast.jl:625
 [14] _broadcast_getindex at broadcast.jl:598
 [15] getindex at broadcast.jl:558
 [16] #23 at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] print_to_string at strings/io.jl:127
 [2] string at strings/io.jl:168
 [3] result_join at broadcast.jl:446
 [4] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:0
 [5] result_style at broadcast.jl:430
 [6] combine_styles at broadcast.jl:406
 [7] broadcasted at broadcast.jl:1216
 [8] broadcast at broadcast.jl:752
 [9] * at C:\Users\accou\.julia\packages\StaticArrays\VyRz3\src\linalg.jl:25
 [10] * at operators.jl:529
 [11] ff5 at REPL[4]:17
 [12] #12 at C:\Users\accou\.julia\packages\GPUifyLoops\6pLga\src\context.jl:185
 [13] _broadcast_getindex_evalf at broadcast.jl:625
 [14] _broadcast_getindex at broadcast.jl:598
 [15] getindex at broadcast.jl:558
 [16] #23 at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\broadcast.jl:50
Reason: unsupported call to the Julia runtime (call to jl_type_error)
Stacktrace:
 [1] print_to_string at strings/io.jl:129
 [2] string at strings/io.jl:168
 [3] result_join at broadcast.jl:446
 [4] overdub at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:0
 [5] result_style at broadcast.jl:430
 [6] combine_styles at broadcast.jl:406
 [7] broadcasted at broadcast.jl:1216
 [8] broadcast at broadcast.jl:752
 [9] * at C:\Users\accou\.julia\packages\StaticArrays\VyRz3\src\linalg.jl:25
 [10] * at operators.jl:529
 [11] ff5 at REPL[4]:17
 [12] #12 at C:\Users\accou\.julia\packages\GPUifyLoops\6pLga\src\context.jl:185
 [13] _broadcast_getindex_evalf at broadcast.jl:625
 [14] _broadcast_getindex at broadcast.jl:598
 [15] getindex at broadcast.jl:558
 [16] #23 at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub(overdub_context::Cassette.Context, overdub_arguments...) in GPUifyLoops at C:\Users\accou\.julia\packages\Cassette\xggAf\src\overdub.jl:508)
Stacktrace:
 [1] result_style at broadcast.jl:430
 [2] combine_styles at broadcast.jl:406
 [3] broadcasted at broadcast.jl:1216
 [4] broadcast at broadcast.jl:752
 [5] * at C:\Users\accou\.julia\packages\StaticArrays\VyRz3\src\linalg.jl:25
 [6] * at operators.jl:529
 [7] ff5 at REPL[4]:17
 [8] #12 at C:\Users\accou\.julia\packages\GPUifyLoops\6pLga\src\context.jl:185
 [9] _broadcast_getindex_evalf at broadcast.jl:625
 [10] _broadcast_getindex at broadcast.jl:598
 [11] getindex at broadcast.jl:558
 [12] #23 at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] broadcasted at broadcast.jl:1216
 [2] broadcast at broadcast.jl:752
 [3] * at C:\Users\accou\.julia\packages\StaticArrays\VyRz3\src\linalg.jl:25
 [4] * at operators.jl:529
 [5] ff5 at REPL[4]:17
 [6] #12 at C:\Users\accou\.julia\packages\GPUifyLoops\6pLga\src\context.jl:185
 [7] _broadcast_getindex_evalf at broadcast.jl:625
 [8] _broadcast_getindex at broadcast.jl:598
 [9] getindex at broadcast.jl:558
 [10] #23 at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\broadcast.jl:50
Reason: unsupported call to the Julia runtime (call to jl_f__apply)
Stacktrace:
 [1] broadcasted at broadcast.jl:1216
 [2] broadcast at broadcast.jl:752
 [3] * at C:\Users\accou\.julia\packages\StaticArrays\VyRz3\src\linalg.jl:25
 [4] * at operators.jl:529
 [5] ff5 at REPL[4]:17
 [6] #12 at C:\Users\accou\.julia\packages\GPUifyLoops\6pLga\src\context.jl:185
 [7] _broadcast_getindex_evalf at broadcast.jl:625
 [8] _broadcast_getindex at broadcast.jl:598
 [9] getindex at broadcast.jl:558
 [10] #23 at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] broadcast at broadcast.jl:752
 [2] * at C:\Users\accou\.julia\packages\StaticArrays\VyRz3\src\linalg.jl:25
 [3] * at operators.jl:529
 [4] ff5 at REPL[4]:17
 [5] #12 at C:\Users\accou\.julia\packages\GPUifyLoops\6pLga\src\context.jl:185
 [6] _broadcast_getindex_evalf at broadcast.jl:625
 [7] _broadcast_getindex at broadcast.jl:598
 [8] getindex at broadcast.jl:558
 [9] #23 at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] * at operators.jl:529
 [2] ff5 at REPL[4]:17
 [3] #12 at C:\Users\accou\.julia\packages\GPUifyLoops\6pLga\src\context.jl:185
 [4] _broadcast_getindex_evalf at broadcast.jl:625
 [5] _broadcast_getindex at broadcast.jl:598
 [6] getindex at broadcast.jl:558
 [7] #23 at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\broadcast.jl:50
Reason: unsupported call to the Julia runtime (call to jl_f__apply)
Stacktrace:
 [1] * at operators.jl:529
 [2] ff5 at REPL[4]:17
 [3] #12 at C:\Users\accou\.julia\packages\GPUifyLoops\6pLga\src\context.jl:185
 [4] _broadcast_getindex_evalf at broadcast.jl:625
 [5] _broadcast_getindex at broadcast.jl:598
 [6] getindex at broadcast.jl:558
 [7] #23 at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\broadcast.jl:50
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] ff5 at REPL[4]:17
 [2] #12 at C:\Users\accou\.julia\packages\GPUifyLoops\6pLga\src\context.jl:185
 [3] _broadcast_getindex_evalf at broadcast.jl:625
 [4] _broadcast_getindex at broadcast.jl:598
 [5] getindex at broadcast.jl:558
 [6] #23 at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\broadcast.jl:50
Stacktrace:
 [1] check_ir(::CUDAnative.CompilerJob, ::LLVM.Module) at C:\Users\accou\.julia\packages\CUDAnative\ytV2j\src\compiler\validation.jl:114
 [2] macro expansion at C:\Users\accou\.julia\packages\TimerOutputs\7zSea\src\TimerOutput.jl:216 [inlined]
 [3] #codegen#119(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.codegen), ::Symbol, ::CUDAnative.CompilerJob) at C:\Users\accou\.julia\packages\CUDAnative\ytV2j\src\compiler\driver.jl:186
 [4] #codegen at .\none:0 [inlined]
 [5] #compile#118(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.compile), ::Symbol, ::CUDAnative.CompilerJob) at C:\Users\accou\.julia\packages\CUDAnative\ytV2j\src\compiler\driver.jl:47
 [6] #compile#117 at .\none:0 [inlined]
 [7] #compile at .\none:0 [inlined] (repeats 2 times)
 [8] macro expansion at C:\Users\accou\.julia\packages\CUDAnative\ytV2j\src\execution.jl:380 [inlined]
 [9] #cufunction#159(::Nothing, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(CUDAnative.cufunction), ::getfield(GPUArrays, Symbol("##23#24")), ::Type{Tuple{CuArrays.CuKernelState,CUDAnative.CuDeviceArray{SArray{Tuple{3},Float32,1,3},1,CUDAnative.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},getfield(GPUifyLoops, Symbol("##12#13")){typeof(ff5)},Tuple{Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{SArray{Tuple{3},Float32,1,3},1,CUDAnative.AS.Global},Tuple{Bool},Tuple{Int64}}}}}}) at C:\Users\accou\.julia\packages\CUDAnative\ytV2j\src\execution.jl:348
 [10] cufunction(::Function, ::Type) at C:\Users\accou\.julia\packages\CUDAnative\ytV2j\src\execution.jl:348
 [11] macro expansion at C:\Users\accou\.julia\packages\CUDAnative\ytV2j\src\execution.jl:174 [inlined]
 [12] macro expansion at .\gcutils.jl:87 [inlined]
 [13] macro expansion at C:\Users\accou\.julia\packages\CUDAnative\ytV2j\src\execution.jl:171 [inlined]
 [14] _gpu_call(::CuArrays.CuArrayBackend, ::Function, ::CuArray{SArray{Tuple{3},Float32,1,3},1}, ::Tuple{CuArray{SArray{Tuple{3},Float32,1,3},1},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},getfield(GPUifyLoops, Symbol("##12#13")){typeof(ff5)},Tuple{Base.Broadcast.Extruded{CuArray{SArray{Tuple{3},Float32,1,3},1},Tuple{Bool},Tuple{Int64}}}}}, ::Tuple{Tuple{Int64},Tuple{Int64}}) at C:\Users\accou\.julia\packages\CuArrays\r9ana\src\gpuarray_interface.jl:59
 [15] gpu_call at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\abstract_gpu_interface.jl:151 [inlined]
 [16] gpu_call at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\abstract_gpu_interface.jl:128 [inlined]
 [17] copyto! at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\broadcast.jl:48 [inlined]
 [18] copyto! at .\broadcast.jl:842 [inlined]
 [19] copy(::Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Tuple{Base.OneTo{Int64}},getfield(GPUifyLoops, Symbol("##12#13")){typeof(ff5)},Tuple{CuArray{SArray{Tuple{3},Float32,1,3},1}}}) at .\broadcast.jl:818
 [20] materialize(::Base.Broadcast.Broadcasted{Base.Broadcast.ArrayStyle{CuArray},Nothing,getfield(GPUifyLoops, Symbol("##12#13")){typeof(ff5)},Tuple{CuArray{SArray{Tuple{3},Float32,1,3},1}}}) at .\broadcast.jl:798
 [21] map(::Function, ::CuArray{SArray{Tuple{3},Float32,1,3},1}) at C:\Users\accou\.julia\packages\GPUArrays\pJw1Y\src\base.jl:9
 [22] top-level scope at REPL[6]:1

Launching a CUDA kernel with an OffsetArray{CuArray} does not work.

I'm trying to use OffsetArrays to implement halo regions (CliMA/Oceananigans.jl#167) but when calling CUDA kernels with an OffsetArrays as one of it's argument I get some Casette error (see below).

The kernel:

using CuArrays, OffsetArrays, GPUifyLoops

# Adapting OffsetArrays to work with CUDA kernels.
import Adapt
Adapt.adapt_structure(to, x::OffsetArray) = OffsetArray(Adapt.adapt(to, parent(x)), x.offsets)

function fill_halo_regions_x!(Nx, Ny, Nz, f)    
    @loop for k in (1:Nz; blockIdx().z)
        @loop for j in (1:Ny; (blockIdx().y - 1) * blockDim().y + threadIdx().y)
            f[0,    j, k] = f[Nx, j, k]
            f[Nx+1, j, k] = f[1, j, k]
        end
    end
    
    @synchronize
end

It works on the CPU:

Nx, Ny, Nz = 5, 5, 5

max_threads = 1024
Tx  = min(max_threads, Nx)
Ty  = min(fld(max_threads, Tx), Ny)
Tz  = min(fld(max_threads, Tx*Ty), Nz)
Bx, By, Bz = cld(Nx, Tx), cld(Ny, Ty), cld(Nz, Tz)

# We'll fill the interior of an OffsetArray on the CPU and check to see if the kernel correctly
# fills in the halo regions in the _x_-direction. It does.
underlying_data = zeros(Nx+2, Ny+2, Nz)  # Halos of size 1 on each side.
data = OffsetArray(underlying_data, 0:Nx+1, 0:Ny+1, 1:Nz)
@. @views data[1:Nx, 1:Ny, 1:Nz] = rand()
@launch CPU() fill_halo_regions_x!(Nx, Ny, Nz, data, threads=(Tx, Ty, Tz), blocks=(Bx, By, Bz))

It does not work on the GPU:

underlying_data = CuArray(zeros(Nx+2, Ny+2, Nz))
data = OffsetArray(underlying_data, 0:Nx+1, 0:Ny+1, 1:Nz)
@. @views data[1:Nx, 1:Ny, 1:Nz] = rand()
@launch CUDA() fill_halo_regions_x!(Nx, Ny, Nz, data, threads=(Tx, Ty, Tz), blocks=(Bx, By, Bz))

and spits out this error and stacktrace:

ERROR: InvalidIRError: compiling #12(Int64, Int64, Int64, OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to Cassette.overdub)
Stacktrace:
 [1] call at /home/alir_mit_edu/.julia/packages/Cassette/xggAf/src/context.jl:447
 [2] fallback at /home/alir_mit_edu/.julia/packages/Cassette/xggAf/src/context.jl:445
 [3] overdub at /home/alir_mit_edu/.julia/packages/Cassette/xggAf/src/context.jl:271
 [4] fill_halo_regions_x! at REPL[4]:2
 [5] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported dynamic function invocation (call to Cassette.overdub)
Stacktrace:
 [1] fill_halo_regions_x! at REPL[4]:2
 [2] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported dynamic function invocation (call to Cassette.overdub)
Stacktrace:
 [1] call at /home/alir_mit_edu/.julia/packages/Cassette/xggAf/src/context.jl:447
 [2] fallback at /home/alir_mit_edu/.julia/packages/Cassette/xggAf/src/context.jl:445
 [3] overdub at /home/alir_mit_edu/.julia/packages/Cassette/xggAf/src/context.jl:271
 [4] macro expansion at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/GPUifyLoops.jl:198
 [5] fill_halo_regions_x! at REPL[4]:2
 [6] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported dynamic function invocation (call to Cassette.overdub)
Stacktrace:
 [1] macro expansion at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/GPUifyLoops.jl:198
 [2] fill_halo_regions_x! at REPL[4]:2
 [3] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported dynamic function invocation (call to Cassette.overdub)
Stacktrace:
 [1] call at /home/alir_mit_edu/.julia/packages/Cassette/xggAf/src/context.jl:447
 [2] fallback at /home/alir_mit_edu/.julia/packages/Cassette/xggAf/src/context.jl:445
 [3] overdub at /home/alir_mit_edu/.julia/packages/Cassette/xggAf/src/context.jl:271
 [4] fill_halo_regions_x! at REPL[4]:3
 [5] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported dynamic function invocation (call to Cassette.overdub)
Stacktrace:
 [1] fill_halo_regions_x! at REPL[4]:3
 [2] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported dynamic function invocation (call to Cassette.overdub)
Stacktrace:
 [1] call at /home/alir_mit_edu/.julia/packages/Cassette/xggAf/src/context.jl:447
 [2] fallback at /home/alir_mit_edu/.julia/packages/Cassette/xggAf/src/context.jl:445
 [3] overdub at /home/alir_mit_edu/.julia/packages/Cassette/xggAf/src/context.jl:271
 [4] macro expansion at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/GPUifyLoops.jl:198
 [5] fill_halo_regions_x! at REPL[4]:3
 [6] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported dynamic function invocation (call to Cassette.overdub)
Stacktrace:
 [1] macro expansion at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/GPUifyLoops.jl:198
 [2] fill_halo_regions_x! at REPL[4]:3
 [3] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported dynamic function invocation (call to Cassette.overdub)
Stacktrace:
 [1] fill_halo_regions_x! at REPL[4]:4
 [2] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported dynamic function invocation (call to Cassette.overdub)
Stacktrace:
 [1] fill_halo_regions_x! at REPL[4]:5
 [2] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported use of an undefined name (use of 'blockIdx')
Stacktrace:
 [1] fill_halo_regions_x! at REPL[4]:2
 [2] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported call to the Julia runtime (call to jl_type_error)
Stacktrace:
 [1] fill_halo_regions_x! at REPL[4]:2
 [2] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported use of an undefined name (use of 'blockIdx')
Stacktrace:
 [1] macro expansion at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/GPUifyLoops.jl:198
 [2] fill_halo_regions_x! at REPL[4]:2
 [3] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported call to the Julia runtime (call to jl_type_error)
Stacktrace:
 [1] macro expansion at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/GPUifyLoops.jl:198
 [2] fill_halo_regions_x! at REPL[4]:2
 [3] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported call to the Julia runtime (call to jl_type_error)
Stacktrace:
 [1] fill_halo_regions_x! at REPL[4]:5
 [2] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported use of an undefined name (use of 'blockIdx')
Stacktrace:
 [1] fill_halo_regions_x! at REPL[4]:3
 [2] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported use of an undefined name (use of 'blockDim')
Stacktrace:
 [1] fill_halo_regions_x! at REPL[4]:3
 [2] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported use of an undefined name (use of 'threadIdx')
Stacktrace:
 [1] fill_halo_regions_x! at REPL[4]:3
 [2] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported call to the Julia runtime (call to jl_type_error)
Stacktrace:
 [1] fill_halo_regions_x! at REPL[4]:3
 [2] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported use of an undefined name (use of 'blockIdx')
Stacktrace:
 [1] macro expansion at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/GPUifyLoops.jl:198
 [2] fill_halo_regions_x! at REPL[4]:3
 [3] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported use of an undefined name (use of 'blockDim')
Stacktrace:
 [1] macro expansion at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/GPUifyLoops.jl:198
 [2] fill_halo_regions_x! at REPL[4]:3
 [3] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported use of an undefined name (use of 'threadIdx')
Stacktrace:
 [1] macro expansion at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/GPUifyLoops.jl:198
 [2] fill_halo_regions_x! at REPL[4]:3
 [3] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Reason: unsupported call to the Julia runtime (call to jl_type_error)
Stacktrace:
 [1] macro expansion at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/GPUifyLoops.jl:198
 [2] fill_halo_regions_x! at REPL[4]:3
 [3] #12 at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/context.jl:136
Stacktrace:
 [1] check_ir(::CUDAnative.CompilerJob, ::LLVM.Module) at /home/alir_mit_edu/.julia/packages/CUDAnative/IZwnv/src/compiler/validation.jl:104
 [2] macro expansion at /home/alir_mit_edu/.julia/packages/CUDAnative/IZwnv/src/compiler/driver.jl:182 [inlined]
 [3] macro expansion at /home/alir_mit_edu/.julia/packages/TimerOutputs/7zSea/src/TimerOutput.jl:216 [inlined]
 [4] #codegen#113(::Bool, ::Bool, ::Bool, ::Function, ::Symbol, ::CUDAnative.CompilerJob) at /home/alir_mit_edu/.julia/packages/CUDAnative/IZwnv/src/compiler/driver.jl:178
 [5] #codegen at /home/alir_mit_edu/.julia/packages/CUDAnative/IZwnv/src/compiler/driver.jl:0 [inlined]
 [6] #compile#112(::Bool, ::Bool, ::Bool, ::Function, ::Symbol, ::CUDAnative.CompilerJob) at /home/alir_mit_edu/.julia/packages/CUDAnative/IZwnv/src/compiler/driver.jl:38
 [7] #compile#111 at ./none:0 [inlined]
 [8] compile at /home/alir_mit_edu/.julia/packages/CUDAnative/IZwnv/src/compiler/driver.jl:22 [inlined] (repeats 2 times)
 [9] macro expansion at /home/alir_mit_edu/.julia/packages/CUDAnative/IZwnv/src/execution.jl:378 [inlined]
 [10] #cufunction#146(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(CUDAnative.cufunction), ::getfield(GPUifyLoops, Symbol("##12#13")){typeof(fill_halo_regions_x!)}, ::Type{Tuple{Int64,Int64,Int64,OffsetArray{Float64,3,CUDAnative.CuDeviceArray{Float64,3,CUDAnative.AS.Global}}}}) at /home/alir_mit_edu/.julia/packages/CUDAnative/IZwnv/src/execution.jl:347
 [11] cufunction(::Function, ::Type) at /home/alir_mit_edu/.julia/packages/CUDAnative/IZwnv/src/execution.jl:347
 [12] macro expansion at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/GPUifyLoops.jl:113 [inlined]
 [13] macro expansion at ./gcutils.jl:87 [inlined]
 [14] #launch#46(::Base.Iterators.Pairs{Symbol,Tuple{Int64,Int64,Int64},Tuple{Symbol,Symbol},NamedTuple{(:threads, :blocks),Tuple{Tuple{Int64,Int64,Int64},Tuple{Int64,Int64,Int64}}}}, ::Function, ::CUDA, ::typeof(fill_halo_regions_x!), ::Int64, ::Vararg{Any,N} where N) at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/GPUifyLoops.jl:110
 [15] (::getfield(GPUifyLoops, Symbol("#kw##launch")))(::NamedTuple{(:threads, :blocks),Tuple{Tuple{Int64,Int64,Int64},Tuple{Int64,Int64,Int64}}}, ::typeof(GPUifyLoops.launch), ::CUDA, ::typeof(fill_halo_regions_x!), ::Int64, ::Vararg{Any,N} where N) at ./none:0
 [16] top-level scope at /home/alir_mit_edu/.julia/packages/GPUifyLoops/hBRid/src/GPUifyLoops.jl:54

Looks like there's some unexpected errors about threadIdx, blockIdx, and blockDim not being defined, but otherwise I'm not too sure what the error is.

I'm using the tb/errors branch of CUDAnative.jl (JuliaGPU/CUDAnative.jl#389) to get these error messages. Otherwise it was just a lot of jl_apply_generic errors.

Note: I'm using the wrong number of threads and block for this kernel but that's not relevant to the issue I think.

Requesting an example program

An example program (say, applying the log function to an array) was requested at the clima monday morning meeting. @charleskawczynski was having some error running a similar kernel, perhaps he can file an issue on that as well?

`@shmem` with integer results in `Any`

Got bit by this one today. If the size of an @shmem is an integer (as opposed to a Tuple) on the CPU then Any results.

using GPUifyLoops

# Integer for size of shmem
function ker1!(::Val{Nq}, y, x) where Nq
  s_x = @shmem eltype(x) Nq # <<<< HERE
  @inbounds @loop for i in (1:Nq; threadIdx().x)
    s_x[i] = x[i]
  end
  nothing
end

# tuple for size of shmem
function ker2!(::Val{Nq}, y, x) where Nq
  s_x = @shmem eltype(x) (Nq, ) # <<<< HERE
  @inbounds @loop for i in (1:Nq; threadIdx().x)
    s_x[i] = x[i]
  end
  nothing
end

function main(Nq = 10)
  x = rand(Nq)
  y = similar(x)

  @time @launch CPU() threads=(Nq,) ker1!(Val(Nq), y, x)
  @time @launch CPU() threads=(Nq,) ker2!(Val(Nq), y, x)
end
julia> main()
  0.000087 seconds (57 allocations: 2.469 KiB)
  0.000023 seconds (4 allocations: 96 bytes)

@code_warntype reveals that s_x is an Any in ker1!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.