Code Monkey home page Code Monkey logo

bumper.jl's Introduction

Bumper.jl

Bumper.jl is a package that aims to make working with bump allocators (also known as arena allocators) easier and safer. You can dynamically allocate memory to these bump allocators, and reset them at the end of a code block, just like Julia's stack. Allocating to a bump allocator with Bumper.jl can be just as efficient as stack allocation. Bumper.jl is still a young package, and may have bugs. Let me know if you find any.

If you use Bumper.jl, please consider submitting a sample of your use-case so I can include it in the test suite.

Basics

Bumper.jl has a task-local default allocator, using a slab allocation strategy which can dynamically grow to arbitary sizes.

The simplest way to use Bumper is to rely on its default buffer implicitly like so:

using Bumper

function f(x)
    # Set up a scope where memory may be allocated, and does not escape:
    @no_escape begin
        # Allocate a `UnsafeVector{eltype(x)}` (see UnsafeArrays.jl) using memory from the default buffer.
        y = @alloc(eltype(x), length(x))
        # Now do some stuff with that vector:
        y .= x .+ 1
        sum(y) # It's okay for the sum of y to escape the block, but references to y itself must not do so!
    end
end

f([1,2,3])
9

When you use @no_escape, you are promising that the code enclosed in the macro will not leak any memory created by @alloc. That is, you are only allowed to do intermediate @alloc allocations inside a @no_escape block, and the lifetime of those allocations is the block. This is important. Once a @no_escape block finishes running, it will reset its internal state to the position it had before the block started, potentially overwriting or freeing any arrays which were created in the block.

In addition to @alloc for creating arrays, you can use @alloc_ptr(n) to get an n-byte pointer (of type Ptr{Nothing}) directly.

Let's compare the performance of f to the equivalent with an intermediate heap allocation:

using BenchmarkTools
@benchmark f(x) setup=(x = rand(1:10, 30))
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
 Range (min … max):  28.465 ns … 49.843 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     28.718 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   28.840 ns ±  0.833 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▄▂▇█▅▆▇▅▂▂▁▁▂▁                                             ▂
  ██████████████████▆▇▅▄▅▅▅▆▃▄▄▁▃▄▄▃▄▃▁▁▁▁▁▃▁▁▁▄▅▅▅▅▄▄▃▄▁▃▃▃▄ █
  28.5 ns      Histogram: log(frequency) by time      31.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

and

function g(x::Vector{Int})
    y = x .+ 1
    sum(y)
end

@benchmark g(x) setup=(x = rand(1:10, 30))
BenchmarkTools.Trial: 10000 samples with 993 evaluations.
 Range (min … max):  32.408 ns …  64.986 μs  ┊ GC (min … max):  0.00% … 99.87%
 Time  (median):     37.443 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   55.929 ns ± 651.009 ns  ┊ GC (mean ± σ):  14.68% ±  5.87%

  ▆█▅▃▁▁▁▁                       ▁▁ ▁                       ▂▁ ▁
  ████████▇██▅▄▃▄▁▁▃▁▁▁▁▁▁▁▁▃▃▁▁██████▇▇▅▁▄▃▃▃▁▁▃▁▁▁▄▃▄▅▄▄▅▇██ █
  32.4 ns       Histogram: log(frequency) by time       227 ns <

 Memory estimate: 304 bytes, allocs estimate: 1.

So, using Bumper.jl in this benchmark gives a slight speedup relative to regular julia Vectors, and a major increase in performance consistency due to the lack of heap allocations.

However, we can actually go a little faster better if we're okay with manually passing around a buffer. The way I invoked @no_escape and @alloc implicitly used the task's default buffer, and fetching that default buffer is not as fast as using a const global variable, because Bumper.jl is trying to protect you against concurrency bugs (more on that later).

If we provide the allocator to f explicitly, we go even faster:

function f(x, buf)
    @no_escape buf begin # <----- Notice I specified buf here
        y = @alloc(Int, length(x)) 
        y .= x .+ 1
        sum(y)
    end
end

@benchmark f(x, buf) setup = begin
    x   = rand(1:10, 30)
    buf = default_buffer()
end
BenchmarkTools.Trial: 10000 samples with 997 evaluations.
 Range (min … max):  19.425 ns … 40.367 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.494 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   19.620 ns ±  0.983 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅                                                          ▁
  ██▅█▇▄▃▄▄▃▃▃▄▅▄▅▄▅▄▇▇▅▄▄▅▆▅▅▅▄▄▄▁▄▃▃▃▁▁▄▃▃▄▁▁▁▁▃▃▃▁▄▄▃▁▄▃▁▃ █
  19.4 ns      Histogram: log(frequency) by time      25.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

If you manually specify a buffer like this, it is your responsibility to ensure that you don't have multiple concurrent tasks using that buffer at the same time.

Running default_buffer() will give you the current task's default buffer. You can explicitly construct your own N byte buffer by calling AllocBuffer(N), or you can create a buffer which can dynamically grow by calling SlabBuffer(). AllocBuffers are slightly faster than SlabBuffers, but will throw an error if you overfill them.

Important notes

  • @no_escape blocks can be nested as much as you want, just don't let references outlive the specific block they were created in.
  • At the end of a @no_escape block, all memory allocations from inside that block are erased and the buffer is reset to its previous state
  • The @alloc marker can only be used directly inside of a @no_escape block, and it will always use the buffer that the corresponding @no_escape block uses.
  • You cannot use @alloc from a different concurrent task than its parent @no_escape block as this can cause concurrency bugs.
  • If for some reason you need to be able to use @alloc outside of the scope of the @no_escape block, there is a function =Bumper.alloc!(bug, T, n...)= which takes in an explicit buffer buf and uses it to allocate an array of element type T, and dimensions n.... Using this is not as safe as @alloc and not recommended.
  • Bumper.jl only supports isbits types. You cannot use it for allocating vectors containing mutable, abstract, or other pointer-backed objects.
  • As mentioned previously, Do not allow any array which was initialized inside a @no_escape block to escape the block. Doing so will cause incorrect results.
  • If you accidentally overfill a buffer, via e.g. a memory leak and need to reset the buffer, use Bumper.reset_buffer! to do this.
  • You are not allowed to use return or @goto inside a @no_escape block, since this could compromise the cleanup it performs after the block finishes.

Concurrency and parallelism

Click me!

Every task has its own independent default buffer. A task's buffer is only created if it is used, so this does not slow down the spawning of Julia tasks in general. Here's a demo showing that the default buffers are different:

using Bumper
let b = default_buffer() # The default buffer on the main task
    t = @async default_buffer() # Get the default buffer on an asychronous task
    fetch(t) === b
end
false

Whereas if we don't spawn any tasks, there is no unnecessary buffer creation:

let b = default_buffer()
    b2 = default_buffer() 
    b2 === b
end
true

Because of this, we don't have to worry about @no_escape begin ... @alloc() ... end blocks on different threads or tasks interfering with each other, so long as they are only operating on buffers local to that task or the default_buffer().

Allocators provided by Bumper

Click me!

SlabBuffer

SlabBuffer is a slab-based bump allocator which can dynamically grow to hold an arbitrary amount of memory. Small allocations from a SlabBuffer will live within a specific slab of memory, and if that slab fills up, a new slab is allocated and future allocations will then happen on that slab. Small allocations are stored in slabs of size SlabSize bytes (default 1 megabyte), and the list of live slabs are tracked in a field called slabs. Allocations which are too large to fit into one slab are stored and tracked in a field called custom_slabs.

SlabBuffers are nearly as fast as stack allocation (typically up to within a couple of nanoseconds) for typical use. One potential performance pitfall is if that SlabBuffer's current position is at the end of a slab, then the next allocation will be slow because it requires a new slab to be created. This means that if you do something like

buf = SlabBuffer{N}()
@no_escape buf begin
    @alloc(Int8, N÷2 - 1) # Take up just under half the first slab
    @alloc(Int8, N÷2 - 1) # Take up another half of the first slab
    # Now buf should be practically out of room. 
    for i in 1:1000
        @no_escape buf begin
            y = @alloc(Int8, 10) # This will allocate a new slab because there's no room
            f(y)
        end # At the end of this block, we delete the new slab because it's not needed.
    end
end

then the inner loop will run slower than normal because at each iteration, a new slab of size N bytes must be freshly allocated. This should be a rare occurance, but is possible to encounter.

Do not manipulate the fields of a SlabBuffer that is in use.

AllocBuffer

AllocBuffer{StorageType} is a very simple bump allocator that could be used to store a fixed amount of memory of type StorageType, so long as ::StoreageType supports pointer, and sizeof. If it runs out of memory to allocate, an error will be thrown. By default, AllocBuffer stores a Vector{UInt8} of 1 megabyte.

Allocations using AllocBuffers should be just as fast as stack allocation.

Do not manually manipulate the fields of an AllocBuffer that is in use.

Creating your own allocator types

Click me!

Bumper.jl's SlabBuffer type is very flexible and fast, and so should almost always be preferred, but you may have specific use-cases where you want to use a different design or make different tradeoffs, but want to be able to interoperate with Bumper.jl's other features. Hence, Bumper.jl provides an API for you to hook custom allocator types into it.

When someone writes

@no_escape buf begin
    y = @alloc(T, n, m, o)
    f(y)
end 

this turns into the equivalent of

begin
    local cp = Bumper.checkpoint_save(buf)
    local result = begin 
        y = Bumper.alloc!(buf, T, n, m, o)
        f(y)
    end
    Bumper.checkpoint_restore!(cp)
    result
end

checkpoint_save should save the state of buf, alloc! should create an array using memory from buf, and checkpoint_restor! needs to reset buf to the state it was in when the checkpoint was created.

Hence, in order to use your custom allocator with Bumper.jl, all you need to write is the following methods:

  • Bumper.alloc_ptr!(::YourAllocator, n::Int)::Ptr{Nothing} which returns a pointer that can hold up to n bytes, and should be created from memory supplied with your allocator type however you see fit.
    • Alternatively, you could implement Bumper.alloc!(::YourAllocator, ::Type{T}, s::Vararg{Integer}) which should return a multidimensional array whose sizes are determined by s..., created from memory supplied by your custom allocator. The default implementation of this method calls Bumper.alloc_ptr!.
  • Bumper.checkpoint_save(::YourAllocator)::YourAllocatorCheckpoint which saves whatever information your allocator needs to save in order to later on deallocate all objects which were created after checkpoint_save was called.
  • checkpoint_restore!(::YourAllocatorCheckpoint) which resets the allocator back to the state it was in when the checkpoint was created.

Let's look at a concrete example where we make our own simple copy of AllocBuffer:

mutable struct MyAllocBuffer
    buf::Vector{UInt8} # The memory chunk we'll use for allocations
    offset::UInt       # A simple offset saying where the current position of the allocator is.
	
    #Default constructor
    MyAllocBuffer(n::Int) = new(Vector{UInt8}(undef, n), UInt(0))
end

struct MyCheckpoint
    buf::MyAllocBuffer # The buffer we want to store
    offset::UInt       # The buffer's offset when the checkpoint was created
end

function Bumper.alloc_ptr!(b::MyAllocBuffer, sz::Int)::Ptr{Cvoid}
    ptr = pointer(b.buf) + b.offset
    b.offset += sz
    b.offset > sizeof(b.buf) && error("alloc: Buffer out of memory.")
    ptr
end

function Bumper.checkpoint_save(buf::MyAllocBuffer)
    MyCheckpoint(buf, buf.offset)
end
function Bumper.checkpoint_restore!(cp::MyCheckpoint)
    cp.buf.offset = cp.offset
    nothing
end

that's it!

julia> let x = [1, 2, 3], buf = MyAllocBuffer(100)
           @btime f($x, $buf)
       end
  9.918 ns (0 allocations: 0 bytes)
9

As a bonus, this isn't required, but if you want to have functionality like default_buffer, it can be simply implemented as follows:

#Some default size, say 16kb
MyAllocBuffer() = MyAllocBuffer(16_000)

const default_buffer_key = gensym(:my_buffer)
function Bumper.default_buffer(::Type{MyAllocBuffer})
    get!(() -> MyAllocBuffer(), task_local_storage(), default_buffer_key)::MyAllocBuffer
end

You may also want to implemet Bumper.reset_buffer! for refreshing you allocator to a freshly initialized state.

Usage with StaticCompiler.jl

Click me!

Bumper.jl is in the process of becoming a dependancy of StaticTools.jl (and thus StaticCompiler.jl), which extends Bumper.jl with a new buffer type, MallocSlabBuffer which is like SlabBuffer but designed to work without needing Julia's runtime at all. This allows for code like the following

using Bumper, StaticTools
function times_table(argc::Int, argv::Ptr{Ptr{UInt8}})
    argc == 3 || return printf(c"Incorrect number of command-line arguments\n")
    rows = argparse(Int64, argv, 2)            # First command-line argument
    cols = argparse(Int64, argv, 3)            # Second command-line argument

    buf = MallocSlabBuffer()
    @no_escape buf begin
        M = @alloc(Int, rows, cols)
        for i=1:rows
            for j=1:cols
                M[i,j] = i*j
            end
        end
        printf(M)
    end
    free(buf)
end

using StaticCompiler
filepath = compile_executable(times_table, (Int64, Ptr{Ptr{UInt8}}), "./")

giving

shell> ./times_table 12, 7
1   2   3   4   5   6   7
2   4   6   8   10  12  14
3   6   9   12  15  18  21
4   8   12  16  20  24  28
5   10  15  20  25  30  35
6   12  18  24  30  36  42
7   14  21  28  35  42  49
8   16  24  32  40  48  56
9   18  27  36  45  54  63
10  20  30  40  50  60  70
11  22  33  44  55  66  77
12  24  36  48  60  72  84

Docstrings

See the full list of docstrings here.

bumper.jl's People

Contributors

jipolanco avatar masonprotter avatar pallharaldsson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

bumper.jl's Issues

Undefined function

Precompiling project...
  ✗ Bumper
  0 dependencies successfully precompiled in 3 seconds. 57 already precompiled.

ERROR: The following 1 direct dependency failed to precompile:

Bumper [8ce10254-0962-460f-a3d8-1f77fea1446e]

Failed to precompile Bumper [8ce10254-0962-460f-a3d8-1f77fea1446e] to /home/lime/.julia/compiled/v1.8/Bumper/jl_ujDpjX.
ERROR: LoadError: UndefVarError: calc_strides_len not defined
Stacktrace:
 [1] include
   @ ./Base.jl:419 [inlined]
 [2] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt64}}, source::Nothing)
   @ Base ./loading.jl:1554
 [3] top-level scope
   @ stdin:1
in expression starting at /home/lime/.julia/packages/Bumper/rK9gd/src/Bumper.jl:1
in expression starting at stdin:1

@no_escape is incompatible with Threads.@threads

julia> function f1()
           @no_escape begin
               y = @alloc(Int,10)
               Threads.@threads for val in y
                   println(val)
               end
           end
       end
ERROR: LoadError: The `return` keyword is not allowed to be used inside the `@no_escape` macro

I have to nest the for loop inside a function to trick it:

julia> function _f(y)
           Threads.@threads for val in y
               println(y)
           end
       end
_f (generic function with 1 method)
julia> function f2()
           @no_escape begin
               y = @alloc(Int,10)
               _f(y)
           end
       end
f2 (generic function with 1 method)

Massive slowdown when running with `--check-bounds=no`

When running julia with --check-bounds=no something goes wrong with Bumper. It should be noted in the docs.

The MWE is the example from the docs:

using Bumper
using BenchmarkTools
using StrideArrays

function f(x)
    # Set up a scope where memory may be allocated, and does not escape:
    @no_escape begin
        # Allocate a `PtrArray` (see StrideArraysCore.jl) using memory from the default buffer.
        y = @alloc(eltype(x), length(x))
        # Now do some stuff with that vector:
        y .= x .+ 1
        sum(y) # It's okay for the sum of y to escape the block, but references to y itself must not do so!
    end
end

@benchmark f(x) setup=(x = rand(1:10, 30))

Starting julia with --check-bounds=auto I get this output:

BenchmarkTools.Trial: 10000 samples with 997 evaluations.
 Range (min … max):  19.837 ns … 41.080 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.998 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.250 ns ±  1.138 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▇▆▅▄▃▂▁  ▁▁▁   ▁                                          ▂
  ████████████████████▇▇▆█▆▆▅▅▅▆▆▆█▇▆▇▆▆▅▄▅▅▃▅▅▄▅▄▂▂▃▃▄▃▄▅▃▃▄ █
  19.8 ns      Histogram: log(frequency) by time      24.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

With --check-bounds=no it is quite a bit slower, and allocating:

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  147.137 μs …  4.958 ms  ┊ GC (min … max): 0.00% … 95.54%
 Time  (median):     152.287 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   156.173 μs ± 87.330 μs  ┊ GC (mean ± σ):  1.91% ±  3.47%

         ▁▁▂▅▆█▆▄▄▂▂▂▂▁▂▁                                       
  ▂▁▃▄▆▇████████████████████▇▆▆▆▆▅▅▅▄▄▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂ ▄
  147 μs          Histogram: frequency by time          166 μs <

 Memory estimate: 49.56 KiB, allocs estimate: 1050.

Julia Version 1.12.0-DEV.606
Commit 6f569c7ba0* (2024-05-27 08:27 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 24 × AMD Ryzen Threadripper PRO 5945WX 12-Cores
WORD_SIZE: 64
LLVM: libLLVM-17.0.6 (ORCJIT, znver3)
Threads: 24 default, 0 interactive, 24 GC (on 24 virtual cores)
Environment:
JULIA_NUM_THREADS = auto
JULIA_EDITOR = emacs -nw

ldiv! does not accept PtrArray

First off, interesting package! I think my issue is more with StrideArrays and LinearAlgebra not meshing well and Bumper is caught in the middle.

The error I'm getting comes from trying to use ldiv! which requires a factorized matrix, but StridedArrays always tries to produce a PtrArray regardless of the function applied:

X = rand(100,100)
y = rand(100)

function f(X,y)

    numObs, numFeatures = size(X)
    T = eltype(X)

    @no_escape begin
        Xfact = @alloc(T, numObs, numFeatures)
        b = @alloc(T, numFeatures)
        ŷ = @alloc(T, numObs)

        Xfact .= X
        qr!(Xfact)
        ldiv!(b,Xfact,y) # <-- ERROR: MethodError: no method matching ldiv!(::PtrArray{…}, ::PtrArray{…})
        mul!(ŷ,X,b)

        err = sum((yᵢ - ŷᵢ)^2 for (yᵢ, ŷᵢ) in zip(y,ŷ)) / numObs
    end

    return err
end

I'm guessing there's no easy way to avoid using PtrArrays. I can use X\y but this of course allocates which kind of defeats the purpose.

Integration tests for DynamicExpressions.jl?

I read in the README:

If you use Bumper.jl, please consider submitting a sample of your use-case so I can include it in the test suite.

Happy to share that I just added support for Bumper.jl in DynamicExpressions.jl, which means people can soon also use it for SymbolicRegression.jl and PySR.

My use-case is coded up in this file with the important part being:

function bumper_eval_tree_array(
    tree::AbstractExpressionNode{T},
    cX::AbstractMatrix{T},
    operators::OperatorEnum,
    ::Val{turbo},
) where {T,turbo}
    result = similar(cX, axes(cX, 2))
    n = size(cX, 2)
    all_ok = Ref(false)
    @no_escape begin
        _result_ok = tree_mapreduce(
            # Leaf nodes, we create an allocation and fill
            # it with the value of the leaf:
            leaf_node -> begin
                ar = @alloc(T, n)
                ok = if leaf_node.constant
                    v = leaf_node.val::T
                    ar .= v
                    isfinite(v)
                else
                    ar .= view(cX, leaf_node.feature, :)
                    true
                end
                ResultOk(ar, ok)
            end,
            # Branch nodes, we simply pass them to the evaluation kernel:
            branch_node -> branch_node,
            # In the evaluation kernel, we combine the branch nodes
            # with the arrays created by the leaf nodes:
            ((args::Vararg{Any,M}) where {M}) ->
                dispatch_kerns!(operators, args..., Val(turbo)),
            tree;
            break_sharing=Val(true),
        )
        x = _result_ok.x
        result .= x
        all_ok[] = _result_ok.ok
    end
    return (result, all_ok[])
end

Basically it's a recursive evaluation scheme for an arbitrary symbolic expression over a 2D array of data. Preliminary result show a massive performance gain with bump allocation! Even faster than LoopVectorization (though the user could even turn on both, though I don't see much more of an improvement).

The way you can write an integration test is:

using DynamicExpressions: Node, OperatorEnum, eval_tree_array
using Bumper
using Random: MersenneTwister as RNG

operators = OperatorEnum(binary_operators=(+, -, *), unary_operators=(cos, exp))

x1 = Node{Float32}(feature=1)
x2 = Node{Float32}(feature=2)

tree = cos(x1 * 0.9 - 0.5) + x2 * exp(1.0 - x3 * x3)
# ^ This is a symbolic expression described as a type-stable binary tree

# Evaluate with Bumper:
X = randn(RNG(0), Float32, 2, 1000);

truth, no_nans_truth = eval_tree_array(tree, X, operators)
test, no_nans_test = eval_tree_array(tree, X, operators; bumper=true)

@test truth  test

You could also random generate expressions if you want to use this as a way to stress test the bump allocator. The code to generate trees is here

which lets you do

tree = gen_random_tree_fixed_size(20, operators, 2, Float32)

Cheers,
Miles

P.S., any tips on how I'm using bumper allocation would be much appreciated!! For example, I do know exactly how large the allocation should be in advance – can that help me get more perf at all?

Move into Julia and/or under JuliaLang, as stdlib?

Hi,

I believe your package has a good track record by now, just works. Probably not many know of it.

Should it be added to Julia, so that e.g. the compiler/optimizer can use it? It seems we could compete with Mojo that way. It deallocates as fast as possible, before variables go out of scope even (in languages like C++).

A first step would even be helpful on its own:

Phase 1.
Just move unchanged, gives more visibility (could also be had by documenting in Julia's docs). Julia itself wouldn't use. But at any point if could, by uses Bumper.jl as documented.

Phase 2.
This would be up to Julia people also, and the main win with merging. Make use of already existing idiomatic Julia code in or out of Julia use Bumper.jl transparently.

I recall our discussion, but can't find it, about dynamically adding to the buffer. I see it's now Task_local (would it be per thread, or is that in effect what it is?). I mentioned a problem with dynamically enlarging, so you backed away from it and now I found a solution, but it seems redundant, with changes I see you've now already implemented. I see you now allocate 1/8th of physical memory, which seems way excessive, which I think is the point so that you never have to enlarge. You rely on the VM (and [RAM] memory not actually used just virtual memory reserved, and the OS allocating more of it transparently). So why 1/8th? Why not even larger, all of it, or smaller? I'm guessing if you have e.g. 8 threads then you allocate all, and with 16 then 2x overcommit (which is ok, at least on Linux).

I do not believe overcommitting works on Windows however, so do you know of problems, if e.g. you have very many threads? Also say Julia's with 8 threads, and 4 such Julias running at once, is that ok? I don't know about macOS, but it's likely similar. Before merging, such use would need to be confirmed ok, or if lower from 1/8th...

95d51c7

Bug report: allocating custom abstract types

First of all: cool package and thanks for your work!

While I was working with the package I encountered an error starting with
"Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks."
so here is that bug report.

I wanted to allocate some memory for an array with an abstract eltype. Here is a minimal (not) working example:

using Bumper

abstract type MyType end

struct MyStruct <: MyType
    x::Int
end

Base.sizeof(::Type{MyType}) = sizeof(Int)

@no_escape begin
    foo_arr = @alloc(MyType, 10)
    println(foo_arr)
end

I suppose the answer might be 'you cannot define sizeof for your abstract type and expect things to work', but I wanted to open this bug report anyway as requested.

Here is the full stack trace:

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x15f21dd2f5e -- _show_default at .\show.jl:465
in expression starting at C:\Users\wolf.nederpel\hive\intsect\scripts\random_julia_test.jl:11
_show_default at .\show.jl:465
show_default at .\show.jl:462 [inlined]
show at .\show.jl:457 [inlined]
show_delim_array at .\show.jl:1346
show_delim_array at .\show.jl:1335 [inlined]
show_vector at .\arrayshow.jl:530
show_vector at .\arrayshow.jl:515 [inlined]
show at .\arrayshow.jl:486 [inlined]
print at .\strings\io.jl:35
print at .\strings\io.jl:46
println at .\strings\io.jl:75
unknown function (ip: 0000015f21dd4f1f)
println at .\coreio.jl:4
unknown function (ip: 0000015f21dd2bcb)
jl_apply at C:/workdir/src\julia.h:1982 [inlined]
do_call at C:/workdir/src\interpreter.c:126
eval_value at C:/workdir/src\interpreter.c:223
eval_body at C:/workdir/src\interpreter.c:489
jl_interpret_toplevel_thunk at C:/workdir/src\interpreter.c:775
jl_toplevel_eval_flex at C:/workdir/src\toplevel.c:934
jl_toplevel_eval_flex at C:/workdir/src\toplevel.c:877
ijl_toplevel_eval at C:/workdir/src\toplevel.c:943 [inlined]
ijl_toplevel_eval_in at C:/workdir/src\toplevel.c:985
eval at .\boot.jl:385 [inlined]
include_string at .\loading.jl:2070
_include at .\loading.jl:2130
include at .\client.jl:489
unknown function (ip: 0000015f21dc916b)
jl_apply at C:/workdir/src\julia.h:1982 [inlined]
do_call at C:/workdir/src\interpreter.c:126
eval_value at C:/workdir/src\interpreter.c:223
eval_stmt_value at C:/workdir/src\interpreter.c:174 [inlined]
eval_body at C:/workdir/src\interpreter.c:635
jl_interpret_toplevel_thunk at C:/workdir/src\interpreter.c:775
jl_toplevel_eval_flex at C:/workdir/src\toplevel.c:934
jl_toplevel_eval_flex at C:/workdir/src\toplevel.c:877
ijl_toplevel_eval at C:/workdir/src\toplevel.c:943 [inlined]
ijl_toplevel_eval_in at C:/workdir/src\toplevel.c:985
eval at .\boot.jl:385 [inlined]
eval_user_input at C:\workdir\usr\share\julia\stdlib\v1.10\REPL\src\REPL.jl:150
repl_backend_loop at C:\workdir\usr\share\julia\stdlib\v1.10\REPL\src\REPL.jl:246
#start_repl_backend#46 at C:\workdir\usr\share\julia\stdlib\v1.10\REPL\src\REPL.jl:231
start_repl_backend at C:\workdir\usr\share\julia\stdlib\v1.10\REPL\src\REPL.jl:228
#run_repl#59 at C:\workdir\usr\share\julia\stdlib\v1.10\REPL\src\REPL.jl:389
run_repl at C:\workdir\usr\share\julia\stdlib\v1.10\REPL\src\REPL.jl:375
jfptr_run_repl_95895.1 at C:\Users\wolf.nederpel\AppData\Local\Programs\Julia-1.10.0\lib\julia\sys.dll (unknown line)
#1013 at .\client.jl:432
jfptr_YY.1013_86694.1 at C:\Users\wolf.nederpel\AppData\Local\Programs\Julia-1.10.0\lib\julia\sys.dll (unknown line)
jl_apply at C:/workdir/src\julia.h:1982 [inlined]
jl_f__call_latest at C:/workdir/src\builtins.c:812
#invokelatest#2 at .\essentials.jl:887 [inlined]
invokelatest at .\essentials.jl:884 [inlined]
run_main_repl at .\client.jl:416
exec_options at .\client.jl:333
_start at .\client.jl:552
jfptr__start_86719.1 at C:\Users\wolf.nederpel\AppData\Local\Programs\Julia-1.10.0\lib\julia\sys.dll (unknown line)
jl_apply at C:/workdir/src\julia.h:1982 [inlined]
true_main at C:/workdir/src\jlapi.c:582
jl_repl_entrypoint at C:/workdir/src\jlapi.c:731
mainCRTStartup at C:/workdir/cli\loader_exe.c:58
BaseThreadInitThunk at C:\windows\System32\KERNEL32.DLL (unknown line)
RtlUserThreadStart at C:\windows\SYSTEM32\ntdll.dll (unknown line)
Allocations: 635700 (Pool: 634725; Big: 975); GC: 1

Add/use slab-bump allocator?

The basic idea is, you have slabs of some size.
When you run out of memory, you allocate a new slab.

Examples:
llvm: https://llvm.org/doxygen/Allocator_8h_source.html
LoopModels: https://github.com/JuliaSIMD/LoopModels/blob/bumprealloc/include/Utilities/Allocators.hpp

LoopModel's is largely a copy of LLVM's, but supports either a bump-up or bump-down.
LoopModel's slab size is constant, but LLVM's slabs grow.

A julia struct itself could look like

mutable struct BumpAlloc{Up,SlabSize}
    current::Ptr{Cvoid}
    slabend::Ptr{Cvoid}
    # you could try and get fancy and reduce the number of indirection's by having your own array type
    slabs::Vector{Ptr{Cvoid}}
    custom_slabs::Vector{Ptr{Cvoid}}
end
# should probably register a finalizer that `Libc.free`s all the pointers
# optionally use a faster library like `mimalloc` instead of `Libc`

The custom_slabs are for objects too big for the SlabSize.
The point of being separate was largely because in C++ there possibly are possibly faster free/delete functions that take the size (i.e. there might exist, and they might be faster).
Given that we don't have that here, we may as well fuse them, unless you find some allocator API that supports sizes.

Being able to grow lets you default to a much smaller slab size.

I was thinking about modifying SimpleChains to use something like this.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Performance for very small arrays

Hi!

I'm testing various custom CPU arrays implementations in Julia, and comparing them with stack-allocated arrays and heap-allocated arrays in C.

https://gist.github.com/mdmaas/d1b6b1a69a6b235143d7110237ff4ae8

The test first allocates the inverse squares of integers from 1 to N, and then performs the sum.

This is how it looks like for Bumper.jl:

@inline function sumArray_bumper(N)
    @no_escape begin
        smallarray = alloc(Float64, N) 
        @turbo for i ∈ 1:N
            smallarray[i] = 1.0 / i^2
        end
        sum = 0.0
        @turbo for i ∈ 1:N
            sum += smallarray[i]
        end
    end
    return sum
end

I am focusing on values of N ranging from 3 to 100, as for larger values of N most implementations converge to similar values (about 10% overhead wrt C), with the exception of the regular Julia arrays, which are generally slower and thus require much larger values of N so the overhead is overshadowed by the actual use of memory.

My favourite method would be to use Bumper, as I think the API is great, but it is the slowest method of all I'm considering as alternatives to standard arrays: (manually pre-allocating a standard array, MallocArrays from StaticTools, and doing malloc in C). Standard arrays are of course slower than Bumper.

Am I doing something wrong? Do you think there could be a way to remove this overhead, and approach the performance of for example, pre-allocated regular arrays?

Best,

Precompilation error

I still get precompilation error with all the Octavian specified version and StrideArrays added. Is it on my end or pkg related ?

[ Info: Precompiling Bumper [8ce10254-0962-460f-a3d8-1f77fea1446e]
ERROR: LoadError: UndefVarError: calc_strides_len not defined
Stacktrace:
 [1] include
   @ ./Base.jl:419 [inlined]
 [2] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_de
ps::Vector{Pair{Base.PkgId, UInt64}}, source::Nothing)                                                                                                             @ Base ./loading.jl:1554
 [3] top-level scope
   @ stdin:1
in expression starting at /Users/usr/.julia/packages/Bumper/rK9gd/src/Bumper.jl:1
in expression starting at stdin:1
ERROR: Failed to precompile Bumper [8ce10254-0962-460f-a3d8-1f77fea1446e] to /Users/usr/.julia/compiled/v1.8/Bumper/jl_LpueaC.
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool)
    @ Base ./loading.jl:1707
  [3] compilecache
    @ ./loading.jl:1651 [inlined]
  [4] _require(pkg::Base.PkgId)
    @ Base ./loading.jl:1337
  [5] _require_prelocked(uuidkey::Base.PkgId)
    @ Base ./loading.jl:1200
  [6] macro expansion
    @ ./loading.jl:1180 [inlined]
  [7] macro expansion
    @ ./lock.jl:223 [inlined]
  [8] require(into::Module, mod::Symbol)
    @ Base ./loading.jl:1144
  [9] eval
    @ ./boot.jl:368 [inlined]
 [10] eval
    @ ./Base.jl:65 [inlined]
 [11] repleval(m::Module, code::Expr, #unused#::String)
    @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.38.2/scripts/packages/VSCodeServer/src/repl.jl:222
 [12] (::VSCodeServer.var"#107#109"{Module, Expr, REPL.LineEditREPL, REPL.LineEdit.Prompt})()
    @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.38.2/scripts/packages/VSCodeServer/src/repl.jl:186
 [13] with_logstate(f::Function, logstate::Any)
    @ Base.CoreLogging ./logging.jl:511
 [14] with_logger
    @ ./logging.jl:623 [inlined]
 [15] (::VSCodeServer.var"#106#108"{Module, Expr, REPL.LineEditREPL, REPL.LineEdit.Prompt})()
    @ VSCodeServer ~/.vscode/extensions/julialang.language-julia-1.38.2/scripts/packages/VSCodeServer/src/repl.jl:187
 [16] #invokelatest#2
    @ ./essentials.jl:729 [inlined]
 [17] invokelatest(::Any)
    @ Base ./essentials.jl:726
 [18] macro expansion
    @ ~/.vscode/extensions/julialang.language-julia-1.38.2/scripts/packages/VSCodeServer/src/eval.jl:34 [inlined]
 [19] (::VSCodeServer.var"#61#62")()
    @ VSCodeServer ./task.jl:484

Proposed Interface for "Automatic" Allocations

I've written a small experimental extension to Bumper.jl I currently call WithAlloc.jl, which substitutes

A = @alloc # figure out how to allocate A 
B = @alloc # figure out how to allocate B 
dosomething!(A, B, x1, x2, x3) 

with

A, B = @withalloc dosomething!(x1, x2, x3) 

by specifying how dosomething! wants its outputs allocated. I am interested in

  • feedback about the general idea
  • comments of the concrete approach, potential bugs etc?
  • is this a sufficiently useful general pattern that I should make a PR in Bumper instead of maintaining it as an extension?
  • I am open-minded about alternatives, naming suggestions, pretty much anything ...

Thank you.

Example

using WithAlloc, LinearAlgebra, Bumper 

# simple allocating operation
B = randn(5,10)
C = randn(10, 3)
A1 = B * C

# tell `WithAlloc` how to allocate memory for `mymul!`
WithAlloc.whatalloc(::typeof(mul!), B, C) = 
          (promote_type(eltype(B), eltype(C)), size(B, 1), size(C, 2))

# the "naive use" of automated pre-allocation could look like this: 
# This is essentially the code that the macro @withalloc generates
@no_escape begin 
   A2_alloc_info = WithAlloc.whatalloc(mul!, B, C)
   A2 = @alloc(A2_alloc_info...)
   mul!(A2, B, C)
   @show A2  A1
end

# but the same pattern will be repreated over and over so ... 
@no_escape begin 
   A3 = @withalloc mul!(B, C)
   @show A3  A1 
end

Make `AllocBuffer` just store a pointer made (by default) by `malloc`

I'm not sure there's much advantage to me letting people wrap whatever type like like for this thing. Might be better to simply do:

mutable struct AllocBuffer
    ptr::Ptr{UInt8}
    length::Int
    offset::UInt8
end

function AllocBuffer(length::Int; finalize=true)
    ptr = malloc(length)
    out = AllocBuffer(ptr, length, UInt(0))
    if finalize
        finalizer(x -> free(x.ptr), out)
    end
    out
end

which'd make it more similar to SlabBuffer. This'd be a breaking change, so I'd like to do it before 1.0 if I do it.

Slowdown when using `alloc!`

Hi,

I have the following example where I observe a 2x slowdown with Bumper.alloc!.

Could you please confirm that I use the package correctly?
Do you have ideas on how to fix this?

Thank you !

using BenchmarkTools, Bumper

function work0(polys; use_custom_allocator=false)
    if use_custom_allocator
        custom_allocator = Bumper.SlabBuffer()
        @no_escape custom_allocator begin
            work1(polys, custom_allocator)
        end
    else
        work1(polys)
    end
end

# Very important work
function work1(polys, custom_allocator=nothing)
    res = 0
    for poly in polys
        new_poly = work2(poly, custom_allocator)
        res += sum(new_poly)
    end
    res
end

function work2(poly::Vector{T}, ::Nothing) where {T}
    new_poly = Vector{T}(undef, length(poly))
    work3!(new_poly)
end

function work2(poly::Vector{T}, custom_allocator) where {T}
    new_poly = Bumper.alloc!(custom_allocator, T, length(poly))
    work3!(new_poly)
end

function work3!(poly::AbstractVector{T}) where {T}
    poly[1] = one(T)
    for i in 2:length(poly)
        poly[i] = convert(T, i)^3 - poly[i - 1]
    end
    poly
end

###

m, n = 1_000, 10_000
polys = [rand(UInt32, rand(1:m)) for _ in 1:n];

@btime work0(polys, use_custom_allocator=false)
#   6.461 ms (10001 allocations: 20.26 MiB)
# 0x0000e2e1c67cdb19

@btime work0(polys, use_custom_allocator=true)
#   14.154 ms (6 allocations: 608 bytes)
# 0x0000e2e1c67cdb19

Running on

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39 (2023-12-25 18:01 UTC) 
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores

Sometimes get boundserror check in CI with v0.7

First of all thanks a lot for the package @MasonProtter, I just started playing with it and it already seems very useful

I just updated a small package of mine that uses Bumper from v0.6 to v0.7.

I tested locally shortly and everything works but when I push on CI some of the tests fail (apparently randomly) with a boundscheck error

Here are two examples differeent CI runs where once it failed on ubunut and once on linux:
https://github.com/disberd/SlottedRandomAccess.jl/actions/runs/10055141472/job/27791169494
https://github.com/disberd/SlottedRandomAccess.jl/actions/runs/10054986890?pr=5

The error always seems to appear at this line below:
https://github.com/disberd/SlottedRandomAccess.jl/blob/7e9def4c5f7d6b383a0933b28272dbefd0a7c3d2/src/compute_plr.jl#L71

I did not investigate further here what might be the issue (or if I am using the library wrongly), but it did not seem to happen with v0.6 and it seems to happen randomly which is more worring :D.

alloc_nothrow needs improvement, or eliminating

I see that you want nothrow for StaticCompiler.jl, but there are some problems.

It will overwrite memory if you're not careful. I'm thinking you may want to check if the buffer is to small, and then there might be a way to rather just exit the program? I think you can print something on stderr first, and then exit(1), or is there some PANIC, similar to in Go?

While alloc_nothrow works in regular Julia, just not vice versa, why it exists, I think the functionality above could be folded into the regular alloc. If you really need to use the other Malloc, could you use that in all cases? It means an extra dependency on the other package, or maybe rather use Libc.malloc directly? You can use Libc.realloc, and then you need to use the best growing strategy yourself, but you already have one.

I'm not sure what using Julia's regular Vector buys you, then it will be tracked by Julia's GC, probably a minimal slowdown though, with no benefit, since you don't want your buffers reclaimed anyway. And it's just an array of bytes, can't contain pointers to other objects. Or actually it may be possible, but then will not be be considered by the GC anyway.

Zero-dimensional arrays:

It seems Bumper.jl cannot allocate zero-dimensional arrays. There might not be much use for that in itself (as they only contain a single element, they are small), but in a general framework where some arrays may happen to be zero-dimensional, this is somewhat annoying:

julia> @no_escape begin
           a = @alloc(Float64, ())
       end
ERROR: MethodError: no method matching alloc!(::SlabBuffer{1048576}, ::Type{Float64}, ::Tuple{})

Closest candidates are:
  alloc!(::Any, ::Type{T}, ::Integer...) where {T, N}
   @ Bumper ~/.julia/packages/Bumper/eoK0g/src/internals.jl:89

Stacktrace:
 [1] macro expansion
   @ REPL[3]:2 [inlined]
 [2] macro expansion
   @ ~/.julia/packages/Bumper/eoK0g/src/internals.jl:74 [inlined]
 [3] top-level scope
   @ REPL[3]:1

julia> @no_escape begin
           a = @alloc(Float64)
       end
ERROR: MethodError: no method matching eachop(::typeof(Static._get_known), ::Tuple{}, ::Type{Tuple{}})

Closest candidates are:
  eachop(::F, ::Tuple{I}, ::Any...) where {F, I, K}
   @ Static ~/.julia/packages/Static/pkxBE/src/tuples.jl:31
  eachop(::F, ::Tuple{I1, I2, Vararg}, ::Any...) where {F, I1, I2, K}
   @ Static ~/.julia/packages/Static/pkxBE/src/tuples.jl:28

Stacktrace:
 [1] known(::Type{Tuple{}})
   @ Static ~/.julia/packages/Static/pkxBE/src/Static.jl:44
 [2] #s25#5
   @ ~/.julia/packages/StrideArraysCore/fDiB0/src/ptr_array.jl:179 [inlined]
 [3] var"#s25#5"(T::Any, N::Any, ::Any, ptr::Any, s::Any)
   @ StrideArraysCore ./none:0
 [4] (::Core.GeneratedFunctionStub)(::UInt64, ::LineNumberNode, ::Any, ::Vararg{Any})
   @ Core ./boot.jl:602
 [5] alloc!(::SlabBuffer{1048576}, ::Type{Float64})
   @ Bumper.Internals ~/.julia/packages/Bumper/eoK0g/src/internals.jl:91
 [6] macro expansion
   @ REPL[4]:2 [inlined]
 [7] macro expansion
   @ ~/.julia/packages/Bumper/eoK0g/src/internals.jl:74 [inlined]
 [8] top-level scope
   @ REPL[4]:1

@alloc detection too narrow

In the switchover from 0.3 to 0.4 when replacing the function based alloc with the macro based version I always got the error that the @alloc is not within a @no_escape block even though it obviously was.

turns out the problem was my usage as Bumper.@alloc instead of just @alloc. From what I can see from a quick glance over the code the replacement code is looking explicitly for an @alloc. Perhaps this could bei widened?

Composing with distinct allocators

In ArrayAllocators.jl, I made some bindings for several allocations functions:

  1. posix_memalign
  2. VirtualAlloc2
  3. VirtualAllocEx
  4. numa_alloc_onnode
  5. numa_alloc_local

What would be a good way to compose ArrayAllocators.jl and Bumper.jl?

StackOverflow with eigen

Dear developers,
I found that the following code gives rise to a stack overflow:

using Bumper
using LinearAlgebra
 function trial(x)
       @no_escape begin
          T = @alloc(eltype(x), 2, 2)
          T .= 0
          T[1,1] = x
          T[2,2] = x
          eigval, eigvects = eigen(T)
          sum(eigval)
       end
end

julia> trial(2)

Generates the following error:

ERROR: StackOverflowError:
Stacktrace:
 [1] AbstractPtrArray
   @ ~/.julia/packages/StrideArraysCore/VyBzA/src/ptr_array.jl:199 [inlined]
 [2] AbstractPtrArray
   @ ~/.julia/packages/StrideArraysCore/VyBzA/src/ptr_array.jl:456 [inlined]
 [3] AbstractPtrArray
   @ ~/.julia/packages/StrideArraysCore/VyBzA/src/ptr_array.jl:481 [inlined]
 [4] view(A::StrideArraysCore.PtrArray{Int64, 2, (1, 2), Tuple{Int64, Int64}, Tuple{Nothing, Nothing}, Tuple{Static.StaticInt{1}, Static.StaticInt{1}}}, i::StepRange{Int64, Int64}) (repeats 79984 times)
   @ StrideArraysCore ~/.julia/packages/StrideArraysCore/VyBzA/src/stridearray.jl:263

Am I using Bumper in the wrong way? My understanding is that the memory allocated inside @no_escape should not escape the block. Still, here, the block returns a scalar reduction of the allocated array, so the memory should not escape.

Is there another way to diagonalize a matrix allocated on the Bumper stack?

EDIT:
Also, the error occurs in the line that calls eigen(T).

Add some EnzymeRules

Currently, Enzyme.jl's reverse mode autodiff doesn't work correctly with Bumper.jl because if you give it a Duplicated buffer, it'll += accumulate results into the duplicated buffer making the answer depend on the state of the buffer at the start of the program.

It'd be good if we could set up some EnzymeRules to explicitly teach Enzyme how to handle Bumper.jl allocations and deallocations. I don't really know how to do this though, so if anyone wants to take it on, or work on it together please do.

Possibility of tranforming existing functions to use `Bumper.jl`

Is it in principle possible to have some sort of a macro that would take a function and replace its inner calls to Vector{T}(undef, n) (and similar) with @alloc(T, n)? To me this appears possible after the calls are inlined and possibly some escape analysis applied but I have a very limited understanding of the problem.
So the macro could do something like:

function mapsum(f, x)
    arr = Vector{Float64}(undef, length(x))
    arr .= f.(x)
    return sum(arr)
end

transforms into

function mapsum_bumpered(f, x)
   @no_escape begin
        arr = @alloc(Float64, length(x))
        arr .= f.(x)
        ans = sum(arr)
    end
   return ans
end

Thanks!

MethodErrors caused by StrideArrays.jl overrides

MWE:

In a fresh REPL:

julia> using Bumper: @no_escape, @alloc

julia> using Random: randn!

julia> T = ComplexF32
ComplexF32 (alias for Complex{Float32})

julia> @no_escape begin
           ar = @alloc(T, 100)
           randn!(ar)
           @. ar = cos(ar)
           sum(ar)
       end
109.13606f0 + 4.8591895f0im

However, if I import StrideArrays, I get an error:

julia> using Bumper: @no_escape, @alloc

julia> using StrideArrays

julia> using Random: randn!

julia> T = ComplexF32
ComplexF32 (alias for Complex{Float32})

julia> @no_escape begin
           ar = @alloc(T, 100)
           randn!(ar)
           @. ar = cos(ar)
           sum(ar)
       end
ERROR: MethodError: no method matching vmaterialize!(::PtrArray{…}, ::Base.Broadcast.Broadcasted{…}, ::Val{…}, ::Val{…}, ::Val{…})

Closest candidates are:
  vmaterialize!(::Any, ::Any, ::Val{Mod}, ::Val{UNROLL}) where {Mod, UNROLL}
   @ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/broadcast.jl:753
  vmaterialize!(::Union{LinearAlgebra.Adjoint{T, A}, LinearAlgebra.Transpose{T, A}}, ::BC, ::Val{Mod}, ::Val{UNROLL}, ::Val{dontbc}) where {T<:Union{Bool, Float16, Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8, SIMDTypes.Bit}, N, A<:AbstractArray{T, N}, BC<:Union{Base.Broadcast.Broadcasted, LoopVectorization.Product}, Mod, UNROLL, dontbc}
   @ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/broadcast.jl:682
  vmaterialize!(::AbstractArray{T, N}, ::BC, ::Val{Mod}, ::Val{UNROLL}, ::Val{dontbc}) where {T<:Union{Bool, Float16, Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8, SIMDTypes.Bit}, N, BC<:Union{Base.Broadcast.Broadcasted, LoopVectorization.Product}, Mod, UNROLL, dontbc}
   @ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/broadcast.jl:673
  ...

Stacktrace:
 [1] vmaterialize!
   @ LoopVectorization ~/.julia/packages/LoopVectorization/7gWfp/src/broadcast.jl:759 [inlined]
 [2] _materialize!
   @ StrideArrays ~/.julia/packages/StrideArrays/PeLtr/src/broadcast.jl:181 [inlined]
 [3] materialize!(dest::PtrArray{…}, bc::Base.Broadcast.Broadcasted{…})
   @ StrideArrays ~/.julia/packages/StrideArrays/PeLtr/src/broadcast.jl:188
 [4] macro expansion
   @ REPL[5]:4 [inlined]
 [5] macro expansion
   @ ~/.julia/packages/Bumper/eoK0g/src/internals.jl:74 [inlined]
 [6] top-level scope
   @ REPL[5]:1
Some type information was truncated. Use `show(err)` to see complete types.

I think maybe a fallback methods should be used if it doesn't exist?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.