Code Monkey home page Code Monkey logo

baorec.jl's Introduction

BAOrec

Julia code for iterative and finite-difference BAO reconstruction on the CPU and GPU.

Usage

To use, you may write on your REPL

] develop https://github.com/dforero0896/BAOrec.jl.git

since the package is not registered.

Remember that for optimal performance it is better to launch a single Julia run for many reconstructions so compilation is only triggered once. Launching a julia script for every catalog will imply that lots of time is lost in compilation and while fast, it won't be optimal. For instance, GPU kernels take ~30s for a reconstruction the first time they are called, 90+% of which is compilation.

Usage on NERSC Perlmutter

To use the code in Perlmutter do the following

module load PrgEnv-nvidia 
module load julia/1.7.2-nvidia

then you will be able to type in the julia REPL package manager (in your terminal type julia, then press ])

 develop https://github.com/dforero0896/BAOrec.jl.git
 add <Whatever registered package you may need>

To run a job:

srun -n1 -c128 -C gpu --account=mxxxx_g --gpus-per-task=1 julia -t 128 examples/lightcone_gpu.jl

For more information check how to submit jobs to Perlmutter.

Benchmarks

The benchmarks below are from the scripts in test/ and performed with BenchmarkTools.jl and the Base Julia timer (for the less important parts). On the CPU

Using 128 threads.
CPU benchmarking...
Struct
0.016441 seconds (20.93 k allocations: 1.261 MiB, 95.29% compilation time)
Reconstruction iterative run
4.503 s (72584 allocations: 7.52 GiB)
Read positions from recon result
598.593 ms (129131072 allocations: 3.72 GiB)
Struct
0.016672 seconds (25.80 k allocations: 1.562 MiB, 96.70% compilation time)
Reconstruction multigrid run
24.342 s (1736157 allocations: 19.96 GiB)
Read positions from recon result
1.474 s (129131109 allocations: 4.22 GiB)
On the GPU
Using 128 threads.
GPU benchmarking...
Data copy to GPU
1.656354 seconds (2.31 M allocations: 222.114 MiB, 77.85% compilation time)
0.067086 seconds (49.28 k allocations: 101.376 MiB, 51.22% compilation time)
Struct
0.016367 seconds (20.93 k allocations: 1.261 MiB, 94.44% compilation time)
Reconstruction iterative run
TrialEstimate(595.770 ms)
BenchmarkTools.Trial
params: BenchmarkTools.Parameters
seconds: Float64 5.0
samples: Int64 10000
evals: Int64 1
overhead: Float64 0.0
gctrial: Bool true
gcsample: Bool false
time_tolerance: Float64 0.05
memory_tolerance: Float64 0.01
times: Array{Float64}((9,)) [6.19338084e8, 5.85914455e8, 5.95770487e8, 5.95403127e8, 5.89746708e8, 6.44420176e8, 6.67371721e8, 5.82569621e8, 6.0782504e8]
gctimes: Array{Float64}((9,)) [0.0, 0.0, 0.0, 0.0, 4.136115e6, 0.0, 0.0, 4.294532e6, 0.0]
memory: Int64 616880
allocs: Int64 10106
Read positions from recon result
TrialEstimate(52.010 ms)
BenchmarkTools.Trial
params: BenchmarkTools.Parameters
seconds: Float64 5.0
samples: Int64 10000
evals: Int64 1
overhead: Float64 0.0
gctrial: Bool true
gcsample: Bool false
time_tolerance: Float64 0.05
memory_tolerance: Float64 0.01
times: Array{Float64}((91,)) [5.0593235e7, 6.8096218e7, 1.02470894e8, 4.7207952e7, 5.0682268e7, 4.6011949e7, 5.3779904e7, 6.1123841e7, 7.4825356e7, 4.2027007e7 … 4.8147999e7, 4.4062783e7, 4.5516833e7, 5.1040159e7, 5.0341599e7, 5.830944e7, 4.7245284e7, 5.3624995e7, 5.3261573e7, 4.7555834e7]
gctimes: Array{Float64}((91,)) [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
memory: Int64 80384
allocs: Int64 1431
Copy results to CPU
0.021447 seconds (4.30 k allocations: 98.762 MiB, 24.60% compilation time)
Struct
0.016492 seconds (25.80 k allocations: 1.562 MiB, 95.65% compilation time)
Reconstruction multigrid run
TrialEstimate(1.820 s)
BenchmarkTools.Trial
params: BenchmarkTools.Parameters
seconds: Float64 5.0
samples: Int64 10000
evals: Int64 1
overhead: Float64 0.0
gctrial: Bool true
gcsample: Bool false
time_tolerance: Float64 0.05
memory_tolerance: Float64 0.01
times: Array{Float64}((3,)) [1.819561052e9, 1.917072578e9, 1.767079466e9]
gctimes: Array{Float64}((3,)) [0.0, 1.7609118e7, 0.0]
memory: Int64 28719152
allocs: Int64 445156
Read positions from recon result
TrialEstimate(43.777 ms)
BenchmarkTools.Trial
params: BenchmarkTools.Parameters
seconds: Float64 5.0
samples: Int64 10000
evals: Int64 1
overhead: Float64 0.0
gctrial: Bool true
gcsample: Bool false
time_tolerance: Float64 0.05
memory_tolerance: Float64 0.01
times: Array{Float64}((105,)) [4.4726826e7, 3.9916308e7, 4.1223725e7, 4.5721919e7, 4.1492334e7, 4.1519807e7, 4.9516615e7, 4.3800055e7, 4.2899173e7, 4.6753172e7 … 4.3837387e7, 3.8426798e7, 3.8705437e7, 4.8431356e7, 4.432548e7, 3.7102207e7, 3.9450497e7, 5.8317155e7, 5.4813162e7, 4.5284053e7]
gctimes: Array{Float64}((105,)) [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.454491e6 … 0.0, 0.0, 0.0, 0.0, 3.027712e6, 0.0, 0.0, 0.0, 0.0, 0.0]
memory: Int64 75824
allocs: Int64 1433
Copy results to CPU
0.013548 seconds (10 allocations: 98.512 MiB)

I am not that confident I am properly benchmarking the GPU runs given that they look a bit too fast.

Issues are welcome.

baorec.jl's People

Contributors

dforero0896 avatar

Watchers

 avatar  avatar

baorec.jl's Issues

Distributed CPU implementation

Methods missing to use distributed arrays. While the current reconstruction part works in a distributed fashion, the real priority for the reconstruction code would have to be distributing the catalog in a domain-aware form, as well as being able to read shifts in a similar manner.

  • Create function that takes catalog array in main rank and distributes it.
    • This can be done via Scatter or distributing the relevant points to each process
    • The first is easier, and would allow reading different files by design. The second would avoid having checks in cic! and read_cic! (but I don;t feel comfortable removing them anyway) but would imply that it is necessary to know which process is the relevant one for a given point so data can be redistributed once in memory.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.