quokka-astro / quokka Goto Github PK

View Code? Open in Web Editor NEW

36.0 4.0 11.0 41.43 MB

Two-moment AMR radiation hydrodynamics (with self-gravity, particles, and chemistry) on CPUs/GPUs for astrophysics

Home Page: https://quokka-astro.github.io/quokka/

License: MIT License

C++ 57.18% CMake 3.06% Shell 1.17% TeX 22.60% Python 0.83% Jupyter Notebook 15.17%

cuda gpu hydrodynamics adaptive-mesh-refinement hip rocm astrochemistry astrophysics particles self-gravity

quokka's Introduction

QUOKKA

Quadrilateral, Umbra-producing, Orthogonal, Kangaroo-conserving Kode for Astrophysics!

The Quokka methods paper is now available: https://arxiv.org/abs/2110.01792

For detailed instructions on installing the code, please refer to the Quokka Documentation. You can start a Discussion for technical support, or open an Issue for any bug reports.

Quokka is a two-moment radiation hydrodynamics code that uses the piecewise-parabolic method, with AMR and subcycling in time. Runs on CPUs (MPI+vectorized) or NVIDIA GPUs (MPI+CUDA) with a single-source codebase. Written in C++17. (100% Fortran-free.)

Here is a a Kelvin-Helmholz instability simulated with Quokka on a 512x512 uniform grid:

This is a 3D Rayleigh-Taylor instability simulated on a $256^3$ grid:

Quokka also features advanced Adaptive Quokka Refinement:tm: technology:

Dependencies

C++ compiler (with C++17 support)
CMake 3.16+
MPI library with GPU-aware support (OpenMPI, MPICH, or Cray MPI)
HDF5 1.10+ (serial version)
CUDA 11.7+ (optional, for NVIDIA GPUs)
ROCm 5.2.0+ (optional, for AMD GPUs)
Ninja (optional, for faster builds)
Python 3.7+ (optional)
ADIOS2 2.9+ with GPU-aware support (optional, for writing terabyte-sized or larger outputs)

Problems?

If you run into problems, please start a Discussion for technical support. If you discover a bug, please let us know by opening an Issue.

quokka's People

Contributors

Stargazers

Watchers

Forkers

knut0815 psharda adacs-australia astrokriel mooerslab aditivijayan pakshingli justinh2002 dmasoumi

quokka's Issues

set arena size to max 16GiB to avoid libfabric bug

The initial memory pool allocation size must be set to amrex.the_arena_init_size=17179869184 (or smaller) in order to avoid a bug in the Cray libfabric memory registration cache on Slingshot-11 systems.

Note that 17179869184 = 16*1024*1024*1024.

use not-quite-transcendental functions for scaled interpolation of cooling table

See code: https://github.com/lanl/not-quite-transcendental and paper: https://arxiv.org/abs/2206.08957.

This should make the cooling solve (depends on performing 10^x multiple times for each substep in each cell!) significantly faster.

do coarse-fine interpolation for ghost cells with internal energy

To avoid temperature glitches, it is necessary to interpolate at coarse-fine boundaries with the primitive variables when filling the ghost cells in the hydro update.

This could be accomplished by creating a temporary multifab with the primitive variables, then performing the ghost cell exchange with it. This would increase GPU memory usage significantly, since primitive variable multifabs would be needed to be stored for every AMR level.

compute 1D profiles using amrex::sumToLine

For various applications, it is useful to compute a 1D profile of some quantity at each timestep. This should be useful for several things, so it should go in RadhydroSimulation or a friend class.

This can be implemented by filling a vector of MultiFabs (one for each level) with the quantity of interest, averaging down (recursively) to the level 0 MultiFab, and then using amrex::sumToLine(), which returns a vector on the host consisting of the 1D profile along a given axis.

Doxygen reference: https://amrex-codes.github.io/amrex/doxygen/namespaceamrex.html#a882a700cf7fab66ee3ce8de37de3ef73

checkHydroStates does not work

checkHydroStates causes a segfault. This is because it was originally written to run on the CPU but access the MultiFab data stored on the GPU by taking advantage of CUDA managed memory. Since managed memory is now disabled in Quokka for MPI performance reasons, this no longer works.

We need to rewrite checkHydroStates to run on device with amrex::ParReduce.

CI builds fail on external PRs due to sonarcloud limitations

GitHub tokens aren't available to external PRs for security reasons. This means that Sonarcloud analysis can't run on external PRs. This causes the GitHub actions CMake build to fail.

Solution: put the Sonarcloud analysis in a separate GitHub action that only runs on push to development (not on pull requests).

suppress unphysical oscillations in the diffusion limit

See the last paragraph of section 3.1 of [1].

[1] https://ui.adsabs.harvard.edu/abs/2022MNRAS.512.1499R/abstract

first-order flux correction should revert to forward Euler

The short story is that the current way we do flux correction doesn't guarantee positivity for the result of the second (or later, if any) stages. To guarantee positivity, we need to revert to first-order in both space and time.

First-order flux correction, as originally proposed by [1], uses the VL2 integrator, and as such can simply revert to the fluxes produced in the predictor step. For RK2 integration, things are less clear, since it is a multi-stage integrator where each stage is a convex combination of the current-stage fluxes and the previous stage state values. However, RK2 can be written as a convex combination of the first- and second-stage fluxes alone. Then for fluxes adjacent to problem cells, we can replace the high-order combination with the first-order (both space and time) flux. This is conceptually straightforward, but requires an additional set of flux multifabs for temporary storage during the hydro update.

A similar first-order correction has been described by [2], but they apply the correction to the cell state, rather than to the fluxes, so it is not conservative.

[1] https://ui.adsabs.harvard.edu/abs/2009ApJ...691.1092L/abstract
[2] https://doi.org/10.1016/S0045-7825(97)00228-4
[3] Linde, Timur, Philip Roe. Robust Euler Codes. 13th Computational Fluid Dynamics Conference, 2098 (1997)
[4] Perthame, B., Shu, CW. On positivity preserving finite volume schemes for Euler equations . Numer. Math. 73, 119–130 (1996). https://doi.org/10.1007/s002110050187

use open-boundary Poisson solver for self-gravity

Use the new AMReX open-boundary ('vacuum' boundary conditions) Poisson solver for composite and level solves of the Poisson equation sourced by the gas density. This will enable simulations of self-gravitating baryonic systems (e.g., star clusters).

See AMReX-Codes/amrex#2912 for details.

migrate ReadTheDocs to Github Pages

Github.io/Github Pages allows us to manage everything with GitHub org permissions and recompile the docs with GitHub Actions, rather than needing to use a separate cloud service. This also means that we can automatically check that PRs that modify the docs successfully compile before they're merged.

Sphinx compile action: https://github.com/ammaraskar/sphinx-action
AMReX docs compile action: https://github.com/AMReX-Codes/amrex/blob/development/.github/workflows/docs.yml

implement backtracking line search in Newton-Raphson solver

This [1] can greatly increase the basin of convergence of the Newton-Raphson method used for matter-radiation energy exchange.

[1] Eisenstat & Walker (1994), https://doi.org/10.1137/0804022
[2] Numerical Recipes (3rd Edition, 2007), Section 9.7.1

Sedov problem does not conserve energy to (double) machine precision on GPU

In some commit after 21.10, the Sedov problem stopped conserving energy on GPU to (double) machine precision. Instead, it seems to only conserve energy to ~6-7 decimal places, which would be expected if running in single-precision (which it is not). On CPU, it conserves energy to double precision machine epsilon or better. On both CPU and GPU, all tests still pass, so it appears that there is an inadvertent conversion to float somewhere (but only when compiling for GPU)...

clean-up Grackle HDF5 code

This code was borrowed from Grackle and doesn't adhere to C++ best practices. For some reason, it also keeps triggering false positive issues on Sonarcloud.

yt fails to read HDF5 (Chombo format) outputs

Tested with yt 4.0.1. See error message below.

>> ds = yt.load('./plt00000.h5')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-fe75068625c1> in <module>
----> 1 ds = yt.load('./plt00000.h5')

~/.local/lib/python3.6/site-packages/yt/loaders.py in load(fn, *args, **kwargs)
     91 
     92     if len(candidates) == 1:
---> 93         return candidates[0](fn, *args, **kwargs)
     94 
     95     if len(candidates) > 1:

~/.local/lib/python3.6/site-packages/yt/frontends/chombo/data_structures.py in __init__(self, filename, dataset_type, storage_filename, ini_filename, units_override, unit_system, default_species_fields)
    262             units_override=units_override,
    263             unit_system=unit_system,
--> 264             default_species_fields=default_species_fields,
    265         )
    266         self.storage_filename = storage_filename

~/.local/lib/python3.6/site-packages/yt/data_objects/static_output.py in __init__(self, filename, dataset_type, file_style, units_override, unit_system, default_species_fields)
    226         self._create_unit_registry(used_unit_system)
    227 
--> 228         self._parse_parameter_file()
    229         self.set_units()
    230         self.setup_cosmology()

~/.local/lib/python3.6/site-packages/yt/frontends/chombo/data_structures.py in _parse_parameter_file(self)
    293 
    294         self.dimensionality = self._handle["Chombo_global/"].attrs["SpaceDim"]
--> 295         self.domain_left_edge = self._calc_left_edge()
    296         self.domain_right_edge = self._calc_right_edge()
    297         self.domain_dimensions = self._calc_domain_dimensions()

~/.local/lib/python3.6/site-packages/yt/frontends/chombo/data_structures.py in _calc_left_edge(self)
    343         dx0 = fileh["/level_0"].attrs["dx"]
    344         D = self.dimensionality
--> 345         LE = dx0 * ((np.array(list(fileh["/level_0"].attrs["prob_domain"])))[0:D])
    346         return LE
    347 

TypeError: only integer scalar arrays can be converted to a scalar index

combine small FABs in a single kernel launch

Since CUDA stream parallelism does not work in practice (too many registers are used for multiple streams to run concurrently), we need to combine several small FABs within each MFIter iteration so that we get >= 128^3 cells per kernel. Otherwise, GPU threads/blocks are just sitting idle.

For single-level simulations, we can just increase the blocking factor to 128. However, this is not viable for AMR simulations, where the typical blocking_factor is 32 (ideally would be 16).

See AMReX-Codes/amrex#2638 for suggested implementation strategy.

add output plotfiles with a 2D slice through the data

See https://github.com/AMReX-Combustion/PelePhysics/blob/b5e6bbf1eed1382ea29a1a5a85fd640ec36f7b67/Utility/Diagnostics/DiagFramePlane.cpp for an example of how to do this.

carbuncle correction may be insufficient for strongly-cooling blast waves

The carbuncle correction from Minoshima & Miyoshi (2021) [1] is not sufficient for blasts in strongly-cooling environments. We need to fall back to HLL in directions perpendicular to strong shocks, following Quirk (1994) [2].

[1] https://arxiv.org/abs/2108.04991
[2] https://onlinelibrary.wiley.com/doi/abs/10.1002/fld.1650180603

some tests fail to compile when CUDA is enabled with strange errors

Some tests fail to compile due to compiler errors about missing functions. However, the errors come from headers that are also included in tests that do successfully compile, so it doesn't make much sense. Need to bisect to find when this was introduced, and also add a CUDA build to GitHub Actions so CUDA build issues can be identified on each commit.

unreadable plotfiles produced in some runs when amrex.async_out=1

A small fraction (about 3%) of plotfiles seem to be corrupted when running with amrex.async_out=1 on Gadi in production AMR simulations (with the shock-cloud problem).

This only affects runs with async output enabled. It should be safe to run with async_out=0 (which is the default).

See AMReX-Codes/amrex#2702

non-conservation occurs with reflecting boundary conditions

When restarting a hydro problem with reflecting boundary conditions, non-conservation of mass occurs within 1 timestep. The conservation relative error is of the order 10^{-9}. (Mass is conserved to machine precision with fully-periodic boundaries.)

This does not happen with the HydroBlast3D problem, so there must be something subtle going on.

check for CFL violation at the end of the hydro advance

Due to nonlinear effects, it's possible for the CFL stable timestep to be violated over the coarse of a timestep, even if the timestep was computed immediately prior to the hydro update. We can prevent this from causing the simulation to crash by checking whether the CFL timestep was violated post-hoc and (if needed) redoing the update after halving the timestep.

See, e.g.: https://github.com/AMReX-Astro/Castro/blob/f41722cc24da0843210337258b8e6e8d7c2ceaf5/Source/hydro/Castro_hydro.cpp#L234

reflux does not work correctly for radiation subsystem

NaNs start to propagate...

This may be caused by the non-conservative way the first-order flux correction is currently done. But it may be caused by subcycling, in which case a sync solve (or deferred sync similar to Castro) will be necessary. Further investigation needed.

add primordial chemistry network from Microphysics

Piyush has successfully used the AMReX-Astro/Microphysics integrator to integrate a primordial chemistry network. This needs to be called for each cell during the Strang-split update. A suitable test problem needs to be written.

See AMReX-Astro/Microphysics#962.

implement IMEX PD-ARS integrator for radiation subsystem

IMEX PD-ARS is a convex-invariant, asymptotic-preserving integrator that reduces to RK2-SSP in the free-streaming limit [1].

The current integrator is RK2-SSP followed by an operator-split step, which is not asymptotic-preserving for non-steady-state solutions.

[1] https://doi.org/10.1016/j.jcp.2019.03.037

unnecessary global synchronization in cooling solve

There is a check for the maximum iteration count in the cooling solve that is computed with MPI_Allreduce. This is not necessary, since it's only used to crash the simulation if the cooling solve has failed. Instead, we can check failure conditions on each rank, then call amrex::Abort() if the cooling solve has failed.

ParaView does not render output correctly for large boxes

This is an issue when examining output from single-GPU performance benchmarks.

Appears to be a ParaView bug.

nvcc internal compiler error when calling template parameter function from inside device lambda

This code

template <typename problem_t>
template <typename F>
void HyperbolicSystem<problem_t>::PredictStep(
    arrayconst_t &consVarOld, array_t &consVarNew,
    std::array<arrayconst_t, AMREX_SPACEDIM> fluxArray, const double dt_in,
    amrex::GpuArray<amrex::Real, AMREX_SPACEDIM> dx_in, amrex::Box const &indexRange,
    const int nvars, F&& isStateValid, amrex::Array4<int> const &redoFlag)
{
	auto const dt = dt_in;
	auto const dx = dx_in[0];
	auto const x1Flux = fluxArray[0];
#if (AMREX_SPACEDIM >= 2)
	auto const dy = dx_in[1];
	auto const x2Flux = fluxArray[1];
#endif
#if (AMREX_SPACEDIM == 3)
	auto const dz = dx_in[2];
	auto const x3Flux = fluxArray[2];
#endif

	amrex::ParallelFor(
	    indexRange, [=] AMREX_GPU_DEVICE(int i, int j, int k) noexcept {
			for (int n = 0; n < nvars; ++n) {
				consVarNew(i, j, k, n) =
				consVarOld(i, j, k, n) +
				(AMREX_D_TERM( (dt / dx) * (x1Flux(i, j, k, n) - x1Flux(i + 1, j, k, n)),
							+ (dt / dy) * (x2Flux(i, j, k, n) - x2Flux(i, j + 1, k, n)),
							+ (dt / dz) * (x3Flux(i, j, k, n) - x3Flux(i, j, k + 1, n))
							));
			}

			// check if state is valid -- flag for re-do if not
			if (!isStateValid(consVarNew, i, j, k)) {
				redoFlag(i, j, k) = quokka::redoFlag::redo;
			} else {
				redoFlag(i, j, k) = quokka::redoFlag::none;
			}
	    });
}

triggers the following nvcc compiler bug:

[137/219] Building CUDA object src/Advection/CMakeFiles/test_advection.dir/test_advection.cpp.o
FAILED: src/Advection/CMakeFiles/test_advection.dir/test_advection.cpp.o 
/usr/local/cuda-11.5/bin/nvcc -forward-unknown-to-host-compiler -DHAVE_PYTHON -I/usr/include/python3.6m -I/avatar/bwibking/quokka/src -I/avatar/bwibking/quokka/extern/amrex/Src/Base -I/avatar/bwibking/quokka/extern/amrex/Src/Base/Parser -I/avatar/bwibking/quokka/extern/amrex/Src/Boundary -I/avatar/bwibking/quokka/extern/amrex/Src/AmrCore -I/avatar/bwibking/quokka/build/amrex -I/avatar/bwibking/quokka/extern/fmt/include -isystem=/avatar/bwibking/spack/opt/spack/linux-rocky8-skylake_avx512/gcc-8.4.1/openmpi-4.1.2-tsfc7y5zaav3q77d3trmeja4273mayvi/include --fmad=false -O3 -DNDEBUG --generate-code=arch=compute_70,code=[compute_70,sm_70] --generate-code=arch=compute_80,code=[compute_80,sm_80] --expt-relaxed-constexpr --expt-extended-lambda -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored -maxrregcount=255 -Xcudafe --display_error_number --Wext-lambda-captures-this --generate-line-info --source-in-ptx -Xcompiler -pthread -std=c++17 -MD -MT src/Advection/CMakeFiles/test_advection.dir/test_advection.cpp.o -MF src/Advection/CMakeFiles/test_advection.dir/test_advection.cpp.o.d -x cu -dc /avatar/bwibking/quokka/src/Advection/test_advection.cpp -o src/Advection/CMakeFiles/test_advection.dir/test_advection.cpp.o
nvcc_internal_extended_lambda_implementation: In instantiation of ‘struct __nv_dl_wrapper_t<__nv_dl_tag<void (*)(const amrex::Array4<const double>&, const amrex::Array4<double>&, std::array<const amrex::Array4<const double>, 1>, double, amrex::GpuArray<double, 1>, const amrex::Box&, int, bool (&)(const amrex::Array4<const double>&, int, int, int), const amrex::Array4<int>&), HyperbolicSystem<SawtoothProblem>::PredictStep<bool (&)(const amrex::Array4<const double>&, int, int, int)>, 1>, const int, const amrex::Array4<double>, const amrex::Array4<const double>, const double, const double, const amrex::Array4<const double>, bool(const amrex::Array4<const double>&, int, int, int), const amrex::Array4<int> >’:
/avatar/bwibking/quokka/src/hyperbolic_system.hpp:347:32:   required from ‘static void HyperbolicSystem<problem_t>::PredictStep(arrayconst_t&, array_t&, std::array<const amrex::Array4<const double>, 1>, double, amrex::GpuArray<double, 1>, const amrex::Box&, int, F&&, const amrex::Array4<int>&) [with F = bool (&)(const amrex::Array4<const double>&, int, int, int); problem_t = SawtoothProblem; arrayconst_t = const amrex::Array4<const double>; array_t = const amrex::Array4<double>]’
/tmp/tmpxft_00173f31_00000000-6_test_advection.compute_80.cudafe1.stub.c:53:613:   required from here
nvcc_internal_extended_lambda_implementation:262:43: error: field ‘__nv_dl_wrapper_t<__nv_dl_tag<void (*)(const amrex::Array4<const double>&, const amrex::Array4<double>&, std::array<const amrex::Array4<const double>, 1>, double, amrex::GpuArray<double, 1>, const amrex::Box&, int, bool (&)(const amrex::Array4<const double>&, int, int, int), const amrex::Array4<int>&), HyperbolicSystem<SawtoothProblem>::PredictStep<bool (&)(const amrex::Array4<const double>&, int, int, int)>, 1>, const int, const amrex::Array4<double>, const amrex::Array4<const double>, const double, const double, const amrex::Array4<const double>, bool(const amrex::Array4<const double>&, int, int, int), const amrex::Array4<int> >::f7’ invalidly declared function type
nvcc_internal_extended_lambda_implementation: In instantiation of ‘struct __nv_dl_wrapper_t<__nv_dl_tag<void (*)(const amrex::Array4<double>&, const amrex::Array4<const double>&, const amrex::Array4<const double>&, std::array<const amrex::Array4<const double>, 1>, double, amrex::GpuArray<double, 1>, const amrex::Box&, int, bool (&)(const amrex::Array4<const double>&, int, int, int), const amrex::Array4<int>&), HyperbolicSystem<SawtoothProblem>::AddFluxesRK2<bool (&)(const amrex::Array4<const double>&, int, int, int)>, 1>, const int, const amrex::Array4<const double>, const amrex::Array4<const double>, const double, const double, const amrex::Array4<const double>, const amrex::Array4<double>, bool(const amrex::Array4<const double>&, int, int, int), const amrex::Array4<int> >’:
/avatar/bwibking/quokka/src/hyperbolic_system.hpp:394:32:   required from ‘static void HyperbolicSystem<problem_t>::AddFluxesRK2(array_t&, arrayconst_t&, arrayconst_t&, std::array<const amrex::Array4<const double>, 1>, double, amrex::GpuArray<double, 1>, const amrex::Box&, int, F&&, const amrex::Array4<int>&) [with F = bool (&)(const amrex::Array4<const double>&, int, int, int); problem_t = SawtoothProblem; array_t = const amrex::Array4<double>; arrayconst_t = const amrex::Array4<const double>]’
/tmp/tmpxft_00173f31_00000000-6_test_advection.compute_80.cudafe1.stub.c:54:655:   required from here
nvcc_internal_extended_lambda_implementation:278:43: error: field ‘__nv_dl_wrapper_t<__nv_dl_tag<void (*)(const amrex::Array4<double>&, const amrex::Array4<const double>&, const amrex::Array4<const double>&, std::array<const amrex::Array4<const double>, 1>, double, amrex::GpuArray<double, 1>, const amrex::Box&, int, bool (&)(const amrex::Array4<const double>&, int, int, int), const amrex::Array4<int>&), HyperbolicSystem<SawtoothProblem>::AddFluxesRK2<bool (&)(const amrex::Array4<const double>&, int, int, int)>, 1>, const int, const amrex::Array4<const double>, const amrex::Array4<const double>, const double, const double, const amrex::Array4<const double>, const amrex::Array4<double>, bool(const amrex::Array4<const double>&, int, int, int), const amrex::Array4<int> >::f8’ invalidly declared function type

Originally posted by @BenWibking in #25 (comment)

upgrade GPU CI machine to CUDA 12

We need to upgrade the driver to 525.00 or higher. This requires IT staff to upgrade the driver and reboot avatargpu.

allow adding non-conservative source terms to RK stages

For multiple physics modules, we need to integrate source terms in an temporally-unsplit manner. We need to add the ability to pass a FAB corresponding to rhs terms at each RK stage in non-conservation law form. (Note that this requires the rhs to be computed using the values at the beginning of the stage -- this is not equivalent to operator split updates at each stage.)

Note that the current RK integrator functions only specify arguments for the flux-divergence terms along each coordinate direction. This was done in order to ensure exact symmetry preservation of the flux-divergence terms. This is the approach taken in Athena++ for source terms that can be case in flux-divergence form [1], but is not used for gravity source terms [2] [3] [4] -- the Athena++ gravity methods paper is not actually the method used in the code!

[1] https://github.com/PrincetonUniversity/athena/wiki/Diffusion-Processes
[2] PrincetonUniversity/athena@1d264a9
[3] https://github.com/PrincetonUniversity/athena/blob/master/src/hydro/srcterms/self_gravity.cpp
[4] https://github.com/PrincetonUniversity/athena/blob/master/src/hydro/srcterms/hydro_srcterms.cpp

incorrect code generated by intel compiler classic

The 'classic' intel compilers icc/icpc perform incorrect floating-point optimizations above -O1. Simulations turn into NaNs.

Make EOS consistent between hydro and chemistry

In simulations with hydro+chemistry, we should use the same EOS for hydro and chemistry. This is currently not the case. For primordial chem, we are using the EOS in Microphysics whereas for hydro we are using the one in Quokka Physics Traits.

buffer overflow in HydroQuirk

Somehow, I have created a buffer overflow in the HydroQuirk problem. (This does not happen for any other problem.)

MPI initialized with 1 MPI processes
MPI initialized with thread support level 3
AMReX (22.06-3-g6e907ff7cf8e) initialized
ishock = 50
=================================================================
==1393086==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7fffef049d20 at pc 0x000000422c5e bp 0x7fffffff81d0 sp 0x7fffffff81c8
WRITE of size 8 at 0x7fffef049d20 thread T0
    #0 0x422c5d in operator() ../src/HydroQuirk/test_quirk.cpp:122
    #1 0x422c5d in call_f<RadhydroSimulation<QuirkProblem>::setInitialConditionsAtLevel(int)::<lambda(int, int, int)> > ../extern/amrex/Src/Base/AMReX_GpuLaunchFunctsC.H:29
    #2 0x422c5d in ParallelFor<RadhydroSimulation<QuirkProblem>::setInitialConditionsAtLevel(int)::<lambda(int, int, int)> > ../extern/amrex/Src/Base/AMReX_GpuLaunchFunctsC.H:119
    #3 0x422c5d in RadhydroSimulation<QuirkProblem>::setInitialConditionsAtLevel(int) ../src/HydroQuirk/test_quirk.cpp:82
    #4 0x58bc98 in AMRSimulation<QuirkProblem>::MakeNewLevelFromScratch(int, double, amrex::BoxArray const&, amrex::DistributionMapping const&) ../src/simulation.hpp:944
    #5 0x69d02d in amrex::AmrMesh::MakeNewGrids(double) ../extern/amrex/Src/AmrCore/AMReX_AmrMesh.cpp:779
    #6 0x56eca5 in AMRSimulation<QuirkProblem>::setInitialConditions() ../src/simulation.hpp:435
    #7 0x4318bf in problem_main() ../src/HydroQuirk/test_quirk.cpp:288
    #8 0x4167f4 in main ../src/main.cpp:53
    #9 0x7ffff4549492 in __libc_start_main (/lib64/libc.so.6+0x23492)
    #10 0x41a21d in _start (/avatar/bwibking/quokka/build/src/HydroQuirk/test_quirk+0x41a21d)

0x7fffef049d20 is located 4384 bytes to the right of 156672-byte region [0x7fffef022800,0x7fffef048c00)
allocated by thread T0 here:
    #0 0x7ffff72ad35f in __interceptor_malloc /tmp/bwibking/spack-stage/spack-stage-gcc-11.2.0-2q7tmnw67c2l56v4ljao3sh3fwokgvck/spack-src/libsanitizer/asan/asan_malloc_linux.cpp:145
    #1 0x630e41 in amrex::DataAllocator::alloc(unsigned long) const ../extern/amrex/Src/Base/AMReX_DataAllocator.H:17
    #2 0x630e41 in amrex::BaseFab<double>::define() ../extern/amrex/Src/Base/AMReX_BaseFab.H:1891

SUMMARY: AddressSanitizer: heap-buffer-overflow ../src/HydroQuirk/test_quirk.cpp:122 in operator()
Shadow bytes around the buggy address:
  0x10007de01350: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x10007de01360: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x10007de01370: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x10007de01380: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x10007de01390: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x10007de013a0: fa fa fa fa[fa]fa fa fa fa fa fa fa fa fa fa fa
  0x10007de013b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x10007de013c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x10007de013d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x10007de013e0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x10007de013f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==1393086==ABORTING

add Strang split source term function

Modify the hydro update so that Strang split sources are added before and after the hydro update. This requires copying the old-time state state_old_[lev] to a temporary MultiFab before the first source terms are applied (otherwise, the time interpolation on finer AMR levels does not work).

A100 performance regression

Performance on the 3D Sedov test on a single A100 is now ~210 Mupdates/s, rather than ~250 Mupdates/s as of the code paper.

Need to automate git bisect to find out when it drops.

use FillPatcher class to reduce communication

See AMReX-Codes/amrex#2972

Mellanox hardware-accelerated collectives cause crash when using >= 64 nodes/IB cards

Reported upstream here: openucx/ucx#7391

Workaround: disable multicast when using hcoll by adding -x HCOLL_ENABLE_MCAST=0 to the mpirun flags.

add PerformanceHints to ensure reasonable grid balance and grid parameters

Something like this: https://github.com/ax3l/WarpX/blob/9f204ef8532ebbfeab645ec4779f1ffdf9c1fba8/Source/Initialization/WarpXInitData.cpp#L536

Unlike WarpX, we need to do this check for each AMR level, since subcycling means we can't parallelize over levels.

Also: ensure blocking_factor >= 32 and max_grid_size >= 128 if running on GPUs.

reduce memory usage in timestep retries

We can now copy the data stored in YAFluxRegister and restore it, rather than creating temporary flux multifabs on each level: AMReX-Codes/amrex#3004

The memory usage is currently a limitation for the shock-cloud problem when max_lev >= 3.

add macOS CI

Use GitHub Actions to run the test suite automatically on macOS. This will ensure that the code is able to compile for new users who have Macs despite the quirks in Apple's version of Clang.

ComputeVelocityDifferences kernel is not correct for 1D or 2D builds

While it works correctly when AMREX_SPACEDIM=3, the array accessors try to access elements that don't exist in 1D and 2D. This needs to be fixed.

Spherical Collapse test takes too long on GPUs

The spherical collapse takes a long time on GPUs. On 1 GPU (with 12 CPUs), it took 3.5 hours:

              Resource Usage on 2022-12-28 06:32:05:

Job Id: 68252464.gadi-pbs
Project: jh2
Exit Status: 0
Service Units: 258.52
NCPUs Requested: 12 NCPUs Used: 12
CPU Time Used: 43:02:55
Memory Requested: 192.0GB Memory Used: 10.78GB
NGPUs Requested: 1 GPU Utilisation: 0%
GPU Memory Used: 359.0MB
Walltime requested: 04:00:00 Walltime Used: 03:35:26
JobFS requested: 100.0MB JobFS used: 8.07MB

On 48 CPUs, it took 17 minutes:

======================================================================================
Resource Usage on 2022-12-21 10:13:03:
Job Id: 67521319.gadi-pbs
Project: jh2
Exit Status: 0
Service Units: 85.04
NCPUs Requested: 48 NCPUs Used: 48
CPU Time Used: 14:05:27
Memory Requested: 192.0GB Memory Used: 21.67GB
Walltime requested: 04:00:00 Walltime Used: 00:17:43
JobFS requested: 100.0MB JobFS used: 8.16MB

checkpoint and stop when out of memory

Currently, AMReX throws an exception and crashes when we run out of GPU memory (because we have refined too much). This requires manually restarting from the last saved checkpoint on a larger number of nodes. In principle, we should be able to catch the exception and write out restartfiles, tell the user that more nodes are needed, and then stop.

This is non-trivial because the out-of-memory condition will usually be reached when creating temporary MultiFabs on refined levels, so the state data across levels will not be at a consistent time.

CUDA 11.6 generates invalid device code for all physicalBoundaryFunctors

nvcc 11.6.55 generates invalid device code for all test problems without fully-periodic boundary conditions:

Starting program: /avatar/bwibking/quokka/build/src/HydroShocktube/test_hydro_shocktube shocktube.in
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: Cannot parse .gnu_debugdata section; LZMA support was disabled at compile time
warning: Cannot parse .gnu_debugdata section; LZMA support was disabled at compile time
[Detaching after fork from child process 231811]
[New Thread 0x7ffff4661000 (LWP 231815)]
[New Thread 0x7ffff3e60000 (LWP 231816)]
Initializing CUDA...
[Detaching after fork from child process 231818]
[New Thread 0x7ffff162f000 (LWP 231828)]
[New Thread 0x7ffff0e2e000 (LWP 231829)]
CUDA initialized with 1 GPU per MPI rank; 1 GPU(s) used in total
MPI initialized with 1 MPI processes
MPI initialized with thread support level 0
AMReX (22.01-1-ga73ce6759c91) initialized
warning: Cuda API error detected: cudaLaunchKernel returned (0x62)

warning: Cuda API error detected: cudaGetLastError returned (0x62)

terminate called after throwing an instance of 'std::runtime_error'
  what():  GPU last error detected in file /avatar/bwibking/quokka/extern/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 862: invalid device function

Thread 1 "test_hydro_shoc" received signal SIGABRT, Aborted.
0x00007ffff63ae37f in raise () from /lib64/libc.so.6
(cuda-gdb) bt
#0  0x00007ffff63ae37f in raise () from /lib64/libc.so.6
#1  0x00007ffff6398db5 in abort () from /lib64/libc.so.6
#2  0x00007ffff6d6609b in ?? () from /lib64/libstdc++.so.6
#3  0x00007ffff6d6c53c in ?? () from /lib64/libstdc++.so.6
#4  0x00007ffff6d6b559 in ?? () from /lib64/libstdc++.so.6
#5  0x00007ffff6d6bed8 in __gxx_personality_v0 () from /lib64/libstdc++.so.6
#6  0x00007ffff674cb03 in ?? () from /lib64/libgcc_s.so.1
#7  0x00007ffff674d071 in _Unwind_RaiseException () from /lib64/libgcc_s.so.1
#8  0x00007ffff6d6c7eb in __cxa_throw () from /lib64/libstdc++.so.6
#9  0x0000000000484c22 in amrex::Abort_host (
    msg=0xec5c050 "GPU last error detected in file /avatar/bwibking/quokka/extern/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 862: invalid device function") at /avatar/bwibking/quokka/extern/amrex/Src/Base/AMReX.cpp:237
#10 0x0000000000484a27 in amrex::Abort (
    msg=0xec5c050 "GPU last error detected in file /avatar/bwibking/quokka/extern/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 862: invalid device function") at /avatar/bwibking/quokka/extern/amrex/Src/Base/AMReX.H:154
#11 amrex::Abort (msg=...) at /avatar/bwibking/quokka/extern/amrex/Src/Base/AMReX.cpp:197
#12 0x000000000041a394 in amrex::Gpu::ErrorCheck (
    file=0x81c988 "/avatar/bwibking/quokka/extern/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H", line=862)
    at /avatar/bwibking/quokka/extern/amrex/Src/Base/AMReX_GpuError.H:57
#13 0x0000000000453e94 in amrex::ParallelFor<__nv_dl_wrapper_t<__nv_dl_tag<void (amrex::GpuBndryFuncFab<setBoundaryFunctor<ShocktubeProblem> >::*)(amrex::Box const&, amrex::FArrayBox&, int, int, amrex::Geometry const&, double, amrex::Vector<amrex::BCRec, std::allocator<amrex::BCRec> > const&, int, int, amrex::FilccCell&&), &amrex::GpuBndryFuncFab<setBoundaryFunctor<ShocktubeProblem> >::ccfcdoit<amrex::FilccCell>, 1u>, amrex::FilccCell, amrex::Array4<double> const, int const, int const, amrex::Box const, amrex::BCRec*, setBoundaryFunctor<ShocktubeProblem> const, amrex::GeometryData const, double const, int const> > (box=..., f=...)
    at /avatar/bwibking/quokka/extern/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H:862
#14 0x000000000044632b in amrex::ParallelFor<__nv_dl_wrapper_t<__nv_dl_tag<void (amrex::GpuBndryFuncFab<setBoundaryFunctor<ShocktubeProblem> >::*)(amrex::Box const&, amrex::FArrayBox&, int, int, amrex::Geometry const&, double, amrex::Vector<amrex::BCRec, std::allocator<amrex::BCRec> > const&, int, int, amrex::FilccCell&&), &amrex::GpuBndryFuncFab<setBoundaryFunctor<ShocktubeProblem> >::ccfcdoit<amrex::FilccCell>, 1u>, amrex::FilccCell, amrex::Array4<double> const, int const, int const, amrex::Box const, amrex::BCRec*, setBoundaryFunctor<ShocktubeProblem> const, amrex::GeometryData const, double const, int const> > (box=..., f=...)
    at /avatar/bwibking/quokka/extern/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H:1275
#15 0x000000000042864d in amrex::GpuBndryFuncFab<setBoundaryFunctor<ShocktubeProblem> >::ccfcdoit<amrex::FilccCell> (
    this=0x7fffffff8130, bx=..., dest=..., dcomp=0, numcomp=9, geom=..., time=0, bcr=..., bcomp=0, orig_comp=0,
    fillfunc=...) at /avatar/bwibking/quokka/extern/amrex/Src/Base/AMReX_PhysBCFunct.H:383
#16 0x0000000000455a0d in amrex::GpuBndryFuncFab<setBoundaryFunctor<ShocktubeProblem> >::operator() (
    this=0x7fffffff8130, bx=..., dest=..., dcomp=0, numcomp=9, geom=..., time=0, bcr=..., bcomp=0, orig_comp=0)
    at /avatar/bwibking/quokka/extern/amrex/Src/Base/AMReX_PhysBCFunct.H:204
#17 0x0000000000449b91 in amrex::PhysBCFunct<amrex::GpuBndryFuncFab<setBoundaryFunctor<ShocktubeProblem> > >::operator() (this=0x7fffffff80c0, mf=..., icomp=0, ncomp=9, nghost=..., time=0, bccomp=0)
    at /avatar/bwibking/quokka/extern/amrex/Src/Base/AMReX_PhysBCFunct.H:177
#18 0x000000000046cca1 in AMRSimulation<ShocktubeProblem>::fillBoundaryConditions (this=0x7fffffff8620, S_filled=...,
    state=..., lev=0, time=0) at /avatar/bwibking/quokka/src/simulation.hpp:695
#19 0x000000000046a6e1 in AMRSimulation<ShocktubeProblem>::MakeNewLevelFromScratch (this=0x7fffffff8620, level=0,
    time=0, ba=..., dm=...) at /avatar/bwibking/quokka/src/simulation.hpp:628
#20 0x00000000007a0435 in amrex::AmrMesh::MakeNewGrids (this=0x7fffffff8620, time=0)
    at /avatar/bwibking/quokka/extern/amrex/Src/AmrCore/AMReX_AmrMesh.cpp:779
#21 0x00000000006fb236 in amrex::AmrCore::InitFromScratch (this=0x7fffffff8620, time=0)
    at /avatar/bwibking/quokka/extern/amrex/Src/AmrCore/AMReX_AmrCore.cpp:64
#22 0x000000000041e227 in AMRSimulation<ShocktubeProblem>::setInitialConditions (this=0x7fffffff8620)
    at /avatar/bwibking/quokka/src/simulation.hpp:280
#23 0x0000000000410ee8 in problem_main () at /avatar/bwibking/quokka/src/HydroShocktube/test_hydro_shocktube.cpp:332
#24 0x000000000040dde5 in main (argc=2, argv=0x7fffffff8d28) at /avatar/bwibking/quokka/src/main.cpp:41

Workaround: Use CUDA <= 11.5.

add the ability to retry coarse steps with a smaller timestep

Sometimes the level-by-level retries fail for hydro problems, especially with strong cooling. Limiting the timestep on each level to the cooling time on each level fixes the problem, but increases the computational cost by over an order of magnitude. In principle, we should be able to re-try the whole coarse timestep when the retries for a given level fail.

This requires additional memory usage to be able to restore the state from the previous coarse timestep (all MultiFabs on all levels must be stored), but we could follow Castro and copy the backup data to host memory to reduce GPU memory pressure.

See: https://github.com/AMReX-Astro/Castro/blob/main/Source/driver/Castro_advance_ctu.cpp#L486-L514

RK3-SSP integration

Three stage, third-order RK strong-stability-preserving (SSP) method [1]. Even without high-order time interpolation for boundary conditions, this should improve stability with PPM.

[1] https://epubs.siam.org/doi/10.1137/S003614450036757X

HDF5 outputs only include level 0

Unclear if this is a Quokka bug or an AMReX bug. More investigation needed.

RT3D performance anomaly

The RayleighTaylor3D problem is significantly slower than the HydroBlast3D problem:

On HydroBlast3D on A100, I get:
Performance figure-of-merit: 0.004781215441 μs/zone-update [209.1518385 Mupdates/s]

RayleighTaylor3D is significantly slower, with
Performance figure-of-merit: 0.006103228624 μs/zone-update [163.8477045 Mupdates/s]

regrid level 0 on restart

We need to regrid level 0 when restarting, since users may restart with more MPI ranks / processors than in the previous run.

adaptively fall-back to PCInterp when AMR interpolation produces negative pressure states

In problems with regions near vacuum, using CellConsLinInterp can lead to states with P < 0. In this case, we should switch to piecewise-constant interpolation onto the fine grid using PCInterp, but only in those cells where CellConsLinInterp fails to produce physical states.

save axis-aligned projections as plotfiles

See AMReX-Codes/amrex#3187.