Code Monkey home page Code Monkey logo

discotec's Introduction

DisCoTec: Distributed Combination Technique Framework

Build Status License: LGPL v3 Zenodo DOI Latest spack version Codacy Badge

What is DisCoTec?

This project contains DisCoTec, a code for running the distributed sparse grid combination technique with MPI parallelization. While it originates from the excellent SGpp project, all the parallelization makes it a very different code, such that it has become its own project.

DisCoTec is designed as a framework that can run multiple instances of a (black-box) grid-based solver implementation. The most basic example we use is a mass-conserving FDM/FVM constant advection upwinding solver. An example of a separate, coupled solver is SeLaLib.

Sparse Grid Combination Technique with Time Stepping

The sparse grid combination technique (Griebel et al. 1992, Garcke 2013, Harding 2016) can be used to alleviate the curse of dimensionality encountered in high-dimensional simulations. Instead of using your solver on a single structured full grid (where every dimension is finely resolved), you would use it on many different structured full grids (each of them differently resolved). We call these coarsely-resolved grids component grids. Taken together, all component grids form a sparse grid approximation, which can be explicitly obtained by a linear superposition of the individual grid functions, with the so-called combination coefficients.

schematic of a combination scheme in 2D

In this two-dimensional combination scheme, all combination coefficients are 1 and -1, respectively. Figure originally published in (Pollinger 2024).

Between time steps, the grids exchange data through a multi-scale approach, which is summarized as the "combination" step in DisCoTec. Assuming a certain smoothness in the solution, this allows for a good approximation of the finely-resolved function, while achieving drastic reductions in compute and memory requirements.

Parallelism in DisCoTec

The DisCoTec framework can work with existing MPI parallelized solver codes operating on structured grids. In addition to the parallelism provided by the solver, it adds the combination technique's parallelism. This is achieved through process groups (pgs): MPI_COMM_WORLD is subdivided into equal-sized process groups (and optionally, a manager rank).

schematic of MPI ranks in DisCoTec

The image describes the two ways of scaling up: One can either increase the size or the number of process groups. Figure originally published in (Pollinger 2024).

All component grids are distributed to process groups (either statically, or dynamically through the manager rank). During the solver time step and most of the combination, MPI communication only happens within the process groups. Conversely, for the sparse grid reduction using the combination coefficients, MPI communication only happens between a rank and its colleagues in the other process groups, e.g., rank 0 in group 0 will only talk to rank 0 in all other groups. Thus, major bottlenecks arising from global communication can be avoided altogether.

Combining the two ways of scaling up, DisCoTec's scalability was demonstrated on several machines, with the experiments comprising up to 524288 cores:

timings for advection solver step on HAWK at various parallelizationstimings for combination step on HAWK at various parallelizations

We see the timings (in seconds) for the advection solver step and the combination step, respectively. This weak scaling experiment used four OpenMP threads per rank, and starts with one pg of four processes in the upper left corner. The largest parallelization is 64 pgs of 2048 processes each. Figure originally published in (Pollinger 2024).

This image also describes the challenges in large-scale experiments with DisCoTec: If the process groups become too large, the MPI communication of the multiscale transform starts to dominate the combination time. If there are too many pgs, the combination reduction will dominate the combination time. However, the times required for the solver stay relatively constant; they are determined by the solver's own scaling and the load balancing quality.

There are only few codes that allow weak scaling up to this problem size: a size that uses most of the available main memory of the entire system.

Contributing

We welcome contributions! To find a good place to start coding, have a look at the currently open issues.

  • Please describe issues and intended changes in the issue tracker.
  • Please develop new features in a new branch (typically on your own fork) and then create a pull request.
  • New features will only be merged to the main branch if they are sufficiently tested: please add unit tests in /tests.

Installing

Installation instructions: spack

DisCoTec can be installed via spack, which handles all dependencies. If you want to develop DisCoTec code or examples, the spack dev-build workflow is recommended as follows.

Clone both spack and DisCoTec to find or build the dependencies and then compile DisCoTec:

git clone [email protected]:spack/spack.git  # use https if your ssh is not set up on github
./spack/bin/spack external find  # find already-installed packages
./spack/bin/spack compiler find  # find compilers present on system
./spack/bin/spack info discotec@main  # shows DisCoTec's variants
./spack/bin/spack spec discotec@main  # shows DisCoTec's dependency tree and which parts are already found

git clone [email protected]:SGpp/DisCoTec.git
cd DisCoTec
../spack/bin/spack dev-build -b install discotec@main

This will first build all dependencies, and then build DisCoTec inside the cloned folder. The executables are placed in the respective example and test folders.

Installation instructions: CMake

Dependencies

cmake >= (3.24.2), libmpich-dev (>= 3.2-6), or other MPI library libboost-all-dev (>= 1.60)

Additional (optional) dependencies:

  • OpenMP
  • HDF5
  • HighFive

Complete build

git clone https://github.com/SGpp/DisCoTec.git DisCoTec
cmake -S DisCoTec -B DisCoTec/build
cmake --build DisCoTec/build -j

Advanced build

All examples and the tools can be built on their own. To build a specific example, run the specific cmake command in the example folder. For example, run the following commands

git clone https://github.com/SGpp/DisCoTec.git DisCoTec
cd DisCoTec/examples/combi_example
cmake -S . -B build
cmake --build build -j

to build the combi_example.

For building only the DisCoTec library, run cmake with the src folder as source folder.

Optional CMake Options

  • DISCOTEC_TEST=**ON**|OFF - Build tests if you build the complete project.
  • DISCOTEC_BUILD_MISSING_DEPS=**ON**|OFF- First order dependencies that are not found are built automatically (glpk is always built).
  • DISCOTEC_TIMING=**ON**|OFF - Enables internal timing
  • DISCOTEC_USE_HDF5=**ON**|OFF
  • DISCOTEC_USE_HIGHFIVE=**ON**|OFF - Enables HDF5 support via HighFive. If DISCOTEC_USE_HIGHFIVE=ON, DISCOTEC_USE_HDF5 has also to be ON.
  • DISCOTEC_UNIFORMDECOMPOSITION=**ON **|OFF - Enables the uniform decomposition of the grid.
  • DISCOTEC_GENE=ON|**OFF** - Currently GEne is not supported with CMake!
  • DISCOTEC_OPENMP=ON|**OFF** - Enables OpenMP support.
  • DISCOTEC_ENABLEFT=ON|**OFF** - Enables the use of the FT library.
  • DISCOTEC_USE_LTO=**ON**|OFF - Enables link time optimization if the compiler supports it.
  • DISCOTEC_OMITREADYSIGNAL=ON|**OFF** - Omit the ready signal in the MPI communication. This can be used to reduce the communication overhead.
  • DISCOTEC_USENONBLOCKINGMPICOLLECTIVE=ON|**OFF** - Flag currently unused
  • DISCOTEC_WITH_SELALIB=ON|**OFF** - Looks for SeLaLib dependencies and compiles the matching example

To run the compiled tests, go to folder tests and run

mpiexec -np 9 ./test_distributedcombigrid_boost

where you can use all the parameters of the boost test suite. If timings matter, consider the pinning described in the respective section. Or you can run the tests with ctest in the build folder.

Executing DisCoTec Binaries

DisCoTec executables are configured through ctparam files, which are parsed on startup. The ctparam file will contain the combination technique parameters (dimension, minimum and maximum level) as well as parallelization parameters (number and size of process groups, parallelism within process groups) in .ini file format.

Any DisCoTec executable must be run through MPI (either mpirun or mpiexec), and if no argument to the binary is specified, it will use the file called ctparam in the current working directory. Make sure that it exists and describes the parameters you want to run.

The exact format and naming in ctparam is not (yet) standardized, to allow adaptation for different solver applications. Please refer to existing parameter files.

Rank and Thread Pinning

The number and size of process groups (in MPI ranks) can be read from the ctparam files:

NGROUP=$(grep ngroup ctparam | awk -F"=" '{print $2}')
NPROCS=$(grep nprocs ctparam | awk -F"=" '{print $2}')

Correct pinning is of utmost importance to performance, especially if DisCoTec was compiled with OpenMP support. The desired pinning is the simplest one you can imagine: Each node and, within a node, each socket is filled consecutively with MPI rank numbers. If OpenMP is used, MPI pinning is strided by the number of threads. The threads should be placed close to their MPI rank. This means for all compilers, if the desired number of OpenMP threads per rank is N, one needs

export OMP_NUM_THREADS=$N;export OMP_PLACES=cores;export OMP_PROC_BIND=close

If OpenMP is used for compilation but should not be used for execution, N should be set to 1 to avoid unintended effects.

Intel-MPI

Intel-MPI requires some environment variables, in particular for OpenMP:

mpiexec -n $(($NGROUP*$NPROCS)) -genv I_MPI_PIN_PROCESSOR_LIST="allcores" -genv I_MPI_PIN_DOMAIN=omp $DISCOTEC_EXECUTABLE

In SLURM, it is important to set --ntasks-per-node to match the number of desired tasks ($CORES_PER_NODE/$OMP_NUM_THREADS). Validate with verbose output: export I_MPI_DEBUG=4

OpenMPI

OpenMPI uses command line arguments, pinning may clash with SLURM settings depending on the exact call.

mpiexec.openmpi --rank-by core --map-by node:PE=${OMP_NUM_THREADS} -n $(($NGROUP*$NPROCS)) $DISCOTEC_EXECUTABLE

Validate with verbose output: --report-bindings

MPT

With mpt, one uses the omplace wrapper tool to set the correct pinning.

mpirun -n $(($NGROUP*$NPROCS)) omplace -v -nt ${OMP_NUM_THREADS} $DISCOTEC_EXECUTABLE

Validate with very verbose output: -vv .

GENE submodules as dependencies for GENE examples

Warning: The CMake Integration is currently not adapted to use GENE!

There are gene versions as submodules: a linear one in the gene_mgr folder, and a nonlinear one in gene-non-linear. To get them, you need access to their respective repos at MPCDF. Then go into the folder and

git submodule init
git submodule update

or use the --recursive flag when cloning the repo.

Further Readmes

discotec's People

Contributors

breyerml avatar cniethammer avatar datmaffin avatar freifrauvonbleifrei avatar ge25duq avatar hurlerml avatar johannesrentrop avatar obersteiner avatar vancraar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

discotec's Issues

Error: Assertion failed in file src/mpi/coll/helper_fns.c at line 84: FALSE

The build on master currently fails with

integration/test_1 2 4
integration/test_1 integration/test_1 2 4
2 4
integration/test_1 integration/test_1 2 4
2 4
integration/test_1 2 4
integration/test_1 integration/test_1 2 4
2 4
integration/test_1 2 4
new manager rank: 2 
new manager rank: new manager rank: 2 
2 
sending class 
manager received status 
sending class 
manager received status 
Assertion failed in file src/mpi/coll/helper_fns.c at line 84: FALSE
memcpy argument memory ranges overlap, dst_=0x7fff9a595c70 src_=0x7fff9a595c68 len_=16
internal ABORT - process 0
scons: *** [distributedcombigrid/tests/test_distributedcombigrid_boost_run] Error 1

But it was still working on 8946f8b. These here are the changes that happened since then:

8946f8b...master

It works fine on my system, though!

Replace std::fill( ... 0 ...) with memset

for these changes
#73
we saw huge run time differences between std::fill and memset for zero values. There are probably a lot more places where the code could benefit from memset.

Openmpi error in MPI_Recv: MPI_ERR_TRUNCATE

I set up DisCoTec on a relatively new machine, but some of the tests (and also examples) are giving me this error here:

$ mpiexec.openmpi -n 9 ./test_distributedcombigrid_boost --run_test=ftolerance,loadmodel,rescheduling
Running 10 test cases...
Running 10 test cases...
Running 10 test cases...
Running 10 test cases...
Running 10 test cases...
Running 10 test cases...
Running 10 test cases...
Running 10 test cases...
Running 10 test cases...
[ipvs-epyc1:3260534] *** An error occurred in MPI_Recv
[ipvs-epyc1:3260534] *** reported by process [2024538113,8]
[ipvs-epyc1:3260534] *** on communicator MPI COMMUNICATOR 10 CREATE FROM 8
[ipvs-epyc1:3260534] *** MPI_ERR_TRUNCATE: message truncated
[ipvs-epyc1:3260534] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ipvs-epyc1:3260534] ***    and potentially your MPI job)

I can run individual tests and they will sometimes fail or not.
I have made absolutely sure that DisCoTec and boost link to the system's OpenMPI version 4.0.3 (even installed an extra boost via spack).
@obersteiner have you seen this error before?

Commit breaks test for ftolerance

... although it really should not!

After this commit here
fca872c
the ftolerance tests somehow suddenly end before the RUN_FIRST signal is finished.

However, the commit only adds a runtime error in DFG, a function in test_helper and calls to that function in some tests. Not the ftolerance test.

To really show this, I have added an error to the test

BOOST_TEST_ERROR("worker never cleanly exits! " + std::to_string(signal));

which is never thrown!

Optimization: `extraSparseGrid` as `MPI_Datatype` instead of `DistributedSparseGridUniform`

To reduce memory usage for IO-ranks, we could avoid allocating the extraSparseGrid as sparse grid, and replace it by a MPI_Datatype, eg by MPI_Type_create_hindexed_block (cf. DistributedHierarchization.hpp)

Consequences:

  • saves memory for the extra sparse grid (at least for larger scenarios)
  • IO on the sparse grid would be strided in main memory but not file system accesses
  • could also be used for the SG-broadcast, if all ranks have the MPI_Datatype defined (requires broadcast of extra SG sizes)

submodule `gene-dev-exahd`

hm, I just noticed that the submodule for gene-dev-exahd is still set wrong. It needs to have

ssh://git@localhost:7990/ext-ec8a21db5900/gene-dev.git

set as origin, branch exahd, commit the current one (or merged with master)

Test assertion breaking w/ MPICH

Since CI with MPICH was introduced, there were a bunch of failing tests due to this check failing:

DisCoTec-gcc-mpich-Debug/tests/test_distributedsparsegrid.cpp(907): error: in "distributedsparsegrid/test_sparseGridAndSubspaceReduce": check subspaceStart[j] == i * nprocs has failed [108071 != 108135]

(code here:

BOOST_CHECK_EQUAL(subspaceStart[j], i * nprocs);
)

never for openmpi, often for mpich in both debug and release mode.

???

Sparsify data transfer for mass-conserving basis functions

Currently, the basis transforms for the biorthogonal and full weighting functions exchange all of the data along one pole at once:

static void exchangeAllData1d(const DistributedFullGrid<FG_ELEMENT>& dfg, DimType dim,

This is not optimal, in particular if hierarchization is performed only to some minimal level > 0.

This thesis describes some conditions:

  • the hat dehierarchization needs the data defined by the "Kegel-Bedingung" (cone condition) / KegelFill algorithm . this optimization is already implemented for the hat functions. (for hierarchization, the hat function only needs its direct parents.)
  • the biorthogonal and full weighting bases need variants of the LiftingFill algorithm for both transforms. In particular, the data dependency is symmetric for our purposes: if a finer-level point needs the data from a coarser-level point, then the reverse will also be true, as scaling function coefficients need to be updated.

Thus, it is worth considering to hierarchize in multiple "rounds" and communicate updated data for each level of hierarchization only.

Optimization: use (more of) subspace reduce

to save memory, we could move from sparse grid reduce to subspace reduce, cf.
https://ebooks.iospress.nl/doi/10.3233/978-1-61499-381-0-564

this was implemented in the past for

  • small setups
  • without parallelization within the grids / without process groups

To do it with process groups, one could "vote" on each process groups which subspaces are present. A set of process groups that shares the same subspaces (and all others don't) could have its own reduce-communicator. the reduction would iterate all reduce-communicators.

A challenge we have already identified:

The (potential) number of communicators can become excessively large: O(Process group size * 2^(number of subspaces)) . Process group size can be up to 2^13- 2^14 for current scenarios. number of subspaces can be in the 100,000s, and 2^ comes from the power set.

-> we are not sure if created communicators use memory on the ranks they contain only, or if the info is collected globally somewhere. how do implementations do it?

-> could maybe be remedied by a good scenario splitting, where partitions = process groups in https://github.com/SGpp/DisCoTec-combischeme-utilities . then, there should be many groups that share the exact same set of subspaces.

-> there could be a trade-off between sparse grid reduce and subspace reduce (if only some subspaces are allocated in addition to the ones that are strictly required)

Broadcast combischeme instead of reading it from `.json` in every rank

It appears there are set-ups where it is not possible that every rank reads ~4MB of data in .json files.

To this end, it would be good if this section

CombiMinMaxSchemeFromFile(DimType dim, LevelVector& lmin, LevelVector& lmax, std::string ctschemeFile)

could be reworked such that only one rank (maybe per process group) reads the data and broadcasts the resulting vectors to the other ranks.

Memory balancing necessary?

I am again running a weak scaling on Hawk (2GB/core, 128 cores/node).
This time, I am trying to get sensible scaling data for GENE+DisCoTec. But the required resolutions and memory footprints make it a lot harder!

While running the scheme on one process group of 4096 workers finishes fine,

[ct]
#last element has to be 1 -> specify species with special field
#dimension of problem
dim = 6
#minimum and maximum level of combination technique
lmin = 5 5 4 4 3 1
lmax = 10 5 9 9 8 1

#levelvector at which 2 final outputs are evaluated (with potential interpolation)
leval = 7 5 5 6 5 1
leval2 = 7 4 4 4 4 1
#indicates number of processors per dimension in domain decomposition
#this is the same for each process group
p = 8 1 8 8 8 1
#number of combination steps
ncombi = 12
#indicates whether combischeme is read from file
readspaces = 1
#indicates the file name of the 2 plot files
fg_file_path = ../plot.dat
fg_file_path2 = ../plot2.dat
#indicates which dimensions have boundary points
boundary = 1 1 1 1 1 0
#indicates which dimensions will be hierarchized
hierarchization_dims = 1 0 1 1 1 0

#possibility to reduce the level of the sparse grid for the combination step
reduceCombinationDimsLmin = 0 0 0 0 0 0
reduceCombinationDimsLmax = 1 0 1 1 1 0

[application]
#timestep size
dt = 0.6960E-02
#timesteps 100000
#number of timesteps between combinations
nsteps = 100000000
#allowed maximal simulation time (physical time) between combination steps
#if it would be exceeded finish also with less steps as defined above
combitime = 0.1
#physical parameters
#shat = 0.7960
kymin = 0.1525E-01
#box size
lx = 125.00
#numbers of species
numspecies = 1
#T for local runs F for global runs
GENE_local = F
#T for nonlinear F for linear runs
GENE_nonlinear = T
#The number of combinations after which we write out checkpoint to disk
checkpointFrequency = 50

[preproc]
#name of gene instance folders
basename = ginstance
#executable name of gene manager
executable = ./gene_hawk
#used mpi version
mpi = mpiexec
startscript = start.bat

[manager]
#number of mpi ranks in each group
ngroup = 1
#number of process groups
nprocs = 4096

it seems that we're running out of memory when running the same problem on eight process groups of 512 each:

#last element has to be 1 -> specify species with special field
#dimension of problem
dim = 6
#minimum and maximum level of combination technique
lmin = 5 5 4 4 3 1
lmax = 10 5 9 9 8 1

#levelvector at which 2 final outputs are evaluated (with potential interpolation)
leval = 7 5 5 6 5 1
leval2 = 7 4 4 4 4 1
#indicates number of processors per dimension in domain decomposition
#this is the same for each process group
p = 4 1 4 8 4 1
#number of combination steps
ncombi = 12
#indicates whether combischeme is read from file
readspaces = 1
#indicates the file name of the 2 plot files
fg_file_path = ../plot.dat
fg_file_path2 = ../plot2.dat
#indicates which dimensions have boundary points
boundary = 1 1 1 1 1 0
#indicates which dimensions will be hierarchized
hierarchization_dims = 1 0 1 1 1 0

#possibility to reduce the level of the sparse grid for the combination step
reduceCombinationDimsLmin = 0 0 0 0 0 0
reduceCombinationDimsLmax = 1 0 1 1 1 0

[application]
#timestep size
dt = 0.6960E-02
#timesteps 100000
#number of timesteps between combinations
nsteps = 100000000
#allowed maximal simulation time (physical time) between combination steps
#if it would be exceeded finish also with less steps as defined above
combitime = 0.1
#physical parameters
#shat = 0.7960
kymin = 0.1525E-01
#box size
lx = 125.00
#numbers of species
numspecies = 1
#T for local runs F for global runs
GENE_local = F
#T for nonlinear F for linear runs
GENE_nonlinear = T
#The number of combinations after which we write out checkpoint to disk
checkpointFrequency = 50

[preproc]
#name of gene instance folders
basename = ginstance
#executable name of gene manager
executable = ./gene_hawk
#used mpi version
mpi = mpiexec
startscript = start.bat

[manager]
#number of mpi ranks in each group
ngroup = 8
#number of process groups
nprocs = 512

It might be that one process group is assigned more memory-intense GENE tasks than the others. This would mean that memory does not correlate strongly enough with run time of the first step, such that our current round-robin assignment approach would work. (Remember: we have an estimate based on run time, I am currently using the gripoint based LinearLoadModel, which we use to assign one grid to each process group. But then, the next grids are assigned to process groups once they finish. For instance, if there is a component grid that takes VERY long to compute the first time step, there would be no other grid assigned to this group.)

I am currently working to verify that this is the exact problem.

The other possible source of error (that I can imagine) would be that the overhead of 3 additionally allocated sparse grids is the culprit. Given the sparse grid size of 481371297 grid points * 16 byte (double complex) * 3 = 1.345 gibibyte, this seems unlikely (this overhead is over all 4096 workers!).

This would make some kind of "memory balancing" necessary in the case of memory scarcity (in analogy to load balancing in the task assignment).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.