llnl / aluminum Goto Github PK

View Code? Open in Web Editor NEW

84.0 84.0 21.0 1.61 MB

High-performance, GPU-aware communication library

Home Page: https://aluminum.readthedocs.io/en/latest/

License: Other

CMake 4.72% C++ 90.47% Cuda 0.57% Python 4.23%

cpp cuda gpu hpc mpi

aluminum's People

Contributors

Stargazers

Watchers

Forkers

benson31 timmoon10 naoyam cnzhanj eminsight zhaojp-frank bvanessen andy-yoo mcneish1 oyamay graham63 prisdxmeany liuhatry baguasys wuziyou199217 wgq-iapcm kylosus khben

aluminum's Issues

Add note and CMake requirement for minimum CUDA version

For reasons, I tried to compile with CUDA 7.0 but couldn't because:

[  8%] Building CUDA object src/CMakeFiles/Al.dir/mpi_cuda/cuda_kernels.cu.o
In file included from /opt/Aluminum-0.2/src/mempool.hpp:35:0,
                 from /opt/Aluminum-0.2/src/mpi_impl.hpp:36,
                 from /opt/Aluminum-0.2/src/Al.hpp:826,
                 from /opt/Aluminum-0.2/src/cuda.cpp:30:
/opt/Aluminum-0.2/src/cuda.cpp: In function 'void Al::internal::cuda::init(int&, char**&)':
/opt/Aluminum-0.2/src/cuda.cpp:61:30: error: 'CU_DEVICE_ATTRIBUTE_CAN_USE_STREAM_MEM_OPS' was not declared in this scope
                       &attr, CU_DEVICE_ATTRIBUTE_CAN_USE_STREAM_MEM_OPS, dev));
                              ^
/opt/Aluminum-0.2/src/cuda.hpp:89:39: note: in definition of macro 'AL_FORCE_CHECK_CUDA_DRV_NOSYNC'
     CUresult status_CHECK_CUDA_DRV = (cuda_call);               \
                                       ^
/opt/Aluminum-0.2/src/cuda.cpp:60:3: note: in expansion of macro 'AL_CHECK_CUDA_DRV'
   AL_CHECK_CUDA_DRV(cuDeviceGetAttribute(
   ^
[ 10%] Building CUDA object src/CMakeFiles/Al.dir/helper_kernels.cu.o
make[2]: *** [src/CMakeFiles/Al.dir/cuda.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
/opt/Aluminum-0.2/src/helper_kernels.cu(48): error: identifier "CU_STREAM_WAIT_VALUE_EQ" is undefined

/opt/Aluminum-0.2/src/helper_kernels.cu(48): error: identifier "cuStreamWaitValue32" is undefined

2 errors detected in the compilation of "/tmp/tmpxft_0000260c_00000000-7_helper_kernels.cpp1.ii".
make[2]: *** [src/CMakeFiles/Al.dir/helper_kernels.cu.o] Error 2
make[1]: *** [src/CMakeFiles/Al.dir/all] Error 2
make: *** [all] Error 2

Looking at the docs, CU_DEVICE_ATTRIBUTE_CAN_USE_STREAM_MEM_OPS was added in CUDA 9.0, and the othermissing defines seem to be mostly from CUDA 8.0. It seems that at least CUDA 9.0 is therefore required. The Readme.md does not mention any minimum version (even though it mentions that at least MPI 3.0 is required). Also, the CMakeLists.txt does not check against a minimum version. Imho, boht should be added. Maybe after some testing whether it really works with CUDA 9.0. I might test this with a singularity container.

Support HIP stream memory operations

As of ROCm 4.2, HIP supports hipStreamWaitValue32/64 and hipStreamWriteValue32/64 (analogous to the corresponding cuStreamWaitValue/WriteValue methods we use). We should support these as well and make them the default implementation on AMD, rather than the manual spin-wait kernel-style approach we currently use.

Hipify should support these, so we may not need to do anything other than change some #defines.

Early exit for trivial NCCL collectives

Right now, some NCCL operations will exit early when their count parameter is 0. However, non-blocking collectives still set up CUDA event synchronization between streams, which will add some overhead.

For some collectives, we may also be able to exit early if the size of the communicator is 1.

Algorithm types for different collectives

Now that we have considerably more communication operations than just allreduce, we should support an algorithm type for each one of them.

Better memory pools

Our memory pools are currently inefficient, ugly, and hard to extend if we need different kinds of memory.

Implement a generic memory pool interface that operates something like:

T* pool_type::GetMemory<MemoryKindTag>(size_t n)
void pool_type::ReleaseMemory<MemoryKindTag>(T* mem)

(This is just a starting point. We may want to put the memory kind on the pool instead.)

We should probably not rely on CUB/etc. in order to minimize cross-platform issues.

The pool should be thread-safe.

Testing half

Our current testing infrastructure does not actually check results when using half, since MPI does not support it.

In-place NCCL allgather is incorrect

This is similar to #21, except that our in-place allgather in NCCL is incorrect. (This actually causes test failures and differs from MPI.)

Host-transfer allreduce on one processor fails

$ jsrun --bind packed:8 --nrs 1 --rs_per_host 1 --tasks_per_rs 1 --launch_distribution packed --cpu_per_rs ALL_CPUS --gpu_per_rs ALL_GPUS ./test_ops.exe --backend ht --op allreduce --size 1                                         
[lassen3:09550] *** An error occurred in MPI_Irecv
[lassen3:09550] *** reported by process [4456568,0]
[lassen3:09550] *** on communicator MPI COMMUNICATOR 5 DUP FROM 0
[lassen3:09550] *** MPI_ERR_RANK: invalid rank
[lassen3:09550] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[lassen3:09550] ***    and potentially your MPI job)
$ jsrun --bind packed:8 --nrs 1 --rs_per_host 1 --tasks_per_rs 1 --launch_distribution packed --cpu_per_rs ALL_CPUS --gpu_per_rs ALL_GPUS ./test_ops.exe --backend ht --op allreduce --inplace --size 1
[lassen3:08077] *** An error occurred in MPI_Irecv
[lassen3:08077] *** reported by process [3087233017,0]
[lassen3:08077] *** on communicator MPI COMMUNICATOR 5 DUP FROM 0
[lassen3:08077] *** MPI_ERR_RANK: invalid rank
[lassen3:08077] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[lassen3:08077] ***    and potentially your MPI job)

This appears to be a problem directly with the host-transfer backend, as the corresponding MPI backend operations work fine.

NCCL inplace reduce-scatterv trashes rank 0 buffer

We currently implement the reduce-scatter as a reduce to rank 0 followed by a scatterv. When doing an in-place op, the reduce is in-place on the input sendbuf. This therefore writes to portions of sendbuf on rank 0 that are outside of the region where the final scattered value would be placed.

I don't find something explicitly prohibiting this in the MPI standard, but:

It's a bit aesthetically displeasing.
In other cases, like a MPI_Recv with a buffer/count larger than the actual message length, MPI does guarantee that no more memory will be touched than is actually needed by the message.
Avoiding it shouldn't take too much overhead if we use a memory pool.
A better, direct implementation can probably avoid it.

Thread safety

Aluminum currently does not provide a documented thread-safety guarantee. In practice, different backends implicitly provide different safety guarantees. I believe the main bottleneck is with the progress engine, which currently uses a single-producer/single-consumer queue.

We should:

Provide an explicit, documented standard for thread safety. I propose that, for performance reasons, we allow this to be chosen at compile-time to be either:
- Multiple threads can simultaneously submit communication operations. If the operations are on the same communicator or compute stream, the ordering must be provided by the user.
- Only a single thread submits communication at a time. (We can allow this to be different threads/etc.-- it is up to the user to synchronize.)
Implement the above. I believe the main task is to improve the queues the progress engine uses and ensure memory pools are thread-safe.

Things that are not necessarily thread-safe:

Internal CUDA stuff. (See, e.g., NCCLRequest).
Memory pools (depending on setting).
Progress engine.

Rooted in-place NCCL calls wrong

I have not verified it, but looking at the code, I believe we do not correctly set the buffers for rooted (e.g., Gather) NCCL collectives, when the root is not 0.

We should add a test that checks this case, and fix it if needed.

Probably we just need to switch to passing internal::IN_PLACE<T>() in the right places.

Hang in progress engine binding

(This issue is already fixed in #181, but I'm writing up an issue to document it. I'm writing about Aluminum as it was before that PR.)

The progress engine does some MPI communication to decide how to bind the progress engine thread. This involves collectives being run among the processes on each physical node (i.e., there is no global collective, just concurrent collectives within each node). If progress engine startup is deferred (with AL_PE_START_ON_DEMAND), then this is not executed until the progress engine actually starts. However, if not every rank on a node performs an operation starting the progress engine (e.g., because they're doing a point-to-point operation), then the ranks may hang and the progress engine not fully start.

Update examples

Our example codes have gotten a bit stale. They need to be updated to reflect changes made in the main Aluminum test/benchmark utilities:

Support Flux environment variables.
Properly handle half and bfloat16.

Changing a CUDA stream of a CUDA communicator should be disabled

Aluminum/src/cudacommunicator.hpp

Line 46 in 0e44e2f

void set_stream(cudaStream_t stream_) { stream = stream_; }

Support half in MPI and Host-Transfer

Add support for half to the MPI and Host-Transfer backends.

If we're testing it (see #104), we have to support it with MPI in some way anyway, so we might as well add full support.

Note that since these will primarily be performing computation on the CPU, and CPUs don't (at present) have good support for half, reduction performance may be poor.

Request: Flag that sets all streams to the default stream

It would be nice to add an environment variable that enables synchronous mode or a single stream (default?).

Thank you!

MPSC queue debug checks

The SPSC queue has a sanity check in debug mode for when the queue is full. We should add a similar check to the MPSC queue.

Progress engine binding on HIP/AMD

Building on #83, we need to improve binding on AMD systems. Right now we get an empty CPU set, so the binding just gives up.

Could not get NUMA node

When trying to run lbann --help on my notebook compiled without CUDA but with Aluminum, support I get this error:

terminate called after throwing an instance of 'Al::al_exception'
  what():  /opt/lbann/src/Aluminum-0.2/src/progress.cpp:145 - Could not get NUMA node.
[ThinkPad-X240:24279] *** Process received signal ***
[ThinkPad-X240:24279] Signal: Aborted (6)
[ThinkPad-X240:24279] Signal code:  (-6)
[ThinkPad-X240:24279] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x126e0)[0x7f291dce96e0]
[ThinkPad-X240:24279] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7f291d5b48bb]
[ThinkPad-X240:24279] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7f291d59f535]
[ThinkPad-X240:24279] [ 3] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c983)[0x7f291d969983]
[ThinkPad-X240:24279] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x928e6)[0x7f291d96f8e6]
[ThinkPad-X240:24279] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x92921)[0x7f291d96f921]
[ThinkPad-X240:24279] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x92b54)[0x7f291d96fb54]
[ThinkPad-X240:24279] [ 7] /opt/lbann/Aluminum/lib/libAl.so(+0xf14e)[0x7f291dc1114e]
[ThinkPad-X240:24279] [ 8] /opt/lbann/Aluminum/lib/libAl.so(_ZN2Al8internal14ProgressEngine6engineEv+0x1a)[0x7f291dc1604a]
[ThinkPad-X240:24279] [ 9] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbbb2f)[0x7f291d998b2f]
[ThinkPad-X240:24279] [10] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3)[0x7f291dcdefa3]
[ThinkPad-X240:24279] [11] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f291d67680f]
[ThinkPad-X240:24279] *** End of error message ***

System setup (singularity container on a system with no GPU, linux kernel 4.15.0 and i7-4600 CPU):

Singularity File:

Bootstrap: docker
From: debian:buster-slim

%environment
    # elevating this to /bin/bash is not possible. Therefore should on ubuntu also be runnable in /bin/dash -.-

    PREFIX=/opt/lbann

    exportPath()
    {
        if test -d "$2"; then
            export "$1"="$2"
            printf "\e[37mExported existing path '$2' into environment variable '$1'\e[0m\n"
        else
            printf "\e[31m[Warning] '$2' is not a directory. Won't export it\e[0m\n"
        fi
    }

    add2path()
    {
        local targetVar=PATH
        if test "$#" -gt 1; then
            targetVar=$1
            shift 1
        fi
        local targetContent=$( eval echo \$$targetVar )
        local oldContent=$targetContent

        while test "$#" -gt 0; do
            if test -d "$1"; then
                case ":$targetContent:" in
                    *:"$1":*)
                    printf "\e[37m[Info] Path '$1' already exists in \$$targetVar. Won't add it.\e[0m\n"
                    ;;
                    *)
                    targetContent=$1:$targetContent
                    ;;
                esac
            else
                printf "\e[33m[Warning] '$1' is not a directory. Won't append to \$$targetVar variable.\e[0m\n"
            fi
            shift 1
        done

        if test "${#targetContent}" -gt "${#oldContent}"; then
            export $targetVar=$targetContent
            printf "\e[37mExporting new \$$targetVar: $targetContent\e[0m\n"
        elif test "${#targetContent}" -lt "${#oldContent}"; then
            printf "\e[31m[Error] After adding paths, the variable is erroneously shorter (${#targetContent}) than before (${#oldContent})"'!'"\e[0m\n"
        fi
    }

    findPath()
    {
        local fileName=$1
        local searchPath=$2

        if test "$( find "$searchPath" -xtype f -name "$fileName" | head -2 | wc -l )" -gt 1; then
            printf "\e[33m[Warning] Found more than one matching sub path in the searchPath '$searchPath'.\e[0m\n" 1>&2
            printf "\e[37mMatches:\n" 1>&2
            find "$searchPath" -xtype f -name "$fileName" 1>&2
            printf "\e[0m\n" 1>&2
        fi

        local matchingPath=$( find "$searchPath" -xtype f -name "$fileName" | head -1 )
        printf '%s' "${matchingPath%$fileName}"
    }

    exportPath ALUMINUM_DIR "$PREFIX/Aluminum"
    exportPath CEREAL_DIR "$PREFIX/cereal"
    exportPath CNPY_DIR "$PREFIX/cnpy"
    exportPath CUB_DIR "$PREFIX"/cub-*/
    exportPath HWLOC_DIR "$PREFIX/hwloc"
    exportPath HYDROGEN_DIR "$PREFIX/Elemental"
    exportPath OPENCV_DIR "$PREFIX/opencv"

    exportPath PROTOBUF_ROOT "$PREFIX/protobuf"
    if test -d "$PROTOBUF_ROOT"; then
        add2path 'PATH' "$PROTOBUF_ROOT/bin"
        add2path 'CMAKE_PREFIX_PATH' "$PROTOBUF_ROOT"
        PROTOBUF_LIB=$( find "$PROTOBUF_ROOT" -mindepth 1 -maxdepth 1 -type d -name 'lib*' | head -1 ) &&
        add2path 'LIBRARY_PATH' "$PROTOBUF_LIB"
        add2path 'LD_LIBRARY_PATH' "$PROTOBUF_LIB"
    fi

    exportPath LBANN_DIR "$PREFIX/lbann"
    if test -d "$LBANN_DIR"; then
        add2path 'PATH' "$LBANN_DIR/bin"
        add2path 'CMAKE_PREFIX_PATH' "$LBANN_DIR"
        add2path 'LIBRARY_PATH' "$LBANN_DIR/lib"
        add2path 'LD_LIBRARY_PATH' "$LBANN_DIR/lib"
    fi

    add2path 'CMAKE_PREFIX_PATH' "$OPENCV_DIR" "$HYDROGEN_DIR" "$ALUMINUM_DIR"

    add2path 'PATH' "$PREFIX/cmake/bin"

%post
    if test "$0" = "/bin/sh"; then
        echo "Elevating script to bash"
        sed -n -z '$p' "/proc/$$/cmdline" | sed 's/\x00/\n/g' | /bin/bash -ve
        exit $?
    fi

    apt-get -y update &&
    apt-get -y install --no-install-recommends \
        findutils sed grep coreutils curl ca-certificates tar dpkg wget cmake \
        gcc g++ gfortran python make zlib*-dev libopenblas-dev libopenmpi-dev libprotobuf-dev protobuf-compiler liblapack-dev

    PREFIX="/opt/lbann"

    mkdir -p -- "$PREFIX/src"

    version-ge() { test "$1" = "$( printf '%s\n%s' "$1" "$2" | sort -V | tail -n 1 )"; }

    commandExists() { command -v "$@" &>/dev/null; }

    unzip(){ python -c "from zipfile import PyZipFile; PyZipFile( '''$1''' ).extractall()"; }

    remoteExtract()
    {
        local compression=
        local url="${@: -1}"
        local ext="$( printf '%s' "$url" | sed 's/\?.*//; s/.*\.//;' )"
        local iTry=5

        for (( ; iTry > 0; iTry )); do
            case "$ext" in
                tgz|gz) compression=--gzip ;;
                xz) compression=--xz ;;
                tbz2|bz2) compression=--bzip2 ;;
            esac

            (
                if command -v wget &>/dev/null; then
                    wget -O- \
                        --retry-connrefused \
                        --timeout=5 \
                        --tries=5 \
                        --waitretry=5 \
                        --read-timeout=20 \
                        "$@" |
                    tar -x $compression
                fi ||
                if command -v curl &>/dev/null; then
                    curl -L \
                        --connect-timeout 5 \
                        --max-time 20 \
                        --retry 5 \
                        --retry-delay 5 \
                        --retry-max-time 60 \
                        "$@" |
                    tar -x $compression
                fi ||
                false
            ) &&
            break
        done
    }

    setupCub()
    {
        cd -- "$PREFIX" &&
        if ! test -d cub-*; then
            remoteExtract 'https://github.com/NVlabs/cub/archive/v1.8.0.tar.gz'
        fi &&
        cd cub-* && export CUB_DIR=$( pwd )
    }

    setupCereal()
    {
        export CEREAL_DIR="$PREFIX"/cereal &&
        if ! test -d "$CEREAL_DIR"; then
            cd -- "$PREFIX/src" &&
            remoteExtract 'https://github.com/USCiLab/cereal/archive/v1.2.2.tar.gz' &&
            cd cereal-* && mkdir -p build && cd -- "$_" &&
            cmake -Wno-dev -DCMAKE_INSTALL_PREFIX="$PREFIX"/cereal -DJUST_INSTALL_CEREAL=ON .. &&
            make -j "$( nproc )" install
        fi
    }

    setupCnpy()
    {
        # commit 4e8810b1a8637695171ed346ce68f6984e585ef4 to be exact but has no release and only 1 commit in last year
        export CNPY_DIR="$PREFIX"/cnpy &&
        if ! test -d "$CNPY_DIR"; then
            cd -- "$PREFIX/src" && curl -L 'https://github.com/rogersce/cnpy/archive/master.zip' -o master.zip &&
            unzip "$_" && command rm -f "$_" && cd cnpy-* && mkdir -p build && cd -- "$_" &&
            cmake -Wno-dev -DCMAKE_INSTALL_PREFIX="$CNPY_DIR" .. && make -j "$( nproc )" install
        fi
    }

    buildAluminum()
    {
        # allow Aluminum build to fail (requires at least CUDA 9 because it uses CU_DEVICE_ATTRIBUTE_CAN_USE_STREAM_MEM_OPS)
        cd -- "$PREFIX/src" && remoteExtract 'https://github.com/LLNL/Aluminum/archive/v0.2.tar.gz' &&
        cd Aluminum-* && mkdir -p build && cd "$_" &&
        cmake -Wno-dev                               \
              -DCMAKE_BUILD_TYPE=Release             \
              -DCMAKE_INSTALL_PREFIX="$ALUMINUM_DIR" \
              -DCMAKE_LIBRARY_PATH="$LIBRARY_PATH"   \
              .. &&
        make -j "$( nproc )" VERBOSE=1 install || true
    }

    buildHydrogen()
    {
        # Needs at least CUDA 7.5 because it uses cuda_fp16.h even though Hydrogen_ENABLE_HALF=OFF Oo
        cd -- "$PREFIX/src" &&
        remoteExtract 'https://github.com/LLNL/Elemental/archive/v1.1.0.tar.gz' &&
        cd Elemental-* && mkdir -p build && cd -- "$_" &&
        cmake -Wno-dev                               \
              -DCMAKE_BUILD_TYPE=Release             \
              -DCMAKE_INSTALL_PREFIX="$HYDROGEN_DIR" \
              -DCMAKE_LIBRARY_PATH="$LIBRARY_PATH"   \
              -DHydrogen_USE_64BIT_INTS=ON           \
              -DHydrogen_ENABLE_OPENMP=ON            \
              -DBUILD_SHARED_LIBS=ON                 \
              -DHydrogen_ENABLE_ALUMINUM=ON          \
              .. &&
        make -j "$( nproc )" VERBOSE=1 install
    }

    buildOpenCV()
    {
        cd -- "$PREFIX/src" && remoteExtract 'https://github.com/opencv/opencv/archive/3.4.3.tar.gz' &&
        cd opencv-* && mkdir -p build && cd -- "$_" &&
        cmake -Wno-dev                             \
              -DCMAKE_BUILD_TYPE=Release           \
              -DCMAKE_INSTALL_PREFIX="$OPENCV_DIR" \
              -DWITH_{JPEG,PNG,TIFF}=ON            \
              -DWITH_{CUDA,JASPER}=OFF             \
              -DBUILD_SHARED_LIBS=ON               \
              -DBUILD_JAVA=OFF                     \
              -DBUILD_opencv_{calib3d,cuda,dnn,features2d,flann,java,{java,python}_bindings_generator,ml,python{2,3},stitching,ts,superres,video{,io,stab}}=OFF .. &&
        make -j "$( nproc )" install
    }

    buildLBANN()
    {
        fixLibZBug()
        {
            find . -type f -execdir  bash -c '
                if grep "g++.*libcnpy\.so" "$0" | grep -q -v " -lz"; then
                    sed -i -r "/g\+\+ .*libcnpy\.so( |$)/{ s:(libcnpy\.so |$):\1-lz : }" "$0";
                fi' {} \;
        }

        cd -- "$PREFIX/src" && remoteExtract 'https://github.com/LLNL/lbann/archive/v0.98.1.tar.gz' &&
        cd lbann-* && mkdir -p build && cd -- "$_" &&
        cmake -Wno-dev                               \
              -DCMAKE_BUILD_TYPE=Release             \
              -DCMAKE_INSTALL_PREFIX="$PREFIX"/lbann \
              -DCMAKE_LIBRARY_PATH="$LIBRARY_PATH"   \
              -DHydrogen_DIR="$HYDROGEN_DIR"         \
              -DLBANN_WITH_ALUMINUM:BOOL=ON          \
              -DLBANN_USE_PROTOBUF_MODULE=$( if test -f "$PROTOBUF_ROOT/lib/cmake/protobuf/protobuf-config.cmake"; then echo OFF; else echo ON; fi )  .. &&
        fixLibZBug && make -j 2 VERBOSE=1 install
        # only building with -j 2 instead of -j 4 because the VM on Taurus doesn't seem to have enough memory to run for compilations in parallel ...
    }

    setupCub
    setupCereal
    setupCnpy

    ALUMINUM_DIR="$PREFIX"/Aluminum &&
    if ! test -d "$ALUMINUM_DIR"; then buildAluminum; fi &&
    export CMAKE_PREFIX_PATH=$ALUMINUM_DIR:$CMAKE_PREFIX_PATH

    HYDROGEN_DIR="$PREFIX"/Elemental &&
    if ! test -d "$HYDROGEN_DIR"; then buildHydrogen; fi &&
    export CMAKE_PREFIX_PATH=$HYDROGEN_DIR:$CMAKE_PREFIX_PATH

    OPENCV_DIR="$PREFIX"/opencv
    if ! test -d "$OPENCV_DIR"; then buildOpenCV; fi &&
    export CMAKE_PREFIX_PATH=$OPENCV_DIR:$CMAKE_PREFIX_PATH

    buildLBANN

    exit 0

Support no progress engine

Support a compile-time flag to only start the progress engine on demand (i.e., if something is submitted to it). This is a flag so that we only pay this runtime check penalty if we need it.

This would primarily benefit the NCCL backend, since we would not have a progress engine thread spinning unless it were needed.

Consistent issues with the RingMPICUDA initialization

The RingMPICUDA ring is consistently failing on most platforms (I think I've seen issues on sierra and pascal). I have this commented out in the MPI-CUDA communicator in my local branches. Perhaps it would be better to lazily initialize it, esp. since it's heap-alloc'd anyway. I'll update this issue if I do any of the debugging legwork to more precisely characterize the error. Superficially, a Sendrecv is called where at least one of the ranks is gibberish in the sense that it's often negative and also way more than 1, 2, or 3 digits long...

Host-transfer multi-threading issue

When using multiple threads (even on #130), some threads in some ranks non-deterministically get incorrect results when testing. Anecdotally, the incorrect results are either a vector of all 0s (which is strange since the code does not zero memory) or simply incorrect values (but which seem to be in a reasonable range, e.g., no NaNs or clear garbage).

I do not observe this issue with only a single thread.

Tests crash

[yv:04179] *** Process received signal ***
[yv:04179] Signal: Segmentation fault (11)
[yv:04179] Signal code: Address not mapped (1)
[yv:04179] Failing at address: 0x440000c8
[yv:04179] [ 0] 0x8260bcb1e <pthread_sigmask+0x54e> at /lib/libthr.so.3
[yv:04179] [ 1] 0x8260bc0cf <pthread_setschedparam+0x83f> at /lib/libthr.so.3
[yv:04179] [ 2] 0x7ffffffff2d3 <_fini+0x7fffffbdb5e7> at ???
[yv:04179] [ 3] 0x822f320d8 <MPI_Comm_get_attr+0x58> at /usr/local/mpi/openmpi/lib/libmpi.so.40
[yv:04179] [ 4] 0x82165de57 <_ZN2Al8internal3mpi4initERiRPPc+0xf7> at /usr/ports/net/aluminum/work/.build/src/libAl.so.1.2.1
[yv:04179] [ 5] 0x82165ab49 <_ZN2Al10InitializeERiRPPc+0x19> at /usr/ports/net/aluminum/work/.build/src/libAl.so.1.2.1
[yv:04179] [ 6] 0x4054b0 <main+0x40> at /usr/ports/net/aluminum/work/.build/test/test_exchange
[yv:04179] *** End of error message ***
*** Signal 11

Version: 1.2.1
clang-14
FreeBSD 13.1

MPI communicator should be wrapped with shared_ptr

Copying MPICommunicator does not deepcopy the MPI communicator, so it should be only deleted when no reference exists.

Aluminum/src/mpi_impl.hpp

Line 68 in 0e44e2f

MPI_Comm_free(&comm);

No refenrence to 'hwloc_linux_parse_cpumap_file'

Looks like the Auminum is only compatible witht the hwloc < 2.x. Otherwise, there will be this error during the building process.

Performance by OpenMPI vs MPICH

Will they be different?

Reorganize MPI backend allreduces

The existing implementations of the MPI allreduces have been haphazardly updated with a number of changes to the progress engine and other APIs. We should clean them up.

Build breaks because of missing 'cuda.h' when ALUMINUM_ENABLE_CUDA=OFF

===>  Building for Aluminum-0.2.1
[1/39] /usr/local/bin/clang++70  -DAl_EXPORTS -I/wrkdirs/usr/ports/net/aluminum/work/Aluminum-0.2.1/src -I. -isystem /usr/local/mpi/openmpi/include -isystem /usr/local/include -Wall -Wextra -pedantic -g -faligned-new -O2 -pipe -fstack-protector -fno-strict-aliasing -fPIC   -pthread -fopenmp=libomp -std=gnu++11 -MD -MT src/CMakeFiles/Al.dir/profiling.cpp.o -MF src/CMakeFiles/Al.dir/profiling.cpp.o.d -o src/CMakeFiles/Al.dir/profiling.cpp.o -c /wrkdirs/usr/ports/net/aluminum/work/Aluminum-0.2.1/src/profiling.cpp
[2/39] /usr/local/bin/clang++70   -I/wrkdirs/usr/ports/net/aluminum/work/Aluminum-0.2.1/test -I/wrkdirs/usr/ports/net/aluminum/work/Aluminum-0.2.1/src -I. -isystem /usr/local/mpi/openmpi/include -isystem /usr/local/include -Wall -Wextra -pedantic -g -faligned-new -O2 -pipe -fstack-protector -fno-strict-aliasing -fPIE   -pthread -fopenmp=libomp -std=gnu++11 -MD -MT benchmark/CMakeFiles/benchmark_waits.exe.dir/benchmark_waits.cpp.o -MF benchmark/CMakeFiles/benchmark_waits.exe.dir/benchmark_waits.cpp.o.d -o benchmark/CMakeFiles/benchmark_waits.exe.dir/benchmark_waits.cpp.o -c /wrkdirs/usr/ports/net/aluminum/work/Aluminum-0.2.1/benchmark/benchmark_waits.cpp
FAILED: benchmark/CMakeFiles/benchmark_waits.exe.dir/benchmark_waits.cpp.o 
/usr/local/bin/clang++70   -I/wrkdirs/usr/ports/net/aluminum/work/Aluminum-0.2.1/test -I/wrkdirs/usr/ports/net/aluminum/work/Aluminum-0.2.1/src -I. -isystem /usr/local/mpi/openmpi/include -isystem /usr/local/include -Wall -Wextra -pedantic -g -faligned-new -O2 -pipe -fstack-protector -fno-strict-aliasing -fPIE   -pthread -fopenmp=libomp -std=gnu++11 -MD -MT benchmark/CMakeFiles/benchmark_waits.exe.dir/benchmark_waits.cpp.o -MF benchmark/CMakeFiles/benchmark_waits.exe.dir/benchmark_waits.cpp.o.d -o benchmark/CMakeFiles/benchmark_waits.exe.dir/benchmark_waits.cpp.o -c /wrkdirs/usr/ports/net/aluminum/work/Aluminum-0.2.1/benchmark/benchmark_waits.cpp
/wrkdirs/usr/ports/net/aluminum/work/Aluminum-0.2.1/benchmark/benchmark_waits.cpp:4:10: fatal error: 'cuda.h' file not found
#include <cuda.h>
         ^~~~~~~~
1 error generated.

Optimize NCCL collectives

Provide optimized versions of our custom-implemented NCCL collectives:

RCCL @ ROCm 3.6.0 doesn't support point-to-point

We should be careful about this for the time being.

Threads linkage isn't propagated to CMake export

In the build we

find_package(Threads REQUIRED)

but we don't restore find_dependency(Threads) in AluminumConfig.cmake.

Traces are not thread-safe

Saving a trace entry is not thread-safe when AL_THREAD_MULTIPLE is set and could trash the trace log.

GPU memory pool

The NCCL Alltoall and Alltoallv operations allocate temporary buffers on the GPU to handle in-place operations. Right now this directly calls cudaMalloc/cudaFree. We should use a memory pool, since these operations have a large overhead.

Right now our memory pools only support host memory. Probably should use cub (or similar).

Check buffers in debug mode

We should check that provided buffers are sane in debug mode. Specifically:

Buffers should not be null (except when permitted, e.g., for non-roots in some collectives).
Buffers do not intersect (except when in-place).

Tests use invalid integer types in std::uniform_int_distribution

-- Build files have been written to: /usr/ports/net/aluminum/work/.build
[ 50% 1/2] /usr/local/libexec/ccache/c++  -I/usr/ports/net/aluminum/work/Aluminum-1.4.0/include -I/usr/ports/net/aluminum/work/.build -I/usr/ports/net/aluminum/work/Aluminum-1.4.0/test -isystem /usr/ports/net/aluminum/work/Aluminum-1.4.0/third_party/cxxopts/include -isystem /usr/local/include -isystem /usr/local/mpi/openmpi/include -O2 -pipe -fstack-protector-strong -fno-strict-aliasing -Wall -Wextra -pedantic -faligned-new -g3 -O2 -pipe -fstack-protector-strong -fno-strict-aliasing   -DNDEBUG -std=gnu++17 -fPIE -fexceptions -pthread -MD -MT test/CMakeFiles/test_ops.dir/test_ops.cpp.o -MF test/CMakeFiles/test_ops.dir/test_ops.cpp.o.d -o test/CMakeFiles/test_ops.dir/test_ops.cpp.o -c /usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_ops.cpp
FAILED: test/CMakeFiles/test_ops.dir/test_ops.cpp.o 
/usr/local/libexec/ccache/c++  -I/usr/ports/net/aluminum/work/Aluminum-1.4.0/include -I/usr/ports/net/aluminum/work/.build -I/usr/ports/net/aluminum/work/Aluminum-1.4.0/test -isystem /usr/ports/net/aluminum/work/Aluminum-1.4.0/third_party/cxxopts/include -isystem /usr/local/include -isystem /usr/local/mpi/openmpi/include -O2 -pipe -fstack-protector-strong -fno-strict-aliasing -Wall -Wextra -pedantic -faligned-new -g3 -O2 -pipe -fstack-protector-strong -fno-strict-aliasing   -DNDEBUG -std=gnu++17 -fPIE -fexceptions -pthread -MD -MT test/CMakeFiles/test_ops.dir/test_ops.cpp.o -MF test/CMakeFiles/test_ops.dir/test_ops.cpp.o.d -o test/CMakeFiles/test_ops.dir/test_ops.cpp.o -c /usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_ops.cpp
In file included from /usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_ops.cpp:28:
In file included from /usr/ports/net/aluminum/work/Aluminum-1.4.0/include/Al.hpp:39:
In file included from /usr/include/c++/v1/vector:312:
In file included from /usr/include/c++/v1/algorithm:1851:
In file included from /usr/include/c++/v1/__algorithm/ranges_sample.h:13:
In file included from /usr/include/c++/v1/__algorithm/sample.h:18:
/usr/include/c++/v1/__random/uniform_int_distribution.h:162:5: error: static assertion failed due to requirement '__libcpp_random_is_valid_inttype<char>::value': IntType must be a supported integer type
    static_assert(__libcpp_random_is_valid_inttype<_IntType>::value, "IntType must be a supported integer type");
    ^             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_utils.hpp:138:36: note: in instantiation of template class 'std::uniform_int_distribution<char>' requested here
  std::uniform_int_distribution<T> rng;
                                   ^
/usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_utils.hpp:149:14: note: in instantiation of function template specialization 'gen_random_val<char, std::linear_congruential_engine<unsigned int, 48271, 0, 2147483647>, true>' requested here
      v[i] = gen_random_val<T>(g);
             ^
/usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_utils.hpp:207:30: note: in instantiation of function template specialization 'RandVectorGen<char>::gen<std::linear_congruential_engine<unsigned int, 48271, 0, 2147483647>>' requested here
    return RandVectorGen<T>::gen(count, rng_gen);
                             ^
/usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_ops.cpp:139:35: note: in instantiation of member function 'VectorType<char, Al::MPIBackend>::gen_data' requested here
    input(VectorType<T, Backend>::gen_data(in_size, comm_wrapper.comm().get_stream())),
                                  ^
/usr/include/c++/v1/__memory/allocator.h:165:28: note: in instantiation of member function 'TestData<Al::MPIBackend, char>::TestData' requested here
        ::new ((void*)__p) _Up(_VSTD::forward<_Args>(__args)...);
                           ^
/usr/include/c++/v1/__memory/allocator_traits.h:290:13: note: (skipping 3 contexts in backtrace; use -ftemplate-backtrace-limit=0 to see all)
        __a.construct(__p, _VSTD::forward<_Args>(__args)...);
            ^
/usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_ops.cpp:356:14: note: in instantiation of function template specialization 'std::vector<TestData<Al::MPIBackend, char>>::emplace_back<unsigned long &, unsigned long &, CommWrapper<Al::MPIBackend> &>' requested here
        data.emplace_back(in_size, out_size, comm_wrappers[i]);
             ^
/usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_ops.cpp:419:7: note: in instantiation of function template specialization 'run_test<Al::MPIBackend, char, true>' requested here
      run_test<Backend, T>(parsed_opts);
      ^
/usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_utils.hpp:321:39: note: in instantiation of function template specialization 'test_dispatcher::operator()<Al::MPIBackend, char>' requested here
    {"char", [&]() { functor.template operator()<Backend, char>(parsed_opts); } },
                                      ^
/usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_utils.hpp:359:20: note: in instantiation of function template specialization 'dispatch_to_backend_type_helper<Al::MPIBackend, test_dispatcher>' requested here
    {"mpi", [&](){ dispatch_to_backend_type_helper<Al::MPIBackend>(parsed_opts, functor); } },
                   ^
/usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_ops.cpp:488:3: note: in instantiation of function template specialization 'dispatch_to_backend<test_dispatcher>' requested here
  dispatch_to_backend(parsed_opts, test_dispatcher());
  ^
1 error generated.

clang-15
FreeBSD 13.2

In-place NCCL reduce-scatter is not in-place

The NCCL backend's in-place reduce-scatter uses sendbuf = recvbuf here. But per NCCL documentation, the in-place reduce-scatter should actually have recvbuf be the appropriate offset into the recvbuf (see here). This appears appears to differ from the usual MPI semantics. Our current version works, but I'm not sure whether using overlapping buffers without telling NCCL that we are is safe. We may also be missing some performance benefits.

NCCL group limits

According to the NCCL documentation, there is a maximum of 2048 calls that can be aggregated within a ncclGroupStart/ncclGroupEnd pair. Our current implementations are fine at small scale, but collectives will exceed this at large scale. For example, Allgatherv performs a send and a receive for each rank, meaning we exceed this at only 1025 processors (assuming all processors contribute data).

We need an API to automagically manage NCCL groups, at least for send/recv operations. We have to be careful in how we split things up (esp. when skipping calls due to 0-length buffers), as a group of send/recv operations only completes when all send/recv operations complete. Thus we need to ensure groups are split up in a way that is consistent across all processors.

Progress engine cleanup

The progress engine code has gotten crufty and has a lot of various hacks. Clean it up.

Support CMake components in export

We should have component support to make life easier for picky downstreams. For example, distconv requires NCCL and HostTransfer backends, and it'd be better to just

find_package(Aluminum COMPONENTS NCCL HT)

(so the complexities of searching/checking/re-searching/re-checking/... are handled by CMake rather than requiring us to write that logic).

I've added this to my TODO list, but it's long. I wanted to leave this here to keep myself honest.

Serialized MPI

Support a (compile-time chosen) serialized mode for the MPI backend. In this mode, MPI will be initialized with MPI_THREAD_SERIALIZED if it is not already initialized. Aside from communicator creation, in this mode Aluminum will serialize all calls to MPI by running them on the progress engine. Blocking calls will be internally transformed into non-blocking calls on the progress engine followed immediately by a wait.

Fix cpuset discovery with rocm smi

You need a call to rsmi_init(0); before this line:

Aluminum/src/progress.cpp

Line 350 in 40a062b

hwloc_rsmi_get_device_cpuset(topo, device, cpuset);

You probably also need an accompanying rsmi_shut_down.

Here is an example that demonstrates the correct usage: https://github.com/open-mpi/hwloc/blob/master/tests/hwloc/rsmi.c

Fix coding style

Our coding style here is a bit of a mess and needs to be unified. Especially variable names.

CUDA Communicator's copy should duplicate a CUDA stream.

The current implementation does not match the behavior of MPICommunicator, where MPI communicator is duplicated.

Aluminum/src/cudacommunicator.hpp

Line 41 in 0e44e2f

return new CUDACommunicator(get_comm(), stream);

Improve progress thread binding

Improve the automatic node topology detection for progress thread binding. We currently end up using AL_PROGRESS_RANKS_PER_NUMA_NODE more frequently than not.

[1.4.1] Tests crash

===>  Testing for Aluminum-1.4.1
===>   Aluminum-1.4.1 depends on package: cxxopts>0 - found
-- Configuring done (4.9s)
-- Generating done (0.0s)
-- Build files have been written to: /usr/ports/net/aluminum/work/.build
ninja: no work to do.
[  0% 1/1] cd /usr/ports/net/aluminum/work/.build && /usr/local/bin/ctest --force-new-ctest-process
Test project /usr/ports/net/aluminum/work/.build
No tests were found!!!
[yv:33027] *** Process received signal ***
[yv:33027] Signal: Segmentation fault (11)
[yv:33027] Signal code: Address not mapped (1)
[yv:33027] Failing at address: 0x440000c8
[yv:33027] [ 0] 0x826d6762c <pthread_sigmask+0x54c> at /lib/libthr.so.3
[yv:33027] [ 1] 0x826d66bd9 <pthread_setschedparam+0x839> at /lib/libthr.so.3
[yv:33027] [ 2] 0x7ffffffff923 <_fini+0x7fffffdd3aa7> at ???
[yv:33027] [ 3] 0x824332fe8 <MPI_Comm_get_attr+0x58> at /usr/local/mpi/openmpi/lib/libmpi.so.40
[yv:33027] [ 4] 0x821b7a4b2 <_ZN2Al8internal3mpi4initERiRPPci+0x102> at /usr/ports/net/aluminum/work/.build/src/libAl.so.1.4.1
[yv:33027] [ 5] 0x821b7735a <_ZN2Al10InitializeERiRPPci+0x1a> at /usr/ports/net/aluminum/work/.build/src/libAl.so.1.4.1
[yv:33027] [ 6] 0x20d730 <main+0x40> at /usr/ports/net/aluminum/work/.build/test/test_exchange
[yv:33027] *** End of error message ***
*** Signal 11

clang-15
FreeBSD 13.2

Host-transfer allgather fails

The host-transfer allgather results in different errors coming from SMPI on Lassen, depending on exactly how many processors are used. It works fine on a single processor, but fails with 2+.

I have not yet identified whether this is an error in SMPI or Aluminum. It is possibly similar to #90 (I see it sometimes trying to dereference address 1).

Support bfloat16

Both NCCL and RCCL support bfloat16, we should add support like with half.

I'm a newbie. Can I ask you a question

Following the README, I executed the cmake and make directives. Then I went to the example directory, followed the README in example, executed make, and got three executables. hello_world, allreduce, pingpong. When I execute./hello_world, I get the following error. I want to fix this error

shangda02@abc-Super-Server:~/LLNL_Aluminum/examples/build$ ./hello_world
terminate called after throwing an instance of 'Al::al_exception'
what(): /home/shangda02/LLNL_Aluminum/src/progress.cpp:88 - Tried to exchange infinite bitmap
[abc-Super-Server:29183] *** Process received signal ***
[abc-Super-Server:29183] Signal: Aborted (6)
[abc-Super-Server:29183] Signal code: (-6)
[abc-Super-Server:29183] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10)[0x7fdb3bd6ef10]
[abc-Super-Server:29183] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fdb3bd6ee87]
[abc-Super-Server:29183] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fdb3bd707f1]
[abc-Super-Server:29183] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c957)[0x7fdb3c3c5957]
[abc-Super-Server:29183] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92ae6)[0x7fdb3c3cbae6]
[abc-Super-Server:29183] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92b21)[0x7fdb3c3cbb21]
[abc-Super-Server:29183] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d54)[0x7fdb3c3cbd54]
[abc-Super-Server:29183] [ 7] / home/shangda02 / LLNL_Aluminum/build/SRC/libAl. So. 1.3.1 (_ZN2Al8internal14ProgressEngine9bind_initEv + 0 xbc8) [0 x7fdb3ca1e678 ]
[abc-Super-Server:29183] [ 8] / home/shangda02 / LLNL_Aluminum/build/SRC/libAl. So. 1.3.1 (_ZN2Al8internal14ProgressEngineC1Ev + 0 x14b) x7fdb3ca1ec3b [0]
[abc-Super-Server:29183] [ 9] / home/shangda02 / LLNL_Aluminum/build/SRC/libAl. So. 1.3.1 (_ZN2Al10InitializeERiRPPcP19ompi_communicator_t + 0 x39) [0 x7fdb3ca19 a89]
[abc-Super-Server:29183] [10] ./hello_world(+0xe30)[0x5598fd9ace30]
[abc-Super-Server:29183] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7fdb3bd51c87]
[abc-Super-Server:29183] [12] ./hello_world(+0xfca)[0x5598fd9acfca]
[abc-Super-Server:29183] *** End of error message ***
Aborted (core dumped)

To be honest, I don't have a good understanding of the whole project

In-place MPI scatter segfaults on one processor

$ jsrun --bind packed:8 --nrs 1 --rs_per_host 1 --tasks_per_rs 1 --launch_distribution packed --cpu_per_rs ALL_CPUS --gpu_per_rs ALL_GPUS ./test_ops.exe --backend mpi --op scatter --inplace
Aborting after hang in Al size=1

Scatterv also hangs.

Regular scatter works fine. Also works with more processors. Likewise, other rooted collectives (bcast, gather, reduce) work in this config. This is a bit strange since this should be a NOP.

Need to also verify this is not a SMPI bug.