Code Monkey home page Code Monkey logo

aluminum's People

Contributors

andy-yoo avatar benson31 avatar bvanessen avatar naoyam avatar ndryden avatar nobles5e avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

aluminum's Issues

Add note and CMake requirement for minimum CUDA version

For reasons, I tried to compile with CUDA 7.0 but couldn't because:

[  8%] Building CUDA object src/CMakeFiles/Al.dir/mpi_cuda/cuda_kernels.cu.o
In file included from /opt/Aluminum-0.2/src/mempool.hpp:35:0,
                 from /opt/Aluminum-0.2/src/mpi_impl.hpp:36,
                 from /opt/Aluminum-0.2/src/Al.hpp:826,
                 from /opt/Aluminum-0.2/src/cuda.cpp:30:
/opt/Aluminum-0.2/src/cuda.cpp: In function 'void Al::internal::cuda::init(int&, char**&)':
/opt/Aluminum-0.2/src/cuda.cpp:61:30: error: 'CU_DEVICE_ATTRIBUTE_CAN_USE_STREAM_MEM_OPS' was not declared in this scope
                       &attr, CU_DEVICE_ATTRIBUTE_CAN_USE_STREAM_MEM_OPS, dev));
                              ^
/opt/Aluminum-0.2/src/cuda.hpp:89:39: note: in definition of macro 'AL_FORCE_CHECK_CUDA_DRV_NOSYNC'
     CUresult status_CHECK_CUDA_DRV = (cuda_call);               \
                                       ^
/opt/Aluminum-0.2/src/cuda.cpp:60:3: note: in expansion of macro 'AL_CHECK_CUDA_DRV'
   AL_CHECK_CUDA_DRV(cuDeviceGetAttribute(
   ^
[ 10%] Building CUDA object src/CMakeFiles/Al.dir/helper_kernels.cu.o
make[2]: *** [src/CMakeFiles/Al.dir/cuda.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
/opt/Aluminum-0.2/src/helper_kernels.cu(48): error: identifier "CU_STREAM_WAIT_VALUE_EQ" is undefined

/opt/Aluminum-0.2/src/helper_kernels.cu(48): error: identifier "cuStreamWaitValue32" is undefined

2 errors detected in the compilation of "/tmp/tmpxft_0000260c_00000000-7_helper_kernels.cpp1.ii".
make[2]: *** [src/CMakeFiles/Al.dir/helper_kernels.cu.o] Error 2
make[1]: *** [src/CMakeFiles/Al.dir/all] Error 2
make: *** [all] Error 2

Looking at the docs, CU_DEVICE_ATTRIBUTE_CAN_USE_STREAM_MEM_OPS was added in CUDA 9.0, and the othermissing defines seem to be mostly from CUDA 8.0. It seems that at least CUDA 9.0 is therefore required. The Readme.md does not mention any minimum version (even though it mentions that at least MPI 3.0 is required). Also, the CMakeLists.txt does not check against a minimum version. Imho, boht should be added. Maybe after some testing whether it really works with CUDA 9.0. I might test this with a singularity container.

Support HIP stream memory operations

As of ROCm 4.2, HIP supports hipStreamWaitValue32/64 and hipStreamWriteValue32/64 (analogous to the corresponding cuStreamWaitValue/WriteValue methods we use). We should support these as well and make them the default implementation on AMD, rather than the manual spin-wait kernel-style approach we currently use.

Hipify should support these, so we may not need to do anything other than change some #defines.

Early exit for trivial NCCL collectives

Right now, some NCCL operations will exit early when their count parameter is 0. However, non-blocking collectives still set up CUDA event synchronization between streams, which will add some overhead.

For some collectives, we may also be able to exit early if the size of the communicator is 1.

Better memory pools

Our memory pools are currently inefficient, ugly, and hard to extend if we need different kinds of memory.

Implement a generic memory pool interface that operates something like:

  • T* pool_type::GetMemory<MemoryKindTag>(size_t n)
  • void pool_type::ReleaseMemory<MemoryKindTag>(T* mem)

(This is just a starting point. We may want to put the memory kind on the pool instead.)

We should probably not rely on CUB/etc. in order to minimize cross-platform issues.

The pool should be thread-safe.

Testing half

Our current testing infrastructure does not actually check results when using half, since MPI does not support it.

Host-transfer allreduce on one processor fails

$ jsrun --bind packed:8 --nrs 1 --rs_per_host 1 --tasks_per_rs 1 --launch_distribution packed --cpu_per_rs ALL_CPUS --gpu_per_rs ALL_GPUS ./test_ops.exe --backend ht --op allreduce --size 1                                         
[lassen3:09550] *** An error occurred in MPI_Irecv
[lassen3:09550] *** reported by process [4456568,0]
[lassen3:09550] *** on communicator MPI COMMUNICATOR 5 DUP FROM 0
[lassen3:09550] *** MPI_ERR_RANK: invalid rank
[lassen3:09550] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[lassen3:09550] ***    and potentially your MPI job)
$ jsrun --bind packed:8 --nrs 1 --rs_per_host 1 --tasks_per_rs 1 --launch_distribution packed --cpu_per_rs ALL_CPUS --gpu_per_rs ALL_GPUS ./test_ops.exe --backend ht --op allreduce --inplace --size 1
[lassen3:08077] *** An error occurred in MPI_Irecv
[lassen3:08077] *** reported by process [3087233017,0]
[lassen3:08077] *** on communicator MPI COMMUNICATOR 5 DUP FROM 0
[lassen3:08077] *** MPI_ERR_RANK: invalid rank
[lassen3:08077] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[lassen3:08077] ***    and potentially your MPI job)

This appears to be a problem directly with the host-transfer backend, as the corresponding MPI backend operations work fine.

NCCL inplace reduce-scatterv trashes rank 0 buffer

We currently implement the reduce-scatter as a reduce to rank 0 followed by a scatterv. When doing an in-place op, the reduce is in-place on the input sendbuf. This therefore writes to portions of sendbuf on rank 0 that are outside of the region where the final scattered value would be placed.

I don't find something explicitly prohibiting this in the MPI standard, but:

  • It's a bit aesthetically displeasing.
  • In other cases, like a MPI_Recv with a buffer/count larger than the actual message length, MPI does guarantee that no more memory will be touched than is actually needed by the message.
  • Avoiding it shouldn't take too much overhead if we use a memory pool.
  • A better, direct implementation can probably avoid it.

Thread safety

Aluminum currently does not provide a documented thread-safety guarantee. In practice, different backends implicitly provide different safety guarantees. I believe the main bottleneck is with the progress engine, which currently uses a single-producer/single-consumer queue.

We should:

  • Provide an explicit, documented standard for thread safety. I propose that, for performance reasons, we allow this to be chosen at compile-time to be either:
    • Multiple threads can simultaneously submit communication operations. If the operations are on the same communicator or compute stream, the ordering must be provided by the user.
    • Only a single thread submits communication at a time. (We can allow this to be different threads/etc.-- it is up to the user to synchronize.)
  • Implement the above. I believe the main task is to improve the queues the progress engine uses and ensure memory pools are thread-safe.

Things that are not necessarily thread-safe:

  • Internal CUDA stuff. (See, e.g., NCCLRequest).
  • Memory pools (depending on setting).
  • Progress engine.

Rooted in-place NCCL calls wrong

I have not verified it, but looking at the code, I believe we do not correctly set the buffers for rooted (e.g., Gather) NCCL collectives, when the root is not 0.

We should add a test that checks this case, and fix it if needed.

Probably we just need to switch to passing internal::IN_PLACE<T>() in the right places.

Hang in progress engine binding

(This issue is already fixed in #181, but I'm writing up an issue to document it. I'm writing about Aluminum as it was before that PR.)

The progress engine does some MPI communication to decide how to bind the progress engine thread. This involves collectives being run among the processes on each physical node (i.e., there is no global collective, just concurrent collectives within each node). If progress engine startup is deferred (with AL_PE_START_ON_DEMAND), then this is not executed until the progress engine actually starts. However, if not every rank on a node performs an operation starting the progress engine (e.g., because they're doing a point-to-point operation), then the ranks may hang and the progress engine not fully start.

Update examples

Our example codes have gotten a bit stale. They need to be updated to reflect changes made in the main Aluminum test/benchmark utilities:

  • Support Flux environment variables.
  • Properly handle half and bfloat16.

Support half in MPI and Host-Transfer

Add support for half to the MPI and Host-Transfer backends.

If we're testing it (see #104), we have to support it with MPI in some way anyway, so we might as well add full support.

Note that since these will primarily be performing computation on the CPU, and CPUs don't (at present) have good support for half, reduction performance may be poor.

MPSC queue debug checks

The SPSC queue has a sanity check in debug mode for when the queue is full. We should add a similar check to the MPSC queue.

Could not get NUMA node

When trying to run lbann --help on my notebook compiled without CUDA but with Aluminum, support I get this error:

terminate called after throwing an instance of 'Al::al_exception'
  what():  /opt/lbann/src/Aluminum-0.2/src/progress.cpp:145 - Could not get NUMA node.
[ThinkPad-X240:24279] *** Process received signal ***
[ThinkPad-X240:24279] Signal: Aborted (6)
[ThinkPad-X240:24279] Signal code:  (-6)
[ThinkPad-X240:24279] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x126e0)[0x7f291dce96e0]
[ThinkPad-X240:24279] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7f291d5b48bb]
[ThinkPad-X240:24279] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7f291d59f535]
[ThinkPad-X240:24279] [ 3] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c983)[0x7f291d969983]
[ThinkPad-X240:24279] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x928e6)[0x7f291d96f8e6]
[ThinkPad-X240:24279] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x92921)[0x7f291d96f921]
[ThinkPad-X240:24279] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0x92b54)[0x7f291d96fb54]
[ThinkPad-X240:24279] [ 7] /opt/lbann/Aluminum/lib/libAl.so(+0xf14e)[0x7f291dc1114e]
[ThinkPad-X240:24279] [ 8] /opt/lbann/Aluminum/lib/libAl.so(_ZN2Al8internal14ProgressEngine6engineEv+0x1a)[0x7f291dc1604a]
[ThinkPad-X240:24279] [ 9] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xbbb2f)[0x7f291d998b2f]
[ThinkPad-X240:24279] [10] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3)[0x7f291dcdefa3]
[ThinkPad-X240:24279] [11] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f291d67680f]
[ThinkPad-X240:24279] *** End of error message ***

System setup (singularity container on a system with no GPU, linux kernel 4.15.0 and i7-4600 CPU):

Singularity File:
Bootstrap: docker
From: debian:buster-slim

%environment
    # elevating this to /bin/bash is not possible. Therefore should on ubuntu also be runnable in /bin/dash -.-

    PREFIX=/opt/lbann

    exportPath()
    {
        if test -d "$2"; then
            export "$1"="$2"
            printf "\e[37mExported existing path '$2' into environment variable '$1'\e[0m\n"
        else
            printf "\e[31m[Warning] '$2' is not a directory. Won't export it\e[0m\n"
        fi
    }

    add2path()
    {
        local targetVar=PATH
        if test "$#" -gt 1; then
            targetVar=$1
            shift 1
        fi
        local targetContent=$( eval echo \$$targetVar )
        local oldContent=$targetContent

        while test "$#" -gt 0; do
            if test -d "$1"; then
                case ":$targetContent:" in
                    *:"$1":*)
                    printf "\e[37m[Info] Path '$1' already exists in \$$targetVar. Won't add it.\e[0m\n"
                    ;;
                    *)
                    targetContent=$1:$targetContent
                    ;;
                esac
            else
                printf "\e[33m[Warning] '$1' is not a directory. Won't append to \$$targetVar variable.\e[0m\n"
            fi
            shift 1
        done

        if test "${#targetContent}" -gt "${#oldContent}"; then
            export $targetVar=$targetContent
            printf "\e[37mExporting new \$$targetVar: $targetContent\e[0m\n"
        elif test "${#targetContent}" -lt "${#oldContent}"; then
            printf "\e[31m[Error] After adding paths, the variable is erroneously shorter (${#targetContent}) than before (${#oldContent})"'!'"\e[0m\n"
        fi
    }

    findPath()
    {
        local fileName=$1
        local searchPath=$2

        if test "$( find "$searchPath" -xtype f -name "$fileName" | head -2 | wc -l )" -gt 1; then
            printf "\e[33m[Warning] Found more than one matching sub path in the searchPath '$searchPath'.\e[0m\n" 1>&2
            printf "\e[37mMatches:\n" 1>&2
            find "$searchPath" -xtype f -name "$fileName" 1>&2
            printf "\e[0m\n" 1>&2
        fi

        local matchingPath=$( find "$searchPath" -xtype f -name "$fileName" | head -1 )
        printf '%s' "${matchingPath%$fileName}"
    }

    exportPath ALUMINUM_DIR "$PREFIX/Aluminum"
    exportPath CEREAL_DIR "$PREFIX/cereal"
    exportPath CNPY_DIR "$PREFIX/cnpy"
    exportPath CUB_DIR "$PREFIX"/cub-*/
    exportPath HWLOC_DIR "$PREFIX/hwloc"
    exportPath HYDROGEN_DIR "$PREFIX/Elemental"
    exportPath OPENCV_DIR "$PREFIX/opencv"

    exportPath PROTOBUF_ROOT "$PREFIX/protobuf"
    if test -d "$PROTOBUF_ROOT"; then
        add2path 'PATH' "$PROTOBUF_ROOT/bin"
        add2path 'CMAKE_PREFIX_PATH' "$PROTOBUF_ROOT"
        PROTOBUF_LIB=$( find "$PROTOBUF_ROOT" -mindepth 1 -maxdepth 1 -type d -name 'lib*' | head -1 ) &&
        add2path 'LIBRARY_PATH' "$PROTOBUF_LIB"
        add2path 'LD_LIBRARY_PATH' "$PROTOBUF_LIB"
    fi

    exportPath LBANN_DIR "$PREFIX/lbann"
    if test -d "$LBANN_DIR"; then
        add2path 'PATH' "$LBANN_DIR/bin"
        add2path 'CMAKE_PREFIX_PATH' "$LBANN_DIR"
        add2path 'LIBRARY_PATH' "$LBANN_DIR/lib"
        add2path 'LD_LIBRARY_PATH' "$LBANN_DIR/lib"
    fi

    add2path 'CMAKE_PREFIX_PATH' "$OPENCV_DIR" "$HYDROGEN_DIR" "$ALUMINUM_DIR"

    add2path 'PATH' "$PREFIX/cmake/bin"

%post
    if test "$0" = "/bin/sh"; then
        echo "Elevating script to bash"
        sed -n -z '$p' "/proc/$$/cmdline" | sed 's/\x00/\n/g' | /bin/bash -ve
        exit $?
    fi

    apt-get -y update &&
    apt-get -y install --no-install-recommends \
        findutils sed grep coreutils curl ca-certificates tar dpkg wget cmake \
        gcc g++ gfortran python make zlib*-dev libopenblas-dev libopenmpi-dev libprotobuf-dev protobuf-compiler liblapack-dev

    PREFIX="/opt/lbann"

    mkdir -p -- "$PREFIX/src"

    version-ge() { test "$1" = "$( printf '%s\n%s' "$1" "$2" | sort -V | tail -n 1 )"; }

    commandExists() { command -v "$@" &>/dev/null; }

    unzip(){ python -c "from zipfile import PyZipFile; PyZipFile( '''$1''' ).extractall()"; }

    remoteExtract()
    {
        local compression=
        local url="${@: -1}"
        local ext="$( printf '%s' "$url" | sed 's/\?.*//; s/.*\.//;' )"
        local iTry=5

        for (( ; iTry > 0; iTry )); do
            case "$ext" in
                tgz|gz) compression=--gzip ;;
                xz) compression=--xz ;;
                tbz2|bz2) compression=--bzip2 ;;
            esac

            (
                if command -v wget &>/dev/null; then
                    wget -O- \
                        --retry-connrefused \
                        --timeout=5 \
                        --tries=5 \
                        --waitretry=5 \
                        --read-timeout=20 \
                        "$@" |
                    tar -x $compression
                fi ||
                if command -v curl &>/dev/null; then
                    curl -L \
                        --connect-timeout 5 \
                        --max-time 20 \
                        --retry 5 \
                        --retry-delay 5 \
                        --retry-max-time 60 \
                        "$@" |
                    tar -x $compression
                fi ||
                false
            ) &&
            break
        done
    }

    setupCub()
    {
        cd -- "$PREFIX" &&
        if ! test -d cub-*; then
            remoteExtract 'https://github.com/NVlabs/cub/archive/v1.8.0.tar.gz'
        fi &&
        cd cub-* && export CUB_DIR=$( pwd )
    }

    setupCereal()
    {
        export CEREAL_DIR="$PREFIX"/cereal &&
        if ! test -d "$CEREAL_DIR"; then
            cd -- "$PREFIX/src" &&
            remoteExtract 'https://github.com/USCiLab/cereal/archive/v1.2.2.tar.gz' &&
            cd cereal-* && mkdir -p build && cd -- "$_" &&
            cmake -Wno-dev -DCMAKE_INSTALL_PREFIX="$PREFIX"/cereal -DJUST_INSTALL_CEREAL=ON .. &&
            make -j "$( nproc )" install
        fi
    }

    setupCnpy()
    {
        # commit 4e8810b1a8637695171ed346ce68f6984e585ef4 to be exact but has no release and only 1 commit in last year
        export CNPY_DIR="$PREFIX"/cnpy &&
        if ! test -d "$CNPY_DIR"; then
            cd -- "$PREFIX/src" && curl -L 'https://github.com/rogersce/cnpy/archive/master.zip' -o master.zip &&
            unzip "$_" && command rm -f "$_" && cd cnpy-* && mkdir -p build && cd -- "$_" &&
            cmake -Wno-dev -DCMAKE_INSTALL_PREFIX="$CNPY_DIR" .. && make -j "$( nproc )" install
        fi
    }

    buildAluminum()
    {
        # allow Aluminum build to fail (requires at least CUDA 9 because it uses CU_DEVICE_ATTRIBUTE_CAN_USE_STREAM_MEM_OPS)
        cd -- "$PREFIX/src" && remoteExtract 'https://github.com/LLNL/Aluminum/archive/v0.2.tar.gz' &&
        cd Aluminum-* && mkdir -p build && cd "$_" &&
        cmake -Wno-dev                               \
              -DCMAKE_BUILD_TYPE=Release             \
              -DCMAKE_INSTALL_PREFIX="$ALUMINUM_DIR" \
              -DCMAKE_LIBRARY_PATH="$LIBRARY_PATH"   \
              .. &&
        make -j "$( nproc )" VERBOSE=1 install || true
    }

    buildHydrogen()
    {
        # Needs at least CUDA 7.5 because it uses cuda_fp16.h even though Hydrogen_ENABLE_HALF=OFF Oo
        cd -- "$PREFIX/src" &&
        remoteExtract 'https://github.com/LLNL/Elemental/archive/v1.1.0.tar.gz' &&
        cd Elemental-* && mkdir -p build && cd -- "$_" &&
        cmake -Wno-dev                               \
              -DCMAKE_BUILD_TYPE=Release             \
              -DCMAKE_INSTALL_PREFIX="$HYDROGEN_DIR" \
              -DCMAKE_LIBRARY_PATH="$LIBRARY_PATH"   \
              -DHydrogen_USE_64BIT_INTS=ON           \
              -DHydrogen_ENABLE_OPENMP=ON            \
              -DBUILD_SHARED_LIBS=ON                 \
              -DHydrogen_ENABLE_ALUMINUM=ON          \
              .. &&
        make -j "$( nproc )" VERBOSE=1 install
    }

    buildOpenCV()
    {
        cd -- "$PREFIX/src" && remoteExtract 'https://github.com/opencv/opencv/archive/3.4.3.tar.gz' &&
        cd opencv-* && mkdir -p build && cd -- "$_" &&
        cmake -Wno-dev                             \
              -DCMAKE_BUILD_TYPE=Release           \
              -DCMAKE_INSTALL_PREFIX="$OPENCV_DIR" \
              -DWITH_{JPEG,PNG,TIFF}=ON            \
              -DWITH_{CUDA,JASPER}=OFF             \
              -DBUILD_SHARED_LIBS=ON               \
              -DBUILD_JAVA=OFF                     \
              -DBUILD_opencv_{calib3d,cuda,dnn,features2d,flann,java,{java,python}_bindings_generator,ml,python{2,3},stitching,ts,superres,video{,io,stab}}=OFF .. &&
        make -j "$( nproc )" install
    }

    buildLBANN()
    {
        fixLibZBug()
        {
            find . -type f -execdir  bash -c '
                if grep "g++.*libcnpy\.so" "$0" | grep -q -v " -lz"; then
                    sed -i -r "/g\+\+ .*libcnpy\.so( |$)/{ s:(libcnpy\.so |$):\1-lz : }" "$0";
                fi' {} \;
        }

        cd -- "$PREFIX/src" && remoteExtract 'https://github.com/LLNL/lbann/archive/v0.98.1.tar.gz' &&
        cd lbann-* && mkdir -p build && cd -- "$_" &&
        cmake -Wno-dev                               \
              -DCMAKE_BUILD_TYPE=Release             \
              -DCMAKE_INSTALL_PREFIX="$PREFIX"/lbann \
              -DCMAKE_LIBRARY_PATH="$LIBRARY_PATH"   \
              -DHydrogen_DIR="$HYDROGEN_DIR"         \
              -DLBANN_WITH_ALUMINUM:BOOL=ON          \
              -DLBANN_USE_PROTOBUF_MODULE=$( if test -f "$PROTOBUF_ROOT/lib/cmake/protobuf/protobuf-config.cmake"; then echo OFF; else echo ON; fi )  .. &&
        fixLibZBug && make -j 2 VERBOSE=1 install
        # only building with -j 2 instead of -j 4 because the VM on Taurus doesn't seem to have enough memory to run for compilations in parallel ...
    }

    setupCub
    setupCereal
    setupCnpy

    ALUMINUM_DIR="$PREFIX"/Aluminum &&
    if ! test -d "$ALUMINUM_DIR"; then buildAluminum; fi &&
    export CMAKE_PREFIX_PATH=$ALUMINUM_DIR:$CMAKE_PREFIX_PATH

    HYDROGEN_DIR="$PREFIX"/Elemental &&
    if ! test -d "$HYDROGEN_DIR"; then buildHydrogen; fi &&
    export CMAKE_PREFIX_PATH=$HYDROGEN_DIR:$CMAKE_PREFIX_PATH

    OPENCV_DIR="$PREFIX"/opencv
    if ! test -d "$OPENCV_DIR"; then buildOpenCV; fi &&
    export CMAKE_PREFIX_PATH=$OPENCV_DIR:$CMAKE_PREFIX_PATH

    buildLBANN

    exit 0

Support no progress engine

Support a compile-time flag to only start the progress engine on demand (i.e., if something is submitted to it). This is a flag so that we only pay this runtime check penalty if we need it.

This would primarily benefit the NCCL backend, since we would not have a progress engine thread spinning unless it were needed.

Consistent issues with the RingMPICUDA initialization

The RingMPICUDA ring is consistently failing on most platforms (I think I've seen issues on sierra and pascal). I have this commented out in the MPI-CUDA communicator in my local branches. Perhaps it would be better to lazily initialize it, esp. since it's heap-alloc'd anyway. I'll update this issue if I do any of the debugging legwork to more precisely characterize the error. Superficially, a Sendrecv is called where at least one of the ranks is gibberish in the sense that it's often negative and also way more than 1, 2, or 3 digits long...

Host-transfer multi-threading issue

When using multiple threads (even on #130), some threads in some ranks non-deterministically get incorrect results when testing. Anecdotally, the incorrect results are either a vector of all 0s (which is strange since the code does not zero memory) or simply incorrect values (but which seem to be in a reasonable range, e.g., no NaNs or clear garbage).

I do not observe this issue with only a single thread.

Tests crash

[yv:04179] *** Process received signal ***
[yv:04179] Signal: Segmentation fault (11)
[yv:04179] Signal code: Address not mapped (1)
[yv:04179] Failing at address: 0x440000c8
[yv:04179] [ 0] 0x8260bcb1e <pthread_sigmask+0x54e> at /lib/libthr.so.3
[yv:04179] [ 1] 0x8260bc0cf <pthread_setschedparam+0x83f> at /lib/libthr.so.3
[yv:04179] [ 2] 0x7ffffffff2d3 <_fini+0x7fffffbdb5e7> at ???
[yv:04179] [ 3] 0x822f320d8 <MPI_Comm_get_attr+0x58> at /usr/local/mpi/openmpi/lib/libmpi.so.40
[yv:04179] [ 4] 0x82165de57 <_ZN2Al8internal3mpi4initERiRPPc+0xf7> at /usr/ports/net/aluminum/work/.build/src/libAl.so.1.2.1
[yv:04179] [ 5] 0x82165ab49 <_ZN2Al10InitializeERiRPPc+0x19> at /usr/ports/net/aluminum/work/.build/src/libAl.so.1.2.1
[yv:04179] [ 6] 0x4054b0 <main+0x40> at /usr/ports/net/aluminum/work/.build/test/test_exchange
[yv:04179] *** End of error message ***
*** Signal 11

Version: 1.2.1
clang-14
FreeBSD 13.1

Reorganize MPI backend allreduces

The existing implementations of the MPI allreduces have been haphazardly updated with a number of changes to the progress engine and other APIs. We should clean them up.

Build breaks because of missing 'cuda.h' when ALUMINUM_ENABLE_CUDA=OFF

===>  Building for Aluminum-0.2.1
[1/39] /usr/local/bin/clang++70  -DAl_EXPORTS -I/wrkdirs/usr/ports/net/aluminum/work/Aluminum-0.2.1/src -I. -isystem /usr/local/mpi/openmpi/include -isystem /usr/local/include -Wall -Wextra -pedantic -g -faligned-new -O2 -pipe -fstack-protector -fno-strict-aliasing -fPIC   -pthread -fopenmp=libomp -std=gnu++11 -MD -MT src/CMakeFiles/Al.dir/profiling.cpp.o -MF src/CMakeFiles/Al.dir/profiling.cpp.o.d -o src/CMakeFiles/Al.dir/profiling.cpp.o -c /wrkdirs/usr/ports/net/aluminum/work/Aluminum-0.2.1/src/profiling.cpp
[2/39] /usr/local/bin/clang++70   -I/wrkdirs/usr/ports/net/aluminum/work/Aluminum-0.2.1/test -I/wrkdirs/usr/ports/net/aluminum/work/Aluminum-0.2.1/src -I. -isystem /usr/local/mpi/openmpi/include -isystem /usr/local/include -Wall -Wextra -pedantic -g -faligned-new -O2 -pipe -fstack-protector -fno-strict-aliasing -fPIE   -pthread -fopenmp=libomp -std=gnu++11 -MD -MT benchmark/CMakeFiles/benchmark_waits.exe.dir/benchmark_waits.cpp.o -MF benchmark/CMakeFiles/benchmark_waits.exe.dir/benchmark_waits.cpp.o.d -o benchmark/CMakeFiles/benchmark_waits.exe.dir/benchmark_waits.cpp.o -c /wrkdirs/usr/ports/net/aluminum/work/Aluminum-0.2.1/benchmark/benchmark_waits.cpp
FAILED: benchmark/CMakeFiles/benchmark_waits.exe.dir/benchmark_waits.cpp.o 
/usr/local/bin/clang++70   -I/wrkdirs/usr/ports/net/aluminum/work/Aluminum-0.2.1/test -I/wrkdirs/usr/ports/net/aluminum/work/Aluminum-0.2.1/src -I. -isystem /usr/local/mpi/openmpi/include -isystem /usr/local/include -Wall -Wextra -pedantic -g -faligned-new -O2 -pipe -fstack-protector -fno-strict-aliasing -fPIE   -pthread -fopenmp=libomp -std=gnu++11 -MD -MT benchmark/CMakeFiles/benchmark_waits.exe.dir/benchmark_waits.cpp.o -MF benchmark/CMakeFiles/benchmark_waits.exe.dir/benchmark_waits.cpp.o.d -o benchmark/CMakeFiles/benchmark_waits.exe.dir/benchmark_waits.cpp.o -c /wrkdirs/usr/ports/net/aluminum/work/Aluminum-0.2.1/benchmark/benchmark_waits.cpp
/wrkdirs/usr/ports/net/aluminum/work/Aluminum-0.2.1/benchmark/benchmark_waits.cpp:4:10: fatal error: 'cuda.h' file not found
#include <cuda.h>
         ^~~~~~~~
1 error generated.

Optimize NCCL collectives

Provide optimized versions of our custom-implemented NCCL collectives:

  • Alltoall
  • Gather
  • Scatter
  • Allgatherv
  • Alltoallv
  • Reduce_scatterv
  • Gatherv
  • Scatterv

Traces are not thread-safe

Saving a trace entry is not thread-safe when AL_THREAD_MULTIPLE is set and could trash the trace log.

GPU memory pool

The NCCL Alltoall and Alltoallv operations allocate temporary buffers on the GPU to handle in-place operations. Right now this directly calls cudaMalloc/cudaFree. We should use a memory pool, since these operations have a large overhead.

Right now our memory pools only support host memory. Probably should use cub (or similar).

Check buffers in debug mode

We should check that provided buffers are sane in debug mode. Specifically:

  • Buffers should not be null (except when permitted, e.g., for non-roots in some collectives).
  • Buffers do not intersect (except when in-place).

Tests use invalid integer types in std::uniform_int_distribution

-- Build files have been written to: /usr/ports/net/aluminum/work/.build
[ 50% 1/2] /usr/local/libexec/ccache/c++  -I/usr/ports/net/aluminum/work/Aluminum-1.4.0/include -I/usr/ports/net/aluminum/work/.build -I/usr/ports/net/aluminum/work/Aluminum-1.4.0/test -isystem /usr/ports/net/aluminum/work/Aluminum-1.4.0/third_party/cxxopts/include -isystem /usr/local/include -isystem /usr/local/mpi/openmpi/include -O2 -pipe -fstack-protector-strong -fno-strict-aliasing -Wall -Wextra -pedantic -faligned-new -g3 -O2 -pipe -fstack-protector-strong -fno-strict-aliasing   -DNDEBUG -std=gnu++17 -fPIE -fexceptions -pthread -MD -MT test/CMakeFiles/test_ops.dir/test_ops.cpp.o -MF test/CMakeFiles/test_ops.dir/test_ops.cpp.o.d -o test/CMakeFiles/test_ops.dir/test_ops.cpp.o -c /usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_ops.cpp
FAILED: test/CMakeFiles/test_ops.dir/test_ops.cpp.o 
/usr/local/libexec/ccache/c++  -I/usr/ports/net/aluminum/work/Aluminum-1.4.0/include -I/usr/ports/net/aluminum/work/.build -I/usr/ports/net/aluminum/work/Aluminum-1.4.0/test -isystem /usr/ports/net/aluminum/work/Aluminum-1.4.0/third_party/cxxopts/include -isystem /usr/local/include -isystem /usr/local/mpi/openmpi/include -O2 -pipe -fstack-protector-strong -fno-strict-aliasing -Wall -Wextra -pedantic -faligned-new -g3 -O2 -pipe -fstack-protector-strong -fno-strict-aliasing   -DNDEBUG -std=gnu++17 -fPIE -fexceptions -pthread -MD -MT test/CMakeFiles/test_ops.dir/test_ops.cpp.o -MF test/CMakeFiles/test_ops.dir/test_ops.cpp.o.d -o test/CMakeFiles/test_ops.dir/test_ops.cpp.o -c /usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_ops.cpp
In file included from /usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_ops.cpp:28:
In file included from /usr/ports/net/aluminum/work/Aluminum-1.4.0/include/Al.hpp:39:
In file included from /usr/include/c++/v1/vector:312:
In file included from /usr/include/c++/v1/algorithm:1851:
In file included from /usr/include/c++/v1/__algorithm/ranges_sample.h:13:
In file included from /usr/include/c++/v1/__algorithm/sample.h:18:
/usr/include/c++/v1/__random/uniform_int_distribution.h:162:5: error: static assertion failed due to requirement '__libcpp_random_is_valid_inttype<char>::value': IntType must be a supported integer type
    static_assert(__libcpp_random_is_valid_inttype<_IntType>::value, "IntType must be a supported integer type");
    ^             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_utils.hpp:138:36: note: in instantiation of template class 'std::uniform_int_distribution<char>' requested here
  std::uniform_int_distribution<T> rng;
                                   ^
/usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_utils.hpp:149:14: note: in instantiation of function template specialization 'gen_random_val<char, std::linear_congruential_engine<unsigned int, 48271, 0, 2147483647>, true>' requested here
      v[i] = gen_random_val<T>(g);
             ^
/usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_utils.hpp:207:30: note: in instantiation of function template specialization 'RandVectorGen<char>::gen<std::linear_congruential_engine<unsigned int, 48271, 0, 2147483647>>' requested here
    return RandVectorGen<T>::gen(count, rng_gen);
                             ^
/usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_ops.cpp:139:35: note: in instantiation of member function 'VectorType<char, Al::MPIBackend>::gen_data' requested here
    input(VectorType<T, Backend>::gen_data(in_size, comm_wrapper.comm().get_stream())),
                                  ^
/usr/include/c++/v1/__memory/allocator.h:165:28: note: in instantiation of member function 'TestData<Al::MPIBackend, char>::TestData' requested here
        ::new ((void*)__p) _Up(_VSTD::forward<_Args>(__args)...);
                           ^
/usr/include/c++/v1/__memory/allocator_traits.h:290:13: note: (skipping 3 contexts in backtrace; use -ftemplate-backtrace-limit=0 to see all)
        __a.construct(__p, _VSTD::forward<_Args>(__args)...);
            ^
/usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_ops.cpp:356:14: note: in instantiation of function template specialization 'std::vector<TestData<Al::MPIBackend, char>>::emplace_back<unsigned long &, unsigned long &, CommWrapper<Al::MPIBackend> &>' requested here
        data.emplace_back(in_size, out_size, comm_wrappers[i]);
             ^
/usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_ops.cpp:419:7: note: in instantiation of function template specialization 'run_test<Al::MPIBackend, char, true>' requested here
      run_test<Backend, T>(parsed_opts);
      ^
/usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_utils.hpp:321:39: note: in instantiation of function template specialization 'test_dispatcher::operator()<Al::MPIBackend, char>' requested here
    {"char", [&]() { functor.template operator()<Backend, char>(parsed_opts); } },
                                      ^
/usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_utils.hpp:359:20: note: in instantiation of function template specialization 'dispatch_to_backend_type_helper<Al::MPIBackend, test_dispatcher>' requested here
    {"mpi", [&](){ dispatch_to_backend_type_helper<Al::MPIBackend>(parsed_opts, functor); } },
                   ^
/usr/ports/net/aluminum/work/Aluminum-1.4.0/test/test_ops.cpp:488:3: note: in instantiation of function template specialization 'dispatch_to_backend<test_dispatcher>' requested here
  dispatch_to_backend(parsed_opts, test_dispatcher());
  ^
1 error generated.

clang-15
FreeBSD 13.2

In-place NCCL reduce-scatter is not in-place

The NCCL backend's in-place reduce-scatter uses sendbuf = recvbuf here. But per NCCL documentation, the in-place reduce-scatter should actually have recvbuf be the appropriate offset into the recvbuf (see here). This appears appears to differ from the usual MPI semantics. Our current version works, but I'm not sure whether using overlapping buffers without telling NCCL that we are is safe. We may also be missing some performance benefits.

NCCL group limits

According to the NCCL documentation, there is a maximum of 2048 calls that can be aggregated within a ncclGroupStart/ncclGroupEnd pair. Our current implementations are fine at small scale, but collectives will exceed this at large scale. For example, Allgatherv performs a send and a receive for each rank, meaning we exceed this at only 1025 processors (assuming all processors contribute data).

We need an API to automagically manage NCCL groups, at least for send/recv operations. We have to be careful in how we split things up (esp. when skipping calls due to 0-length buffers), as a group of send/recv operations only completes when all send/recv operations complete. Thus we need to ensure groups are split up in a way that is consistent across all processors.

Support CMake components in export

We should have component support to make life easier for picky downstreams. For example, distconv requires NCCL and HostTransfer backends, and it'd be better to just

find_package(Aluminum COMPONENTS NCCL HT)

(so the complexities of searching/checking/re-searching/re-checking/... are handled by CMake rather than requiring us to write that logic).

I've added this to my TODO list, but it's long. I wanted to leave this here to keep myself honest.

Serialized MPI

Support a (compile-time chosen) serialized mode for the MPI backend. In this mode, MPI will be initialized with MPI_THREAD_SERIALIZED if it is not already initialized. Aside from communicator creation, in this mode Aluminum will serialize all calls to MPI by running them on the progress engine. Blocking calls will be internally transformed into non-blocking calls on the progress engine followed immediately by a wait.

Fix coding style

Our coding style here is a bit of a mess and needs to be unified. Especially variable names.

Improve progress thread binding

Improve the automatic node topology detection for progress thread binding. We currently end up using AL_PROGRESS_RANKS_PER_NUMA_NODE more frequently than not.

[1.4.1] Tests crash

===>  Testing for Aluminum-1.4.1
===>   Aluminum-1.4.1 depends on package: cxxopts>0 - found
-- Configuring done (4.9s)
-- Generating done (0.0s)
-- Build files have been written to: /usr/ports/net/aluminum/work/.build
ninja: no work to do.
[  0% 1/1] cd /usr/ports/net/aluminum/work/.build && /usr/local/bin/ctest --force-new-ctest-process
Test project /usr/ports/net/aluminum/work/.build
No tests were found!!!
[yv:33027] *** Process received signal ***
[yv:33027] Signal: Segmentation fault (11)
[yv:33027] Signal code: Address not mapped (1)
[yv:33027] Failing at address: 0x440000c8
[yv:33027] [ 0] 0x826d6762c <pthread_sigmask+0x54c> at /lib/libthr.so.3
[yv:33027] [ 1] 0x826d66bd9 <pthread_setschedparam+0x839> at /lib/libthr.so.3
[yv:33027] [ 2] 0x7ffffffff923 <_fini+0x7fffffdd3aa7> at ???
[yv:33027] [ 3] 0x824332fe8 <MPI_Comm_get_attr+0x58> at /usr/local/mpi/openmpi/lib/libmpi.so.40
[yv:33027] [ 4] 0x821b7a4b2 <_ZN2Al8internal3mpi4initERiRPPci+0x102> at /usr/ports/net/aluminum/work/.build/src/libAl.so.1.4.1
[yv:33027] [ 5] 0x821b7735a <_ZN2Al10InitializeERiRPPci+0x1a> at /usr/ports/net/aluminum/work/.build/src/libAl.so.1.4.1
[yv:33027] [ 6] 0x20d730 <main+0x40> at /usr/ports/net/aluminum/work/.build/test/test_exchange
[yv:33027] *** End of error message ***
*** Signal 11

clang-15
FreeBSD 13.2

Host-transfer allgather fails

The host-transfer allgather results in different errors coming from SMPI on Lassen, depending on exactly how many processors are used. It works fine on a single processor, but fails with 2+.

I have not yet identified whether this is an error in SMPI or Aluminum. It is possibly similar to #90 (I see it sometimes trying to dereference address 1).

Support bfloat16

Both NCCL and RCCL support bfloat16, we should add support like with half.

I'm a newbie. Can I ask you a question

Following the README, I executed the cmake and make directives. Then I went to the example directory, followed the README in example, executed make, and got three executables. hello_world, allreduce, pingpong. When I execute./hello_world, I get the following error. I want to fix this error

shangda02@abc-Super-Server:~/LLNL_Aluminum/examples/build$ ./hello_world
terminate called after throwing an instance of 'Al::al_exception'
what(): /home/shangda02/LLNL_Aluminum/src/progress.cpp:88 - Tried to exchange infinite bitmap
[abc-Super-Server:29183] *** Process received signal ***
[abc-Super-Server:29183] Signal: Aborted (6)
[abc-Super-Server:29183] Signal code: (-6)
[abc-Super-Server:29183] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10)[0x7fdb3bd6ef10]
[abc-Super-Server:29183] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fdb3bd6ee87]
[abc-Super-Server:29183] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fdb3bd707f1]
[abc-Super-Server:29183] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c957)[0x7fdb3c3c5957]
[abc-Super-Server:29183] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92ae6)[0x7fdb3c3cbae6]
[abc-Super-Server:29183] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92b21)[0x7fdb3c3cbb21]
[abc-Super-Server:29183] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d54)[0x7fdb3c3cbd54]
[abc-Super-Server:29183] [ 7] / home/shangda02 / LLNL_Aluminum/build/SRC/libAl. So. 1.3.1 (_ZN2Al8internal14ProgressEngine9bind_initEv + 0 xbc8) [0 x7fdb3ca1e678 ]
[abc-Super-Server:29183] [ 8] / home/shangda02 / LLNL_Aluminum/build/SRC/libAl. So. 1.3.1 (_ZN2Al8internal14ProgressEngineC1Ev + 0 x14b) x7fdb3ca1ec3b [0]
[abc-Super-Server:29183] [ 9] / home/shangda02 / LLNL_Aluminum/build/SRC/libAl. So. 1.3.1 (_ZN2Al10InitializeERiRPPcP19ompi_communicator_t + 0 x39) [0 x7fdb3ca19 a89]
[abc-Super-Server:29183] [10] ./hello_world(+0xe30)[0x5598fd9ace30]
[abc-Super-Server:29183] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7fdb3bd51c87]
[abc-Super-Server:29183] [12] ./hello_world(+0xfca)[0x5598fd9acfca]
[abc-Super-Server:29183] *** End of error message ***
Aborted (core dumped)

To be honest, I don't have a good understanding of the whole project

In-place MPI scatter segfaults on one processor

$ jsrun --bind packed:8 --nrs 1 --rs_per_host 1 --tasks_per_rs 1 --launch_distribution packed --cpu_per_rs ALL_CPUS --gpu_per_rs ALL_GPUS ./test_ops.exe --backend mpi --op scatter --inplace
Aborting after hang in Al size=1

Scatterv also hangs.

Regular scatter works fine. Also works with more processors. Likewise, other rooted collectives (bcast, gather, reduce) work in this config. This is a bit strange since this should be a NOP.

Need to also verify this is not a SMPI bug.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.