openucx / ucc Goto Github PK

View Code? Open in Web Editor NEW

173.0 22.0 80.0 17.05 MB

Unified Collective Communication Library

Home Page: https://openucx.github.io/ucc/

License: BSD 3-Clause "New" or "Revised" License

C 54.04% Makefile 0.82% Shell 0.64% M4 2.24% C++ 40.83% Cuda 1.37% Groovy 0.01% CMake 0.05%

collectives cuda deep-learning hpc infiniband mpi openshmem pgas pytorch roce sharp

ucc's People

Stargazers

Watchers

Forkers

vspetrov manjugv sergei-lebedev alex--m seyedmir bureddy artemry-nv karasevb jladd-mlnx lappazos wfaderhold21 shimmybalsam rocm janjust yqin zasdfgbnm-nvidia souravzzz mellanox facebookresearch tunchinkao tonycurtis jakobskr tamirronen avildema jiexa24 gvallee raffenet vibhatha bws paklui samnordmann edgargabriel kingchc wesbland xinyao1994 alexey-rivkin pavanbalaji crcrpar mellanox-lab jeniaka jirikraus rbradford brianplus pedramalizadeh mamzib co-simulation aiworkspace b-a-s chivier pauloohaha ikryukov nsarka nwpu-hpc jywangx brminich cowanmeg arupcsedu nileshnegi jeffnvidia taekyungheo miharulidze jamestiotio junyang0412 artpol84 schkillten mstaylor estepona sandeepd-nv x41lakazam

ucc's Issues

Add topology support in UCC

Issue

Information about the node and network topology is useful to achieve many optimizations including implementing topology-aware collectives, routing data via fast paths, and minimizing the impact of congestion. Current UCC interfaces lack the topology information.

Potential solution

Add topology abstraction to library and team creation interfaces.

Note

This issue is a placeholder to capture the discussion, proposals, and details related to topology abstraction.

Team ID passed by the user into team_create

For several reasons we might need to generate a context-wise team id inside the UCC. There are way to do that using OOB but they involve extra communication. Sometimes runtime has that information. It can be passed by the user through team_params.

Proposal: extend team_params with uint64_t TEAM ID field. We might have a restrictions on the max team id that we can support internally (due to implementation specifics) but then it is our internal decision: if the provided ID from the runtime is too large we can still go and generate internal ID.

Interchangeable between ucc_status_t and ucs_status_t?

My compiler (clang with -Wenum-conversion) is complaining the following:

third-party/ucc/src/components/mc/cuda/mc_cuda.c:187:9: error: implicit conversion from enumeration type 'ucs_status_t' to different enumeration type 'ucc_status_t' [-Werror,-Wenum-conversion]
        CUDADRV_FUNC(cuDeviceGetAttribute(&mem_ops_attr,
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
third-party/ucc/src/components/mc/cuda/mc_cuda.h:116:32: note: expanded from macro 'CUDADRV_FUNC'
        ucc_status_t _status = UCS_OK;
                     ~~~~~~~   ^~~~~~

It looks like a typo to me:

ucc/src/components/mc/cuda/mc_cuda.h

Line 116 in d13e395

ucc_status_t _status = UCS_OK; \

Another similar variable in TL NCCL:

ucc/src/components/tl/nccl/tl_nccl_team.c

Line 135 in d13e395

ucs_status_t status;

Just wanted to check if ucc_status_t and ucs_status_t is Interchangeable since they are slightly different in my understanding. Is ucs_status_t a superset of ucc_status_t?

cc @Sergei-Lebedev

UCG needs address lookup callbacks

For reference, UCG parameters require the following:

typedef struct ucg_params {

    /* Callback functions for address lookup, used at connection establishment */
    struct {
        int (*lookup_f)(void *cb_group_context,
                        ucg_group_member_index_t index,
                        ucp_address_t **addr,
                        size_t *addr_len);
        void (*release_f)(ucp_address_t *addr);
    } address;

Currently, the OMPI-based implementation satisfies this requirement as follows:

int mca_coll_ucx_resolve_address(void *cb_group_obj,
                                 ucg_group_member_index_t rank,
                                 ucp_address_t **addr,
                                 size_t *addr_len)
{
    /* Sanity checks */
    ompi_communicator_t* comm = (ompi_communicator_t*)cb_group_obj;
    if (rank == (ucg_group_member_index_t)comm->c_my_rank) {
        COLL_UCX_ERROR("mca_coll_ucx_resolve_address(rank=%lu)"
                       "shouldn't be called on its own rank (loopback)", rank);
        return 1;
    }

    /* Check the cache for a previously established connection to that rank */
    ompi_proc_t *proc_peer =
          (struct ompi_proc_t*)ompi_comm_peer_lookup((ompi_communicator_t*)cb_group_obj, rank);
    *addr = proc_peer->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_COLL];
    *addr_len = 0; /* UCX doesn't need the length to unpack the address */
    if (*addr) {
       return 0;
    }

    /* Obtain the UCP address of the remote */
    int ret = mca_coll_ucx_recv_worker_address(proc_peer, addr, addr_len);
    if (ret < 0) {
        COLL_UCX_ERROR("mca_coll_ucx_recv_worker_address(proc=%d rank=%lu) failed",
                       proc_peer->super.proc_name.vpid, rank);
        return 1;
    }

    /* Cache the connection for future invocations with this rank */
    proc_peer->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_COLL] = *addr;
    return 0;
}

void mca_coll_ucx_release_address(ucp_address_t *addr)
{
    /* no need to free - the address is stored in proc_peer->proc_endpoints */
}

What CUDA versions are supported?

Hi, I am trying to build UCC on PyTorch's CI, and it failed on the CUDA 10.2 CI because __double2half is added in CUDA 11. I will NOT request UCC to support CUDA 10.2, because PyTorch is already working on dropping CUDA 10.2 support as well.

But I searched this repository, and I didn't see any mention of versioning. The CI seems to be testing only one CUDA version as well. I really hope that UCC can clearly mention what is the lowest supported CUDA version on README.md and have a CI running on that CUDA version.

Question: What's the status of Core Direct?

I realize I'm about a decade out late, but is Core Direct still a thing, and will UCC be using it?
I've been trying to find information on core direct, but it seems like support got dropped somewhere.
Is there any documentation regarding core direct, or even a header file with the endpoints?

Thanks

gcc 11.2 error: implicit conversion from ‘ucs_status_t’ to ‘ucc_status_t’

UCC fails to compile on Ubuntu 21.10 with gcc 11.2.0, as UCC sets -Werror and -Wenum-conversion in the CFLAGS:

make[3]: Entering directory '/home/fabecassis/github/ucc/src/components/tl/sharp'
  CC       libucc_tl_sharp_la-tl_sharp_context.lo
tl_sharp_context.c: In function ‘ucc_tl_sharp_context_t_init’:
tl_sharp_context.c:285:15: error: implicit conversion from ‘ucs_status_t’ to ‘ucc_status_t’ [-Werror=enum-conversion]
  285 |        status = ucc_rcache_create(&rcache_params, "SHARP", NULL, &self->rcache);
      |               ^
cc1: all warnings being treated as errors
make[3]: *** [Makefile:582: libucc_tl_sharp_la-tl_sharp_context.lo] Error 1
  CC       libucc_tl_sharp_la-tl_sharp_coll.lo
tl_sharp_coll.c: In function ‘ucc_tl_sharp_mem_register’:
tl_sharp_coll.c:87:16: error: implicit conversion from ‘ucs_status_t’ to ‘ucc_status_t’ [-Werror=enum-conversion]
   87 |         status = ucc_rcache_get(ctx->rcache, (void *)addr, length,
      |                ^
cc1: all warnings being treated as errors
make[3]: *** [Makefile:596: libucc_tl_sharp_la-tl_sharp_coll.lo] Error 1

UCX comes from HPC-X 2.9.0:

$ ucx_info -v
# UCT version=1.11.0 revision 6031c98
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --with-knem --with-xpmem=/hpc/local/oss/xpmem/90a95a4 --without-java --enable-devel-headers --with-cuda=/hpc/local/oss/cuda11.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.9.0-gcc-MLNX_OFED_LINUX-5.4-1.0.3.0-ubuntu20.04-x86_64/ucx

The workaround is simple for now though: make CFLAGS=-Wno-error=enum-conversion

Wiki: Create and Write FAQ

coll_args flags documentation

ucc.h (as well as doxygen) lacks the description of coll_args flags. While some of them are relatively obvious (persistent, inplace) some others are not (got questions regarding what those flags mean):
UCC_COLL_ARGS_FLAG_IN_PLACE
UCC_COLL_ARGS_FLAG_PERSISTENT
UCC_COLL_ARGS_FLAG_COUNT_64BIT
UCC_COLL_ARGS_FLAG_DISPLACEMENTS_64BIT
UCC_COLL_ARGS_FLAG_CONTIG_SRC_BUFFER
UCC_COLL_ARGS_FLAG_CONTIG_DST_BUFFER

Need to update docs.

NCCL TL

In PR #84 we are adding support for NCCL TL. If UCC was built with NCCL support TL NCCL might be selected by CLs for CUDA collectives i.e. when both source and destination buffers are of memory type CUDA. However there are some known limitations when NCCL is used such as launching multiple collectives on different streams concurrently. Therefore users are encouraged to follow NCCL guidelines to avoid potential deadlocks. From UCC perspective it means that if multiple teams are created and NCCL TL is used then user should not post CUDA collectives to different teams at the same time.

Road map for a release

We implemented a UCX based transport for the Cylon project which is a distributed DataFrame. We are looking forward to integrating the collective communications with the Cylon project. Are you planning to release the UCC sometime soon? Is it released separately or along with UCX?

non blocking team destroy

Smth for the WG to discuss.
Currently we defined the UCC team creation to be non-blocking: team_create_post + test.
The corresponding ucc_team_destroy is blocking call.

However, in some cases ucc_team_destroy will actually involve communication among the ranks of the team that is being destroyed. Examples: TL UCP team might have the UCP EPs connected that must be disconnected during destroy - this involves ucp up_close protocol which is non-local. Another example would be mcast group destruction which requires synchronizing flush over participating ranks.

I faced the issue when i was adding team_destroy into the gtest. There we simulate multiple ranks from the single process and obviously it is impossible then to destroy the team with blocking API (the very first rank in gtest will hang). i've implemented non-blocking team destruction internally (in CL/TL and Base interface) and use it in gtest. Currently the ucc api is kept the same: blocking team destruction (implemented as while (UCC_OK != ucc_team_destroy_nb(team))).

Question: do we want to define ucc_team_destroy as non-blocking call in UCC API? Probably there is not much use case for it, but this could make it more consistent.

[MTT] build fail on atest with CUDA

Shell: command failed "cd atest
./autogen.sh
./configure --with-mpipath=/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210624_101458_17631_43994_vulcan02.swx.labs.mlnx/ompi_src/install --enable-debug
make -j 9
"

make  all-recursive
make[1]: Entering directory `/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210624_101458_17631_43994_vulcan02.swx.labs.mlnx/installs/dneq/tests/atest/mtt-tests.git/atest'
Making all in src
make[2]: Entering directory `/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210624_101458_17631_43994_vulcan02.swx.labs.mlnx/installs/dneq/tests/atest/mtt-tests.git/atest/src'
depbase=`echo main.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210624_101458_17631_43994_vulcan02.swx.labs.mlnx/ompi_src/install/bin/mpicc -DHAVE_CONFIG_H -I. -I..  -I../src/tests -I../src/cmd -I../src/types -I../src/comms -I../src/env -I../src/output -I../src/prof -I../src/utils -D__LINUX__  -Wall -Werror -g -D_DEBUG -g -O2 -MT main.o -MD -MP -MF $depbase.Tpo -c -o main.o main.c &&\
mv -f $depbase.Tpo $depbase.Po/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210624_101458_17631_43994_vulcan02.swx.labs.mlnx/ompi_src/install/bin/mpicc: error while loading shared libraries: libcudart.so.11.0: cannot open shared object file: No such file or directory
make[2]: *** [main.o] Error 127
make[2]: *** Waiting for unfinished jobs....make[2]: Leaving directory `/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210624_101458_17631_43994_vulcan02.swx.labs.mlnx/installs/dneq/tests/atest/mtt-tests.git/atest/src'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210624_101458_17631_43994_vulcan02.swx.labs.mlnx/installs/dneq/tests/atest/mtt-tests.git/atest'
make: *** [all] Error 2

Default MT progress queue

Currently, the default mt tasks progress queue is locked pq. For 2 threaded cases they perform almost the same, but for a bigger number of threads, we need the Lock-free progress queue. need to decide what will be the default, or maybe some other mechanism besides the config flag.

ucc/src/core/ucc_context.c

Line 21 in 61da0e9

{"LOCK_FREE_PROGRESS_Q", "0",

ucc_context_progress return value

currently we simply return ucc_status_t. do we want to return some int to represent how much progressed was made? or boolean for progress or non-progress?

TL/CUDA: what behavior if we cannot use cache?

In ucc_tl_cuda_create_cache, it is possible ucs_pgtable_init return error. refer to

ucc/src/components/tl/cuda/tl_cuda_cache.c

Lines 82 to 86 in 4f0ad00

    
           if (UCS_OK != ucs_pgtable_init(&cache_desc->pgtable, 
        
                                          ucc_tl_cuda_cache_pgt_dir_alloc, 
        
                                          ucc_tl_cuda_cache_pgt_dir_release)) { 
        
               goto err_destroy_rwlock; 
        
           }

In this case, what error code we should return (or just UCC_OK to proceed silently?)? and can I safely assume that TL/CUDA still works without cache (i.e., expected worse perf)?

Asking because I got compilation error (using clang) along this

tl_cuda_cache.c:82:9: error: variable 'status' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
    if (UCS_OK != ucs_pgtable_init(&cache_desc->pgtable,
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tl_cuda_cache.c:101:12: note: uninitialized use occurs here
    return status;
           ^~~~~~
tl_cuda_cache.c:82:5: note: remove the 'if' if its condition is always false
    if (UCS_OK != ucs_pgtable_init(&cache_desc->pgtable,
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tl_cuda_cache.c:65:5: note: variable 'status' is declared here
    ucc_status_t status;
    ^
1 error generated.

cc @Sergei-Lebedev @vspetrov @bureddy

Getting Started with NCCL

I am trying to test UCC with NCCL backend. Is there any documentation to get started?

Missed Free

never frees in normal destroy proccess.

ucc/src/components/tl/nccl/tl_nccl_team.c

Line 119 in be42d81

ucc_free(team->unique_id);

undefined reference to `ucm_set_global_opts'

I am seeing the following error when building UCC in NVIDIA's PyTorch container:

  CCLD     ucc_info
/usr/bin/ld: /usr/local/lib/libucs.so: undefined reference to `ucm_set_global_opts'
/usr/bin/ld: /usr/local/lib/libucs.so: undefined reference to `ucm_mmap_hook_modes'

To reproduce, run the following docker container:

docker run -it nvcr.io/nvidia/pytorch:22.03-py3

And inside the container, run the following script:

#!/bin/bash

set -ex

export UCX_HOME="/usr"
export UCC_HOME="/usr"

install_ucx() {
    set -ex
    echo "Will install ucx at: $UCX_HOME"
    rm -rf ucx
    git clone --recursive https://github.com/openucx/ucx.git
    pushd ucx
    ./autogen.sh
    ./configure --prefix=$UCX_HOME      \
        --without-bfd                   \
        --enable-mt                     \
        --with-cuda=/usr/local/cuda/    \
        --enable-profiling              \
        --enable-stats
    make -j
    make install
    popd
}

install_ucc() {
    set -ex
    echo "Will install ucc at: $UCC_HOME"
    rm -rf ucc
    git clone --recursive https://github.com/openucx/ucc.git
    pushd ucc
    ./autogen.sh
    ./configure --prefix=$UCC_HOME      \
        --with-ucx=$UCX_HOME            \
        --with-nccl=/usr                \
        --with-cuda=/usr/local/cuda/
    make -j
    make install
    popd
}

install_ucx
install_ucc

UCG needs local/global rank/size information

This is a special case of topology information (#13), likely easier to accomplish.

For reference, UCG parameters require the following:

    /**
     * Information about other processes running UCX on the same node, used for
     * the UCG - Group operations (e.g. MPI collective operations). This includes
     * both the total number of processes (including myself) and a zero-based
     * index of my process, guaranteed to be unique among the local processes
     * which this process will contact. One such pair refers strictly to the
     * peers on the same host, and the other pair refers to the total amount
     * of peers for communication across the network. Typically the process with
     * index #0 (in either pair) performs special duties in group-aware
     * transports, and those transports need this information on every process.
     *
     * @note Both fields are indicated be the same bit in @ref field_mask.
     */
    struct {
    uint32_t                           num_local;
    uint32_t                           local_idx;
    uint32_t                           num_global;
    uint32_t                           global_idx;
    } peer_info;

Full disclose: this is NOT part of the upstream UCP version, but rather a modified UCP I've been using for UCG.

Currently, the OMPI-based implementation satisfies this requirement as follows:

    ucp_params.peer_info.num_local  = ompi_process_info.num_local_peers + 1;
    ucp_params.peer_info.local_idx  = ompi_process_info.my_local_rank;
    ucp_params.peer_info.num_global = ompi_process_info.num_procs;
    ucp_params.peer_info.global_idx = ompi_process_info.myprocid.rank;

To clarify, the reason this code has ucp_params is that this information is passed to UCP (and UCT), but is used exclusively for collective operations and not P2P.

Are there any examples for UCC?

Are there any examples for UCC and how to use the API?

Outstanding Issues - API (v1.0)

Capture outstanding issues on PR #1

Docs refinement

ucc_team_create_post returns UCC_OK or error, it doesn't return UCC_INPROGRESS if team is not created yet
If ucc_team_create_post returns UCC_OK but later ucc_team_create_test returns error user is responsible to call ucc_team_destroy to free allocated resources
If ucc_collective_init returns UCC_OK but later ucc_collective_post or ucc_collective_triggered_post or ucc_collective_test returns error user is responsbile to call ucc_collective_finalize to free allocated resources
add ucc_error_type_t to ucc_team_params_t and ucc_context_params_t

Review for resource-sharing approach for OMPI

Recap: I want your feedback on the OMPI code for UCC: https://github.com/alex--m/ompi/tree/topic/shared-ucx

In addition, I'll be asking the UCX community to do the same, seeing as during the last meeting we agreed that this is required for UCC.

Gtest: test_context_config.modify fails under Valgrind

[----------] 4 tests from test_context_config, where TypeParam =
[ RUN      ] test_context_config.read_release <> <>
[       OK ] test_context_config.read_release (10 ms)
[ RUN      ] test_context_config.print <> <>
[       OK ] test_context_config.print (10 ms)
[ RUN      ] test_context_config.modify <> <>
**3465** [1617199293.251438] [0]     ucc_context.c:161  UCC  ERROR failed to modify CL "basic" configuration, name _UNKNOWN_FIELD, value _unknown_value
../../../test/gtest/core/test_context_config.cc:65: Failure
Expected: (std::string::npos) != (output.find("failed to modify")), actual: 18446744073709551615 vs 18446744073709551615
**3465** [1617199293.309419] [0]          ucc_cl.c:71   UCC  ERROR incorrect value is passed as part of UCC_CLS list: _unknown_cl
**3465** [1617199293.310071] [0]     ucc_context.c:145  UCC  ERROR failed to parse cls string: _unknown_cl
../../../test/gtest/core/test_context_config.cc:72: Failure
Expected: (std::string::npos) != (output.find("incorrect value")), actual: 18446744073709551615 vs 18446744073709551615
[  FAILED  ] test_context_config.modify, where TypeParam =  and GetParam() =  (83 ms)
[ RUN      ] test_context_config.modify_core <> <>
**3465** [1617199293.330501] [0]     ucc_context.c:137  UCC  ERROR failed to modify CORE configuration, name _UNKNOWN_FIELD, value _unknown_value
../../../test/gtest/core/test_context_config.cc:97: Failure
Expected: (std::string::npos) != (output.find("failed to modify")), actual: 18446744073709551615 vs 18446744073709551615
[  FAILED  ] test_context_config.modify_core, where TypeParam =  and GetParam() =  (12 ms)
[----------] 4 tests from test_context_config (121 ms total)

Tested with commit: f744574
Configure: --with-ucx (uses ucx debug build) --enable-gtest --without-cuda --enable-debug --with-valgrind

UCC compiled with cuda print ERROR if GPU is not available

UCC compiled with CUDA
cudart is available in runtime
GPU is not available (or not enabled)
[~/workspace/ompi/build_rel]

[valentinp@hpchead]> mpirun -np 2 osu_allreduce -x 0 -i 1 -m 4:4
[1619030637.735735] [jazz20:166840:0]         mc_cuda.c:145  �   ERROR cuda failed with ret:100(no CUDA-capable device is detected)
[1619030637.735735] [jazz20:166841:0]         mc_cuda.c:145  |   ERROR cuda failed with ret:100(no CUDA-capable device is detected)

@Sergei-Lebedev should we handle this w/o error maybe?

[mtt] Atest failed on host

Configuration

OMPI: 5.0.0a1
MOFED: MLNX_OFED_LINUX-5.1-2.5.8.0
Module: none
Test module: none
Nodes: jazz x6 (ppn=28(x6), nodelist=jazz[06,08,11,16,25,32])

MTT log:
http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/ucc/20210705_042928_99018_46207_jazz06.swx.labs.mlnx/html/test_stdout_PTm1nw.txt

Cmd:
/hpc/mtr_scrap/users/mtt/scratch/ucc/20210705_042928_99018_46207_jazz06.swx.labs.mlnx/ompi_src/install/bin/mpirun -np 168 --display map --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 -x UCC_TL_UCP_TUNE=allreduce:1 --map-by node --bind-to core --mca pmix_server_max_wait 8 /hpc/mtr_scrap/users/mtt/scratch/ucc/20210705_042928_99018_46207_jazz06.swx.labs.mlnx/installs/6uoo/tests/atest/mtt-tests.git/atest/src/atest -e 200

Output:

Total tests:                    818
Total tests exec time (secs):   94
Total failed:                   12
Total success:                  806

Add Performance Tests

[MTT] infinity log with cuda errors

UCX: 1.11
UCC: master
OMPI: v5.0.x

setup: GPU (cuda)

infinity print to log

[1624426315.715236] [vulcan02:9197 :0] mc_cuda_reduce_multi.cu:206  cuda mc ERROR cuda failed with ret:400(invalid resource handle)
[1624426315.715239] [vulcan02:9197 :0] reduce_scatter_knomial.c:151  TL_UCP ERROR failed to perform dt reduction
[1624426315.715244] [vulcan02:9197 :0] mc_cuda_reduce_multi.cu:206  cuda mc ERROR cuda failed with ret:400(invalid resource handle)
[1624426315.715247] [vulcan02:9197 :0] reduce_scatter_knomial.c:151  TL_UCP ERROR failed to perform dt reduction
[1624426315.715252] [vulcan02:9197 :0] mc_cuda_reduce_multi.cu:206  cuda mc ERROR cuda failed with ret:400(invalid resource handle)
[1624426315.715256] [vulcan02:9197 :0] reduce_scatter_knomial.c:151  TL_UCP ERROR failed to perform dt reduction
[1624426315.715260] [vulcan02:9197 :0] mc_cuda_reduce_multi.cu:206  cuda mc ERROR cuda failed with ret:400(invalid resource handle)
[1624426301.703205] [vulcan04:18934:0] reduce_scatter_knomial.c:151  TL_UCP ERROR failed to perform dt reduction
[1624426301.703210] [vulcan04:18934:0] mc_cuda_reduce_multi.cu:206  cuda mc ERROR cuda failed with ret:400(invalid resource handle)
[1624426301.703214] [vulcan04:18934:0] reduce_scatter_knomial.c:151  TL_UCP ERROR failed to perform dt reduction
[1624426301.703218] [vulcan04:18934:0] mc_cuda_reduce_multi.cu:206  cuda mc ERROR cuda failed with ret:400(invalid resource handle)
[1624426301.703221] [vulcan04:18934:0] reduce_scatter_knomial.c:151  TL_UCP ERROR failed to perform dt reduction
[1624426301.703226] [vulcan04:18934:0] mc_cuda_reduce_multi.cu:206  cuda mc ERROR cuda failed with ret:400(invalid resource handle)

Multithreading support clarification

We declare 3 types of threading support but we don't explicitly say which APIs are allowed to be used concurrently from multiple threads:

obviously user can use collectives_init/post/test/finalize with multiple threads (with the proper thread level support)
Can user call ucc_context_create from multiple threads ? Using different ucc_lib_h handles?
Can user call ucc_team_create_post/test/destroy? I guess if the context is the same then not. But what about different contexts?

[MTT] Fail build in mpi-small-tests. MPI_Errhandler_set was removed in MPI-3.0

UCX: 1.11
UCC: lastest
OMPI: v5.0.x
test: https://gitlab.com/Mellanox/mtt-tests

+ cd mpi-small-tests
+ mpicc rdmacm_perf.c -g -o rdmacm_perf -libverbs -lrdmacm
+ cd misc
+ make MPI_HOME=/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210623_131945_106403_43826_jazz29.swx.labs.mlnx/ompi_src/install
/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210623_131945_106403_43826_jazz29.swx.labs.mlnx/ompi_src/install/bin/mpicc -c -g perftest.c -o perftest.o
/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210623_131945_106403_43826_jazz29.swx.labs.mlnx/ompi_src/install/bin/mpicc -c -g collmeas.c -o collmeas.o
/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210623_131945_106403_43826_jazz29.swx.labs.mlnx/ompi_src/install/bin/mpicc -c -g comm_create.c -o comm_create.o
/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210623_131945_106403_43826_jazz29.swx.labs.mlnx/ompi_src/install/bin/mpicc -c -g ctxalloc.c -o ctxalloc.o
ctxalloc.c: In function ‘main’:
ctxalloc.c:39:20: error: call to ‘MPI_Errhandler_set’ declared with attribute error: MPI_Errhandler_set was removed in MPI-3.0.  Use MPI_Comm_set_errhandler instead.
  MPI_Errhandler_set( newcomm1, MPI_ERRORS_RETURN );
                    ^make: *** [ctxalloc.o] Error 1

Build Issue with ompi

I am trying to build open-mpi with UCC. I followed the README guideline and I was able to install UCX, UCC and then I tried to install open-mpi as follows. I am not sure if this is an issue that needs to be reported to ompi. But I would like to get insight from the UCC community.

git clone --recursive https://github.com/open-mpi/ompi
cd ompi/
./autogen.pl 
./configure --prefix=/home/vibhatha/github/dist/ompi --with-ucx=/home/vibhatha/github/dist/ucx --with-ucc=/home/vibhatha/github/dist/ucc
make -j8

when running the make, I am getting the following error.

/home/vibhatha/github/ompi/3rd-party/prrte/src/mca/prtereachable/prtereachable.h:80:72: error: unknown type name ‘pmix_list_t’; did you mean ‘pmix_class_t’?
   80 | typedef prte_reachable_t *(*prte_reachable_base_module_reachable_fn_t)(pmix_list_t *local_ifs,
      |                                                                        ^~~~~~~~~~~
      |                                                                        pmix_class_t
/home/vibhatha/github/ompi/3rd-party/prrte/src/mca/prtereachable/prtereachable.h:81:72: error: unknown type name ‘pmix_list_t’; did you mean ‘pmix_class_t’?
   81 |                                                                        pmix_list_t *remote_ifs);
      |                                                                        ^~~~~~~~~~~
      |                                                                        pmix_class_t
/home/vibhatha/github/ompi/3rd-party/prrte/src/mca/prtereachable/prtereachable.h:90:5: error: unknown type name ‘prte_reachable_base_module_reachable_fn_t’
   90 |     prte_reachable_base_module_reachable_fn_t reachable;
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from reachable_netlink.h:17,
                 from reachable_netlink_module.c:24:
/home/vibhatha/github/ompi/3rd-party/prrte/src/mca/prtereachable/prtereachable.h:80:72: error: unknown type name ‘pmix_list_t’; did you mean ‘pmix_class_t’?
   80 | typedef prte_reachable_t *(*prte_reachable_base_module_reachable_fn_t)(pmix_list_t *local_ifs,
      |                                                                        ^~~~~~~~~~~
      |                                                                        pmix_class_t
/home/vibhatha/github/ompi/3rd-party/prrte/src/mca/prtereachable/prtereachable.h:81:72: error: unknown type name ‘pmix_list_t’; did you mean ‘pmix_class_t’?
   81 |                                                                        pmix_list_t *remote_ifs);
      |                                                                        ^~~~~~~~~~~
      |                                                                        pmix_class_t
/home/vibhatha/github/ompi/3rd-party/prrte/src/mca/prtereachable/prtereachable.h:90:5: error: unknown type name ‘prte_reachable_base_module_reachable_fn_t’
   90 |     prte_reachable_base_module_reachable_fn_t reachable;
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
reachable_netlink_module.c:34:24: error: unknown type name ‘pmix_pif_t’; did you mean ‘pmix_list_t’?
   34 | static int get_weights(pmix_pif_t *local_if, pmix_pif_t *remote_if);
      |                        ^~~~~~~~~~
      |                        pmix_list_t
make[4]: *** [Makefile:837: reachable_netlink_component.lo] Error 1
make[4]: *** Waiting for unfinished jobs....
reachable_netlink_module.c:34:46: error: unknown type name ‘pmix_pif_t’; did you mean ‘pmix_list_t’?
   34 | static int get_weights(pmix_pif_t *local_if, pmix_pif_t *remote_if);
      |                                              ^~~~~~~~~~
      |                                              pmix_list_t
reachable_netlink_module.c: In function ‘netlink_reachable’:
reachable_netlink_module.c:62:5: error: unknown type name ‘pmix_pif_t’; did you mean ‘pmix_list_t’?
   62 |     pmix_pif_t *local_iter, *remote_iter;
      |     ^~~~~~~~~~
      |     pmix_list_t
In file included from /home/vibhatha/github/ompi/3rd-party/prrte/src/mca/base/prte_mca_base_framework.h:21,
                 from /home/vibhatha/github/ompi/3rd-party/prrte/src/mca/prtereachable/base/base.h:20,
                 from reachable_netlink_module.c:25:
reachable_netlink_module.c:71:46: error: ‘pmix_pif_t’ undeclared (first use in this function); did you mean ‘pmix_list_t’?
   71 |     PMIX_LIST_FOREACH(local_iter, local_ifs, pmix_pif_t)
      |                                              ^~~~~~~~~~
/home/vibhatha/github/ompi/3rd-party/openpmix/src/class/pmix_list.h:221:18: note: in definition of macro ‘PMIX_LIST_FOREACH’
  221 |     for (item = (type *) (list)->pmix_list_sentinel.pmix_list_next; \
      |                  ^~~~
reachable_netlink_module.c:71:46: note: each undeclared identifier is reported only once for each function it appears in
   71 |     PMIX_LIST_FOREACH(local_iter, local_ifs, pmix_pif_t)
      |                                              ^~~~~~~~~~
/home/vibhatha/github/ompi/3rd-party/openpmix/src/class/pmix_list.h:221:18: note: in definition of macro ‘PMIX_LIST_FOREACH’
  221 |     for (item = (type *) (list)->pmix_list_sentinel.pmix_list_next; \
      |                  ^~~~
/home/vibhatha/github/ompi/3rd-party/openpmix/src/class/pmix_list.h:221:24: error: expected expression before ‘)’ token
  221 |     for (item = (type *) (list)->pmix_list_sentinel.pmix_list_next; \
      |                        ^
reachable_netlink_module.c:71:5: note: in expansion of macro ‘PMIX_LIST_FOREACH’
   71 |     PMIX_LIST_FOREACH(local_iter, local_ifs, pmix_pif_t)
      |     ^~~~~~~~~~~~~~~~~
/home/vibhatha/github/ompi/3rd-party/openpmix/src/class/pmix_list.h:223:24: error: expected expression before ‘)’ token
  223 |          item = (type *) ((pmix_list_item_t *) (item))->pmix_list_next)
      |                        ^
reachable_netlink_module.c:71:5: note: in expansion of macro ‘PMIX_LIST_FOREACH’
   71 |     PMIX_LIST_FOREACH(local_iter, local_ifs, pmix_pif_t)
      |     ^~~~~~~~~~~~~~~~~
/home/vibhatha/github/ompi/3rd-party/openpmix/src/class/pmix_list.h:221:24: error: expected expression before ‘)’ token
  221 |     for (item = (type *) (list)->pmix_list_sentinel.pmix_list_next; \
      |                        ^
reachable_netlink_module.c:74:9: note: in expansion of macro ‘PMIX_LIST_FOREACH’
   74 |         PMIX_LIST_FOREACH(remote_iter, remote_ifs, pmix_pif_t)
      |         ^~~~~~~~~~~~~~~~~
/home/vibhatha/github/ompi/3rd-party/openpmix/src/class/pmix_list.h:223:24: error: expected expression before ‘)’ token
  223 |          item = (type *) ((pmix_list_item_t *) (item))->pmix_list_next)
      |                        ^
reachable_netlink_module.c:74:9: note: in expansion of macro ‘PMIX_LIST_FOREACH’
   74 |         PMIX_LIST_FOREACH(remote_iter, remote_ifs, pmix_pif_t)
      |         ^~~~~~~~~~~~~~~~~
reachable_netlink_module.c:76:48: warning: implicit declaration of function ‘get_weights’ [-Wimplicit-function-declaration]
   76 |             reachable_results->weights[i][j] = get_weights(local_iter, remote_iter);
      |                                                ^~~~~~~~~~~
reachable_netlink_module.c: At top level:
reachable_netlink_module.c:85:24: error: unknown type name ‘pmix_pif_t’; did you mean ‘pmix_list_t’?
   85 | static int get_weights(pmix_pif_t *local_if, pmix_pif_t *remote_if)
      |                        ^~~~~~~~~~
      |                        pmix_list_t
reachable_netlink_module.c:85:46: error: unknown type name ‘pmix_pif_t’; did you mean ‘pmix_list_t’?
   85 | static int get_weights(pmix_pif_t *local_if, pmix_pif_t *remote_if)
      |                                              ^~~~~~~~~~
      |                                              pmix_list_t
reachable_netlink_module.c:189:73: warning: initialization of ‘int’ from ‘prte_reachable_t * (*)(pmix_list_t *, pmix_list_t *)’ {aka ‘struct prte_reachable_t * (*)(struct pmix_list_t *, struct pmix_list_t *)’} makes integer from pointer without a cast [-Wint-conversion]
  189 |                                                                         netlink_reachable};
      |                                                                         ^~~~~~~~~~~~~~~~~
reachable_netlink_module.c:189:73: note: (near initialization for ‘prte_prtereachable_netlink_module.reachable’)
reachable_netlink_module.c:189:73: error: initializer element is not computable at load time
reachable_netlink_module.c:189:73: note: (near initialization for ‘prte_prtereachable_netlink_module.reachable’)
make[4]: *** [Makefile:837: reachable_netlink_module.lo] Error 1
make[4]: Leaving directory '/home/vibhatha/github/ompi/3rd-party/prrte/src/mca/prtereachable/netlink'
make[3]: *** [Makefile:1653: all-recursive] Error 1
make[3]: Leaving directory '/home/vibhatha/github/ompi/3rd-party/prrte/src'
make[2]: *** [Makefile:936: all-recursive] Error 1
make[2]: Leaving directory '/home/vibhatha/github/ompi/3rd-party/prrte'
make[1]: *** [Makefile:1347: all-recursive] Error 1
make[1]: Leaving directory '/home/vibhatha/github/ompi/3rd-party'
make: *** [Makefile:1469: all-recursive] Error 1

Please note that this was attempted on Ubuntu 20.04.

[MTT] Fail in osu_put_bibw

Configuration

OMPI: v5.0.0a1
MOFED: MLNX_OFED_LINUX-5.4-1.0.3.0
Module: none
Test module: none
Nodes: jazz x12 (ppn=28(x12), nodelist=jazz[12-21,29-30])

MTT log:
http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/ucc/20210802_170250_189628_54877_jazz12.swx.labs.mlnx/html/test_stdout_Ccnq4w.txt

Cmd:
/hpc/mtr_scrap/users/mtt/scratch/ucc/20210802_170250_189628_54877_jazz12.swx.labs.mlnx/ompi_src/install/bin/mpirun -np 2 --display map --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 -x UCC_TL_UCP_TUNE=allreduce:1 --map-by node --bind-to core /hpc/mtr_scrap/users/mtt/scratch/ucc/20210802_170250_189628_54877_jazz12.swx.labs.mlnx/installs/OSS4/tests/osu_micro_benchmark/osu-micro-benchmarks-5.6.2/mpi/one-sided/osu_put_bibw

Output:

# OSU MPI_Put Bi-directional Bandwidth Test v5.6.2
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_post/start/complete/wait
# Size      Bandwidth (MB/s)[jazz13.swx.labs.mlnx:163928] ../../../../opal/mca/common/ucx/common_ucx_wpool.h:526  Error: ucp_atomic_cswap64 failed: -1
[jazz13:00000] *** An error occurred in MPI_Win_post
[jazz13:00000] *** reported by process [1362690049,1]
[jazz13:00000] *** on win ucx window 3
[jazz13:00000] *** MPI_ERR_OTHER: known error not in list
[jazz13:00000] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[jazz13:00000] ***    and MPI will try to terminate your MPI job as well)+ rc=16

Add clang-format (like in UCX) to CI

Spoke with @artemry-nv, he can do it in a separate PR

Team params: team_size and oob.participants consistency

We have 2 fields in the team_params structure:
uint64_t team_size and
uint32_t oob.participants
both representing the same thing: the number of processes in a team. Yet they have different dtype: uint32_t and uint64_t.

We need a consistent definition here. Firstly, both should be of same size, which one do we want to support?
Secondly, what guidelines should user follow when providing them?

Are those fields mutually exclusive or can be provided simultaneously?
If both a given, i assume they should be equal, right?

@manjugv plz have a look.

Gtest: test_lib failure: duplicate pool creation

Configs:
UCC:

$./ucc_info -v
# UCC version=1.0.355 revision f744574

# configured with: --prefix=... --with-ucx=[1] --enable-gtest --with-cuda --enable-debug

UCX:

# UCT version=1.11.0 revision 10705c5
# configured with: --enable-gtest --enable-examples --with-valgrind --enable-profiling --enable-frame-pointer --enable-stats --enable-memtrack --enable-fault-injection --enable-debug-data --enable-mt --prefix=[1]

Reproduce:

$GTEST_FILTER="test_lib.*" make -C test/gtest test_valgrind
...
[       OK ] test_lib.init_finalize (37350 ms)
[ RUN      ] test_lib.init_multiple <> <>
 
Memcheck: the 'impossible' happened:
   MC_(create_mempool): duplicate pool creation
 
host stacktrace:
==5653==    at 0x5803FC7D: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==5653==    by 0x5803FD94: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==5653==    by 0x58040034: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==5653==    by 0x5800CF6D: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==5653==    by 0x5801DB6F: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==5653==    by 0x58058460: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==5653==    by 0x58096437: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==5653==    by 0x580A4C2A: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable (lwpid 5653)
==5653==    at 0x529EDEA: ucs_mpool_init (mpool.c:88)
==5653==    by 0xA97F955: ucc_mc_cuda_init (mc_cuda.c:149)
==5653==    by 0x503FC53: ucc_mc_init (ucc_mc.c:51)
==5653==    by 0x503D280: ucc_init_version (ucc_lib.c:282)
==5653==    by 0x473D47: ucc_init (ucc.h:544)
==5653==    by 0x474215: test_lib_init_multiple_Test::test_body() (test_lib.cc:38)
==5653==    by 0x464DFA: ucc::test_base::run() (test.cc:89)
==5653==    by 0x464E26: ucc::test_base::TestBodyProxy() (test.cc:95)
==5653==    by 0x463BC3: ucc::test::TestBody() (test.h:107)
==5653==    by 0x45948B: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest-all.cc:3562)
==5653==    by 0x454B81: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest-all.cc:3598)
==5653==    by 0x43C804: testing::Test::Run() (gtest-all.cc:3635)
==5653==    by 0x43CF69: testing::TestInfo::Run() (gtest-all.cc:3812)
==5653==    by 0x43D57B: testing::TestCase::Run() (gtest-all.cc:3930)
==5653==    by 0x443BBF: testing::internal::UnitTestImpl::RunAllTests() (gtest-all.cc:5808)
==5653==    by 0x45A260: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest-all.cc:3562)
==5653==    by 0x455941: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest-all.cc:3598)
==5653==    by 0x442967: testing::UnitTest::Run() (gtest-all.cc:5422)
==5653==    by 0x464492: RUN_ALL_TESTS() (gtest.h:20059)
==5653==    by 0x4643D9: main (main.cc:43)

Thread 2: status = VgTs_WaitSys (lwpid 5658)
==5653==    at 0x5A126FD: ??? (in /usr/lib64/libpthread-2.17.so)
==5653==    by 0x8B9CCEC: ucs_vfs_fuse_wait_for_path (vfs_fuse.c:265)
==5653==    by 0x8B9D089: ucs_vfs_fuse_thread_func (vfs_fuse.c:345)
==5653==    by 0x5A0BDD4: start_thread (in /usr/lib64/libpthread-2.17.so)

Thread 3: status = VgTs_WaitSys (lwpid 5717)
==5653==    at 0x84CD33F: accept4 (in /usr/lib64/libc-2.17.so)
==5653==    by 0x621ECD2: ??? (in /usr/lib64/libcuda.so.460.32.03)
==5653==    by 0x62C0BD5: ??? (in /usr/lib64/libcuda.so.460.32.03)
==5653==    by 0x5A0BDD4: start_thread (in /usr/lib64/libpthread-2.17.so)

Thread 4: status = VgTs_WaitSys (lwpid 5718)
==5653==    at 0x84C120D: ??? (in /usr/lib64/libc-2.17.so)
==5653==    by 0x62CAB60: ??? (in /usr/lib64/libcuda.so.460.32.03)
==5653==    by 0x62A3DE9: ??? (in /usr/lib64/libcuda.so.460.32.03)
==5653==    by 0x62C0BD5: ??? (in /usr/lib64/libcuda.so.460.32.03)
==5653==    by 0x5A0BDD4: start_thread (in /usr/lib64/libpthread-2.17.so)

how to get ucc initialized with mt support

I was running the command in run_tests_torch_ucc.sh

echo "INFO: UCC barrier (CPU)"
/bin/bash ${SRC_DIR}/test/start_test.sh ${SRC_DIR}/test/torch_barrier_test.py --backend=gloo

And encounter this error

[E torch_ucc_comm.cpp:156] ucc library wasn't initialized with mt support check ucc compile options

What does this "mt" mean? Does it mean "multi-thread"? I guess.

But I can't find compile options regarding multithread mode.
Thank you.

[mtt] Failed osu-micro-benchmarks-5.6.2/mpi/one-sided/osu_put_bibw on host

Configuration

OMPI: 5.0.0a1
MOFED: MLNX_OFED_LINUX-5.1-2.5.8.0
Module: none
Test module: none
Nodes: jazz x6 (ppn=28(x6), nodelist=jazz[06,08,11,16,25,32])

MTT log:
http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/ucc/20210705_042928_99018_46207_jazz06.swx.labs.mlnx/html/test_stdout_7HNGrU.txt

Cmd:
/hpc/mtr_scrap/users/mtt/scratch/ucc/20210705_042928_99018_46207_jazz06.swx.labs.mlnx/ompi_src/install/bin/mpirun -np 2 --display map --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 --map-by node --bind-to core /hpc/mtr_scrap/users/mtt/scratch/ucc/20210705_042928_99018_46207_jazz06.swx.labs.mlnx/installs/6uoo/tests/osu_micro_benchmark/osu-micro-benchmarks-5.6.2/mpi/one-sided/osu_put_bibw

Output:

=============================================================
# OSU MPI_Put Bi-directional Bandwidth Test v5.6.2
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_post/start/complete/wait
# Size      Bandwidth (MB/s)[jazz06.swx.labs.mlnx:154306] ../../../../opal/mca/common/ucx/common_ucx_wpool.h:526  Error: ucp_atomic_cswap64 failed: -1
[jazz06:00000] *** An error occurred in MPI_Win_post
[jazz06:00000] *** reported by process [2164850689,0]
[jazz06:00000] *** on win ucx window 3
[jazz06:00000] *** MPI_ERR_OTHER: known error not in list
[jazz06:00000] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[jazz06:00000] ***    and MPI will try to terminate your MPI job as well)+ rc=16

Component architecture diagram

Readme file points to old component architecture diagram (docs/images/ucc_components.png) and there is no png file in docs directory for the new diagram.

Optimization: API Change to input topology/hwloc information ucc_lib

#266

Inconsistency in maximum rank

It looks to me as if there is an inconsistency in how UCC determines the maximum rank within a team. On the one hand, there's this code in src/utils/ucc_datastruct.h that says the maximum rank is UINT32_MAX - 1 (4,294,967,295):

ucc/src/utils/ucc_datastruct.h

Lines 14 to 15 in 56df2df

    
           #define UCC_RANK_INVALID UINT32_MAX 
        
           #define UCC_RANK_MAX UCC_RANK_INVALID - 1

On the other hand, within the UCP TL, this code seems to only handle a maximum of 24 bits (16,777,216):

ucc/src/components/tl/ucp/tl_ucp_tag.h

Lines 9 to 22 in 56df2df

    
           /* 
        
            * UCP tag structure: 
        
            * 
        
            *  01        | 01234567 01234567 |    234   |      567    | 01234567 01234567 01234567 | 01234567 01234567 
        
            *            |                   |          |             |                            | 
        
            *  RESERV(2) | message tag (16)  | SCOPE(3) | SCOPE_ID(3) |     source rank (24)       |    team id (16) 
        
            */ 
        
           #define UCC_TL_UCP_RESERVED_BITS 2 
        
           #define UCC_TL_UCP_SCOPE_BITS    3 
        
           #define UCC_TL_UCP_SCOPE_ID_BITS 3 
        
           #define UCC_TL_UCP_TAG_BITS      16 
        
           #define UCC_TL_UCP_SENDER_BITS   24 
        
           #define UCC_TL_UCP_ID_BITS       16

I don't think it's likely that either maximum is being hit right now, but it would probably be good to be consistent. Particularly as different TLs could potentially support different maximum sizes.

Semantics and Description of Teams Params

Addressing the changes introduced in #274

Callbacks for Performance and Functionality

To be discussed in the WG

Add callbacks for reduction operations
Add callbacks for topology
Add callbacks for completion

[JENKINS] UCC unit tests breaks the CI system

It looks like UCC unit tests break the CI cluster: during the final phase of testing it hangs and the overall system becomes non-operable (only hard reset helps). The tests were temporarily disabled. Look like a regression (found a UCC git commit without such issue).

[mtt][atest] Fail test

Test: ucc-ompi.*;atest;atest

Configuration

OMPI: v5.0.0rc2
MOFED: MLNX_OFED_LINUX-5.4-1.0.3.0
Module: none
Test module: none
Nodes: jazz x2 (ppn=28(x2), nodelist=jazz[18,24])

MTT log:

http://hpcweb.lab.mtl.com//hpc/mtr_scrap/users/mtt/scratch/ucc/20211207_012044_191759_83678_jazz18.swx.labs.mlnx/html/test_stdout_aivlfY.txt

Cmd:
/hpc/mtr_scrap/users/mtt/scratch/ucc/20211207_012044_191759_83678_jazz18.swx.labs.mlnx/ompi_src/install/bin/mpirun -np 56 --display map --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 --map-by node --bind-to core --mca pmix_server_max_wait 8 /hpc/mtr_scrap/users/mtt/scratch/ucc/20211207_012044_191759_83678_jazz18.swx.labs.mlnx/installs/YsHX/tests/atest/mtt-tests.git/atest/src/atest --time 0 -v 1 --test-cross 0

Output:

Total tests:                    817
Total tests exec time (secs):   64
Total failed:                   1
Total success:                  816

comment:
one test of 10 fail

[mtt] fail ucc MPI TEST

Configuration

OMPI: 4.1.2a1
MOFED: MLNX_OFED_LINUX-5.2-2.2.0.0
Module: hpcx-gcc (2021-06-15)
Test module: none
Nodes: dgx x2 (ppn=80(x2), nodelist=swx-dgx[01,04])

MTT log:
http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/ucc/20210617_075227_34765_41420_swx-dgx01.swx.labs.mlnx/html/test_stdout_err9o5.txt

Cmd:
/hpc/local/benchmarks/daily/next/2021-06-15/hpcx-gcc-redhat7.6/ompi/bin/mpirun -np 64 --map-by node --bind-to hwthread -x UCC_TL_NCCL_TUNE=0 /hpc/mtr_scrap/users/mtt/scratch/ucc/20210617_075227_34765_41420_swx-dgx01.swx.labs.mlnx/installs/HAOp/tests/ucc_repo/ucc.git/test/mpi/.libs/ucc_test_mpi --colls allreduce --mtypes cuda --inplace 1 --set_device 2

Output:

===== UCC MPI TEST REPORT =====
   total tests : 224
   passed      : 223
   skipped     : 0
   failed      : 1
   elapsed     : 18s--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    swx-dgx01
  Remote host:   swx-dgx04
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
--------------------------------------------------------------------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Gtest segfault in gtest with many ifaces

When 1 process creates too many UCP contexts (with multiple interfaces) UCP fails to initialize. Currently we segv: need to gracefully stop.
repro:
UCC_TL_NCCL_TUNE=0 ./test/gtest/gtest

Value of: ucc_context_create(lib_h, &ctx_params, ctx_config, &ctx_h)
  Actual: -6
Expected: UCC_OK
Which is: 0
[1620760142.526132] [jazz23:193943:0]            sock.c:139  UCX  ERROR socket create failed: Too many open files
[1620760142.526143] [jazz23:193943:0]            sock.c:139  UCX  ERROR socket create failed: Too many open files
[1620760142.526148] [jazz23:193943:0]            sock.c:139  UCX  ERROR socket create failed: Too many open files
[1620760142.526962] [jazz23:193943:0]  tl_ucp_context.c:89   TL_UCP ERROR failed to create ucp worker, Input/output error
[1620760142.529811] [jazz23:193943:0]     ucc_context.c:293  UCC  WARN  failed to create tl context for ucp
[1620760142.529821] [jazz23:193943:0] cl_basic_context.c:23   CL_BASIC WARN  TL UCP context is not available, CL BASIC can't proceed
[1620760142.529824] [jazz23:193943:0]     ucc_context.c:362  UCC  WARN  failed to create cl context for basic, skipping
[1620760142.529827] [jazz23:193943:0]     ucc_context.c:370  UCC  ERROR no CL context created in ucc_context_create
../../../test/gtest/common/test_ucc.cc:22: Failure
Value of: ucc_context_create(lib_h, &ctx_params, ctx_config, &ctx_h)
  Actual: -6
Expected: UCC_OK
Which is: 0
[jazz23:193943:0:193943] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x88)

/labhome/valentinp/workspace/ucc/build_rel/src/../../src/core/ucc_team.c: [ ucc_team_create_post_single() ]
      ...
       37 {
       38     ucc_status_t status;
       39     if ((team->params.mask & UCC_TEAM_PARAM_FIELD_EP) &&
==>    40         (team->params.mask & UCC_TEAM_PARAM_FIELD_EP_RANGE) &&
       41         (team->params.ep_range == UCC_COLLECTIVE_EP_RANGE_CONTIG)) {
       42         team->rank =
       43             team->params.ep; //TODO need to make sure we don't exceed rank size

==== backtrace (tid: 193943) ====

 0 0x00000000000080c5 ucc_team_create_post_single()  /labhome/valentinp/workspace/ucc/build_rel/src/../../src/core/ucc_team.c:40
 1 0x00000000000080c5 ucc_team_create_post()  /labhome/valentinp/workspace/ucc/build_rel/src/../../src/core/ucc_team.c:161
 2 0x000000000047d7bb UccTeam::init_team()  /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/test_ucc.cc:160
 3 0x000000000047d7bb UccTeam::init_team()  /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/test_ucc.cc:160
 4 0x000000000047ef41 UccTeam::UccTeam()  /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/test_ucc.cc:221
 5 0x000000000047f21d construct<UccTeam, std::vector<std::shared_ptr<UccProcess>, std::allocator<std::shared_ptr<UccProcess> > >&>()  /usr/include/c++/4.8.2/ext/new_allocator.h:120
 6 0x000000000047f21d __shared_ptr<std::allocator<UccTeam>, std::vector<std::shared_ptr<UccProcess>, std::allocator<std::shared_ptr<UccProcess> > >&>()  /usr/include/c++/4.8.2/bits/shared_ptr_base.h:961
 7 0x000000000047f21d shared_ptr<std::allocator<UccTeam>, std::vector<std::shared_ptr<UccProcess>, std::allocator<std::shared_ptr<UccProcess> > >&>()  /usr/include/c++/4.8.2/bits/shared_ptr.h:316
 8 0x000000000047f21d allocate_shared<UccTeam, std::allocator<UccTeam>, std::vector<std::shared_ptr<UccProcess>, std::allocator<std::shared_ptr<UccProcess> > >&>()  /usr/include/c++/4.8.2/bits/shared_ptr.h:598
 9 0x000000000047f21d make_shared<UccTeam, std::vector<std::shared_ptr<UccProcess>, std::allocator<std::shared_ptr<UccProcess> > >&>()  /usr/include/c++/4.8.2/bits/shared_ptr.h:614
10 0x000000000047f21d UccJob::create_team()  /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/test_ucc.cc:303
11 0x00000000004b3651 test_team_team_create_multiple_preconnect_Test::test_body()  /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/core/test_team.cc:53
12 0x00000000004b3651 std::vector<std::shared_ptr<UccTeam>, std::allocator<std::shared_ptr<UccTeam> > >::push_back()  /usr/include/c++/4.8.2/bits/stl_vector.h:920
13 0x00000000004b3651 test_team_team_create_multiple_preconnect_Test::test_body()  /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/core/test_team.cc:53
14 0x000000000047aa96 ucc::test_base::run()  /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/test.cc:89
15 0x0000000000476523 HandleSehExceptionsInMethodIfSupported<testing::Test, void>()  /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/gtest-all.cc:3562
16 0x000000000046990d testing::Test::Run()  /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/gtest-all.cc:3635
17 0x00000000004699dc testing::TestInfo::Run()  /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/gtest-all.cc:3812
18 0x0000000000469b3f testing::TestCase::Run()  /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/gtest-all.cc:3930
19 0x000000000046df47 testing::internal::UnitTestImpl::RunAllTests()  /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/gtest-all.cc:5808
20 0x000000000046e24b testing::internal::UnitTestImpl::RunAllTests()  /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/gtest-all.cc:5725
21 0x0000000000453f89 RUN_ALL_TESTS()  /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/gtest.h:20059
22 0x0000000000453f89 main()  /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/main.cc:43
23 0x00000000000223d5 __libc_start_main()  ???:0
24 0x0000000000455e60 _start()  ???:0
=================================

OOB Allgather details

I've got several questions regarding OOB Allgather:

Do we assume that the ordering of the buffers in the destination is fixed when the oob.allgather is called several times. For example, i can imagine (though it is very weird) the implementation of the allgather where all ranks send their data to the root, which receives the buffers in ANY order and packs them one after another. This implementation will produce different results from one call to another. Do we allow this?
Suppose user calls ucc_team_create_post providing EP value as input + EP_RANGE_CONTIG. In other words user specifies the "rank" of the calling process in the team. Does this "rank" has to match the ordering in the oob.allgather. Ie, if i will invoke oob.allgather with the uniq process identifier as input and then parse the result buffer, then i will find the local id at the position corresponding to "rank" - is that correct?

Share Topology information

Currently, the topology information (socket, core, and binding information) is computed by each process on the node. This issue is a placeholder to add optimization, where one process computes this information and shares this information.

Related to #266

UCC team create/destroy in a loop leaks memory due to UCP

UCC code like below:

ucc_init
ucc_context_create->context
for i in [1..100]:
    ucc_team_create(context)->team
    ucc_alltoall(team)
    ucc_team_destroy(team)

leaks memory when executed using TL/UCP. The reason for that is that each time we create a team with TL/UCP we allocate UPC endpoints. The call to alltoall forces ALL eps to connect (N^2 total connections). The call to ucc_team_destroy will result in ucp_ep_close(FLUSH) being executed on each local endpoint. HOWEVER: for some transports (rc is one of them) only local part of the p2p ep (qp) will be cleaned up by UCP - ie. the corresponding part of the connection stored on the remote ucp worker is left and consumes memory -> leak. Attached is a stand-alone UCP reproduced for the issue.
ucp_test.txt

Looks like it is known limitation and we can't expect it to be resolved in UCP soon.

What can we do in UCC about it?

First: short term solution.
Use hashing and store UCP endpoints in the global hashtable. Requires uniq participant ID generation. I will prepare a prototype PR.
Long term:
For some runtimes (mostly i'm talking about MPI) we want to optimize. We have a notion of "WORLD" there, so we can store global endpoints array and re-map team ranks there. The first approach was made in PR #77. This implies global (collective) ucc context AND context rank notion, which we would like to avoid. Another approach could be: use comm_world ucc team and alternative interface for team creation.

	if (UCS_OK != ucs_pgtable_init(&cache_desc->pgtable,
	ucc_tl_cuda_cache_pgt_dir_alloc,
	ucc_tl_cuda_cache_pgt_dir_release)) {
	goto err_destroy_rwlock;
	}

	#define UCC_RANK_INVALID UINT32_MAX
	#define UCC_RANK_MAX UCC_RANK_INVALID - 1

	/*
	* UCP tag structure:
	*
	* 01 \| 01234567 01234567 \| 234 \| 567 \| 01234567 01234567 01234567 \| 01234567 01234567
	* \| \| \| \| \|
	* RESERV(2) \| message tag (16) \| SCOPE(3) \| SCOPE_ID(3) \| source rank (24) \| team id (16)
	*/

	#define UCC_TL_UCP_RESERVED_BITS 2
	#define UCC_TL_UCP_SCOPE_BITS 3
	#define UCC_TL_UCP_SCOPE_ID_BITS 3
	#define UCC_TL_UCP_TAG_BITS 16
	#define UCC_TL_UCP_SENDER_BITS 24
	#define UCC_TL_UCP_ID_BITS 16

openucx / ucc Goto Github PK

ucc's People

Stargazers

Watchers

Forkers

ucc's Issues

Issue

Potential solution

Note

Recommend Projects

Recommend Topics

Recommend Org