openucx / ucc Goto Github PK
View Code? Open in Web Editor NEWUnified Collective Communication Library
Home Page: https://openucx.github.io/ucc/
License: BSD 3-Clause "New" or "Revised" License
Unified Collective Communication Library
Home Page: https://openucx.github.io/ucc/
License: BSD 3-Clause "New" or "Revised" License
Information about the node and network topology is useful to achieve many optimizations including implementing topology-aware collectives, routing data via fast paths, and minimizing the impact of congestion. Current UCC interfaces lack the topology information.
Add topology abstraction to library and team creation interfaces.
This issue is a placeholder to capture the discussion, proposals, and details related to topology abstraction.
For several reasons we might need to generate a context-wise team id inside the UCC. There are way to do that using OOB but they involve extra communication. Sometimes runtime has that information. It can be passed by the user through team_params.
Proposal: extend team_params with uint64_t TEAM ID field. We might have a restrictions on the max team id that we can support internally (due to implementation specifics) but then it is our internal decision: if the provided ID from the runtime is too large we can still go and generate internal ID.
My compiler (clang with -Wenum-conversion
) is complaining the following:
third-party/ucc/src/components/mc/cuda/mc_cuda.c:187:9: error: implicit conversion from enumeration type 'ucs_status_t' to different enumeration type 'ucc_status_t' [-Werror,-Wenum-conversion]
CUDADRV_FUNC(cuDeviceGetAttribute(&mem_ops_attr,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
third-party/ucc/src/components/mc/cuda/mc_cuda.h:116:32: note: expanded from macro 'CUDADRV_FUNC'
ucc_status_t _status = UCS_OK;
~~~~~~~ ^~~~~~
It looks like a typo to me:
ucc/src/components/mc/cuda/mc_cuda.h
Line 116 in d13e395
ucc/src/components/tl/nccl/tl_nccl_team.c
Line 135 in d13e395
Just wanted to check if ucc_status_t and ucs_status_t is Interchangeable since they are slightly different in my understanding. Is ucs_status_t a superset of ucc_status_t?
For reference, UCG parameters require the following:
typedef struct ucg_params {
/* Callback functions for address lookup, used at connection establishment */
struct {
int (*lookup_f)(void *cb_group_context,
ucg_group_member_index_t index,
ucp_address_t **addr,
size_t *addr_len);
void (*release_f)(ucp_address_t *addr);
} address;
Currently, the OMPI-based implementation satisfies this requirement as follows:
int mca_coll_ucx_resolve_address(void *cb_group_obj,
ucg_group_member_index_t rank,
ucp_address_t **addr,
size_t *addr_len)
{
/* Sanity checks */
ompi_communicator_t* comm = (ompi_communicator_t*)cb_group_obj;
if (rank == (ucg_group_member_index_t)comm->c_my_rank) {
COLL_UCX_ERROR("mca_coll_ucx_resolve_address(rank=%lu)"
"shouldn't be called on its own rank (loopback)", rank);
return 1;
}
/* Check the cache for a previously established connection to that rank */
ompi_proc_t *proc_peer =
(struct ompi_proc_t*)ompi_comm_peer_lookup((ompi_communicator_t*)cb_group_obj, rank);
*addr = proc_peer->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_COLL];
*addr_len = 0; /* UCX doesn't need the length to unpack the address */
if (*addr) {
return 0;
}
/* Obtain the UCP address of the remote */
int ret = mca_coll_ucx_recv_worker_address(proc_peer, addr, addr_len);
if (ret < 0) {
COLL_UCX_ERROR("mca_coll_ucx_recv_worker_address(proc=%d rank=%lu) failed",
proc_peer->super.proc_name.vpid, rank);
return 1;
}
/* Cache the connection for future invocations with this rank */
proc_peer->proc_endpoints[OMPI_PROC_ENDPOINT_TAG_COLL] = *addr;
return 0;
}
void mca_coll_ucx_release_address(ucp_address_t *addr)
{
/* no need to free - the address is stored in proc_peer->proc_endpoints */
}
Hi, I am trying to build UCC on PyTorch's CI, and it failed on the CUDA 10.2 CI because __double2half
is added in CUDA 11. I will NOT request UCC to support CUDA 10.2, because PyTorch is already working on dropping CUDA 10.2 support as well.
But I searched this repository, and I didn't see any mention of versioning. The CI seems to be testing only one CUDA version as well. I really hope that UCC can clearly mention what is the lowest supported CUDA version on README.md
and have a CI running on that CUDA version.
I realize I'm about a decade out late, but is Core Direct still a thing, and will UCC be using it?
I've been trying to find information on core direct, but it seems like support got dropped somewhere.
Is there any documentation regarding core direct, or even a header file with the endpoints?
Thanks
UCC fails to compile on Ubuntu 21.10 with gcc 11.2.0, as UCC sets -Werror
and -Wenum-conversion
in the CFLAGS:
make[3]: Entering directory '/home/fabecassis/github/ucc/src/components/tl/sharp'
CC libucc_tl_sharp_la-tl_sharp_context.lo
tl_sharp_context.c: In function ‘ucc_tl_sharp_context_t_init’:
tl_sharp_context.c:285:15: error: implicit conversion from ‘ucs_status_t’ to ‘ucc_status_t’ [-Werror=enum-conversion]
285 | status = ucc_rcache_create(&rcache_params, "SHARP", NULL, &self->rcache);
| ^
cc1: all warnings being treated as errors
make[3]: *** [Makefile:582: libucc_tl_sharp_la-tl_sharp_context.lo] Error 1
CC libucc_tl_sharp_la-tl_sharp_coll.lo
tl_sharp_coll.c: In function ‘ucc_tl_sharp_mem_register’:
tl_sharp_coll.c:87:16: error: implicit conversion from ‘ucs_status_t’ to ‘ucc_status_t’ [-Werror=enum-conversion]
87 | status = ucc_rcache_get(ctx->rcache, (void *)addr, length,
| ^
cc1: all warnings being treated as errors
make[3]: *** [Makefile:596: libucc_tl_sharp_la-tl_sharp_coll.lo] Error 1
UCX comes from HPC-X 2.9.0:
$ ucx_info -v
# UCT version=1.11.0 revision 6031c98
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --with-knem --with-xpmem=/hpc/local/oss/xpmem/90a95a4 --without-java --enable-devel-headers --with-cuda=/hpc/local/oss/cuda11.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.9.0-gcc-MLNX_OFED_LINUX-5.4-1.0.3.0-ubuntu20.04-x86_64/ucx
The workaround is simple for now though: make CFLAGS=-Wno-error=enum-conversion
ucc.h (as well as doxygen) lacks the description of coll_args flags. While some of them are relatively obvious (persistent, inplace) some others are not (got questions regarding what those flags mean):
UCC_COLL_ARGS_FLAG_IN_PLACE
UCC_COLL_ARGS_FLAG_PERSISTENT
UCC_COLL_ARGS_FLAG_COUNT_64BIT
UCC_COLL_ARGS_FLAG_DISPLACEMENTS_64BIT
UCC_COLL_ARGS_FLAG_CONTIG_SRC_BUFFER
UCC_COLL_ARGS_FLAG_CONTIG_DST_BUFFER
Need to update docs.
In PR #84 we are adding support for NCCL TL. If UCC was built with NCCL support TL NCCL might be selected by CLs for CUDA collectives i.e. when both source and destination buffers are of memory type CUDA. However there are some known limitations when NCCL is used such as launching multiple collectives on different streams concurrently. Therefore users are encouraged to follow NCCL guidelines to avoid potential deadlocks. From UCC perspective it means that if multiple teams are created and NCCL TL is used then user should not post CUDA collectives to different teams at the same time.
We implemented a UCX based transport for the Cylon project which is a distributed DataFrame. We are looking forward to integrating the collective communications with the Cylon project. Are you planning to release the UCC sometime soon? Is it released separately or along with UCX?
Smth for the WG to discuss.
Currently we defined the UCC team creation to be non-blocking: team_create_post + test.
The corresponding ucc_team_destroy is blocking call.
However, in some cases ucc_team_destroy will actually involve communication among the ranks of the team that is being destroyed. Examples: TL UCP team might have the UCP EPs connected that must be disconnected during destroy - this involves ucp up_close protocol which is non-local. Another example would be mcast group destruction which requires synchronizing flush over participating ranks.
I faced the issue when i was adding team_destroy into the gtest. There we simulate multiple ranks from the single process and obviously it is impossible then to destroy the team with blocking API (the very first rank in gtest will hang). i've implemented non-blocking team destruction internally (in CL/TL and Base interface) and use it in gtest. Currently the ucc api is kept the same: blocking team destruction (implemented as while (UCC_OK != ucc_team_destroy_nb(team))).
Question: do we want to define ucc_team_destroy as non-blocking call in UCC API? Probably there is not much use case for it, but this could make it more consistent.
Shell: command failed "cd atest
./autogen.sh
./configure --with-mpipath=/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210624_101458_17631_43994_vulcan02.swx.labs.mlnx/ompi_src/install --enable-debug
make -j 9
"
make all-recursive
make[1]: Entering directory `/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210624_101458_17631_43994_vulcan02.swx.labs.mlnx/installs/dneq/tests/atest/mtt-tests.git/atest'
Making all in src
make[2]: Entering directory `/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210624_101458_17631_43994_vulcan02.swx.labs.mlnx/installs/dneq/tests/atest/mtt-tests.git/atest/src'
depbase=`echo main.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210624_101458_17631_43994_vulcan02.swx.labs.mlnx/ompi_src/install/bin/mpicc -DHAVE_CONFIG_H -I. -I.. -I../src/tests -I../src/cmd -I../src/types -I../src/comms -I../src/env -I../src/output -I../src/prof -I../src/utils -D__LINUX__ -Wall -Werror -g -D_DEBUG -g -O2 -MT main.o -MD -MP -MF $depbase.Tpo -c -o main.o main.c &&\
mv -f $depbase.Tpo $depbase.Po/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210624_101458_17631_43994_vulcan02.swx.labs.mlnx/ompi_src/install/bin/mpicc: error while loading shared libraries: libcudart.so.11.0: cannot open shared object file: No such file or directory
make[2]: *** [main.o] Error 127
make[2]: *** Waiting for unfinished jobs....make[2]: Leaving directory `/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210624_101458_17631_43994_vulcan02.swx.labs.mlnx/installs/dneq/tests/atest/mtt-tests.git/atest/src'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210624_101458_17631_43994_vulcan02.swx.labs.mlnx/installs/dneq/tests/atest/mtt-tests.git/atest'
make: *** [all] Error 2
Currently, the default mt tasks progress queue is locked pq. For 2 threaded cases they perform almost the same, but for a bigger number of threads, we need the Lock-free progress queue. need to decide what will be the default, or maybe some other mechanism besides the config flag.
Line 21 in 61da0e9
currently we simply return ucc_status_t. do we want to return some int to represent how much progressed was made? or boolean for progress or non-progress?
In ucc_tl_cuda_create_cache
, it is possible ucs_pgtable_init
return error. refer to
ucc/src/components/tl/cuda/tl_cuda_cache.c
Lines 82 to 86 in 4f0ad00
In this case, what error code we should return (or just UCC_OK
to proceed silently?)? and can I safely assume that TL/CUDA still works without cache (i.e., expected worse perf)?
Asking because I got compilation error (using clang) along this
tl_cuda_cache.c:82:9: error: variable 'status' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
if (UCS_OK != ucs_pgtable_init(&cache_desc->pgtable,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tl_cuda_cache.c:101:12: note: uninitialized use occurs here
return status;
^~~~~~
tl_cuda_cache.c:82:5: note: remove the 'if' if its condition is always false
if (UCS_OK != ucs_pgtable_init(&cache_desc->pgtable,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tl_cuda_cache.c:65:5: note: variable 'status' is declared here
ucc_status_t status;
^
1 error generated.
I am trying to test UCC with NCCL backend. Is there any documentation to get started?
never frees in normal destroy proccess.
ucc/src/components/tl/nccl/tl_nccl_team.c
Line 119 in be42d81
I am seeing the following error when building UCC in NVIDIA's PyTorch container:
CCLD ucc_info
/usr/bin/ld: /usr/local/lib/libucs.so: undefined reference to `ucm_set_global_opts'
/usr/bin/ld: /usr/local/lib/libucs.so: undefined reference to `ucm_mmap_hook_modes'
To reproduce, run the following docker container:
docker run -it nvcr.io/nvidia/pytorch:22.03-py3
And inside the container, run the following script:
#!/bin/bash
set -ex
export UCX_HOME="/usr"
export UCC_HOME="/usr"
install_ucx() {
set -ex
echo "Will install ucx at: $UCX_HOME"
rm -rf ucx
git clone --recursive https://github.com/openucx/ucx.git
pushd ucx
./autogen.sh
./configure --prefix=$UCX_HOME \
--without-bfd \
--enable-mt \
--with-cuda=/usr/local/cuda/ \
--enable-profiling \
--enable-stats
make -j
make install
popd
}
install_ucc() {
set -ex
echo "Will install ucc at: $UCC_HOME"
rm -rf ucc
git clone --recursive https://github.com/openucx/ucc.git
pushd ucc
./autogen.sh
./configure --prefix=$UCC_HOME \
--with-ucx=$UCX_HOME \
--with-nccl=/usr \
--with-cuda=/usr/local/cuda/
make -j
make install
popd
}
install_ucx
install_ucc
This is a special case of topology information (#13), likely easier to accomplish.
For reference, UCG parameters require the following:
/**
* Information about other processes running UCX on the same node, used for
* the UCG - Group operations (e.g. MPI collective operations). This includes
* both the total number of processes (including myself) and a zero-based
* index of my process, guaranteed to be unique among the local processes
* which this process will contact. One such pair refers strictly to the
* peers on the same host, and the other pair refers to the total amount
* of peers for communication across the network. Typically the process with
* index #0 (in either pair) performs special duties in group-aware
* transports, and those transports need this information on every process.
*
* @note Both fields are indicated be the same bit in @ref field_mask.
*/
struct {
uint32_t num_local;
uint32_t local_idx;
uint32_t num_global;
uint32_t global_idx;
} peer_info;
Full disclose: this is NOT part of the upstream UCP version, but rather a modified UCP I've been using for UCG.
Currently, the OMPI-based implementation satisfies this requirement as follows:
ucp_params.peer_info.num_local = ompi_process_info.num_local_peers + 1;
ucp_params.peer_info.local_idx = ompi_process_info.my_local_rank;
ucp_params.peer_info.num_global = ompi_process_info.num_procs;
ucp_params.peer_info.global_idx = ompi_process_info.myprocid.rank;
To clarify, the reason this code has ucp_params
is that this information is passed to UCP (and UCT), but is used exclusively for collective operations and not P2P.
Are there any examples for UCC and how to use the API?
Capture outstanding issues on PR #1
Endpoint
Request object
Void on destroy ?
Negative status codes ?
bool or _Bool ?
Add UCC_TYPE_COLL_LAST ?
Recap: I want your feedback on the OMPI code for UCC: https://github.com/alex--m/ompi/tree/topic/shared-ucx
In addition, I'll be asking the UCX community to do the same, seeing as during the last meeting we agreed that this is required for UCC.
[----------] 4 tests from test_context_config, where TypeParam =
[ RUN ] test_context_config.read_release <> <>
[ OK ] test_context_config.read_release (10 ms)
[ RUN ] test_context_config.print <> <>
[ OK ] test_context_config.print (10 ms)
[ RUN ] test_context_config.modify <> <>
**3465** [1617199293.251438] [0] ucc_context.c:161 UCC ERROR failed to modify CL "basic" configuration, name _UNKNOWN_FIELD, value _unknown_value
../../../test/gtest/core/test_context_config.cc:65: Failure
Expected: (std::string::npos) != (output.find("failed to modify")), actual: 18446744073709551615 vs 18446744073709551615
**3465** [1617199293.309419] [0] ucc_cl.c:71 UCC ERROR incorrect value is passed as part of UCC_CLS list: _unknown_cl
**3465** [1617199293.310071] [0] ucc_context.c:145 UCC ERROR failed to parse cls string: _unknown_cl
../../../test/gtest/core/test_context_config.cc:72: Failure
Expected: (std::string::npos) != (output.find("incorrect value")), actual: 18446744073709551615 vs 18446744073709551615
[ FAILED ] test_context_config.modify, where TypeParam = and GetParam() = (83 ms)
[ RUN ] test_context_config.modify_core <> <>
**3465** [1617199293.330501] [0] ucc_context.c:137 UCC ERROR failed to modify CORE configuration, name _UNKNOWN_FIELD, value _unknown_value
../../../test/gtest/core/test_context_config.cc:97: Failure
Expected: (std::string::npos) != (output.find("failed to modify")), actual: 18446744073709551615 vs 18446744073709551615
[ FAILED ] test_context_config.modify_core, where TypeParam = and GetParam() = (12 ms)
[----------] 4 tests from test_context_config (121 ms total)
Tested with commit: f744574
Configure: --with-ucx (uses ucx debug build) --enable-gtest --without-cuda --enable-debug --with-valgrind
[valentinp@hpchead]> mpirun -np 2 osu_allreduce -x 0 -i 1 -m 4:4
[1619030637.735735] [jazz20:166840:0] mc_cuda.c:145 � ERROR cuda failed with ret:100(no CUDA-capable device is detected)
[1619030637.735735] [jazz20:166841:0] mc_cuda.c:145 | ERROR cuda failed with ret:100(no CUDA-capable device is detected)
@Sergei-Lebedev should we handle this w/o error maybe?
Configuration
OMPI: 5.0.0a1
MOFED: MLNX_OFED_LINUX-5.1-2.5.8.0
Module: none
Test module: none
Nodes: jazz x6 (ppn=28(x6), nodelist=jazz[06,08,11,16,25,32])
MTT log:
http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/ucc/20210705_042928_99018_46207_jazz06.swx.labs.mlnx/html/test_stdout_PTm1nw.txt
Cmd:
/hpc/mtr_scrap/users/mtt/scratch/ucc/20210705_042928_99018_46207_jazz06.swx.labs.mlnx/ompi_src/install/bin/mpirun -np 168 --display map --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 -x UCC_TL_UCP_TUNE=allreduce:1 --map-by node --bind-to core --mca pmix_server_max_wait 8 /hpc/mtr_scrap/users/mtt/scratch/ucc/20210705_042928_99018_46207_jazz06.swx.labs.mlnx/installs/6uoo/tests/atest/mtt-tests.git/atest/src/atest -e 200
Output:
Total tests: 818
Total tests exec time (secs): 94
Total failed: 12
Total success: 806
UCX: 1.11
UCC: master
OMPI: v5.0.x
setup: GPU (cuda)
infinity print to log
[1624426315.715236] [vulcan02:9197 :0] mc_cuda_reduce_multi.cu:206 cuda mc ERROR cuda failed with ret:400(invalid resource handle)
[1624426315.715239] [vulcan02:9197 :0] reduce_scatter_knomial.c:151 TL_UCP ERROR failed to perform dt reduction
[1624426315.715244] [vulcan02:9197 :0] mc_cuda_reduce_multi.cu:206 cuda mc ERROR cuda failed with ret:400(invalid resource handle)
[1624426315.715247] [vulcan02:9197 :0] reduce_scatter_knomial.c:151 TL_UCP ERROR failed to perform dt reduction
[1624426315.715252] [vulcan02:9197 :0] mc_cuda_reduce_multi.cu:206 cuda mc ERROR cuda failed with ret:400(invalid resource handle)
[1624426315.715256] [vulcan02:9197 :0] reduce_scatter_knomial.c:151 TL_UCP ERROR failed to perform dt reduction
[1624426315.715260] [vulcan02:9197 :0] mc_cuda_reduce_multi.cu:206 cuda mc ERROR cuda failed with ret:400(invalid resource handle)
[1624426301.703205] [vulcan04:18934:0] reduce_scatter_knomial.c:151 TL_UCP ERROR failed to perform dt reduction
[1624426301.703210] [vulcan04:18934:0] mc_cuda_reduce_multi.cu:206 cuda mc ERROR cuda failed with ret:400(invalid resource handle)
[1624426301.703214] [vulcan04:18934:0] reduce_scatter_knomial.c:151 TL_UCP ERROR failed to perform dt reduction
[1624426301.703218] [vulcan04:18934:0] mc_cuda_reduce_multi.cu:206 cuda mc ERROR cuda failed with ret:400(invalid resource handle)
[1624426301.703221] [vulcan04:18934:0] reduce_scatter_knomial.c:151 TL_UCP ERROR failed to perform dt reduction
[1624426301.703226] [vulcan04:18934:0] mc_cuda_reduce_multi.cu:206 cuda mc ERROR cuda failed with ret:400(invalid resource handle)
We declare 3 types of threading support but we don't explicitly say which APIs are allowed to be used concurrently from multiple threads:
UCX: 1.11
UCC: lastest
OMPI: v5.0.x
test: https://gitlab.com/Mellanox/mtt-tests
+ cd mpi-small-tests
+ mpicc rdmacm_perf.c -g -o rdmacm_perf -libverbs -lrdmacm
+ cd misc
+ make MPI_HOME=/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210623_131945_106403_43826_jazz29.swx.labs.mlnx/ompi_src/install
/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210623_131945_106403_43826_jazz29.swx.labs.mlnx/ompi_src/install/bin/mpicc -c -g perftest.c -o perftest.o
/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210623_131945_106403_43826_jazz29.swx.labs.mlnx/ompi_src/install/bin/mpicc -c -g collmeas.c -o collmeas.o
/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210623_131945_106403_43826_jazz29.swx.labs.mlnx/ompi_src/install/bin/mpicc -c -g comm_create.c -o comm_create.o
/hpc/mtr_scrap/users/anatolyv/scratch/ucc/20210623_131945_106403_43826_jazz29.swx.labs.mlnx/ompi_src/install/bin/mpicc -c -g ctxalloc.c -o ctxalloc.o
ctxalloc.c: In function ‘main’:
ctxalloc.c:39:20: error: call to ‘MPI_Errhandler_set’ declared with attribute error: MPI_Errhandler_set was removed in MPI-3.0. Use MPI_Comm_set_errhandler instead.
MPI_Errhandler_set( newcomm1, MPI_ERRORS_RETURN );
^make: *** [ctxalloc.o] Error 1
I am trying to build open-mpi with UCC. I followed the README guideline and I was able to install UCX, UCC and then I tried to install open-mpi as follows. I am not sure if this is an issue that needs to be reported to ompi. But I would like to get insight from the UCC community.
git clone --recursive https://github.com/open-mpi/ompi
cd ompi/
./autogen.pl
./configure --prefix=/home/vibhatha/github/dist/ompi --with-ucx=/home/vibhatha/github/dist/ucx --with-ucc=/home/vibhatha/github/dist/ucc
make -j8
when running the make
, I am getting the following error.
/home/vibhatha/github/ompi/3rd-party/prrte/src/mca/prtereachable/prtereachable.h:80:72: error: unknown type name ‘pmix_list_t’; did you mean ‘pmix_class_t’?
80 | typedef prte_reachable_t *(*prte_reachable_base_module_reachable_fn_t)(pmix_list_t *local_ifs,
| ^~~~~~~~~~~
| pmix_class_t
/home/vibhatha/github/ompi/3rd-party/prrte/src/mca/prtereachable/prtereachable.h:81:72: error: unknown type name ‘pmix_list_t’; did you mean ‘pmix_class_t’?
81 | pmix_list_t *remote_ifs);
| ^~~~~~~~~~~
| pmix_class_t
/home/vibhatha/github/ompi/3rd-party/prrte/src/mca/prtereachable/prtereachable.h:90:5: error: unknown type name ‘prte_reachable_base_module_reachable_fn_t’
90 | prte_reachable_base_module_reachable_fn_t reachable;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from reachable_netlink.h:17,
from reachable_netlink_module.c:24:
/home/vibhatha/github/ompi/3rd-party/prrte/src/mca/prtereachable/prtereachable.h:80:72: error: unknown type name ‘pmix_list_t’; did you mean ‘pmix_class_t’?
80 | typedef prte_reachable_t *(*prte_reachable_base_module_reachable_fn_t)(pmix_list_t *local_ifs,
| ^~~~~~~~~~~
| pmix_class_t
/home/vibhatha/github/ompi/3rd-party/prrte/src/mca/prtereachable/prtereachable.h:81:72: error: unknown type name ‘pmix_list_t’; did you mean ‘pmix_class_t’?
81 | pmix_list_t *remote_ifs);
| ^~~~~~~~~~~
| pmix_class_t
/home/vibhatha/github/ompi/3rd-party/prrte/src/mca/prtereachable/prtereachable.h:90:5: error: unknown type name ‘prte_reachable_base_module_reachable_fn_t’
90 | prte_reachable_base_module_reachable_fn_t reachable;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
reachable_netlink_module.c:34:24: error: unknown type name ‘pmix_pif_t’; did you mean ‘pmix_list_t’?
34 | static int get_weights(pmix_pif_t *local_if, pmix_pif_t *remote_if);
| ^~~~~~~~~~
| pmix_list_t
make[4]: *** [Makefile:837: reachable_netlink_component.lo] Error 1
make[4]: *** Waiting for unfinished jobs....
reachable_netlink_module.c:34:46: error: unknown type name ‘pmix_pif_t’; did you mean ‘pmix_list_t’?
34 | static int get_weights(pmix_pif_t *local_if, pmix_pif_t *remote_if);
| ^~~~~~~~~~
| pmix_list_t
reachable_netlink_module.c: In function ‘netlink_reachable’:
reachable_netlink_module.c:62:5: error: unknown type name ‘pmix_pif_t’; did you mean ‘pmix_list_t’?
62 | pmix_pif_t *local_iter, *remote_iter;
| ^~~~~~~~~~
| pmix_list_t
In file included from /home/vibhatha/github/ompi/3rd-party/prrte/src/mca/base/prte_mca_base_framework.h:21,
from /home/vibhatha/github/ompi/3rd-party/prrte/src/mca/prtereachable/base/base.h:20,
from reachable_netlink_module.c:25:
reachable_netlink_module.c:71:46: error: ‘pmix_pif_t’ undeclared (first use in this function); did you mean ‘pmix_list_t’?
71 | PMIX_LIST_FOREACH(local_iter, local_ifs, pmix_pif_t)
| ^~~~~~~~~~
/home/vibhatha/github/ompi/3rd-party/openpmix/src/class/pmix_list.h:221:18: note: in definition of macro ‘PMIX_LIST_FOREACH’
221 | for (item = (type *) (list)->pmix_list_sentinel.pmix_list_next; \
| ^~~~
reachable_netlink_module.c:71:46: note: each undeclared identifier is reported only once for each function it appears in
71 | PMIX_LIST_FOREACH(local_iter, local_ifs, pmix_pif_t)
| ^~~~~~~~~~
/home/vibhatha/github/ompi/3rd-party/openpmix/src/class/pmix_list.h:221:18: note: in definition of macro ‘PMIX_LIST_FOREACH’
221 | for (item = (type *) (list)->pmix_list_sentinel.pmix_list_next; \
| ^~~~
/home/vibhatha/github/ompi/3rd-party/openpmix/src/class/pmix_list.h:221:24: error: expected expression before ‘)’ token
221 | for (item = (type *) (list)->pmix_list_sentinel.pmix_list_next; \
| ^
reachable_netlink_module.c:71:5: note: in expansion of macro ‘PMIX_LIST_FOREACH’
71 | PMIX_LIST_FOREACH(local_iter, local_ifs, pmix_pif_t)
| ^~~~~~~~~~~~~~~~~
/home/vibhatha/github/ompi/3rd-party/openpmix/src/class/pmix_list.h:223:24: error: expected expression before ‘)’ token
223 | item = (type *) ((pmix_list_item_t *) (item))->pmix_list_next)
| ^
reachable_netlink_module.c:71:5: note: in expansion of macro ‘PMIX_LIST_FOREACH’
71 | PMIX_LIST_FOREACH(local_iter, local_ifs, pmix_pif_t)
| ^~~~~~~~~~~~~~~~~
/home/vibhatha/github/ompi/3rd-party/openpmix/src/class/pmix_list.h:221:24: error: expected expression before ‘)’ token
221 | for (item = (type *) (list)->pmix_list_sentinel.pmix_list_next; \
| ^
reachable_netlink_module.c:74:9: note: in expansion of macro ‘PMIX_LIST_FOREACH’
74 | PMIX_LIST_FOREACH(remote_iter, remote_ifs, pmix_pif_t)
| ^~~~~~~~~~~~~~~~~
/home/vibhatha/github/ompi/3rd-party/openpmix/src/class/pmix_list.h:223:24: error: expected expression before ‘)’ token
223 | item = (type *) ((pmix_list_item_t *) (item))->pmix_list_next)
| ^
reachable_netlink_module.c:74:9: note: in expansion of macro ‘PMIX_LIST_FOREACH’
74 | PMIX_LIST_FOREACH(remote_iter, remote_ifs, pmix_pif_t)
| ^~~~~~~~~~~~~~~~~
reachable_netlink_module.c:76:48: warning: implicit declaration of function ‘get_weights’ [-Wimplicit-function-declaration]
76 | reachable_results->weights[i][j] = get_weights(local_iter, remote_iter);
| ^~~~~~~~~~~
reachable_netlink_module.c: At top level:
reachable_netlink_module.c:85:24: error: unknown type name ‘pmix_pif_t’; did you mean ‘pmix_list_t’?
85 | static int get_weights(pmix_pif_t *local_if, pmix_pif_t *remote_if)
| ^~~~~~~~~~
| pmix_list_t
reachable_netlink_module.c:85:46: error: unknown type name ‘pmix_pif_t’; did you mean ‘pmix_list_t’?
85 | static int get_weights(pmix_pif_t *local_if, pmix_pif_t *remote_if)
| ^~~~~~~~~~
| pmix_list_t
reachable_netlink_module.c:189:73: warning: initialization of ‘int’ from ‘prte_reachable_t * (*)(pmix_list_t *, pmix_list_t *)’ {aka ‘struct prte_reachable_t * (*)(struct pmix_list_t *, struct pmix_list_t *)’} makes integer from pointer without a cast [-Wint-conversion]
189 | netlink_reachable};
| ^~~~~~~~~~~~~~~~~
reachable_netlink_module.c:189:73: note: (near initialization for ‘prte_prtereachable_netlink_module.reachable’)
reachable_netlink_module.c:189:73: error: initializer element is not computable at load time
reachable_netlink_module.c:189:73: note: (near initialization for ‘prte_prtereachable_netlink_module.reachable’)
make[4]: *** [Makefile:837: reachable_netlink_module.lo] Error 1
make[4]: Leaving directory '/home/vibhatha/github/ompi/3rd-party/prrte/src/mca/prtereachable/netlink'
make[3]: *** [Makefile:1653: all-recursive] Error 1
make[3]: Leaving directory '/home/vibhatha/github/ompi/3rd-party/prrte/src'
make[2]: *** [Makefile:936: all-recursive] Error 1
make[2]: Leaving directory '/home/vibhatha/github/ompi/3rd-party/prrte'
make[1]: *** [Makefile:1347: all-recursive] Error 1
make[1]: Leaving directory '/home/vibhatha/github/ompi/3rd-party'
make: *** [Makefile:1469: all-recursive] Error 1
Please note that this was attempted on Ubuntu 20.04.
Configuration
OMPI: v5.0.0a1
MOFED: MLNX_OFED_LINUX-5.4-1.0.3.0
Module: none
Test module: none
Nodes: jazz x12 (ppn=28(x12), nodelist=jazz[12-21,29-30])
MTT log:
http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/ucc/20210802_170250_189628_54877_jazz12.swx.labs.mlnx/html/test_stdout_Ccnq4w.txt
Cmd:
/hpc/mtr_scrap/users/mtt/scratch/ucc/20210802_170250_189628_54877_jazz12.swx.labs.mlnx/ompi_src/install/bin/mpirun -np 2 --display map --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 -x UCC_TL_UCP_TUNE=allreduce:1 --map-by node --bind-to core /hpc/mtr_scrap/users/mtt/scratch/ucc/20210802_170250_189628_54877_jazz12.swx.labs.mlnx/installs/OSS4/tests/osu_micro_benchmark/osu-micro-benchmarks-5.6.2/mpi/one-sided/osu_put_bibw
Output:
# OSU MPI_Put Bi-directional Bandwidth Test v5.6.2
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_post/start/complete/wait
# Size Bandwidth (MB/s)[jazz13.swx.labs.mlnx:163928] ../../../../opal/mca/common/ucx/common_ucx_wpool.h:526 Error: ucp_atomic_cswap64 failed: -1
[jazz13:00000] *** An error occurred in MPI_Win_post
[jazz13:00000] *** reported by process [1362690049,1]
[jazz13:00000] *** on win ucx window 3
[jazz13:00000] *** MPI_ERR_OTHER: known error not in list
[jazz13:00000] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[jazz13:00000] *** and MPI will try to terminate your MPI job as well)+ rc=16
Spoke with @artemry-nv, he can do it in a separate PR
We have 2 fields in the team_params structure:
uint64_t team_size and
uint32_t oob.participants
both representing the same thing: the number of processes in a team. Yet they have different dtype: uint32_t and uint64_t.
We need a consistent definition here. Firstly, both should be of same size, which one do we want to support?
Secondly, what guidelines should user follow when providing them?
@manjugv plz have a look.
Configs:
UCC:
$./ucc_info -v
# UCC version=1.0.355 revision f744574
# configured with: --prefix=... --with-ucx=[1] --enable-gtest --with-cuda --enable-debug
UCX:
# UCT version=1.11.0 revision 10705c5
# configured with: --enable-gtest --enable-examples --with-valgrind --enable-profiling --enable-frame-pointer --enable-stats --enable-memtrack --enable-fault-injection --enable-debug-data --enable-mt --prefix=[1]
Reproduce:
$GTEST_FILTER="test_lib.*" make -C test/gtest test_valgrind
...
[ OK ] test_lib.init_finalize (37350 ms)
[ RUN ] test_lib.init_multiple <> <>
Memcheck: the 'impossible' happened:
MC_(create_mempool): duplicate pool creation
host stacktrace:
==5653== at 0x5803FC7D: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==5653== by 0x5803FD94: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==5653== by 0x58040034: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==5653== by 0x5800CF6D: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==5653== by 0x5801DB6F: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==5653== by 0x58058460: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==5653== by 0x58096437: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
==5653== by 0x580A4C2A: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux)
sched status:
running_tid=1
Thread 1: status = VgTs_Runnable (lwpid 5653)
==5653== at 0x529EDEA: ucs_mpool_init (mpool.c:88)
==5653== by 0xA97F955: ucc_mc_cuda_init (mc_cuda.c:149)
==5653== by 0x503FC53: ucc_mc_init (ucc_mc.c:51)
==5653== by 0x503D280: ucc_init_version (ucc_lib.c:282)
==5653== by 0x473D47: ucc_init (ucc.h:544)
==5653== by 0x474215: test_lib_init_multiple_Test::test_body() (test_lib.cc:38)
==5653== by 0x464DFA: ucc::test_base::run() (test.cc:89)
==5653== by 0x464E26: ucc::test_base::TestBodyProxy() (test.cc:95)
==5653== by 0x463BC3: ucc::test::TestBody() (test.h:107)
==5653== by 0x45948B: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest-all.cc:3562)
==5653== by 0x454B81: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest-all.cc:3598)
==5653== by 0x43C804: testing::Test::Run() (gtest-all.cc:3635)
==5653== by 0x43CF69: testing::TestInfo::Run() (gtest-all.cc:3812)
==5653== by 0x43D57B: testing::TestCase::Run() (gtest-all.cc:3930)
==5653== by 0x443BBF: testing::internal::UnitTestImpl::RunAllTests() (gtest-all.cc:5808)
==5653== by 0x45A260: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest-all.cc:3562)
==5653== by 0x455941: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest-all.cc:3598)
==5653== by 0x442967: testing::UnitTest::Run() (gtest-all.cc:5422)
==5653== by 0x464492: RUN_ALL_TESTS() (gtest.h:20059)
==5653== by 0x4643D9: main (main.cc:43)
Thread 2: status = VgTs_WaitSys (lwpid 5658)
==5653== at 0x5A126FD: ??? (in /usr/lib64/libpthread-2.17.so)
==5653== by 0x8B9CCEC: ucs_vfs_fuse_wait_for_path (vfs_fuse.c:265)
==5653== by 0x8B9D089: ucs_vfs_fuse_thread_func (vfs_fuse.c:345)
==5653== by 0x5A0BDD4: start_thread (in /usr/lib64/libpthread-2.17.so)
Thread 3: status = VgTs_WaitSys (lwpid 5717)
==5653== at 0x84CD33F: accept4 (in /usr/lib64/libc-2.17.so)
==5653== by 0x621ECD2: ??? (in /usr/lib64/libcuda.so.460.32.03)
==5653== by 0x62C0BD5: ??? (in /usr/lib64/libcuda.so.460.32.03)
==5653== by 0x5A0BDD4: start_thread (in /usr/lib64/libpthread-2.17.so)
Thread 4: status = VgTs_WaitSys (lwpid 5718)
==5653== at 0x84C120D: ??? (in /usr/lib64/libc-2.17.so)
==5653== by 0x62CAB60: ??? (in /usr/lib64/libcuda.so.460.32.03)
==5653== by 0x62A3DE9: ??? (in /usr/lib64/libcuda.so.460.32.03)
==5653== by 0x62C0BD5: ??? (in /usr/lib64/libcuda.so.460.32.03)
==5653== by 0x5A0BDD4: start_thread (in /usr/lib64/libpthread-2.17.so)
I was running the command in run_tests_torch_ucc.sh
echo "INFO: UCC barrier (CPU)"
/bin/bash ${SRC_DIR}/test/start_test.sh ${SRC_DIR}/test/torch_barrier_test.py --backend=gloo
And encounter this error
[E torch_ucc_comm.cpp:156] ucc library wasn't initialized with mt support check ucc compile options
What does this "mt" mean? Does it mean "multi-thread"? I guess.
But I can't find compile options regarding multithread mode.
Thank you.
Configuration
OMPI: 5.0.0a1
MOFED: MLNX_OFED_LINUX-5.1-2.5.8.0
Module: none
Test module: none
Nodes: jazz x6 (ppn=28(x6), nodelist=jazz[06,08,11,16,25,32])
MTT log:
http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/ucc/20210705_042928_99018_46207_jazz06.swx.labs.mlnx/html/test_stdout_7HNGrU.txt
Cmd:
/hpc/mtr_scrap/users/mtt/scratch/ucc/20210705_042928_99018_46207_jazz06.swx.labs.mlnx/ompi_src/install/bin/mpirun -np 2 --display map --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 --map-by node --bind-to core /hpc/mtr_scrap/users/mtt/scratch/ucc/20210705_042928_99018_46207_jazz06.swx.labs.mlnx/installs/6uoo/tests/osu_micro_benchmark/osu-micro-benchmarks-5.6.2/mpi/one-sided/osu_put_bibw
Output:
=============================================================
# OSU MPI_Put Bi-directional Bandwidth Test v5.6.2
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_post/start/complete/wait
# Size Bandwidth (MB/s)[jazz06.swx.labs.mlnx:154306] ../../../../opal/mca/common/ucx/common_ucx_wpool.h:526 Error: ucp_atomic_cswap64 failed: -1
[jazz06:00000] *** An error occurred in MPI_Win_post
[jazz06:00000] *** reported by process [2164850689,0]
[jazz06:00000] *** on win ucx window 3
[jazz06:00000] *** MPI_ERR_OTHER: known error not in list
[jazz06:00000] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[jazz06:00000] *** and MPI will try to terminate your MPI job as well)+ rc=16
Readme file points to old component architecture diagram (docs/images/ucc_components.png) and there is no png file in docs directory for the new diagram.
It looks to me as if there is an inconsistency in how UCC determines the maximum rank within a team. On the one hand, there's this code in src/utils/ucc_datastruct.h
that says the maximum rank is UINT32_MAX - 1
(4,294,967,295):
ucc/src/utils/ucc_datastruct.h
Lines 14 to 15 in 56df2df
On the other hand, within the UCP TL, this code seems to only handle a maximum of 24 bits (16,777,216):
ucc/src/components/tl/ucp/tl_ucp_tag.h
Lines 9 to 22 in 56df2df
I don't think it's likely that either maximum is being hit right now, but it would probably be good to be consistent. Particularly as different TLs could potentially support different maximum sizes.
Addressing the changes introduced in #274
To be discussed in the WG
It looks like UCC unit tests break the CI cluster: during the final phase of testing it hangs and the overall system becomes non-operable (only hard reset helps). The tests were temporarily disabled. Look like a regression (found a UCC git commit without such issue).
Test: ucc-ompi.*;atest;atest
Configuration
OMPI: v5.0.0rc2
MOFED: MLNX_OFED_LINUX-5.4-1.0.3.0
Module: none
Test module: none
Nodes: jazz x2 (ppn=28(x2), nodelist=jazz[18,24])
MTT log:
Cmd:
/hpc/mtr_scrap/users/mtt/scratch/ucc/20211207_012044_191759_83678_jazz18.swx.labs.mlnx/ompi_src/install/bin/mpirun -np 56 --display map --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 --map-by node --bind-to core --mca pmix_server_max_wait 8 /hpc/mtr_scrap/users/mtt/scratch/ucc/20211207_012044_191759_83678_jazz18.swx.labs.mlnx/installs/YsHX/tests/atest/mtt-tests.git/atest/src/atest --time 0 -v 1 --test-cross 0
Output:
Total tests: 817
Total tests exec time (secs): 64
Total failed: 1
Total success: 816
comment:
one test of 10 fail
Configuration
OMPI: 4.1.2a1
MOFED: MLNX_OFED_LINUX-5.2-2.2.0.0
Module: hpcx-gcc (2021-06-15)
Test module: none
Nodes: dgx x2 (ppn=80(x2), nodelist=swx-dgx[01,04])
MTT log:
http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/ucc/20210617_075227_34765_41420_swx-dgx01.swx.labs.mlnx/html/test_stdout_err9o5.txt
Cmd:
/hpc/local/benchmarks/daily/next/2021-06-15/hpcx-gcc-redhat7.6/ompi/bin/mpirun -np 64 --map-by node --bind-to hwthread -x UCC_TL_NCCL_TUNE=0 /hpc/mtr_scrap/users/mtt/scratch/ucc/20210617_075227_34765_41420_swx-dgx01.swx.labs.mlnx/installs/HAOp/tests/ucc_repo/ucc.git/test/mpi/.libs/ucc_test_mpi --colls allreduce --mtypes cuda --inplace 1 --set_device 2
Output:
===== UCC MPI TEST REPORT =====
total tests : 224
passed : 223
skipped : 0
failed : 1
elapsed : 18s--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: swx-dgx01
Remote host: swx-dgx04
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
--------------------------------------------------------------------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
When 1 process creates too many UCP contexts (with multiple interfaces) UCP fails to initialize. Currently we segv: need to gracefully stop.
repro:
UCC_TL_NCCL_TUNE=0 ./test/gtest/gtest
Value of: ucc_context_create(lib_h, &ctx_params, ctx_config, &ctx_h)
Actual: -6
Expected: UCC_OK
Which is: 0
[1620760142.526132] [jazz23:193943:0] sock.c:139 UCX ERROR socket create failed: Too many open files
[1620760142.526143] [jazz23:193943:0] sock.c:139 UCX ERROR socket create failed: Too many open files
[1620760142.526148] [jazz23:193943:0] sock.c:139 UCX ERROR socket create failed: Too many open files
[1620760142.526962] [jazz23:193943:0] tl_ucp_context.c:89 TL_UCP ERROR failed to create ucp worker, Input/output error
[1620760142.529811] [jazz23:193943:0] ucc_context.c:293 UCC WARN failed to create tl context for ucp
[1620760142.529821] [jazz23:193943:0] cl_basic_context.c:23 CL_BASIC WARN TL UCP context is not available, CL BASIC can't proceed
[1620760142.529824] [jazz23:193943:0] ucc_context.c:362 UCC WARN failed to create cl context for basic, skipping
[1620760142.529827] [jazz23:193943:0] ucc_context.c:370 UCC ERROR no CL context created in ucc_context_create
../../../test/gtest/common/test_ucc.cc:22: Failure
Value of: ucc_context_create(lib_h, &ctx_params, ctx_config, &ctx_h)
Actual: -6
Expected: UCC_OK
Which is: 0
[jazz23:193943:0:193943] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x88)
/labhome/valentinp/workspace/ucc/build_rel/src/../../src/core/ucc_team.c: [ ucc_team_create_post_single() ]
...
37 {
38 ucc_status_t status;
39 if ((team->params.mask & UCC_TEAM_PARAM_FIELD_EP) &&
==> 40 (team->params.mask & UCC_TEAM_PARAM_FIELD_EP_RANGE) &&
41 (team->params.ep_range == UCC_COLLECTIVE_EP_RANGE_CONTIG)) {
42 team->rank =
43 team->params.ep; //TODO need to make sure we don't exceed rank size
==== backtrace (tid: 193943) ====
0 0x00000000000080c5 ucc_team_create_post_single() /labhome/valentinp/workspace/ucc/build_rel/src/../../src/core/ucc_team.c:40
1 0x00000000000080c5 ucc_team_create_post() /labhome/valentinp/workspace/ucc/build_rel/src/../../src/core/ucc_team.c:161
2 0x000000000047d7bb UccTeam::init_team() /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/test_ucc.cc:160
3 0x000000000047d7bb UccTeam::init_team() /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/test_ucc.cc:160
4 0x000000000047ef41 UccTeam::UccTeam() /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/test_ucc.cc:221
5 0x000000000047f21d construct<UccTeam, std::vector<std::shared_ptr<UccProcess>, std::allocator<std::shared_ptr<UccProcess> > >&>() /usr/include/c++/4.8.2/ext/new_allocator.h:120
6 0x000000000047f21d __shared_ptr<std::allocator<UccTeam>, std::vector<std::shared_ptr<UccProcess>, std::allocator<std::shared_ptr<UccProcess> > >&>() /usr/include/c++/4.8.2/bits/shared_ptr_base.h:961
7 0x000000000047f21d shared_ptr<std::allocator<UccTeam>, std::vector<std::shared_ptr<UccProcess>, std::allocator<std::shared_ptr<UccProcess> > >&>() /usr/include/c++/4.8.2/bits/shared_ptr.h:316
8 0x000000000047f21d allocate_shared<UccTeam, std::allocator<UccTeam>, std::vector<std::shared_ptr<UccProcess>, std::allocator<std::shared_ptr<UccProcess> > >&>() /usr/include/c++/4.8.2/bits/shared_ptr.h:598
9 0x000000000047f21d make_shared<UccTeam, std::vector<std::shared_ptr<UccProcess>, std::allocator<std::shared_ptr<UccProcess> > >&>() /usr/include/c++/4.8.2/bits/shared_ptr.h:614
10 0x000000000047f21d UccJob::create_team() /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/test_ucc.cc:303
11 0x00000000004b3651 test_team_team_create_multiple_preconnect_Test::test_body() /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/core/test_team.cc:53
12 0x00000000004b3651 std::vector<std::shared_ptr<UccTeam>, std::allocator<std::shared_ptr<UccTeam> > >::push_back() /usr/include/c++/4.8.2/bits/stl_vector.h:920
13 0x00000000004b3651 test_team_team_create_multiple_preconnect_Test::test_body() /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/core/test_team.cc:53
14 0x000000000047aa96 ucc::test_base::run() /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/test.cc:89
15 0x0000000000476523 HandleSehExceptionsInMethodIfSupported<testing::Test, void>() /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/gtest-all.cc:3562
16 0x000000000046990d testing::Test::Run() /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/gtest-all.cc:3635
17 0x00000000004699dc testing::TestInfo::Run() /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/gtest-all.cc:3812
18 0x0000000000469b3f testing::TestCase::Run() /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/gtest-all.cc:3930
19 0x000000000046df47 testing::internal::UnitTestImpl::RunAllTests() /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/gtest-all.cc:5808
20 0x000000000046e24b testing::internal::UnitTestImpl::RunAllTests() /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/gtest-all.cc:5725
21 0x0000000000453f89 RUN_ALL_TESTS() /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/gtest.h:20059
22 0x0000000000453f89 main() /labhome/valentinp/workspace/ucc/build_rel/test/gtest/../../../test/gtest/common/main.cc:43
23 0x00000000000223d5 __libc_start_main() ???:0
24 0x0000000000455e60 _start() ???:0
=================================
I've got several questions regarding OOB Allgather:
Currently, the topology information (socket, core, and binding information) is computed by each process on the node. This issue is a placeholder to add optimization, where one process computes this information and shares this information.
Related to #266
UCC code like below:
ucc_init
ucc_context_create->context
for i in [1..100]:
ucc_team_create(context)->team
ucc_alltoall(team)
ucc_team_destroy(team)
leaks memory when executed using TL/UCP. The reason for that is that each time we create a team with TL/UCP we allocate UPC endpoints. The call to alltoall forces ALL eps to connect (N^2 total connections). The call to ucc_team_destroy will result in ucp_ep_close(FLUSH) being executed on each local endpoint. HOWEVER: for some transports (rc is one of them) only local part of the p2p ep (qp) will be cleaned up by UCP - ie. the corresponding part of the connection stored on the remote ucp worker is left and consumes memory -> leak. Attached is a stand-alone UCP reproduced for the issue.
ucp_test.txt
Looks like it is known limitation and we can't expect it to be resolved in UCP soon.
What can we do in UCC about it?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.