Code Monkey home page Code Monkey logo

mana's Issues

Missing list.h for contrib/infiniband

The Makefile.in includes the following lines:

HEADERS = $(DMTCP_INCLUDE_PATH)/dmtcp.h lib/list.h
ibvctx.h ibv_internal.h ibvidentifier.h debug.h

However the repo does not contain lib/list.h anywhere. This prevents any compiling of the infiniband plugin because the subsequent pattern matching rule that depends on HEADERS fails.

Where is this file?

Are messages on the fly lost after checkpointing?

Hi, I'm interested in the idea of checkpoint & restore. In the case distributed app, I'm wondering are messages on the fly lost after checkpointing?

I've noticed MANA currently may have large runtime overhead or loss of accuracy on restart.. So I guess yes?

Supporting MPI_Buffer_attach in MANA

One of our users wanted to run TiledArray (https://github.com/ValeevGroup/tiledarray) with MANA, but ran into the following error,

[41000] ERROR at mpi_unimplemented_wrappers.cpp:79 in MPI_Buffer_attach; REASON='JASSERT(false) failed'
"wrapper: MPI_Buffer_attach not implemented" = wrapper: MPI_Buffer_attach not implemented
ccd (41000): Terminating...

Could you please add the support for the MPI_Buffer_attach?

Thanks,
Zhengji

`./configure-mana` enables debug by default, where just running `./configure` apparently does not.

This behavior is unspecified and might lead the user to accidentally enable debug when not desired.

./configure and ./configure-mana are not descriptive names anyway. Perhaps ./configure-mana should be renamed to ./configure-debug, if it is not necessary to use it to build. I have been working on Perlmutter these past months using ./configure and did not observe any glitches related to it.

MANA CP Failed for MPI Program

Hello Team,

I can successfully CPR single-threaded and multi-thread (OpenMP) applications on our HPC system using the
DMTCP and MANA CP tools.
Please note:

  1. Fixes applied for bugs.
  2. The failed cases are noted and are an exception.

To checkpoint parallel application using MANA, I have referred to the
documentation links: CentOS Build Info and Build Info other OS, respectively
and prepared scripts custom to our HPC system attached in the issue as ZIP folder.

A standard error is outputted when CPR a MPI program and is captured in file: error.log and
error_verbose.log, respectively. The same error occurs for MANA branch checkout
master, refactoring, feature/centos, respectively.

The following is the build steps:
`

  1. ./build_glibc.sh
  2. ./build_mpich.sh
  3. ./build_liblzma.sh
  4. ./build_mana.sh
  5. source ./env.sh
    `

Additionally, I have followed the debug steps as suggested by a MANA developer. The steps output process address layout.
`

  1. gdb mana/bin/lh_proxy
  2. break main
  3. run
  4. source mana/util/gdb-dmtcp-utils
  5. procmaps
    The dump as follows:
    [********************]$ gdb ./bin/lh_proxy
    GNU gdb (GDB) Red Hat Enterprise Linux 8.2-12.el8
    Copyright (C) 2018 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.
    Type "show copying" and "show warranty" for details.
    This GDB was configured as "x86_64-redhat-linux-gnu".
    Type "show configuration" for configuration details.
    For bug reporting instructions, please see:
    http://www.gnu.org/software/gdb/bugs/.
    Find the GDB manual and other documentation resources online at:
    http://www.gnu.org/software/gdb/documentation/.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./bin/lh_proxy...done.
(gdb) break main
Breakpoint 1 at 0xe003470: file lh_proxy.c, line 35.
(gdb) r
Starting program: /scratch/hpc-prf-dmtcp/iteration_2_mana_master/mana/bin/lh_proxy
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x000000000e2ffd44 in _IO_new_fclose (fp=0x0) at iofclose.c:48
48 if (fp->_flags & _IO_IS_FILEBUF)
(gdb)
(gdb) source ./util/gdb-dmtcp-utils
(gdb) procmaps

***************************************** procmaps *********************************************

0e000000-0e51b000 r-xp 00000000 4fa:ff5a 216174549593571439 *****************/iteration_2_mana_master/mana/bin/lh_proxy
0e51c000-0e526000 rw-p 0051b000 4fa:ff5a 216174549593571439 *****************//iteration_2_mana_master/mana/bin/lh_proxy
0e526000-0e595000 rw-p 00000000 00:00 0 [heap]
155555551000-155555554000 r--p 00000000 00:00 0 [vvar]
155555554000-155555556000 r-xp 00000000 00:00 0 [vdso]
7ffffffda000-7ffffffff000 rw-p 00000000 00:00 0 [stack]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
(END)
`

Please do the needful to resolve the error.

Thanks.

MANA_Build.zip

Singularity / apptainer / container support

Hi,

I cannot find anywhere on this repository the answer of the question of support of containers, in particular for Singularity/Apptainer? I haven't found the time to try it, as I spent already lots of time trying other Checkpoint/Resume technology, before looking at Mana!

Thank you.

linker error on CentOS

When installing mana ./configure; make -j mana on CentOS, I am getting:

make[4]: Entering directory '/home/osboxes/LAB/mana/contrib/mpi-proxy-split/lower-half'
if mpicc -v 2>&1 | grep -q 'MPICH version'; then \
  rm -f tmp.sh; \
  mpicc -show -static -Wl,-Ttext-segment -Wl,0xE000000 -Wl,--wrap -Wl,__munmap -Wl,--wrap -Wl,shmat -Wl,--wrap -Wl,shmget -o lh_proxy -Wl,-start-group \
    lh_proxy.o libproxy.a -L$HOME/mpich-static/usr/lib64 -lmpi -llzma -lz -lm -lxml2 -lrt -lpthread -lc -Wl,-end-group | \
    sed -e 's^-lunwind ^ ^'> tmp.sh; \
  sh tmp.sh; \
  rm -f tmp.sh; \
else \
  mpicc -static -Wl,-Ttext-segment -Wl,0xE000000 -Wl,--wrap -Wl,__munmap -Wl,--wrap -Wl,shmat -Wl,--wrap -Wl,shmget -o lh_proxy -Wl,-start-group \
            lh_proxy.o libproxy.a -L$HOME/mpich-static/usr/lib64 -lmpi -llzma -lz -lm -lxml2 -lrt -lpthread -lc -Wl,-end-group; \
fi
/usr/bin/ld: cannot find -lmpi
/usr/bin/ld: cannot find -llzma
/usr/bin/ld: cannot find -lz
/usr/bin/ld: cannot find -lm
/usr/bin/ld: cannot find -lxml2
/usr/bin/ld: cannot find -lrt
/usr/bin/ld: cannot find -lpthread
/usr/bin/ld: cannot find -lc
collect2: error: ld returned 1 exit status

Unable to checkpoint: Checkpointing during dense collective calls hangs

When attempting to checkpoint with densely grouped collective calls, the checkpointing process does not complete. Instead, the ranks are unable to progress beyond the PRESUSPEND barrier.

Coordinator:
  Host: nid00223
  Port: 7779
Client List:
#, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE, BARRIER
1, Allgather_test.mana.exe[40000:21271]@nid00223, 658e8dad50cc029e-40000-4dabcde180831, WorkerState::PRESUSPE
ND, MANA-PRESUSPEND-469
2, Allgather_test.mana.exe[41000:21270]@nid00223, 658e8dad50cc029e-41000-4dabcde1743cc, WorkerState::PRESUSPE
ND, MANA-PRESUSPEND-469
3, Allgather_test.mana.exe[42000:21269]@nid00223, 658e8dad50cc029e-42000-4dabcde16dcce, WorkerState::PRESUSPE
ND, MANA-PRESUSPEND-470
4, Allgather_test.mana.exe[43000:21268]@nid00223, 658e8dad50cc029e-43000-4dabcde155b45, WorkerState::PRESUSPE
ND, MANA-PRESUSPEND-470

Based on my testing, it appears that the current_phase variable for each rank is not being set to IS_READY, allowing the rank to proceed from the PRESUSPEND barrier. More specifically, it appears that one or two ranks are entering a collective call with commit_begin, and checkpointing occurs before the other ranks reach this commit_begin. Therefore, the ranks that have entered the commit have current_phase = IN_CS, while the other ranks have current_phase = IS_READY. None of the ranks are then able to proceed, and the ranks that enter the commit early do not reach commit_finish.

I believe this logic is related to the sequence number changes to the two phase commit. @gc00 @xuyao0127 do you have any pointers on where you think the error might be?

To reproduce, you can run:
python3 $MANA_ROOT/mpi-proxy-split/test/mana_test.py $MANA_ROOT/mpi-proxy-split/test/Allgather_test -i 100000000 -n 4
on Cori, then checkpoint manually.

Centos7.6 MPICH ping_pang.c restart segmentation fault

root@6248r-node121 test-ckpt-restart]# mpirun -np 2 mana_restart
[28951] mtcp_restart.c:799 main:
[Rank: 0] Choosing ckpt image: ./ckpt_rank_0/ckpt_a.out_3c3936238b6c9197-41000-1ceb312627b83.dmtcp
[28952] mtcp_restart.c:799 main:
[Rank: 1] Choosing ckpt image: ./ckpt_rank_1/ckpt_a.out_3c3936238b6c9197-40000-1ceb311d18559.dmtcp
[28951] mtcp_restart.c:1458 unmap_memory_areas_and_restore_vdso:
***Error: vdso/vvar order was different during ckpt.
[28952] mtcp_restart.c:1458 unmap_memory_areas_and_restore_vdso:
***Error: vdso/vvar order was different during ckpt.
/home/mana/bin/mana_restart: line 125: 28952 Segmentation fault (core dumped) $dir/dmtcp_restart --mpi --join-coordinator --coord-host $submissionHost --coord-port $submissionPort $options

when i ‘’make -j mana‘’ according the txt about install mana in Centos, it hava error as follows:
make[3]: ../../lib/dmtcp/libmpidummy.so' is up to date. make[3]: Leaving directory /root/mana/contrib/mpi-proxy-split'
make ../../bin/lh_proxy
make[3]: Entering directory /root/mana/contrib/mpi-proxy-split' make -C lower-half install make[4]: Entering directory /root/mana/contrib/mpi-proxy-split/lower-half'
if mpicc -v 2>&1 | grep -q 'MPICH version'; then
rm -f tmp.sh;
mpicc -show -static -Wl,-Ttext-segment -Wl,0xE000000 -Wl,--wrap -Wl,__munmap -Wl,--wrap -Wl,shmat -Wl,--wrap -Wl,shmget -o lh_proxy -Wl,-start-group
lh_proxy.o libproxy.a gethostbyname-static/gethostbyname_static.o -L$HOME/mpich-static/usr/lib64 -lmpi -llzma -lz -lm -lxml2 -lrt -lpthread -lc -Wl,-end-group |
sed -e 's^-lunwind ^ ^'> tmp.sh;
sh tmp.sh;
rm -f tmp.sh;
else
mpicc -static -Wl,-Ttext-segment -Wl,0xE000000 -Wl,--wrap -Wl,__munmap -Wl,--wrap -Wl,shmat -Wl,--wrap -Wl,shmget -o lh_proxy -Wl,-start-group
lh_proxy.o libproxy.a gethostbyname-static/gethostbyname_static.o -L$HOME/mpich-static/usr/lib64 -lmpi -llzma -lz -lm -lxml2 -lrt -lpthread -lc -Wl,-end-group;
fi
/opt/rh/devtoolset-8/root/usr/libexec/gcc/x86_64-redhat-linux/8/ld: cannot find -ludev
collect2: error: ld returned 1 exit status
cp -f lh_proxy gethostbyname-static/gethostbyname_static.o ../../../bin/
cp: cannot stat ‘lh_proxy’: No such file or directory
make[4]: *** [install] Error 1
make[4]: Leaving directory /root/mana/contrib/mpi-proxy-split/lower-half' make[3]: *** [../../bin/lh_proxy] Error 2 make[3]: Leaving directory /root/mana/contrib/mpi-proxy-split'
make[2]: *** [install] Error 2
make[2]: Leaving directory /root/mana/contrib/mpi-proxy-split' make[1]: *** [mana_part2] Error 2 make[1]: Leaving directory /root/mana'
make: *** [mana] Error 2

i found that is only hava shared udev library, do not hava the static .a udev library, so i download the systemd src rpm,and
recompile it with enable with static, but it also hava error as follows:

6997AE85-A7EE-43D3-B220-8FF52997B88B

i changed the configure manully without the restrictions and compiled with the libudev.a

Error configuring and compiling mana

Hi,

When cloning the feature/dmtcp-master branch on a Centos 7 machine, configuring mana gives the following warning:
configure: WARNING: no configuration information is in dmtcp

And when trying to compile:

make[1]: *** No targets specified and no makefile found. Stop.

Thanks!

Aarch64 support

Hi,

When trying to compile Mana in an Aarch64 cluster, i get the following output:

mtcp_restart.c: In function ‘restorememoryareas’:
mtcp_restart.c:587:3: warning: #warning __FUNCTION__ "TODO: Implementation for ARM64" [-Wcpp]
 # warning __FUNCTION__ "TODO: Implementation for ARM64"
   ^
$HOME/mana_install/mana/restart_plugin/getcontext.S: Assembler messages:
$HOME/mana_install/mana/restart_plugin/getcontext.S:42: Error: unknown mnemonic `movq' -- `movq %rbx,128(%rdi)'
$HOME/mana_install/mana/restart_plugin/getcontext.S:43: Error: unknown mnemonic `movq' -- `movq %rbp,120(%rdi)'
$HOME/mana_install/mana/restart_plugin/getcontext.S:44: Error: unknown mnemonic `movq' -- `movq %r12,72(%rdi)'
$HOME/mana_install/mana/restart_plugin/getcontext.S:45: Error: unknown mnemonic `movq' -- `movq %r13,80(%rdi)'
$HOME/mana_install/mana/restart_plugin/getcontext.S:46: Error: unknown mnemonic `movq' -- `movq %r14,88(%rdi)'
$HOME/mana_install/mana/restart_plugin/getcontext.S:47: Error: unknown mnemonic `movq' -- `movq %r15,96(%rdi)'
$HOME/mana_install/mana/restart_plugin/getcontext.S:49: Error: unknown mnemonic `movq' -- `movq %rdi,104(%rdi)'
$HOME/mana_install/mana/restart_plugin/getcontext.S:50: Error: unknown mnemonic `movq' -- `movq %rsi,112(%rdi)'
$HOME/mana_install/mana/restart_plugin/getcontext.S:51: Error: unknown mnemonic `movq' -- `movq %rdx,136(%rdi)'
$HOME/mana_install/mana/restart_plugin/getcontext.S:52: Error: unknown mnemonic `movq' -- `movq %rcx,152(%rdi)'
$HOME/mana_install/mana/restart_plugin/getcontext.S:53: Error: unknown mnemonic `movq' -- `movq %r8,40(%rdi)'
$HOME/mana_install/mana/restart_plugin/getcontext.S:54: Error: unknown mnemonic `movq' -- `movq %r9,48(%rdi)'
$HOME/mana_install/mana/restart_plugin/getcontext.S:56: Error: unknown mnemonic `movq' -- `movq (%rsp),%rcx'
$HOME/mana_install/mana/restart_plugin/getcontext.S:57: Error: unknown mnemonic `movq' -- `movq %rcx,168(%rdi)'
$HOME/mana_install/mana/restart_plugin/getcontext.S:58: Error: unknown mnemonic `leaq' -- `leaq 8(%rsp),%rcx'
$HOME/mana_install/mana/restart_plugin/getcontext.S:59: Error: unknown mnemonic `movq' -- `movq %rcx,160(%rdi)'
$HOME/mana_install/mana/restart_plugin/getcontext.S:65: Error: unknown mnemonic `leaq' -- `leaq 424(%rdi),%rcx'
$HOME/mana_install/mana/restart_plugin/getcontext.S:66: Error: unknown mnemonic `movq' -- `movq %rcx,224(%rdi)'
$HOME/mana_install/mana/restart_plugin/getcontext.S:68: Error: unknown mnemonic `fnstenv' -- `fnstenv (%rcx)'
$HOME/mana_install/mana/restart_plugin/getcontext.S:69: Error: unknown mnemonic `fldenv' -- `fldenv (%rcx)'
$HOME/mana_install/mana/restart_plugin/getcontext.S:70: Error: unknown mnemonic `stmxcsr' -- `stmxcsr 448(%rdi)'
$HOME/mana_install/mana/restart_plugin/getcontext.S:74: Error: unknown mnemonic `leaq' -- `leaq 296(%rdi),%rdx'
$HOME/mana_install/mana/restart_plugin/getcontext.S:75: Error: unknown mnemonic `xorl' -- `xorl %esi,%esi'
$HOME/mana_install/mana/restart_plugin/getcontext.S:77: Error: unknown mnemonic `xorl' -- `xorl %edi,%edi'
$HOME/mana_install/mana/restart_plugin/getcontext.S:81: Error: unknown mnemonic `movl' -- `movl $8,%r10d'
$HOME/mana_install/mana/restart_plugin/getcontext.S:82: Error: unknown mnemonic `movl' -- `movl $135,%eax'
$HOME/mana_install/mana/restart_plugin/getcontext.S:83: Error: unknown mnemonic `syscall' -- `syscall'
$HOME/mana_install/mana/restart_plugin/getcontext.S:84: Error: unknown mnemonic `cmpq' -- `cmpq $-4095,%rax'
$HOME/mana_install/mana/restart_plugin/getcontext.S:89: Error: unknown mnemonic `xorl' -- `xorl %eax,%eax'
make[3]: *** [getcontext.o] Error 1
make[3]: *** Waiting for unfinished jobs....
make[3]: Leaving directory `$HOME/mana_install/mana/dmtcp/src/mtcp'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `$HOME/mana_install/mana/dmtcp/src'
make[1]: *** [dmtcp] Error 2
make[1]: Leaving directory `$HOME/mana_install/mana/dmtcp'
make: *** [dmtcp] Error 2

My understanding is that it seems that the Aarch64 architecture is still not supported yet, is that correct? Is there current development on making Mana work for this architecture?

Thanks!

Marc

Deadlock in the pre-ckpt phase when the checkpoint interval is short

Quoting @marcpb94 from #133:

It seems to be working now! I tried with mpi_hello_world with 1 rank and a heat distribution application (that we usually use for testing) with 4 ranks.

However, there seems to be an issue we already noticed in the old version of MANA, and we were hoping it was fixed in the new version. It seems to happen with our heat distribution application, I attach the source code so that you are able to reproduce the problem. heatdis.zip

The issue is that when checkpointing with relatively short intervals(and letting it execute for a few minutes), the execution eventually encounters a deadlock in the pre-checkpoint phase. Specifically, at least one of the MPI processes gets stuck in the drainSendRecv() function in the DMTCP_EVENT_PRECHECKPOINT case from mpi_plugin_event_hook().

Handling completed asynchronous request mappings

We are seeing a bug with CP2K where asynchronous request mappings are not being handled correctly. For example, we are observing the following sequence of events:

  1. An Isend cal creates a new virtual --> real request mapping
  2. This request is tested several times (normally)
  3. The request completes, and we see Test_internal set the real request value to MPI_REQUEST_NULL (also expected behavior)
  4. The next time the request is tested, the Test method removes the virtual-->real request mapping from our table
  5. Wait is called on the virtual request, which no longer has a mapping. This calls the Test_internal to fail with an invalid request

How should this issue be approached? One potential issue is moving the checks for deleted from Test to Test_internal (see here). However, this wouldn't solve the problem of multiple calls to test or wait on a completed request. In this case would checking for deleted mappings in Test_internal be the way to solve this issue?

Missing mana_coordinator.Tpo

Trying the neil-development branch and the "make" phase produces this:

../contrib/mpi-proxy-split/mana_coordinator.cpp:313:1: fatal error: opening dependency file .deps/../contrib/mpi-proxy-split/mana_coordinator.Tpo: No such file or directory

When will the version with aarch64 support be released?

Mana is really a great and useful tool and I'm looking forward to use it under env:aarch64+Centos7.6+openmpi.

When will the version that support aarch64+Centos7.6+openmpi be released? Will there be a relase or a note? How could I get it that the main branch do support aarch64?

core dump on CentOS

After compilation without errors, I am using mana by:

  $>  /****/mana/bin/dmtcp_coordinator --mpi --daemon --exit-on-last -q -i200
  $> mpiexec -np 4 /****/mana/bin/mana_launch ./heat_mpi

trying to checkpoint a simple heat distribution app (from -> https://repository.prace-ri.eu/git/CodeVault/training-material/parallel-programming/MPI.git)

I am getting a vast amount (~80,000 lines) of this:

a.out: mmap64.c:122: getNextAddr: Assertion `0' failed.

before getting finally:

/****/mana/bin/mana_launch: line 156: 188021 Segmentation fault      (core dumped) env MPICH_SMP_SINGLE_COPY_OFF=1 $dir/dmtcp_launch --coord-host $submissionHost --coord-port $submissionPort --no-gzip --join-coordinator --disable-dl-plugin --with-plugin $plugindir/lib/dmtcp/libmana.so $options

It's not the app, since I have tried it with others.

MANA build fails on CentOS 8

MANA build is failing on CentOS 8 with the following message;

sh: tmp.sh: No such file or directory
cp -f lh_proxy lh_proxy_da gethostbyname-static/gethostbyname_static.o /home/tom/mana/bin/
cp: cannot stat 'lh_proxy': No such file or directory
cp: cannot stat 'lh_proxy_da': No such file or directory
make[3]: *** [Makefile:97: install] Error 1
make[3]: Leaving directory '/home/tom/mana/mpi-proxy-split/lower-half'
make[2]: *** [Makefile:106: /home/tom/mana/bin/lh_proxy] Error 2
make[2]: Leaving directory '/home/tom/mana/mpi-proxy-split'
make[1]: *** [Makefile:120: install] Error 2
make[1]: Leaving directory '/home/tom/mana/mpi-proxy-split'
make: *** [Makefile:49: mana] Error 2

Building MANA error on NERSC Perlmutter

I attempted to build Mana on the NERSC Perlmutter system but encountered the following error during the make -j mana process:

make -j mana > log.mana.txt (log.mana.txt)
openjdk version "11.0.20.1" 2023-08-24
OpenJDK Runtime Environment (build 11.0.20.1+0-suse-150000.3.102.1-x8664)
OpenJDK 64-Bit Server VM (build 11.0.20.1+0-suse-150000.3.102.1-x8664, mixed mode)
/usr/lib64/gcc/x86_64-suse-linux/12/../../../../x86_64-suse-linux/bin/ld: libproxy.a(libproxy.o): in function getVdsoPointerInLinkMap': /global/cfs/cdirs/../../checkpointR/mana/mpi-proxy-split/lower-half/libproxy.c:237:(.text+0x5ae): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking /usr/lib64/gcc/x86_64-suse-linux/12/../../../../x86_64-suse-linux/bin/ld: /global/cfs/cdirs/cr/pm_dependencies/libcurl.a(libcurl_la-netrc.o): in function Curl_parsenetrc':
netrc.c:(.text+0x664): warning: Using 'getpwuid_r' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/lib64/gcc/x86_64-suse-linux/12/../../../../x86_64-suse-linux/bin/ld: libnl3.a(lib_libnl_3_la-utils.o): in function nl_ip_proto2str': /global/cfs/cdirs/../../checkpointR/mana/mpi-proxy-split/lower-half/rpmbuild/BUILD/libnl-3.3.0/lib/utils.c:868:(.text+0xf8d): warning: Using 'getprotobynumber' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking /usr/lib64/gcc/x86_64-suse-linux/12/../../../../x86_64-suse-linux/bin/ld: libnl3.a(lib_libnl_3_la-utils.o): in function nl_str2ip_proto':
/global/cfs/cdirs/../../checkpointR/mana/mpi-proxy-split/lower-half/rpmbuild/BUILD/libnl-3.3.0/lib/utils.c:881:(.text+0xff9): warning: Using 'getprotobyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/lib64/gcc/x86_64-suse-linux/12/../../../../x86_64-suse-linux/bin/ld: gethostbyname-static/gethostbyname_static.o: in function `gethostbyname2_r':
/global/cfs/cdirs/../../checkpointR/mana/mpi-proxy-split/lower-half/gethostbyname-static/gethostbyname_static.c:154:(.text+0x61f): warning: Using 'gethostbyname_r' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/lib64/gcc/x86_64-suse-linux/12/../../../../x86_64-suse-linux/bin/ld: warning: lib_libmpi_gnu_123_la-cray_memcpy_rome.o: missing .note.GNU-stack section implies executable stack
/usr/lib64/gcc/x86_64-suse-linux/12/../../../../x86_64-suse-linux/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
day1_mpi.f90:151:20:

128 | call MPI_Recv ( r_buffer, r_dim, MPI_REAL, source, tag, &
| 2
......
151 | call MPI_Recv ( i_buffer, i_dim, MPI_INTEGER, source, tag, &
| 1
Warning: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/REAL(4)).
day1_mpi.f90:185:20:

119 | call MPI_Send ( i_buffer, count, MPI_INTEGER, dest, tag, &
| 2
......
185 | call MPI_Send ( r_buffer, count2, MPI_REAL, dest, tag, &
| 1
Warning: Type mismatch between actual argument at (1) and actual argument at (2) (REAL(4)/INTEGER(4)).
wave_mpi.f90:459:22:

440 | call MPI_Recv ( buffer, 2, MPI_INTEGER, i, collect1, MPI_COMM_WORLD, &
| 2
......
459 | call MPI_Recv ( u_global(i_global_lo+1), n_local2, MPI_DOUBLE_PRECISION, &
| 1
Warning: Type mismatch between actual argument at (1) and actual argument at (2) (REAL(8)/INTEGER(4)).
wave_mpi.f90:490:20:

486 | call MPI_Send ( buffer, 2, MPI_INTEGER, 0, collect1, MPI_COMM_WORLD, error )
| 2
......
490 | call MPI_Send ( u_local, n_local, MPI_DOUBLE_PRECISION, 0, collect2, &
| 1
Warning: Type mismatch between actual argument at (1) and actual argument at (2) (REAL(8)/INTEGER(4)).
day1_mpi.f90:151:20:

128 | call MPI_Recv ( r_buffer, r_dim, MPI_REAL, source, tag, &
| 2
......
151 | call MPI_Recv ( i_buffer, i_dim, MPI_INTEGER, source, tag, &
| 1
Warning: Type mismatch between actual argument at (1) and actual argument at (2) (INTEGER(4)/REAL(4)).
day1_mpi.f90:185:20:

119 | call MPI_Send ( i_buffer, count, MPI_INTEGER, dest, tag, &
| 2
......
185 | call MPI_Send ( r_buffer, count2, MPI_REAL, dest, tag, &
| 1
Warning: Type mismatch between actual argument at (1) and actual argument at (2) (REAL(4)/INTEGER(4)).
wave_mpi.f90:459:22:

440 | call MPI_Recv ( buffer, 2, MPI_INTEGER, i, collect1, MPI_COMM_WORLD, &
| 2
......
459 | call MPI_Recv ( u_global(i_global_lo+1), n_local2, MPI_DOUBLE_PRECISION, &
| 1
Warning: Type mismatch between actual argument at (1) and actual argument at (2) (REAL(8)/INTEGER(4)).
wave_mpi.f90:490:20:

486 | call MPI_Send ( buffer, 2, MPI_INTEGER, 0, collect1, MPI_COMM_WORLD, error )
| 2
......
490 | call MPI_Send ( u_local, n_local, MPI_DOUBLE_PRECISION, 0, collect2, &
| 1
Warning: Type mismatch between actual argument at (1) and actual argument at (2) (REAL(8)/INTEGER(4)).
mtcp_split_process.c: In function ‘initializeLowerHalf’:
mtcp_split_process.c:421:18: error: ‘LowerHalfInfo_t’ {aka ‘struct _LowerHalfInfo’} has no member named ‘endOfHeapFrozenAddr’
421 | *lh_info_addr->endOfHeapFrozenAddr = 1;
| ^~
make[1]: *** [Makefile:121: mtcp_split_process.o] Error 1
make[1]: *** Waiting for unfinished jobs....
make: *** [Makefile:52: mana] Error 2

read_lh_proxy_bits error

Hi,
I tried to compile mana using nvidia and gcc8.3.0. The steps were ./configure and make -j mana. I encountered the following error:
[################]: enter readMapsLine
[################]: enter readMapsLine
[################]: enter readMapsLine

end
[################]: after setLhMemRange
[JTRACE-WEILU###] [split_process.cpp ] [75 ] [splitProcess ()] , msg='END startProxy'
[JTRACE-WEILU###] [split_process.cpp ] [78 ] [splitProcess ()] , msg='BEFORE read_lh_proxy_bits'
[################]: enter read_lh_proxy_bits
%%%%%%%%%%%%%%0urrent_pid = -155596608*****************************[realPid = 28563] [childpid = 191000]
[JTRACE-WEILU###] [split_process.cpp ] [228] [read_lh_proxy_bits ()] , msg='before mmap_iov 1'
[################ [base = 0x0e000000] [len = 3776512] [prot = 7] [flags = 50] [image_fd = 12]
[JTRACE-WEILU###] [split_process.cpp ] [194] [mmap_iov ()] , msg='beforeWEILU################ mmap result: Success
[################ mmap result: Success
[190000] ERROR at split_process.cpp:280 in read_lh_proxy_bits; REASON='JASSERT(ret != -1) failed'
(strerror((*__errno_location ()))) = No such process
Message: Error reading data from lh_proxy
hello_mpi_init_thread.mana.exe (190000): Terminating...

How to solve this error, I need your help.
Best

Cannot build Mana on container CentOS 7.9

Expected result:
Building Mana on either main or latest tag should work.

Current result:
It fails with this error

�[91mf951: Warning: command-line option '-std=gnu11' is valid for C/ObjC but not for Fortran
�[0m�[91mIn file included from mpi_type_wrappers.cpp:29:
mpi_nextfunc.h:99:19: error: conflicting declaration of C function 'int MPI_Type_struct(int, int*, MPI_Aint*, MPI_Datatype*, MPI_Datatype*)'
   99 |   EXTERNC rettype MPI_##name(APPLY(PAIR, args))
      |                   ^~~~
mpi_type_wrappers.cpp:186:1: note: in expansion of macro 'USER_DEFINED_WRAPPER'
  186 | USER_DEFINED_WRAPPER(int, Type_struct, (int) count,
      | ^~~~~~~~~~~~~~~~~~~~
�[0m�[91mIn file included from ../mpi_plugin.h:25,
                 from mpi_type_wrappers.cpp:22:
/usr/include/mpich-3.2-x86_64/mpi.h:986:5: note: previous declaration 'int MPI_Type_struct(int, const int*, const MPI_Aint*, const MPI_Datatype*, MPI_Datatype*)'
  986 | int MPI_Type_struct(int count, const int *array_of_blocklengths,
      |     ^~~~~~~~~~~~~~~
�[0m�[91mIn file included from mpi_type_wrappers.cpp:29:

Steps to reproduce:
Build this docker container:

FROM centos:7.9.2009 as builder
RUN yum update -y
RUN yum install centos-release-scl epel-release -y \
  && rpms="bzip2 cmake3 devtoolset-11-toolchain git python3-pip wget" \
  && yum install ${rpms} -y \
  && for rpm in ${rpms} ; do yum install "${rpm}" -y ; done

RUN ln -s /usr/bin/cmake3 /usr/bin/cmake \
  && sclCreateProxy() { \
  cmd="$1" \
  sclName="$2" \
  && echo '#!/bin/sh' >/usr/bin/"${cmd}" \
  && echo exec scl enable "${sclName}" -- "${cmd}" \"\$@\" >>/usr/bin/"${cmd}" \
  && chmod 775 /usr/bin/"${cmd}" \
  ; } \
  && sclCreateProxy make devtoolset-11 \
  && sclCreateProxy gcc devtoolset-11 \
  && sclCreateProxy g++ devtoolset-11

####
# MPICH
####

RUN rpms="mpich-3.2 mpich-3.2-devel libxml2-static zlib-static" \
  && yum install ${rpms} -y \
  && for rpm in ${rpms} ; do yum install "${rpm}" -y ; done && yum clean all

ENV PATH="${PATH}:/usr/lib64/mpich-3.2/bin"

RUN set -x && mkdir /tmp/mana && git clone https://github.com/mpickpt/mana.git -b nersc-release-phase-2-v2 \
  && cd ./mana && export MANA_ROOT=$PWD \
  && git submodule update --init && ./configure && make -j mana

TLA+/PlusCal specification

Your paper "MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing" briefly mentions a verification with TLA+/PlusCal. Can you please make the specification available? Thank you.

`make -j mana` can fail with `p2p_deterministic.c`, where `make mana` does not.

When I build mana using make -j mana, I sometimes get:

[...]
mpic++ -std=c++14 -fno-stack-protector -Wall -g3 -O0 -DDEBUG -fPIC -I../../dmtcp/include -I../../dmtcp/jalib -I.. -I../../dmtcp/src -I../lower-half -c -o mpi_p2p_wrappers.o mpi_p2p_wrappers.cpp
mpic++ -std=c++14 -fno-stack-protector -Wall -g3 -O0 -DDEBUG -fPIC -I../../dmtcp/include -I../../dmtcp/jalib -I.. -I../../dmtcp/src -I../lower-half -c -o mpi_request_wrappers.o mpi_request_wrappers.cpp
p2p-deterministic.c: In function ‘set_next_msg’:
p2p-deterministic.c:154:34: error: ‘P2P_LOG_MSG’ undeclared (first use in this function)
  154 |     snprintf(buf, sizeof(buf)-1, P2P_LOG_MSG, rank);
      |                                  ^~~~~~~~~~~
p2p-deterministic.c:164:26: error: invalid application of ‘sizeof’ to incomplete type ‘struct p2p_log_msg’
  164 |   p2p_msg = malloc(sizeof(*p2p_msg) + size * count);
      |                          ^
p2p-deterministic.c:165:10: error: invalid use of undefined type ‘struct p2p_log_msg’
  165 |   p2p_msg->count = count;
      |          ^~
p2p-deterministic.c:166:10: error: invalid use of undefined type ‘struct p2p_log_msg’
  166 |   p2p_msg->datatype = datatype;
      |          ^~
p2p-deterministic.c:167:10: error: invalid use of undefined type ‘struct p2p_log_msg’
  167 |   p2p_msg->source = source;
      |          ^~
p2p-deterministic.c:168:10: error: invalid use of undefined type ‘struct p2p_log_msg’
  168 |   p2p_msg->tag = tag;
      |          ^~
p2p-deterministic.c:169:10: error: invalid use of undefined type ‘struct p2p_log_msg’
  169 |   p2p_msg->comm = comm;
      |          ^~
p2p-deterministic.c:170:17: error: invalid use of undefined type ‘struct p2p_log_msg’
  170 |   memcpy(p2p_msg->data, data, size * count);
      |                 ^~
p2p-deterministic.c:173:10: error: invalid use of undefined type ‘struct p2p_log_msg’
  173 |   p2p_msg->request = (request ? *request : MPI_REQUEST_NULL);
      |          ^~
p2p-deterministic.c:175:12: error: invalid use of undefined type ‘struct p2p_log_msg’
  175 |     p2p_msg->source = status->MPI_SOURCE;
      |            ^~
p2p-deterministic.c:176:12: error: invalid use of undefined type ‘struct p2p_log_msg’
  176 |     p2p_msg->tag = status->MPI_TAG;
      |            ^~
p2p-deterministic.c:183:3: warning: implicit declaration of function ‘writeall’ [-Wimplicit-function-declaration]
  183 |   writeall(fd, p2p_msg, sizeof(*p2p_msg) + size * count);
      |   ^~~~~~~~
p2p-deterministic.c:183:31: error: invalid application of ‘sizeof’ to incomplete type ‘struct p2p_log_msg’
  183 |   writeall(fd, p2p_msg, sizeof(*p2p_msg) + size * count);
      |                               ^
p2p-deterministic.c: In function ‘p2p_replay_pre_irecv’:
p2p-deterministic.c:194:22: error: storage size of ‘p2p_msg’ isn’t known
  194 |   struct p2p_log_msg p2p_msg;
      |                      ^~~~~~~
p2p-deterministic.c:194:22: warning: unused variable ‘p2p_msg’ [-Wunused-variable]
p2p-deterministic.c: In function ‘p2p_replay_post_iprobe’:
p2p-deterministic.c:209:24: error: storage size of ‘p2p_msg’ isn’t known
  209 |     struct p2p_log_msg p2p_msg;
      |                        ^~~~~~~
p2p-deterministic.c:209:24: warning: unused variable ‘p2p_msg’ [-Wunused-variable]
p2p-deterministic.c: In function ‘save_request_info’:
p2p-deterministic.c:253:26: error: storage size of ‘p2p_request’ isn’t known
  253 |   struct p2p_log_request p2p_request;
      |                          ^~~~~~~~~~~
p2p-deterministic.c:260:34: error: ‘P2P_LOG_REQUEST’ undeclared (first use in this function)
  260 |     snprintf(buf, sizeof(buf)-1, P2P_LOG_REQUEST, rank);
      |                                  ^~~~~~~~~~~~~~~
p2p-deterministic.c:253:26: warning: unused variable ‘p2p_request’ [-Wunused-variable]
  253 |   struct p2p_log_request p2p_request;
      |                          ^~~~~~~~~~~
p2p-deterministic.c: At top level:
p2p-deterministic.c:38:27: error: storage size of ‘next_msg_entry’ isn’t known
   38 | static struct p2p_log_msg next_msg_entry = {0, MPI_CHAR, 0, 0, 0, MPI_COMM_NULL, MPI_REQUEST_NULL};
      |                           ^~~~~~~~~~~~~~
[...]

And the build fails.

This does not happen if I use make mana instead.

Latest commit does not compile. make fails with "no known conversion from 'const void*' to 'MPI_Datatype' {aka 'ompi_datatype_t*'}"

Hi,

Summary

Compile fails with:

"no known conversion from 'const void*' to 'MPI_Datatype' {aka 'ompi_datatype_t*'}"

Details

Cloned this commit:
Ran the following as per docs:

$./configure
$make -j mana

Lots of errors like

../record-replay.h:186:5: note:   no known conversion from ‘const void*’ to ‘MPI_Datatype’ {aka ‘ompi_datatype_t*’}

Full stderr dump can be seen here

System Information:

OS: Ubuntu 22.04-LTS
CPU: 6-core model: Intel Core i7-8750H
MPI Information: OpenMPI 4.1.2
MPI Info details: Standard ubuntu deb package. Details on pastebin

compile failed using openmpi in Centos in x86

make failed info :
mpic++ -g -O2 -fno-stack-protector -fPIC -I../../include -I../../jalib -Impi-wrappers -I. -I../../src -Ilower-half -std=c++11 -c -o mana_coordinator.o mana_coordinator.cpp
mana_coordinator.cpp: In function ‘void processPreSuspendClientMsgHelper(dmtcp::DmtcpCoordinator*, dmtcp::CoordClient*, int&, const dmtcp::DmtcpMessage&, const void*)’:
mana_coordinator.cpp:219:22: error: invalid conversion from ‘MPI_Comm {aka ompi_communicator_t*}’ to ‘std::map<dmtcp::CoordClient*, unsigned int, std::lessdmtcp::CoordClient*, dmtcp::DmtcpAlloc<std::pair<dmtcp::CoordClient* const, unsigned int> > >::mapped_type {aka unsigned int}’ [-fpermissive]

env info:
Openmpi-version:
mpirun (Open MPI) 4.1.0
Linux:
3.10.0-957.el7.x86_64
ibstatus:
Infiniband device 'mlx5_0' port 1 status:
default gid: fe80:0000:0000:0000:7cd9:a003:00cd:a498
base lid: 0x5
sm lid: 0x3
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (4X EDR)
link_layer: InfiniBand

Infiniband device 'mlx5_1' port 1 status:
default gid: fe80:0000:0000:0000:7ed9:a0ff:fecd:7978
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (4X EDR)
link_layer: Ethernet

libproxy.a(sbrk.o): In function `__sbrk':

Hi,
I tried to compile mana using IntelMPI and gcc9.3.0. The steps were ./configure and make -j mana. I encountered the following error:

if mpicc -v 2>&1 | grep -q 'MPICH version'; then
rm -f tmp.sh;
mpicc -show -static -Wl,-Ttext-segment -Wl,0xE000000 -Wl,--wrap -Wl,__munmap -Wl,--wrap -Wl,shmat -Wl,--wrap -Wl,shmget -o lh_proxy -Wl,-start-group
lh_proxy.o libproxy.a gethostbyname-static/gethostbyname_static.o -L$HOME/mpich-static/usr/lib64 -lmpi -llzma -lz -lm -lxml2 -lrt -lpthread -lc -Wl,-end-group |
sed -e 's^-lunwind ^ ^'> tmp.sh;
sh tmp.sh;
rm -f tmp.sh;
else
mpicc -static -Wl,-Ttext-segment -Wl,0xE000000 -Wl,--wrap -Wl,__munmap -Wl,--wrap -Wl,shmat -Wl,--wrap -Wl,shmget -o lh_proxy -Wl,-start-group
lh_proxy.o libproxy.a gethostbyname-static/gethostbyname_static.o -L$HOME/mpich-static/usr/lib64 -lmpi -llzma -lz -lm -lxml2 -lrt -lpthread -lc -Wl,-end-group;
fi
libproxy.a(sbrk.o): In function __sbrk': /home/luxu/mana-interface7/contrib/mpi-proxy-split/lower-half/sbrk.c:51: undefined reference to __brk'
collect2: error: ld returned 1 exit status
make[4]: *** [lh_proxy] Error 1
make[4]: Leaving directory /home/luxu/mana-interface7/contrib/mpi-proxy-split/lower-half' make[3]: *** [../../bin/lh_proxy] Error 2 make[3]: Leaving directory /home/luxu/mana-interface7/contrib/mpi-proxy-split'
make[2]: *** [install] Error 2
make[2]: Leaving directory /home/luxu/mana-interface7/contrib/mpi-proxy-split' make[1]: *** [mana_part2] Error 2 make[1]: Leaving directory /home/luxu/mana-interface7'
make: *** [mana] Error 2

How to solve this error, I need youre help.
Best

Final issues with iPic3D

Good afternoon!
As you will be aware for recent issues I have posted, I am trying to use MANA with iPic3D. You have already helped me with unimplemented wrappers. After wrappers for MPI_Type_hvector and MPI_Cart_create worked properly, several other MPI methods required wrappers (MPI_Wtick, MPI_Type_create_subarray, MPI_File_write, MPI_File_set_view, MPI_File_write_all). Although I would like to learn more in depth how MANA works, I still have much to learn and I did my best trying to write wrappers for the aforementioned MPI routines. I have reached a situation where iPic3D works with MANA, but I have two main problems which I would like to discuss.

  1. iPic3D uses non-blocking communications (Isend and Irecv) between neighbouring domains and keeps using MPI_Waitall before continuing with the next section of communications, so that all processes are at the same point. Unfortunately, if I leave those MPI_Waitall intact and run iPic3D with MANA, when the application gets to the communication part, it exits with error code 1. This is because MPI_Waitall sets the statuses to something different to MPI_SUCCESS. I think it might have something to do with the bug that was found by @chirag-singh-memverge recently #202, and fixed by himself in recent commits. I see from those commits that the wrapper for MPI_Waitall was not modified and I was wondering if it had to be changed for the same logic as discussed in #202. A possible way out that works is changing these MPI_Waitall for MPI_Barrier, but that comes with a significant overhead increased, so not a very desirable option.

  2. Using MPI_Barrier's instead of MPI_Waitall's the application finishes, and from the info printed to terminal it seems that the calculations are the same in a native run and using MANA. However, issues appear when writing with MPI_File_routines. With the wrappers I implemented, the output file is basically nonsense. This is not strange, because the application itself complains with

Error in MPI_File_set_view: Invalid datatype

I believe the issue might be in the wrappers I wrote (or, should I say, copied from similar methods that were already implemented) MPI_File_set_view itself or MPI_Type_create_subarray.
For MPI_File_set_view I just wrote (as with MPI_File_write and MPI_File_write_all)

DEFINE_FNC(int, File_set_view, (MPI_File) fh, (MPI_Offset) disp, (MPI_Datatype) etype,
                     (MPI_Datatype) filetype, (const char*) datarep, (MPI_Info) info);

For MPI_Type_create_subarray I wrote

USER_DEFINED_WRAPPER(int, Type_create_subarray, (int) ndims, (const int*) array_of_sizes,
                     (const int*) array_of_subsizes, (const int*) array_of_starts,
                     (int) order, (MPI_Datatype) oldtype, (MPI_Datatype*) newtype)
{
  int retval;
  DMTCP_PLUGIN_DISABLE_CKPT();
  MPI_Datatype realType = VIRTUAL_TO_REAL_TYPE(oldtype);
  JUMP_TO_LOWER_HALF(lh_info.fsaddr);
  retval = NEXT_FUNC(Type_create_subarray)(ndims, array_of_sizes, array_of_subsizes,
                                           array_of_starts,  order, realType, newtype);
  RETURN_TO_UPPER_HALF();
  if (retval == MPI_SUCCESS && MPI_LOGGING()) {
    MPI_Datatype virtType = ADD_NEW_TYPE(*newtype);
    *newtype = virtType;
    FncArg sizes = CREATE_LOG_BUF(array_of_sizes, ndims * sizeof(int));
    FncArg subsizes = CREATE_LOG_BUF(array_of_subsizes, ndims * sizeof(int));
    FncArg starts = CREATE_LOG_BUF(array_of_starts, ndims * sizeof(int));
    LOG_CALL(restoreTypes, Type_create_subarray, ndims, sizes, subsizes, starts,
             order, oldtype, virtType);
  }
  DMTCP_PLUGIN_ENABLE_CKPT();
  return retval;
}

Do you know where the problem might be? Thank you!

Best,
Marc

License

What license is MANA released under? Can an appropriate license file be added?

Error loading libmana.so

I configured and installed mana from the dmtcp-master branch, on a local centos 7 machine. When trying to execute the mpi_hello_world.exe example, i get the following error:

ERROR: ld.so: object 'bin/../lib/dmtcp/libmana.so' from LD_PRELOAD cannot be preloaded: ignored. ERROR: ld.so: object 'bin/../lib/dmtcp/libmana.so' from LD_PRELOAD cannot be preloaded: ignored. ERROR: ld.so: object 'bin/../lib/dmtcp/libmana.so' from LD_PRELOAD cannot be preloaded: ignored. ERROR: ld.so: object 'bin/../lib/dmtcp/libmana.so' from LD_PRELOAD cannot be preloaded: ignored.
It seems that the mana_launch script tries to point dmtcp_launch to the wrong libmana.so location (bin/../lib/dmtcp instead of bin/../dmtcp/lib/dmtcp), but even if I try to point to where libmana.so is, the program gets stuck at the start.

Am I doing something wrong? Is the dmtcp-master branch currently working? Thanks!

Implemented wrapper for MPI_Cart_create() not working with iPic3D

Good morning!
After @chirag-singh-memverge kindly solved the issue with the implemented wrapper for MPI_Type_hvector, a new problem arose. Now the program is complaining that the current implementation for the MPI_Cart_create() wrapper only supports one cartesian communicator. However, iPic3D creates more than one. Is there any easy solution?
Thank you,
Marc

Problem with the MPI_Type_hvector wrapper for iPic3D

Good afternoon,
Some days ago I asked whether a wrapper for MPI_Type_hvector could be implemented, as iPic3D required it. @chirag-singh-memverge was kind enough to implement it. Thank you! However, the assumption that the stride must be a multiple of the size of the type does not hold for iPic3D. Is there an easy fix to this?

Thank you,
Marc

type "MPI_Comm" cannot be assigned to an entity of type "unsigned int"

hello, when i use intel open mpi to build mana, i encountered a problem.

make[3]: Entering directory `/gs/mana/contrib/mpi-proxy-split'
mpic++ -g -O2 -fno-stack-protector -fPIC -I../../include -I../../jalib -Impi-wrappers -I. -I../../src -Ilower-half -std=c++11 -c -o mana_coordinator.o mana_coordinator.cpp
mana_coordinator.cpp(201): error: a value of type "MPI_Comm" cannot be assigned to an entity of type "unsigned int"
clientGids[client] = state.comm;
^
compilation aborted for mana_coordinator.cpp (code 2)
make[3]: *** [mana_coordinator.o] Error 2

please help me, thanks.

`mpi-proxy-split/util/mpi-logger` will not work for OpenMPI.

Build error:

In file included from mpi_logger.cpp:5:
mpi_logger_utils.h: In function ‘void get_datatype_string(MPI_Datatype, char*)’:
mpi_logger_utils.h:36:11: error: switch quantity not an integer
   36 |   switch (datatype) {
      |           ^~~~~~~~
mpi_logger_utils.h: In function ‘void get_op_string(MPI_Op, char*)’:
mpi_logger_utils.h:93:11: error: switch quantity not an integer
   93 |   switch (op) {
      |           ^~
mpi_logger.cpp: In function ‘int MPI_Cart_create(MPI_Comm, int, const int*, const int*, int, ompi_communicator_t**)’:
mpi_logger.cpp:86:25: warning: ‘sizeof’ on array function parameter ‘dims’ will return size of ‘const int*’ [-Wsizeof-array-argument]
   86 |   int dim_size = sizeof(dims) / sizeof(int);
      |                        ~^~~~~
mpi_logger.cpp:73:61: note: declared here
   73 | int MPI_Cart_create(MPI_Comm comm_old, int ndims, const int dims[],const int periods[], int reorder, MPI_Comm * comm_cart) {
      |                                                   ~~~~~~~~~~^~~~~~
mpi_logger.cpp:96:23: warning: ‘sizeof’ on array function parameter ‘periods’ will return size of ‘const int*’ [-Wsizeof-array-argument]
   96 |   int p_size = sizeof(periods) / sizeof(int);
      |                      ~^~~~~~~~
mpi_logger.cpp:73:78: note: declared here
   73 | int MPI_Cart_create(MPI_Comm comm_old, int ndims, const int dims[],const int periods[], int reorder, MPI_Comm * comm_cart) {
      |                                                                    ~~~~~~~~~~^~~~~~~~~
mpi_logger.cpp: In function ‘int MPI_Cart_sub(MPI_Comm, const int*, ompi_communicator_t**)’:
mpi_logger.cpp:124:25: warning: ‘sizeof’ on array function parameter ‘remain_dims’ will return size of ‘const int*’ [-Wsizeof-array-argument]
  124 |   int dim_size = sizeof(remain_dims) / sizeof(int);
      |                        ~^~~~~~~~~~~~
mpi_logger.cpp:111:43: note: declared here
  111 | int MPI_Cart_sub(MPI_Comm comm, const int remain_dims[], MPI_Comm *newcomm) {
      |                                 ~~~~~~~~~~^~~~~~~~~~~~~
mpi_logger.cpp: In function ‘int MPI_Alltoall(const void*, int, MPI_Datatype, void*, int, MPI_Datatype, MPI_Comm)’:
mpi_logger.cpp:531:1: warning: no return statement in function returning non-void [-Wreturn-type]
  531 | }
      | ^

Which makes sense, since OpenMPI datatypes are not integers.

Since we are trying to build MANA for multiple MPI, it may be useful to have this utility also build with multiple MPI.

And, since getting a working Fortran has been problematic for us on NEU Discovery (and may be problematic at other sites), it would be great if we could optionally compile without fortran support (with the understanding that fortran functionality would not work)

CUDA support

Hi,

I see mention of Perlmutter in the repository here, so I wonder if MANA supports CUDA? (Or any other GPU platform.)

Thanks.

Error installing mana on SuSE cluster

I'm currently trying to install mana in a cluster running SuSE Linux Enterprise Server. I have installed xml2, xz, libpciaccess static libraries in a local_install directory. When trying to install mana, i get the following error:

make[4]: Leaving directory '***/mana/mpi-proxy-split/lower-half/gethostbyname-static'
cd gethostbyname-static && make gethostbyname_proxy
make[4]: Entering directory '***/mana/mpi-proxy-split/lower-half/gethostbyname-static'
gcc -g3 -O0 gethostbyname_proxy.c -o gethostbyname_proxy
ar cr libproxy.a libproxy.o procmapsutils.o sbrk.o mmap64.o munmap.o shmat.o shmget.o
make[4]: Leaving directory '***/mana/mpi-proxy-split/lower-half/gethostbyname-static'
cp -f gethostbyname-static/gethostbyname_proxy ../../bin/
if mpicc -v 2>&1 | grep -q 'MPICH version'; then \
  rm -f tmp.sh; \
  mpicc -show -static -Wl,-Ttext-segment -Wl,0xE000000 -Wl,--wrap -Wl,__munmap -Wl,--wrap -Wl,shmat -Wl,--wrap -Wl,shmget -o lh_proxy -Wl,-start-group \
    lh_proxy.o libproxy.a gethostbyname-static/gethostbyname_static.o -L***/mpich-static/lib/ -L***/local_install/lib/ -lmpi -llzma -lz -lm -lxml2 -lrt -lpthread -lc -Wl,-end-group | \
    sed -e 's^-lunwind ^ ^'> tmp.sh; \
  sh tmp.sh; \
  rm -f tmp.sh; \
else \
  mpicc -static -Wl,-Ttext-segment -Wl,0xE000000 -Wl,--wrap -Wl,__munmap -Wl,--wrap -Wl,shmat -Wl,--wrap -Wl,shmget -o lh_proxy -Wl,-start-group \
            lh_proxy.o libproxy.a gethostbyname-static/gethostbyname_static.o -L***/mpich-static/lib/ -L***/local_install/lib/ -lmpi -llzma -lz -lm -lxml2 -lrt -lpthread -lc -Wl,-end-group; \
fi
/usr/bin/ld: cannot find -ludev
collect2: error: ld returned 1 exit status
cp -f lh_proxy gethostbyname-static/gethostbyname_static.o ../../bin/
cp: cannot stat 'lh_proxy': No such file or directory
Makefile:81: recipe for target 'install' failed
make[3]: *** [install] Error 1
make[3]: Leaving directory '***/mana/mpi-proxy-split/lower-half'
Makefile:105: recipe for target '../bin/lh_proxy' failed
make[2]: *** [../bin/lh_proxy] Error 2
make[2]: Leaving directory '***/mana/mpi-proxy-split'
Makefile:116: recipe for target 'install' failed
make[1]: *** [install] Error 2
make[1]: Leaving directory '***/mana/mpi-proxy-split'
Makefile:49: recipe for target 'mana' failed
make: *** [mana] Error 2

It is worth mentioning that I had to change MPI_LD_FLAG in mpi-proxy-split/Makefile_config so that they would point to the local_install and mpich_static directories. The error seems to complain about the lack of a static libudev library. If it was another library, I would just compile and install it statically in the local install directory, but libudev is supposedly integrated inside systemd so I'm unsure of how to proceed. I'm using commit 6fb119f of the mana repo, the updated main branch also gave an error at the same place but the message was not very descriptive.

More information about the environment:

  • compiler: GCC 7.2.0
  • MPICH 3.3.2
  • libxml2 2.9.14
  • xz 5.2.2
  • libpciaccess 0.14

Thanks!

Problem with MPI_STATUS_IGNORE in MPI_Sendrecv

Good morning,

I was trying to run MANA with LAMMPS and encountered a problem with the implemented wrapper for MPI_Sendrecv. LAMMPS uses MPI_STATUS_IGNORE, but then in the wrapper it tries to dereference it, and the application does not crash but does not continue either.
Changing

*status = sts[1];

for

if( status != MPI_STATUS_IGNORE ){
     *status = sts[1];
}

solved the issue. I submit a minimal reproducible code reproducing the bug. Have a nice day!
main.c.gz

MPI_Scan Implementation is Incorrect for MPI_COLLECTIVE_P2P Defined

It looks like the MPI_Scan implementation for collective P2P calls is incorrect (see here. This implementation will always hang because only the root sends data to every other rank, while every thread other than the root expects a Recv from every rank other than the root.

Is it worth fully fixing this implementation right now, or should it be enough to do a small placeholder where the root (rank 0) gathers all data, performs all reductions, and then scatters the data to each rank? @gc00

Note that this implementation can be easily tested with the Scan_test.c test.

Building MANA on non-NERSC cluster

Hello,

When building MANA on CentOS Linux release 7.4.1708, gcc 9.1.0, and mpich 3.2, I get the following error message:
if mpicc -v 2>&1 | grep -q 'MPICH version'; then \ rm -f tmp1.sh; \ mpicc -show -static -Wl,--wrap -Wl,__munmap -Wl,--wrap -Wl,shmat -Wl,--wrap -Wl,shmget -o lh_proxy_da -Wl,-start-group \ lh_proxy.o libproxy.a gethostbyname-static/gethostbyname_static.o -L$HOME/mpich-static/usr/lib64 -lmpi -L$HOME/local_install/lib -llzma -lz -lm -lxml2 -lrt -lpthread -lc -ldl -Wl,-end-group | \ sed -e 's^-lunwind ^ ^'> tmp1.sh; \ sh tmp1.sh; \ rm -f tmp1.sh; \ elif false; then \ make libnl3.a; \ mpicc -static -Wl,--wrap -Wl,__munmap -Wl,--wrap -Wl,shmat -Wl,--wrap -Wl,shmget -o lh_proxy_da -Wl,-start-group \ lh_proxy.o libproxy.a gethostbyname-static/gethostbyname_static.o -L$HOME/mpich-static/usr/lib64 -lmpi -L$HOME/local_install/lib -llzma -lz -lm -lxml2 cat static_libs.txt-ldl -Wl,--end-group; \ else \ mpicc -static -Wl,--wrap -Wl,__munmap -Wl,--wrap -Wl,shmat -Wl,--wrap -Wl,shmget -o lh_proxy_da -Wl,-start-group \ lh_proxy.o libproxy.a gethostbyname-static/gethostbyname_static.o -L$HOME/mpich-static/usr/lib64 -lmpi -L$HOME/local_install/lib -llzma -lz -lm -lxml2 -lrt -lpthread -lc -ldl -Wl,-end-group; \ fi libproxy.a(sbrk.o): In function__sbrk':
/mana/mpi-proxy-split/lower-half/sbrk.c:105: undefined reference to __brk' libproxy.a(munmap.o): In function munmap':
/mana/mpi-proxy-split/lower-half/munmap.c:110: undefined reference to __munmap' collect2: error: ld returned 1 exit status libproxy.a(sbrk.o): In function __sbrk':
/mana/mpi-proxy-split/lower-half/sbrk.c:105: undefined reference to __brk' libproxy.a(munmap.o): In function munmap':
/mana/mpi-proxy-split/lower-half/munmap.c:110: undefined reference to __munmap' collect2: error: ld returned 1 exit status make[3]: *** [lh_proxy] Error 1 make[3]: *** Waiting for unfinished jobs.... make[3]: *** [lh_proxy_da] Error 1 make[3]: Leaving directory /mana/mpi-proxy-split/lower-half'
make[2]: *** [../bin/lh_proxy] Error 2
make[2]: Leaving directory /mana/mpi-proxy-split' make[1]: *** [install] Error 2 make[1]: Leaving directory /mana/mpi-proxy-split'
make: *** [mana] Error 2`

Any help with this matter is much appreciated. Thank you for your time.

Best,

Georges
Graduate student from the University of Illinois at Urbana Champaign

stucking dmtcp coordintator with simple hello world mpi example

Hello together,

I'm trying the dmtcp_coordinator with version dmtcp 3.0.0 and I'm stuck. I want to try dmtcp with mpi, but the console stops after typing this line:
dmtcp_launch mpiexec ./hello_world

I'm running the coordinator in a second terminal and this is the hello_world.c file:
#include <stdio.h>
#include <mpi.h>

main(int argc, char **argv)
{
int ierr;

  ierr = MPI_Init(&argc, &argv);
  printf("Hello world\n");

  ierr = MPI_Finalize();

}

The coordinator triggers this output and then gets stuck:
[7527] NOTE at dmtcp_coordinator.cpp:1079 in initializeComputation; REASON='Resetting computation'

Any help appreciated.

Thany you.

Is it possible to adapt mana to a split tool as Elfie?

Mana saves a checkpoint and restore it till the end of program. Is there a way that the program can stop at next checkpoint?

Like Elfie of Intel, this tool splits a binary into multiple parts and each of them is a executable binary that replays the special part of raw binary.

As Mana supports openmp and MPI, is it possible that we adapt mana as a splitter like Elfie? Any ideas or suggestions?

Infiniband build support fails

Trying to build the current repo on a system and I reach this error:

cd contrib && make
make[1]: Entering directory `/turquoise/users/dog/checkpointing/mana/contrib'
cd infiniband && make
gcc -I../../jalib -I../../include -g -O2 -fPIC -c -o infinibandwrappers.o infinibandwrappers.c
infinibandwrappers.c:124:1: error: conflicting types for ‘ibv_get_device_guid’
124 | ibv_get_device_guid(struct ibv_device *device)
| ^~~~~~~~~~~~~~~~~~~
In file included from infinibandwrappers.c:11:
/usr/include/infiniband/verbs.h:1977:8: note: previous declaration of ‘ibv_get_device_guid’ was here
1977 | __be64 ibv_get_device_guid(struct ibv_device *device);
| ^~~~~~~~~~~~~~~~~~~
make: *** [infinibandwrappers.o] Error 1

The problem is that the system verbs.h file defines
__be64 ibv_get_device_guid(struct ibv_device *device);
while the current infinibandwrappers.c file defines it as a uint64_t

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.