gutmann / coarray_icar Goto Github PK
View Code? Open in Web Editor NEWTesting implementation of CoArrays for the basic ICAR algorithms
License: MIT License
Testing implementation of CoArrays for the basic ICAR algorithms
License: MIT License
Cheyenne has iMPI, MPT, MPICH, OpenMPI (maybe?)
Cori will have cray MPI, anything else?
(converted to an issue to create a place for discussion)
A cmake build process exists and seems to work, but I'm having problems with it on cheyenne. Building with gcc 6.3 opencoarrays 1.9.1 and SGI's MPT MPI implementation 2.15f. Interestingly, the static makefile compile works and the executable runs just fine, so there seems to be something different that cmake does.
I suspect somehow cmake is finding a different fortran compiler or runtime library somehow. Does cmake know about MPT? or will it try to use MPICH or openmpi somehow (both of which are installed on cheyenne). Perhaps this combined with my module environment having MPT loading is breaking something at runtime.
Is it possible to have cmake (or the makefile it generates) print out the compile line and or all the libraries it is linking against?
Do we stick with gfortran 6.3 or can we get gfortran 8? to work (on all machines)
Which version of ifort is available on cori (cheyenne has 16, 17, 18 all x.0.1 and 16.0.3)
Cray ftn version? (not on Cheyenne obviously)
Needs a core set of test cases (can be idealized) with "correct" output that can be checked as different parallelization/scaling strategies are investigated.
Depends on : #10
@gutmann Feel free assign issue #19 and #20 to @scrasmussen and adjust the paper milestone to have a July 31 deadline and you can probably close issue #26 (or at least remove it from the paper milestone) because presumably @afanfa won't have time to work on it.
Possibly there could also be a paper project set up via GitHub for gathering issues into a kanban board, but it's also possible that's overkill in this case. Your call.
Preliminary performance analysis of the coarray version of ICAR indicates that the algorithm exhibits significant load imbalance. We believe this is due to lopsided expense of evaluating a few physics kernels in mountain ranges on the grid versus other parts of the grid which do not feature mountains. (Ethan, can you comment on this to verify that I stated the problem correctly?)
Achieving good code performance and scalability requires that we address this load imbalance. One approach would be to partition the grid in an asymmetrical way such that the regions which require expensive kernels to run are distributed evenly among images, rather than being concentrated on a few images. Other approaches may also be possible
One appealing feature of coarrays is the one-sided nature of the communication among images, allowing overlap of communication and computation. OpenCoarrays exploits this by using RMA functions and shared memory windows in MPI 3.
It would be interesting to measure what benefit ICAR sees in using one-sided communication via coarrays, over the traditional two-sided MPI communication which has been the standard programming model for bulk-synchronous parallel applications for many years. We can measure this by writing an analogous two-sided MPI version of ICAR and comparing the performance with the coarray version.
The Cray Fortran compiler integrates coarrays with a communication library which uses the Aries interconnect (featured, e.g., on the Cray XC series systems). It would be an interesting exercise to determine if one can also compile Intel MPI and coarrays with Aries, so that we can measure performance between the two coarray implementations on the same hardware.
In commit c86ac77 ("now prints precipitation mid way through the domain after the run completes"), a loop was added to the end of test-ideal.f90
which invokes sync all
num_images()
times. This becomes expensive at large numbers of images. What was the intent of this?
coarray_icar/src/tests/test-ideal.f90
Line 63 in f1d8e4a
Good afternoon,
I'm trying to compile the latest version of the master
branch with the Cray compiler CCE v8.5.4. I used the CMake command:
cmake -DCMAKE_Fortran_FLAGS="-e T" ..
(The -e T
flag tells CCE to run the pre-processor on all files, not just those with capital .F90 suffixes.)
The compiler fails on test-initialization.f90
with this error:
cd /global/homes/f/friesen/coarrays/coarray_icar/build/src/tests && /opt/cray/pe/craype/2.5.7/bin/ftn -I/global/homes/f/friesen/coarrays/coarray_icar/src -I/global/homes/f/friesen/coarrays/coarray_icar/build/mod -e T -em -J. -c /global/homes/f/friesen/coarrays/coarray_icar/src/tests/test-initialization.f90 -o CMakeFiles/initialization-test.dir/test-initialization.f90.o
ftn-942 crayftn: ERROR block, File = ../../../global/u2/f/friesen/coarrays/coarray_icar/src/tests/test-initialization.f90, Line = 10, Column = 23
Object "DOMAIN" is a type with a coarray ultimate component, so it must be a dummy argument or have the ALLOCATABLE or SAVE attribute.
ftn-942 crayftn: ERROR block, File = ../../../global/u2/f/friesen/coarrays/coarray_icar/src/tests/test-initialization.f90, Line = 16, Column = 23
Object "DOMAIN" is a type with a coarray ultimate component, so it must be a dummy argument or have the ALLOCATABLE or SAVE attribute.
Cray Fortran : Version 8.5.4 (20160920191524_bfcbad9dba1deb728a485dee62483f6acb821568)
Cray Fortran : Mon Jul 24, 2017 17:38:21
Cray Fortran : Compile time: 0.0120 seconds
Cray Fortran : 28 source lines
Cray Fortran : 2 errors, 0 warnings, 0 other messages, 0 ansi
Cray Fortran : "explain ftn-message number" gives more information about each message.```
Could you remind me whether coarray_icar works at all with gfortran versions > 6.x? When I build the develop branch of this fork as follows with gfortran
8.2.0 and a recent OpenCoarrays commit, I get the following:
$ cd src/tests
$ export COMPILER=gnu
$ make USE_ASSERTIONS=.true.
$ cafrun -n 4 ./test-ideal
Number of images = 4
1 domain%initialize_from_file('input-parameters.txt')
ximgs= 2 yimgs= 2
call master_initialize(this)
call this%variable%initialize(this%get_grid_dimensions(),variable_test_val)
Layer height Pressure Temperature Water Vapor
[m] [hPa] [K] [kg/kg]
9750.00000 271.047180 206.509430 9.17085254E-06
7750.00000 364.236786 224.725372 7.91714992E-05
5750.00000 481.825287 243.449936 5.01311326E-04
3750.00000 628.424316 262.669800 2.46796501E-03
1750.00000 809.217651 282.372711 9.08217765E-03
ThompMP: read qr_acr_qg.dat instead of computing
qr_acr_qg initialized: 0.229000002
ThompMP: read qr_acr_qs.dat instead of computing
qr_acr_qs initialized: 0.170000002
ThompMP: read freezeH2O.dat instead of computing
freezeH2O initialized: 1.02300000
qi_aut_qs initialized: 1.79999992E-02
Beginning simulation...
Assertion "put_north: conformable halo_south_in and local " failed on image 1
ERROR STOP
Assertion "put_south: conformable halo_north_in and local " failed on image 4
ERROR STOP
Assertion "put_south: conformable halo_north_in and local " failed on image 3
ERROR STOP
Assertion "put_north: conformable halo_south_in and local " failed on image 2
ERROR STOP
[proxy:0:0@Sourcery-Institute-VM] HYDU_sock_write (/home/sourcerer/Desktop/opencoarrays/prerequisites/downloads/mpich-3.2.1/src/pm/hydra/utils/sock/sock.c:294): write error (Broken pipe)
[proxy:0:0@Sourcery-Institute-VM] HYD_pmcd_pmip_control_cmd_cb (/home/sourcerer/Desktop/opencoarrays/prerequisites/downloads/mpich-3.2.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:932): unable to write to downstream stdin
[proxy:0:0@Sourcery-Institute-VM] HYDT_dmxu_poll_wait_for_event (/home/sourcerer/Desktop/opencoarrays/prerequisites/downloads/mpich-3.2.1/src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0@Sourcery-Institute-VM] main (/home/sourcerer/Desktop/opencoarrays/prerequisites/downloads/mpich-3.2.1/src/pm/hydra/pm/pmiserv/pmip.c:202): demux engine error waiting for event
[mpiexec@Sourcery-Institute-VM] control_cb (/home/sourcerer/Desktop/opencoarrays/prerequisites/downloads/mpich-3.2.1/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:208): assert (!closed) failed
[mpiexec@Sourcery-Institute-VM] HYDT_dmxu_poll_wait_for_event (/home/sourcerer/Desktop/opencoarrays/prerequisites/downloads/mpich-3.2.1/src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@Sourcery-Institute-VM] HYD_pmci_wait_for_completion (/home/sourcerer/Desktop/opencoarrays/prerequisites/downloads/mpich-3.2.1/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpiexec@Sourcery-Institute-VM] main (/home/sourcerer/Desktop/opencoarrays/prerequisites/downloads/mpich-3.2.1/src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion
Error: Command:
`/opt/mpich/3.2.1/gnu/8.2.0/bin/mpiexec -n 4 --disable-auto-cleanup ./test-ideal`
failed to run.
I've also attempted to build with the Intel 18 and 19 compilers on the pegasus.nic.uoregon.edu and got the following runtime messages after which execution hangs:
$ mpiexec -np 1 ./test-ideal
[mpiexec@pegasus] HYDU_parse_hostfile (../../utils/args/args.c:553): unable to open host file: ./cafconfig.txt
[mpiexec@pegasus] config_tune_fn (../../ui/mpich/utils.c:2192): error parsing config file
[mpiexec@pegasus] match_arg (../../utils/args/args.c:243): match handler returned error
[mpiexec@pegasus] HYDU_parse_array_single (../../utils/args/args.c:294): argument matching returned error
[mpiexec@pegasus] HYD_uii_mpx_get_parameters (../../ui/mpich/utils.c:4999): error parsing input array
Usage: ./mpiexec [global opts] [exec1 local opts] : [exec2 local opts] : ...
Global options (passed to all executables):
Global environment options:
-genv {name} {value} environment variable name and value
-genvlist {env1,env2,...} environment variable list to pass
-genvnone do not pass any environment variables
-genvall pass all environment variables not managed
by the launcher (default)
Other global options:
-f {name} | -hostfile {name} file containing the host names
-hosts {host list} comma separated host list
-configfile {name} config file containing MPMD launch options
-machine {name} | -machinefile {name}
file mapping procs to machines
-pmi-connect {nocache|lazy-cache|cache}
set the PMI connections mode to use
-pmi-aggregate aggregate PMI messages
-pmi-noaggregate do not aggregate PMI messages
-trace {<libraryname>} trace the application using <libraryname>
profiling library; default is libVT.so
-trace-imbalance {<libraryname>} trace the application using <libraryname>
imbalance profiling library; default is libVTim.so
-check-mpi {<libraryname>} check the application using <libraryname>
checking library; default is libVTmc.so
-ilp64 Preload ilp64 wrapper library for support default size of
integer 8 bytes
-mps start statistics gathering for MPI Performance Snapshot (MPS)
-aps start statistics gathering for Application Performance Snapshot (APS)
-trace-pt2pt collect information about
Point to Point operations
-trace-collectives collect information about
Collective operations
-tune [<confname>] apply the tuned data produced by
the MPI Tuner utility
-use-app-topology <statfile> perform optimized rank placement based statistics
and cluster topology
-noconf do not use any mpiexec's configuration files
-branch-count {leaves_num} set the number of children in tree
-gwdir {dirname} working directory to use
-gpath {dirname} path to executable to use
-gumask {umask} mask to perform umask
-tmpdir {tmpdir} temporary directory for cleanup input file
-cleanup create input file for clean up
-gtool {options} apply a tool over the mpi application
-gtoolfile {file} apply a tool over the mpi application. Parameters specified in the file
Local options (passed to individual executables):
Local environment options:
-env {name} {value} environment variable name and value
-envlist {env1,env2,...} environment variable list to pass
-envnone do not pass any environment variables
-envall pass all environment variables (default)
Other local options:
-host {hostname} host on which processes are to be run
-hostos {OS name} operating system on particular host
-wdir {dirname} working directory to use
-path {dirname} path to executable to use
-umask {umask} mask to perform umask
-n/-np {value} number of processes
{exec_name} {args} executable name and arguments
Hydra specific options (treated as global):
Bootstrap options:
-bootstrap bootstrap server to use
(ssh rsh pdsh fork slurm srun ll llspawn.stdio lsf blaunch sge qrsh persist service pbsdsh)
-bootstrap-exec executable to use to bootstrap processes
-bootstrap-exec-args additional options to pass to bootstrap server
-prefork use pre-fork processes startup method
-enable-x/-disable-x enable or disable X forwarding
Resource management kernel options:
-rmk resource management kernel to use (user slurm srun ll llspawn.stdio lsf blaunch sge qrsh pbs cobalt)
Processor topology options:
-binding process-to-core binding mode
Extended fabric control options:
-rdma select RDMA-capable network fabric (dapl). Fallback list is ofa,tcp,tmi,ofi
-RDMA select RDMA-capable network fabric (dapl). Fallback is ofa
-dapl select DAPL-capable network fabric. Fallback list is tcp,tmi,ofa,ofi
-DAPL select DAPL-capable network fabric. No fallback fabric is used
-ib select OFA-capable network fabric. Fallback list is dapl,tcp,tmi,ofi
-IB select OFA-capable network fabric. No fallback fabric is used
-tmi select TMI-capable network fabric. Fallback list is dapl,tcp,ofa,ofi
-TMI select TMI-capable network fabric. No fallback fabric is used
-mx select Myrinet MX* network fabric. Fallback list is dapl,tcp,ofa,ofi
-MX select Myrinet MX* network fabric. No fallback fabric is used
-psm select PSM-capable network fabric. Fallback list is dapl,tcp,ofa,ofi
-PSM select PSM-capable network fabric. No fallback fabric is used
-psm2 select Intel* Omni-Path Fabric. Fallback list is dapl,tcp,ofa,ofi
-PSM2 select Intel* Omni-Path Fabric. No fallback fabric is used
-ofi select OFI-capable network fabric. Fallback list is tmi,dapl,tcp,ofa
-OFI select OFI-capable network fabric. No fallback fabric is used
Checkpoint/Restart options:
-ckpoint {on|off} enable/disable checkpoints for this run
-ckpoint-interval checkpoint interval
-ckpoint-prefix destination for checkpoint files (stable storage, typically a cluster-wide file system)
-ckpoint-tmp-prefix temporary/fast/local storage to speed up checkpoints
-ckpoint-preserve number of checkpoints to keep (default: 1, i.e. keep only last checkpoint)
-ckpointlib checkpointing library (blcr)
-ckpoint-logfile checkpoint activity/status log file (appended)
-restart restart previously checkpointed application
-ckpoint-num checkpoint number to restart
Demux engine options:
-demux demux engine (poll select)
Debugger support options:
-tv run processes under TotalView
-tva {pid} attach existing mpiexec process to TotalView
-gdb run processes under GDB
-gdba {pid} attach existing mpiexec process to GDB
-gdb-ia run processes under Intel IA specific GDB
Other Hydra options:
-v | -verbose verbose mode
-V | -version show the version
-info build information
-print-rank-map print rank mapping
-print-all-exitcodes print exit codes of all processes
-iface network interface to use
-help show this message
-perhost <n> place consecutive <n> processes on each host
-ppn <n> stand for "process per node"; an alias to -perhost <n>
-grr <n> stand for "group round robin"; an alias to -perhost <n>
-rr involve "round robin" startup scheme
-s <spec> redirect stdin to all or 1,2 or 2-4,6 MPI processes (0 by default)
-ordered-output avoid data output intermingling
-profile turn on internal profiling
-l | -prepend-rank prepend rank to output
-prepend-pattern prepend pattern to output
-outfile-pattern direct stdout to file
-errfile-pattern direct stderr to file
-localhost local hostname for the launching node
-nolocal avoid running the application processes on the node where mpiexec.hydra started
Intel(R) MPI Library for Linux* OS, Version 2018 Update 3 Build 20180411 (id: 18329)
Copyright 2003-2018 Intel Corporation.
^C[mpiexec@pegasus] Sending Ctrl-C to processes as requested
[mpiexec@pegasus] Press Ctrl-C again to force abort
Needs some form of IO to begin testing larger cases and visualizing output.
Dependency on NetCDF could be limiting, but most supers have netcdf installed, and it is not that difficult to install (home-brew, apt-get, etc).
The next core physics component to add from ICAR is the linear theory mountain wave solution... this is actually non-trivial as it requires FFTs over the full domain. However, these FFTs can be run at initialization to create a look up table and this LUT can be written/read from disk in the future.
Although the LT solution (LUT access) only needs to be computed once every IO forcing time step, with LARGE numbers of cores this could be a limiting step as some sort of more global communication becomes necessary. e.g. the LUT has to be stored on a master process and accessed by all, or (probably better because it can be very large) it has to be distributed among all processes and each process has to access another process's memory once for EVERY gridcell in its domain.
Depends on #10
When compiled with the Cray Fortran compiler, the coarray version of ICAR scales to very high concurrencies - nearly 100 000 images on the Edison system at NERSC, and up to several 10 000s of images on Intel Xeon Phi on the Cori system at NERSC. However, the OpenCoarrays version, compiled with Cray MPI, scales poorly on Xeon Phi, even at relatively low concurrencies. We should investigate the cause of this poor scaling and determine how it can be fixed in OpenCoarrays.
Early benchmarking of the coarray version of ICAR has expressed the concurrency in coarrays by placing one coarray image on each physical core on a compute node. On architectures such as Intel Xeon Phi, this can result in large numbers of images on a relatively few number of nodes. Such a large degree of concurrency may affect code performance and scaling, depending on how the underlying coarray communication model has been implemented in a particular compiler.
One way to mitigate the large degree of concurrency (and therefore communication) among coarray images is to combine the coarray implementation with OpenMP, such that some of the parallelism expressed within each compute node is on shared memory, rather than on a global address space. We should test the performance of coarrays+OpenMP versus pure coarrays in ICAR to evaluate this.
Experience suggests that resolving issue #21 is a long shot. I got the Intel compiler to work on a Cray back in 2012 through Herculean efforts and have never gotten it to work anytime since then. Nonetheless, it will be important to have results from using the Intel compiler at scale.
To Do:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.