gutmann / coarray_icar Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 6.0 265 KB

Testing implementation of CoArrays for the basic ICAR algorithms

License: MIT License

Fortran 93.87% CMake 2.27% Makefile 2.43% Shell 1.42%

coarray_icar's People

Contributors

Stargazers

Watchers

Forkers

rouson bcfriesen bgin scrasmussen fmanquehual

coarray_icar's Issues

As much as possible select common MPI?

Cheyenne has iMPI, MPT, MPICH, OpenMPI (maybe?)

Cori will have cray MPI, anything else?

(converted to an issue to create a place for discussion)

Get Coarray ICAR to work with GCC 7 + OpenCoarrays on Cheyenne

cmake build process

A cmake build process exists and seems to work, but I'm having problems with it on cheyenne. Building with gcc 6.3 opencoarrays 1.9.1 and SGI's MPT MPI implementation 2.15f. Interestingly, the static makefile compile works and the executable runs just fine, so there seems to be something different that cmake does.

I suspect somehow cmake is finding a different fortran compiler or runtime library somehow. Does cmake know about MPT? or will it try to use MPICH or openmpi somehow (both of which are installed on cheyenne). Perhaps this combined with my module environment having MPT loading is breaking something at runtime.

Is it possible to have cmake (or the makefile it generates) print out the compile line and or all the libraries it is linking against?

Select common compilers across machines?

Do we stick with gfortran 6.3 or can we get gfortran 8? to work (on all machines)

Which version of ifort is available on cori (cheyenne has 16, 17, 18 all x.0.1 and 16.0.3)

Cray ftn version? (not on Cheyenne obviously)

Test cases

Needs a core set of test cases (can be idealized) with "correct" output that can be checked as different parallelization/scaling strategies are investigated.

Depends on : #10

Update the paper milestone

@gutmann Feel free assign issue #19 and #20 to @scrasmussen and adjust the paper milestone to have a July 31 deadline and you can probably close issue #26 (or at least remove it from the paper milestone) because presumably @afanfa won't have time to work on it.

Possibly there could also be a paper project set up via GitHub for gathering issues into a kanban board, but it's also possible that's overkill in this case. Your call.

Address load imbalance

Preliminary performance analysis of the coarray version of ICAR indicates that the algorithm exhibits significant load imbalance. We believe this is due to lopsided expense of evaluating a few physics kernels in mountain ranges on the grid versus other parts of the grid which do not feature mountains. (Ethan, can you comment on this to verify that I stated the problem correctly?)

Achieving good code performance and scalability requires that we address this load imbalance. One approach would be to partition the grid in an asymmetrical way such that the regions which require expensive kernels to run are distributed evenly among images, rather than being concentrated on a few images. Other approaches may also be possible

Write 2-sided MPI version of ICAR for benchmarking

One appealing feature of coarrays is the one-sided nature of the communication among images, allowing overlap of communication and computation. OpenCoarrays exploits this by using RMA functions and shared memory windows in MPI 3.

It would be interesting to measure what benefit ICAR sees in using one-sided communication via coarrays, over the traditional two-sided MPI communication which has been the standard programming model for bulk-synchronous parallel applications for many years. We can measure this by writing an analogous two-sided MPI version of ICAR and comparing the performance with the coarray version.

(Attempt to) evaluate Intel MPI + coarrays on Cray Aries network

The Cray Fortran compiler integrates coarrays with a communication library which uses the Aries interconnect (featured, e.g., on the Cray XC series systems). It would be an interesting exercise to determine if one can also compile Intel MPI and coarrays with Aries, so that we can measure performance between the two coarray implementations on the same hardware.

"num_images()" invocations of "sync all"?

In commit c86ac77 ("now prints precipitation mid way through the domain after the run completes"), a loop was added to the end of test-ideal.f90 which invokes sync all num_images() times. This becomes expensive at large numbers of images. What was the intent of this?

coarray_icar/src/tests/test-ideal.f90

Line 63 in f1d8e4a

sync all

Cray compiler error with src/tests/test-initialization.f90

Good afternoon,

I'm trying to compile the latest version of the master branch with the Cray compiler CCE v8.5.4. I used the CMake command:

cmake -DCMAKE_Fortran_FLAGS="-e T" ..

(The -e T flag tells CCE to run the pre-processor on all files, not just those with capital .F90 suffixes.)

The compiler fails on test-initialization.f90 with this error:

cd /global/homes/f/friesen/coarrays/coarray_icar/build/src/tests && /opt/cray/pe/craype/2.5.7/bin/ftn  -I/global/homes/f/friesen/coarrays/coarray_icar/src -I/global/homes/f/friesen/coarrays/coarray_icar/build/mod  -e T -em -J.   -c /global/homes/f/friesen/coarrays/coarray_icar/src/tests/test-initialization.f90 -o CMakeFiles/initialization-test.dir/test-initialization.f90.o


ftn-942 crayftn: ERROR block, File = ../../../global/u2/f/friesen/coarrays/coarray_icar/src/tests/test-initialization.f90, Line = 10, Column = 23
  Object "DOMAIN" is a type with a coarray ultimate component, so it must be a dummy argument or have the ALLOCATABLE or SAVE attribute.


ftn-942 crayftn: ERROR block, File = ../../../global/u2/f/friesen/coarrays/coarray_icar/src/tests/test-initialization.f90, Line = 16, Column = 23
  Object "DOMAIN" is a type with a coarray ultimate component, so it must be a dummy argument or have the ALLOCATABLE or SAVE attribute.

Cray Fortran : Version 8.5.4 (20160920191524_bfcbad9dba1deb728a485dee62483f6acb821568)
Cray Fortran : Mon Jul 24, 2017  17:38:21
Cray Fortran : Compile time:  0.0120 seconds
Cray Fortran : 28 source lines
Cray Fortran : 2 errors, 0 warnings, 0 other messages, 0 ansi
Cray Fortran : "explain ftn-message number" gives more information about each message.```

Runtime issues with ifort and gfortran/OpenCoarrays.

@gutmann @scrasmussen

Could you remind me whether coarray_icar works at all with gfortran versions > 6.x? When I build the develop branch of this fork as follows with gfortran 8.2.0 and a recent OpenCoarrays commit, I get the following:

$ cd src/tests
$ export COMPILER=gnu
$ make USE_ASSERTIONS=.true.
$ cafrun -n 4 ./test-ideal
Number of images =            4
          1 domain%initialize_from_file('input-parameters.txt')
ximgs=           2 yimgs=           2
call master_initialize(this)
call this%variable%initialize(this%get_grid_dimensions(),variable_test_val)
 Layer height       Pressure        Temperature      Water Vapor
     [m]              [hPa]             [K]            [kg/kg]
  9750.00000       271.047180       206.509430       9.17085254E-06
  7750.00000       364.236786       224.725372       7.91714992E-05
  5750.00000       481.825287       243.449936       5.01311326E-04
  3750.00000       628.424316       262.669800       2.46796501E-03
  1750.00000       809.217651       282.372711       9.08217765E-03
ThompMP: read qr_acr_qg.dat instead of computing
qr_acr_qg initialized:  0.229000002            
ThompMP: read qr_acr_qs.dat instead of computing
qr_acr_qs initialized:  0.170000002            
ThompMP: read freezeH2O.dat instead of computing
freezeH2O initialized:   1.02300000            
qi_aut_qs initialized:   1.79999992E-02        

Beginning simulation...
Assertion "put_north: conformable halo_south_in and local " failed on image            1
ERROR STOP 
Assertion "put_south: conformable halo_north_in and local " failed on image            4
ERROR STOP 
Assertion "put_south: conformable halo_north_in and local " failed on image            3
ERROR STOP 
Assertion "put_north: conformable halo_south_in and local " failed on image            2
ERROR STOP 

[proxy:0:0@Sourcery-Institute-VM] HYDU_sock_write (/home/sourcerer/Desktop/opencoarrays/prerequisites/downloads/mpich-3.2.1/src/pm/hydra/utils/sock/sock.c:294): write error (Broken pipe)
[proxy:0:0@Sourcery-Institute-VM] HYD_pmcd_pmip_control_cmd_cb (/home/sourcerer/Desktop/opencoarrays/prerequisites/downloads/mpich-3.2.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:932): unable to write to downstream stdin
[proxy:0:0@Sourcery-Institute-VM] HYDT_dmxu_poll_wait_for_event (/home/sourcerer/Desktop/opencoarrays/prerequisites/downloads/mpich-3.2.1/src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0@Sourcery-Institute-VM] main (/home/sourcerer/Desktop/opencoarrays/prerequisites/downloads/mpich-3.2.1/src/pm/hydra/pm/pmiserv/pmip.c:202): demux engine error waiting for event
[mpiexec@Sourcery-Institute-VM] control_cb (/home/sourcerer/Desktop/opencoarrays/prerequisites/downloads/mpich-3.2.1/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:208): assert (!closed) failed
[mpiexec@Sourcery-Institute-VM] HYDT_dmxu_poll_wait_for_event (/home/sourcerer/Desktop/opencoarrays/prerequisites/downloads/mpich-3.2.1/src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@Sourcery-Institute-VM] HYD_pmci_wait_for_completion (/home/sourcerer/Desktop/opencoarrays/prerequisites/downloads/mpich-3.2.1/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpiexec@Sourcery-Institute-VM] main (/home/sourcerer/Desktop/opencoarrays/prerequisites/downloads/mpich-3.2.1/src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion
Error: Command:
  `/opt/mpich/3.2.1/gnu/8.2.0/bin/mpiexec -n 4 --disable-auto-cleanup ./test-ideal`
failed to run.

I've also attempted to build with the Intel 18 and 19 compilers on the pegasus.nic.uoregon.edu and got the following runtime messages after which execution hangs:

$ mpiexec -np 1 ./test-ideal
[mpiexec@pegasus] HYDU_parse_hostfile (../../utils/args/args.c:553): unable to open host file: ./cafconfig.txt
[mpiexec@pegasus] config_tune_fn (../../ui/mpich/utils.c:2192): error parsing config file
[mpiexec@pegasus] match_arg (../../utils/args/args.c:243): match handler returned error
[mpiexec@pegasus] HYDU_parse_array_single (../../utils/args/args.c:294): argument matching returned error
[mpiexec@pegasus] HYD_uii_mpx_get_parameters (../../ui/mpich/utils.c:4999): error parsing input array

Usage: ./mpiexec [global opts] [exec1 local opts] : [exec2 local opts] : ...

Global options (passed to all executables):

  Global environment options:
    -genv {name} {value}             environment variable name and value
    -genvlist {env1,env2,...}        environment variable list to pass
    -genvnone                        do not pass any environment variables
    -genvall                         pass all environment variables not managed
                                          by the launcher (default)

  Other global options:
    -f {name} | -hostfile {name}     file containing the host names
    -hosts {host list}               comma separated host list
    -configfile {name}               config file containing MPMD launch options
    -machine {name} | -machinefile {name}
                                     file mapping procs to machines
    -pmi-connect {nocache|lazy-cache|cache}
                                     set the PMI connections mode to use
    -pmi-aggregate                   aggregate PMI messages
    -pmi-noaggregate                 do not  aggregate PMI messages
    -trace {<libraryname>}           trace the application using <libraryname>
                                     profiling library; default is libVT.so
    -trace-imbalance {<libraryname>} trace the application using <libraryname>
                                     imbalance profiling library; default is libVTim.so
    -check-mpi {<libraryname>}       check the application using <libraryname>
                                     checking library; default is libVTmc.so
    -ilp64                           Preload ilp64 wrapper library for support default size of
                                     integer 8 bytes
    -mps                             start statistics gathering for MPI Performance Snapshot (MPS)
    -aps                             start statistics gathering for Application Performance Snapshot (APS)
    -trace-pt2pt                     collect information about
                                     Point to Point operations
    -trace-collectives               collect information about
                                     Collective operations
    -tune [<confname>]               apply the tuned data produced by
                                     the MPI Tuner utility
    -use-app-topology <statfile>     perform optimized rank placement based statistics
                                     and cluster topology
    -noconf                          do not use any mpiexec's configuration files
    -branch-count {leaves_num}       set the number of children in tree
    -gwdir {dirname}                 working directory to use
    -gpath {dirname}                 path to executable to use
    -gumask {umask}                  mask to perform umask
    -tmpdir {tmpdir}                 temporary directory for cleanup input file
    -cleanup                         create input file for clean up
    -gtool {options}                 apply a tool over the mpi application
    -gtoolfile {file}                apply a tool over the mpi application. Parameters specified in the file


Local options (passed to individual executables):

  Local environment options:
    -env {name} {value}              environment variable name and value
    -envlist {env1,env2,...}         environment variable list to pass
    -envnone                         do not pass any environment variables
    -envall                          pass all environment variables (default)

  Other local options:
    -host {hostname}                 host on which processes are to be run
    -hostos {OS name}                operating system on particular host
    -wdir {dirname}                  working directory to use
    -path {dirname}                  path to executable to use
    -umask {umask}                   mask to perform umask
    -n/-np {value}                   number of processes
    {exec_name} {args}               executable name and arguments


Hydra specific options (treated as global):

  Bootstrap options:
    -bootstrap                       bootstrap server to use
     (ssh rsh pdsh fork slurm srun ll llspawn.stdio lsf blaunch sge qrsh persist service pbsdsh)
    -bootstrap-exec                  executable to use to bootstrap processes
    -bootstrap-exec-args             additional options to pass to bootstrap server
    -prefork                         use pre-fork processes startup method
    -enable-x/-disable-x             enable or disable X forwarding

  Resource management kernel options:
    -rmk                             resource management kernel to use (user slurm srun ll llspawn.stdio lsf blaunch sge qrsh pbs cobalt)

  Processor topology options:
    -binding                         process-to-core binding mode
  Extended fabric control options:
    -rdma                            select RDMA-capable network fabric (dapl). Fallback list is ofa,tcp,tmi,ofi
    -RDMA                            select RDMA-capable network fabric (dapl). Fallback is ofa
    -dapl                            select DAPL-capable network fabric. Fallback list is tcp,tmi,ofa,ofi
    -DAPL                            select DAPL-capable network fabric. No fallback fabric is used
    -ib                              select OFA-capable network fabric. Fallback list is dapl,tcp,tmi,ofi
    -IB                              select OFA-capable network fabric. No fallback fabric is used
    -tmi                             select TMI-capable network fabric. Fallback list is dapl,tcp,ofa,ofi
    -TMI                             select TMI-capable network fabric. No fallback fabric is used
    -mx                              select Myrinet MX* network fabric. Fallback list is dapl,tcp,ofa,ofi
    -MX                              select Myrinet MX* network fabric. No fallback fabric is used
    -psm                             select PSM-capable network fabric. Fallback list is dapl,tcp,ofa,ofi
    -PSM                             select PSM-capable network fabric. No fallback fabric is used
    -psm2                            select Intel* Omni-Path Fabric. Fallback list is dapl,tcp,ofa,ofi
    -PSM2                            select Intel* Omni-Path Fabric. No fallback fabric is used
    -ofi                             select OFI-capable network fabric. Fallback list is tmi,dapl,tcp,ofa
    -OFI                             select OFI-capable network fabric. No fallback fabric is used

  Checkpoint/Restart options:
    -ckpoint {on|off}                enable/disable checkpoints for this run
    -ckpoint-interval                checkpoint interval
    -ckpoint-prefix                  destination for checkpoint files (stable storage, typically a cluster-wide file system)
    -ckpoint-tmp-prefix              temporary/fast/local storage to speed up checkpoints
    -ckpoint-preserve                number of checkpoints to keep (default: 1, i.e. keep only last checkpoint)
    -ckpointlib                      checkpointing library (blcr)
    -ckpoint-logfile                 checkpoint activity/status log file (appended)
    -restart                         restart previously checkpointed application
    -ckpoint-num                     checkpoint number to restart

  Demux engine options:
    -demux                           demux engine (poll select)

  Debugger support options:
    -tv                              run processes under TotalView
    -tva {pid}                       attach existing mpiexec process to TotalView
    -gdb                             run processes under GDB
    -gdba {pid}                      attach existing mpiexec process to GDB
    -gdb-ia                          run processes under Intel IA specific GDB

  Other Hydra options:
    -v | -verbose                    verbose mode
    -V | -version                    show the version
    -info                            build information
    -print-rank-map                  print rank mapping
    -print-all-exitcodes             print exit codes of all processes
    -iface                           network interface to use
    -help                            show this message
    -perhost <n>                     place consecutive <n> processes on each host
    -ppn <n>                         stand for "process per node"; an alias to -perhost <n>
    -grr <n>                         stand for "group round robin"; an alias to -perhost <n>
    -rr                              involve "round robin" startup scheme
    -s <spec>                        redirect stdin to all or 1,2 or 2-4,6 MPI processes (0 by default)
    -ordered-output                  avoid data output intermingling
    -profile                         turn on internal profiling
    -l | -prepend-rank               prepend rank to output
    -prepend-pattern                 prepend pattern to output
    -outfile-pattern                 direct stdout to file
    -errfile-pattern                 direct stderr to file
    -localhost                       local hostname for the launching node
    -nolocal                         avoid running the application processes on the node where mpiexec.hydra started

Intel(R) MPI Library for Linux* OS, Version 2018 Update 3 Build 20180411 (id: 18329)
Copyright 2003-2018 Intel Corporation.
^C[mpiexec@pegasus] Sending Ctrl-C to processes as requested
[mpiexec@pegasus] Press Ctrl-C again to force abort

Needs IO

Needs some form of IO to begin testing larger cases and visualizing output.

Dependency on NetCDF could be limiting, but most supers have netcdf installed, and it is not that difficult to install (home-brew, apt-get, etc).

Compile / link w/NetCDF on Cheyenne
Write test case for output object
Add SIMPLE output to output array from one image
Send all data to master image for output at end of simulation
Add input... (this is a pretty big lift, and will likely be its own issue when the time comes)

Needs linear mountain wave solution

The next core physics component to add from ICAR is the linear theory mountain wave solution... this is actually non-trivial as it requires FFTs over the full domain. However, these FFTs can be run at initialization to create a look up table and this LUT can be written/read from disk in the future.

Although the LT solution (LUT access) only needs to be computed once every IO forcing time step, with LARGE numbers of cores this could be a limiting step as some sort of more global communication becomes necessary. e.g. the LUT has to be stored on a master process and accessed by all, or (probably better because it can be very large) it has to be distributed among all processes and each process has to access another process's memory once for EVERY gridcell in its domain.

Depends on #10

Investigate poor scaling of OpenCoarrays on Intel Xeon Phi

When compiled with the Cray Fortran compiler, the coarray version of ICAR scales to very high concurrencies - nearly 100 000 images on the Edison system at NERSC, and up to several 10 000s of images on Intel Xeon Phi on the Cori system at NERSC. However, the OpenCoarrays version, compiled with Cray MPI, scales poorly on Xeon Phi, even at relatively low concurrencies. We should investigate the cause of this poor scaling and determine how it can be fixed in OpenCoarrays.

Test pure coarrays vs coarrays+OpenMP

Early benchmarking of the coarray version of ICAR has expressed the concurrency in coarrays by placing one coarray image on each physical core on a compute node. On architectures such as Intel Xeon Phi, this can result in large numbers of images on a relatively few number of nodes. Such a large degree of concurrency may affect code performance and scaling, depending on how the underlying coarray communication model has been implemented in a particular compiler.

One way to mitigate the large degree of concurrency (and therefore communication) among coarray images is to combine the coarray implementation with OpenMP, such that some of the parallelism expressed within each compute node is on shared memory, rather than on a global address space. We should test the performance of coarrays+OpenMP versus pure coarrays in ICAR to evaluate this.

Run with OFI transport layer

Get performance and scalability results using the Intel compiler at scale (not Aires)

Experience suggests that resolving issue #21 is a long shot. I got the Intel compiler to work on a Cray back in 2012 through Herculean efforts and have never gotten it to work anytime since then. Nonetheless, it will be important to have results from using the Intel compiler at scale.

To Do:

Identify a platform
Test in distributed memory

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.