Code Monkey home page Code Monkey logo

transport_se's Introduction

The tracer-transport mini-app (transport_se) is a simplified version of
HOMME designed to facilitate experimentation and performance improvements.

To configure and build the mini-app:

  1) cp configure.sh ~/work/        copy the configure script to a directory of your choosing
  2) vi ~/work/configure.sh         customize the USER-DEFINED PARAMETERS using a text editor
  3) ~/work/configure.sh            execute the script to configure and build the mini-app

To test the mini-app:

  4) cd ~/work/bld/test             navigate to the test directory
  5) make install                   install the namelists and test scripts
  6) vi run_ne8_test.sh             customize the mpirun command, tasks per node, etc...
  7) qsub run_ne8_test.sh           submit a test to the debug queue (or execute it on a laptop)
  8) tail -n 40 ./out_ne8_jobid     examine timing and error norm results
  9) display ~/work/run/image_*     examine plots of tracer cross-sections


Tests:
run_ne8_test.sh:    ultra low-res test for quick debug/check  
run_ne30_test.sh    1 degree test with 4 tracers for verification 
run_ne120_test.sh   1/4 degree test for additonal, optional verification at high-res
                    This test is expensive and often sits in the que for 1 day.

run_ne8_perf.sh
run_ne30_perf.sh
run_ne120_perf.sh   Test case for performance testing
                      DCMIP1-1 for 1 hour of model time.
                      (we might have to increase run time )
                      output disabled.  most diagnostics disabled
                      tracers increased to 35

Recommendations:

   Step 1. Use "run_ne8_test.sh" to get the code up and running

   Step 2. Use "run_ne30_test.sh" to verify the results are correct.  L1, L2
   and Linf errors, overshoot and undershoots should agree to 2-3
   digits.  One should also check that the tracer mass is conserved by
   looking at the "Q, Q diss" values in the stdout.  For additional
   verification, compare the PDF plots of the solution with reference
   plots on the ACME confluence page, "Transport Mini-app Setup and
   Test Results"

   The results should be BFB when run on different numbers of MPI
   tasks and/or openMP threads.  We should add a test for this.

   Step 3. Use "run_ne120_perf.sh" as a starting point for performance testing.  


Note: This mini-app is using DCMIP tests 1-1 and 1-2, modified to use
the ACME 72L configuration instead of the DCMIP equally spaced levels.
The performance characteristics will be very faithful to the HOMME
dycore as used by ACME. However, the ACME 72L configuration has much
larger errors since there are fewer levels where the tracers are
located.  In addition, surface pressure is not treated correctly - for
consistency with prescribed winds and disabled dynamics, we evolve the
surface pressure using the implicit surface pressure from the tracer
advection scheme.  Thus the simulation output of this mini-app should
not be used for DCMIP test case results.



__________________________________________________________________________________________________________________

Typical results for run_ne8_test.sh on NERSC:Hopper

            name     procs  threads  count         walltotal     wallmax (proc   thrd  )   wallmin (proc   thrd  )
  DCMIP1-1: prim_run 48     48       3.456000e+04  1.853381e+03  38.974  (     0      0)   25.267  (     1      0)
  DCMIP1-2: prim_run 48     48       2.880000e+03  9.024055e+01  1.892   (    32      0)   1.660   (     1      0)

  DCMIP1-1: L1=0.564863 L2= 0.527211 Linf= 0.478169 q_max= 0.648135 q_min= -0.0615289
  DCMIP1-2: L1=0.368728 L2= 0.558725 Linf= 1.040230 q_max= 0.991667 q_min= -5.59564e-05

  Used Resources: cput=00:00:07,energy_used=0,mem=12220kb,vmem=123196kb,walltime=00:01:29

Updated 2015-6-27 
With rsplit=1 and limiter=8 (monotone limiter) on Edison
  DCMIP 1-1: L1=0.529721 L2= 0.637361 Linf= 0.616722 q_max= 0.459436 q_min= -2.24628e-08
  DCMIP 1-2: L1=0.386038 L2= 0.556078 Linf= 0.914061 q_max= 0.987406 q_min= -2.64163e-06
On Darwin:
  DCMIP 1-1: L1=0.529721 L2= 0.637360 Linf= 0.616721 q_max= 0.459436 q_min= -2.246253e-08
  DCMIP 1-2: L1=0.386038 L2= 0.556077 Linf= 0.914060 q_max= 0.987405 q_min= -2.642738e-06

Updated 2015-7-14 (new direct addressing limiter)
  DCMIP 1-1: L1=0.529676 L2= 0.637311 Linf= 0.616969 q_max= 0.459701 q_min= -9.78542e-09
  DCMIP 1-2: L1=0.386039 L2= 0.556071 Linf= 0.91406 q_max= 0.987389 q_min= -2.77826e-06

Updated 2015-11-27 (rsplit=3, to match CAM)
  DCMIP 1-1: L1=0.529725 L2=0.637015 Linf=0.610602 q_max=0.458165 q_min= -9.533290e-14
  DCMIP 1-2: L1=0.263420 L2=0.388462 Linf=0.638387 q_max=0.989853 q_min= -1.274771e-08

Updated 2015-11-27 (rsplit=3, ACME 72 level configuration, skybridge)
  DCMIP 1-1: L1=0.578151 L2=0.865526 Linf=0.883168 q_max=0.187204 q_min= -3.207090e-13
  DCMIP 1-2: L1=0.307665 L2=0.622099 Linf=0.839133 q_max=0.813105 q_min= -9.385639e-06




__________________________________________________________________________________________________________________

Typical results for run_ne30_test.sh on NERSC:Hopper

            name      procs  threads  count         walltotal     wallmax (proc   thrd  )   wallmin (proc   thrd  )
  DCMIP1-1  prim_run  216    216      7.464960e+05  8.528053e+04  394.886 (     6      0)   393.686 (     1      0)
  DCMIP1-2  prim_run  216    216      6.220800e+04  5.899456e+03  27.387  (   202      0)    25.604 (     1      0)

  DCMIP1-1: L1=0.1810760 L2= 0.209121 Linf= 0.317952 q_max= 1.014850 q_min= -0.0410253
  DCMIP1-2: L1=0.0799994 L2= 0.150355 Linf= 0.297211 q_max= 0.977932 q_min= -5.7599e-13

  Used Resources: cput=00:00:09,energy_used=0,mem=97628kb,vmem=208712kb,walltime=00:08:41

Updated 2015-6-27
With rsplit limiter=8 (monotone limiter) on Edison
  DCMIP 1-1: L1=0.138421  L2= 0.202415 Linf= 0.298503 q_max= 0.974222 q_min= -7.4405e-14
  DCMIP 1-2: L1=0.0578899 L2= 0.123954 Linf= 0.270832 q_max= 0.979736 q_min= -1.26331e-12

Updated 2015-7-14 (new direct addressing limiter)
  DCMIP 1-1: L1=0.138438  L2= 0.202473 Linf= 0.298522 q_max= 0.974238 q_min= -7.27863e-14
  DCMIP 1-2: L1=0.0578902 L2= 0.123954 Linf= 0.270832 q_max= 0.979733 q_min= -4.54058e-14

Updated 2015-11-27  (rsplit=3, to match CAM)
  DCMIP 1-1: L1=0.138438 L2= 0.202473 Linf= 0.298522 q_max= 0.97423 q_min= -7.633613e-14
  DCMIP 1-2: L1=0.057890 L2= 0.123954 Linf= 0.270831 q_max= 0.97973 q_min= -4.540577e-14

Updated 2015-11-27  (rsplit=3, ACME 72 level config, skybridge)
  DCMIP 1-1: L1=0.490013 L2=0.789052 Linf=0.918454 q_max=0.445141 q_min= -3.559994e-11
  DCMIP 1-2: L1=0.121783 L2=0.361005 Linf=1.092784 q_max=0.836177 q_min= -3.671997e-05



__________________________________________________________________________________________________________________

Typical results for run_ne120_test.sh on NERSC:Hopper

Updated 2015-6-27
With rsplit limiter=8 (monotone limiter) on Edison

  DCMIP1-1 prim_run   960   960   4.423680e+06  2.117038e+06  2205.253 (  709     0)  2205.234 (  259     0)
  DCMIP1-2 prim_run   960   960   3.686400e+05  1.279739e+05   133.365 (  662     0)  133.245  (  443     0)

  DCMIP 1-1: L1=0.0794909 L2= 0.139211  Linf= 0.306975 q_max= 0.984071 q_min= -1.84532e-13
  DCMIP 1-2: L1=0.0363203 L2= 0.0987049 Linf= 0.277493 q_max= 0.994146 q_min= -5.89671e-14

Updated 2015-7-14 (new direct addressing limiter)
  DCMIP 1-1: L1=0.0794911 L2= 0.139212  Linf= 0.30697  q_max= 0.984063 q_min= -1.76014e-13
  DCMIP 1-2: L1=0.0363203 L2= 0.0987048 Linf= 0.277493 q_max= 0.994146 q_min= -5.06671e-14

Updated 2015-11-27 (rsplit=3, ACME 72 level config)
  (something is wrong with the DCMIP 1-1 configuration on 72L, but it should still be ok for performace )
  DCMIP 1-1: L1=0.479398 L2=0.782613 Linf=0.922696 q_max=0.501561 q_min= -1.070570e-09
  DCMIP 1-2: L1=0.081287 L2=0.264887 Linf=0.591157 q_max=0.959530 q_min= -2.795861e-09




__________________________________________________________________________________________________________________

Typical results for run_ne120_perf.sh on Edison:

A x B x C, with
A = nodes 
B = mpitasks_per_node 
C = threads_per_mpitask

40 node cases.  2160 elements per node.  1h simulation time

                                     2015-6-27            2015-11-29
                                     50 tracers            35 tracers
                                     64L                   72L
                                     rsplit=1              rsplit=3
40x24x1
prim_run                                                   42.643
prim_advec_tracers_remap_rk2         53.343                37.195
vertical_remap                        1.800                 1.527

40x12x2   
prim_run                                                   43.306  
prim_advec_tracers_remap_rk2         56.085                37.826       
vertical_remap                        1.801                 1.518  
                                              

40x6x4                               
prim_run                                                   44.176
prim_advec_tracers_remap_rk2         63.195                38.644  
vertical_remap                        1.769                 1.473

40x2x12                               
prim_run                                                    46.040
prim_advec_tracers_remap_rk2         88.949                 40.390
vertical_remap                        1.765                  1.492






transport_se's People

Contributors

amametjanov avatar ambrad avatar drhansj avatar halldm2000 avatar mt5555 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

xiaohumengdie

transport_se's Issues

verbose stdout?

I note that the test cases are producing quite a bit of diagnostics output. This can be a performance hit, since it involves lots of global reductions. We should consider reducing this substantially, so our performance profiles dont include these global reductions

vertical remap algorithm namelist value needs to be changed

To match ACME, we need to add this to all namelists:

vert_remap_q_alg = 1
rsplit = 3

Setting rsplit=3 may increase the errors, becuase I think the prescribed winds will not be computed on the floating lagrangian layers, and so will not be 100% correct. But in practice we remap ever 3 timesteps, so this should be turned on to match ACME.

lim8 produces diff warnings when using COLUMN_OMP and -cc numa_mode

Lots of difference values are printed in prim_advec after lim8 when COLUMN_OMP=true, and number of threads > 1 and -cc numa_node flag is used on Edison. Have examined the DCMIP solutions and found them to be damaged in this case. So the errors are real, not spurious. These errors might be related to issue #11.

Most tracers are set to 0

David, a couple of issure: The current code, when setting the velocity, calls "set_elemement_state()" each timestep. This also sets the tracers, but I think only at the future timelevel, which is then overwritten by the actual tracer timestep. But I wanted to make sure this was correct?

Second question: where is the best place to set the values of tracers 5..qsize? If it was done in "set_element_state" it would automatically apply to all test cases, but this doesn't seem like a good idea.

Finally, if we bump this up to 50 tracers, becuae of the way set_element_state() works, we will make a copy of them every timestep. I worry this will skew our timings a small amount. As this copy is unneeded, should this dcmip wrapper interface be changed, perhaps a set_dynamics_state and set_tracer_state?

Conservation errors when threading inside the column, in the limiter

This recent patch from John Dennis in HOMME (see below) might be related to this bug.

dennis
Fixed issues with COLUMN_OPENMP in the Eulerian Advection code.

Modified: trunk/src/share/prim_advection_mod.F90

--- trunk/src/share/prim_advection_mod.F90 2015-07-10 03:38:41 UTC (rev 4738)
+++ trunk/src/share/prim_advection_mod.F90 2015-07-16 15:27:16 UTC (rev 4739)
@@ -1,7 +1,7 @@

ifdef HAVE_CONFIG_H

include "config.h"

endif

-#define NEWEULER_B4B 1
+#undef NEWEULER_B4B

define OVERLAP 1

   module EXTRAE_MODULE

@@ -2355,11 +2355,14 @@
do ie = nets , nete
! add hyperviscosity to RHS. apply to Q at timelevel n0, Qdp(n0)/dp

if (defined COLUMN_OPENMP)

-!$omp parallel do private(k, q)
+!$omp parallel do private(k)

endif

   do k = 1 , nlev    !  Loop index added with implicit inversion (AAM)
     dp(:,:,k) = elem(ie)%derived%dp(:,:,k) - rhs_multiplier*dt*elem(ie)%derived%divdp_proj(:,:,k)
   enddo

+#if (defined COLUMN_OPENMP)
+!$omp parallel do private(q,k)
+#endif
do q = 1 , qsize
do k=1,nlev
Qtens_biharmonic(:,:,k,q,ie) = elem(ie)%state%Qdp(:,:,k,q,n0_qdp)/dp(:,:,k)
@@ -2390,10 +2393,10 @@
! nu_p>0): qtens_biharmonc *= elem()%psdiss_ave (for consistency, if nu_p=nu_q)
if ( nu_p > 0 ) then
do ie = nets , nete
+#ifdef NEWEULER_B4B

if (defined COLUMN_OPENMP)

  •      !$omp parallel do private(k, q, dp0, dpdiss)
    
  •   !$omp parallel do private(k, q, dpdiss)
    
    #endif
    -#ifdef NEWEULER_B4B
    do k = 1 , nlev
    dpdiss(:,:) = elem(ie)%derived%dpdiss_ave(:,:,k)
    do q = 1 , qsize
    @@ -2402,9 +2405,15 @@
    enddo
    enddo
    #else
    +#if (defined COLUMN_OPENMP)
  •    !$omp parallel do private(k)
    
    +#endif
    do k = 1 , nlev
    dpdissk(:,:,k) = elem(ie)%derived%dpdiss_ave(:,:,k)/dp0(k)
    enddo
    +#if (defined COLUMN_OPENMP)
  •    !$omp parallel do private(q,k)
    
    +#endif
    do q = 1 , qsize
    do k = 1 , nlev
    ! NOTE: divide by dp0 since we multiply by dp0 below
    @@ -2428,7 +2437,7 @@
    call biharmonic_wk_scalar(elem,qtens_biharmonic,deriv,edgeAdv,hybrid,nets,nete)
    do ie = nets , nete
    #if (defined COLUMN_OPENMP)
  •    !$omp parallel do private(k, q, dp0)
    
    +!$omp parallel do private(k, q, dp0)
    #endif
    do q = 1 , qsize
    do k = 1 , nlev ! Loop inversion (AAM)
    @@ -2447,7 +2456,7 @@
    do ie = nets , nete
    #if (defined COLUMN_OPENMP)
  •    !$omp parallel do private(k, q, dp0)
    
    +!$omp parallel do private(k, q)
    #endif
    do q = 1 , qsize
    do k = 1 , nlev ! Loop inversion (AAM)
    @@ -2476,7 +2485,7 @@
    ! Compute velocity used to advance Qdp
    #if (defined COLUMN_OPENMP)
    -!$omp parallel do private(k)
  • !$omp parallel do private(k)
    #endif
    do k = 1 , nlev ! Loop index added (AAM)
    ! derived variable divdp_proj() (DSS'd version of divdp) will only be correct on 2nd and 3rd stage
    @@ -2487,6 +2496,9 @@
    enddo
    if ( limiter_option == 8) then
    ! Note that the term dpdissk is independent of Q
    +#if (defined COLUMN_OPENMP)
  • !$omp parallel do private(q,k,dpdiss)
    +#endif
    do k = 1 , nlev ! Loop index added (AAM)
    ! UN-DSS'ed dp at timelevel n0+1:
    dpdissk(:,:,k) = dp(:,:,k) - dt * elem(ie)%derived%divdp(:,:,k)
    @@ -2508,7 +2520,7 @@
    ! advance Qdp
    #if (defined COLUMN_OPENMP)
    -!$omp parallel do private(q,k,gradQ,dp_star,qtens,dpdiss)
  • !$omp parallel do private(q,k,gradQ,dp_star,qtens,kptr)
    #endif
    do q = 1 , qsize
    do k = 1 , nlev ! dp_star used as temporary instead of divdp (AAM)
    @@ -2561,9 +2573,9 @@
    call edgeVpack(edgeAdvp1 , elem(ie)%state%Qdp(:,:,:,q,np1_qdp) , nlev , kptr , ie )
    enddo
    ! also DSS extra field
    -#if (defined COLUMN_OPENMP)
    -!$omp parallel do private(k)
    -#endif
    +!JMD#if (defined COLUMN_OPENMP)
    +!JMD !$omp parallel do private(k)
    +!JMD#endif
    do k = 1 , nlev
    DSSvar(:,:,k) = elem(ie)%spheremp(:,:) * DSSvar(:,:,k)
    enddo
    @@ -2578,12 +2590,15 @@
    if ( DSSopt == DSSdiv_vdp_ave ) DSSvar => elem(ie)%derived%divdp_proj(:,:,:)
    call edgeVunpack( edgeAdvp1 , DSSvar(:,:,1:nlev) , nlev , qsize*nlev , ie )
    +#if (defined COLUMN_OPENMP)
  • !$omp parallel do private(k)
    +#endif
    do k = 1 , nlev
    DSSvar(:,:,k) = DSSvar(:,:,k) * elem(ie)%rspheremp(:,:)
    enddo
    #if (defined COLUMN_OPENMP)
  •  !$omp parallel do private(q,k,kptr)
    
    +!$omp parallel do private(q,k,kptr)
    #endif
    do q = 1 , qsize
    kptr = nlev*(q-1)
    Homme mailing list
    [email protected]
    http://mailman.ucar.edu/mailman/listinfo/homme

Why two test cases (DCMIP 1 and DCMP 2)?

This mini-app will mostly be used for performance work, and the test cases are needed to verify correctness. Is one test sufficient to verify correctness? When should a user need to run both tests?

Most tracers are set to 0

David, a couple of issure: The current code, when setting the velocity, calls "set_elemement_state()" each timestep. This also sets the tracers, but I think only at the future timelevel, which is then overwritten by the actual tracer timestep. But I wanted to make sure this was correct?

Second question: where is the best place to set the values of tracers 5..qsize? If it was done in "set_element_state" it would automatically apply to all test cases, but this doesn't seem like a good idea.

create test case which matches ACME watercycle prototype

Small test cases are good for debugging. For performance, we need a ne120 test case that matches the ACME watercycle proptotype (our best estimate of what ACME v1 will look like),

To match ACME, we need:
for all test cases:
rsplit = 3 (remap every 3 timesteps)
limiter_option=8 (monotone limiter)
use PPM vertical remap

Just for the ne120 performance test:
set tracers to 50

Vectorize np loops in limiter_optim_iter_full

There are 8 np loops in limiter_optim_iter_full subroutine in prim_advection_mod.F90. In most cases, np is 4 and most of the loops have trip counts of 4-by-4, 4, or 16. Since the call to this subroutine is already inside a nested OMP parallel region, further improvement should be done with SIMD. If vectorization is not possible, we should explore loop unroll by a factor of 4.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.