e3sm-project / transport_se Goto Github PK
View Code? Open in Web Editor NEWAtmosphere transport mini app
Atmosphere transport mini app
The tracer-transport mini-app (transport_se) is a simplified version of HOMME designed to facilitate experimentation and performance improvements. To configure and build the mini-app: 1) cp configure.sh ~/work/ copy the configure script to a directory of your choosing 2) vi ~/work/configure.sh customize the USER-DEFINED PARAMETERS using a text editor 3) ~/work/configure.sh execute the script to configure and build the mini-app To test the mini-app: 4) cd ~/work/bld/test navigate to the test directory 5) make install install the namelists and test scripts 6) vi run_ne8_test.sh customize the mpirun command, tasks per node, etc... 7) qsub run_ne8_test.sh submit a test to the debug queue (or execute it on a laptop) 8) tail -n 40 ./out_ne8_jobid examine timing and error norm results 9) display ~/work/run/image_* examine plots of tracer cross-sections Tests: run_ne8_test.sh: ultra low-res test for quick debug/check run_ne30_test.sh 1 degree test with 4 tracers for verification run_ne120_test.sh 1/4 degree test for additonal, optional verification at high-res This test is expensive and often sits in the que for 1 day. run_ne8_perf.sh run_ne30_perf.sh run_ne120_perf.sh Test case for performance testing DCMIP1-1 for 1 hour of model time. (we might have to increase run time ) output disabled. most diagnostics disabled tracers increased to 35 Recommendations: Step 1. Use "run_ne8_test.sh" to get the code up and running Step 2. Use "run_ne30_test.sh" to verify the results are correct. L1, L2 and Linf errors, overshoot and undershoots should agree to 2-3 digits. One should also check that the tracer mass is conserved by looking at the "Q, Q diss" values in the stdout. For additional verification, compare the PDF plots of the solution with reference plots on the ACME confluence page, "Transport Mini-app Setup and Test Results" The results should be BFB when run on different numbers of MPI tasks and/or openMP threads. We should add a test for this. Step 3. Use "run_ne120_perf.sh" as a starting point for performance testing. Note: This mini-app is using DCMIP tests 1-1 and 1-2, modified to use the ACME 72L configuration instead of the DCMIP equally spaced levels. The performance characteristics will be very faithful to the HOMME dycore as used by ACME. However, the ACME 72L configuration has much larger errors since there are fewer levels where the tracers are located. In addition, surface pressure is not treated correctly - for consistency with prescribed winds and disabled dynamics, we evolve the surface pressure using the implicit surface pressure from the tracer advection scheme. Thus the simulation output of this mini-app should not be used for DCMIP test case results. __________________________________________________________________________________________________________________ Typical results for run_ne8_test.sh on NERSC:Hopper name procs threads count walltotal wallmax (proc thrd ) wallmin (proc thrd ) DCMIP1-1: prim_run 48 48 3.456000e+04 1.853381e+03 38.974 ( 0 0) 25.267 ( 1 0) DCMIP1-2: prim_run 48 48 2.880000e+03 9.024055e+01 1.892 ( 32 0) 1.660 ( 1 0) DCMIP1-1: L1=0.564863 L2= 0.527211 Linf= 0.478169 q_max= 0.648135 q_min= -0.0615289 DCMIP1-2: L1=0.368728 L2= 0.558725 Linf= 1.040230 q_max= 0.991667 q_min= -5.59564e-05 Used Resources: cput=00:00:07,energy_used=0,mem=12220kb,vmem=123196kb,walltime=00:01:29 Updated 2015-6-27 With rsplit=1 and limiter=8 (monotone limiter) on Edison DCMIP 1-1: L1=0.529721 L2= 0.637361 Linf= 0.616722 q_max= 0.459436 q_min= -2.24628e-08 DCMIP 1-2: L1=0.386038 L2= 0.556078 Linf= 0.914061 q_max= 0.987406 q_min= -2.64163e-06 On Darwin: DCMIP 1-1: L1=0.529721 L2= 0.637360 Linf= 0.616721 q_max= 0.459436 q_min= -2.246253e-08 DCMIP 1-2: L1=0.386038 L2= 0.556077 Linf= 0.914060 q_max= 0.987405 q_min= -2.642738e-06 Updated 2015-7-14 (new direct addressing limiter) DCMIP 1-1: L1=0.529676 L2= 0.637311 Linf= 0.616969 q_max= 0.459701 q_min= -9.78542e-09 DCMIP 1-2: L1=0.386039 L2= 0.556071 Linf= 0.91406 q_max= 0.987389 q_min= -2.77826e-06 Updated 2015-11-27 (rsplit=3, to match CAM) DCMIP 1-1: L1=0.529725 L2=0.637015 Linf=0.610602 q_max=0.458165 q_min= -9.533290e-14 DCMIP 1-2: L1=0.263420 L2=0.388462 Linf=0.638387 q_max=0.989853 q_min= -1.274771e-08 Updated 2015-11-27 (rsplit=3, ACME 72 level configuration, skybridge) DCMIP 1-1: L1=0.578151 L2=0.865526 Linf=0.883168 q_max=0.187204 q_min= -3.207090e-13 DCMIP 1-2: L1=0.307665 L2=0.622099 Linf=0.839133 q_max=0.813105 q_min= -9.385639e-06 __________________________________________________________________________________________________________________ Typical results for run_ne30_test.sh on NERSC:Hopper name procs threads count walltotal wallmax (proc thrd ) wallmin (proc thrd ) DCMIP1-1 prim_run 216 216 7.464960e+05 8.528053e+04 394.886 ( 6 0) 393.686 ( 1 0) DCMIP1-2 prim_run 216 216 6.220800e+04 5.899456e+03 27.387 ( 202 0) 25.604 ( 1 0) DCMIP1-1: L1=0.1810760 L2= 0.209121 Linf= 0.317952 q_max= 1.014850 q_min= -0.0410253 DCMIP1-2: L1=0.0799994 L2= 0.150355 Linf= 0.297211 q_max= 0.977932 q_min= -5.7599e-13 Used Resources: cput=00:00:09,energy_used=0,mem=97628kb,vmem=208712kb,walltime=00:08:41 Updated 2015-6-27 With rsplit limiter=8 (monotone limiter) on Edison DCMIP 1-1: L1=0.138421 L2= 0.202415 Linf= 0.298503 q_max= 0.974222 q_min= -7.4405e-14 DCMIP 1-2: L1=0.0578899 L2= 0.123954 Linf= 0.270832 q_max= 0.979736 q_min= -1.26331e-12 Updated 2015-7-14 (new direct addressing limiter) DCMIP 1-1: L1=0.138438 L2= 0.202473 Linf= 0.298522 q_max= 0.974238 q_min= -7.27863e-14 DCMIP 1-2: L1=0.0578902 L2= 0.123954 Linf= 0.270832 q_max= 0.979733 q_min= -4.54058e-14 Updated 2015-11-27 (rsplit=3, to match CAM) DCMIP 1-1: L1=0.138438 L2= 0.202473 Linf= 0.298522 q_max= 0.97423 q_min= -7.633613e-14 DCMIP 1-2: L1=0.057890 L2= 0.123954 Linf= 0.270831 q_max= 0.97973 q_min= -4.540577e-14 Updated 2015-11-27 (rsplit=3, ACME 72 level config, skybridge) DCMIP 1-1: L1=0.490013 L2=0.789052 Linf=0.918454 q_max=0.445141 q_min= -3.559994e-11 DCMIP 1-2: L1=0.121783 L2=0.361005 Linf=1.092784 q_max=0.836177 q_min= -3.671997e-05 __________________________________________________________________________________________________________________ Typical results for run_ne120_test.sh on NERSC:Hopper Updated 2015-6-27 With rsplit limiter=8 (monotone limiter) on Edison DCMIP1-1 prim_run 960 960 4.423680e+06 2.117038e+06 2205.253 ( 709 0) 2205.234 ( 259 0) DCMIP1-2 prim_run 960 960 3.686400e+05 1.279739e+05 133.365 ( 662 0) 133.245 ( 443 0) DCMIP 1-1: L1=0.0794909 L2= 0.139211 Linf= 0.306975 q_max= 0.984071 q_min= -1.84532e-13 DCMIP 1-2: L1=0.0363203 L2= 0.0987049 Linf= 0.277493 q_max= 0.994146 q_min= -5.89671e-14 Updated 2015-7-14 (new direct addressing limiter) DCMIP 1-1: L1=0.0794911 L2= 0.139212 Linf= 0.30697 q_max= 0.984063 q_min= -1.76014e-13 DCMIP 1-2: L1=0.0363203 L2= 0.0987048 Linf= 0.277493 q_max= 0.994146 q_min= -5.06671e-14 Updated 2015-11-27 (rsplit=3, ACME 72 level config) (something is wrong with the DCMIP 1-1 configuration on 72L, but it should still be ok for performace ) DCMIP 1-1: L1=0.479398 L2=0.782613 Linf=0.922696 q_max=0.501561 q_min= -1.070570e-09 DCMIP 1-2: L1=0.081287 L2=0.264887 Linf=0.591157 q_max=0.959530 q_min= -2.795861e-09 __________________________________________________________________________________________________________________ Typical results for run_ne120_perf.sh on Edison: A x B x C, with A = nodes B = mpitasks_per_node C = threads_per_mpitask 40 node cases. 2160 elements per node. 1h simulation time 2015-6-27 2015-11-29 50 tracers 35 tracers 64L 72L rsplit=1 rsplit=3 40x24x1 prim_run 42.643 prim_advec_tracers_remap_rk2 53.343 37.195 vertical_remap 1.800 1.527 40x12x2 prim_run 43.306 prim_advec_tracers_remap_rk2 56.085 37.826 vertical_remap 1.801 1.518 40x6x4 prim_run 44.176 prim_advec_tracers_remap_rk2 63.195 38.644 vertical_remap 1.769 1.473 40x2x12 prim_run 46.040 prim_advec_tracers_remap_rk2 88.949 40.390 vertical_remap 1.765 1.492
I note that the test cases are producing quite a bit of diagnostics output. This can be a performance hit, since it involves lots of global reductions. We should consider reducing this substantially, so our performance profiles dont include these global reductions
Need to update cmake/machines/edison.cmake to allow for openMP testing on Edison
mini-app should only launch threads that have a non-zero number of elements.
To match ACME, we need to add this to all namelists:
vert_remap_q_alg = 1
rsplit = 3
Setting rsplit=3 may increase the errors, becuase I think the prescribed winds will not be computed on the floating lagrangian layers, and so will not be 100% correct. But in practice we remap ever 3 timesteps, so this should be turned on to match ACME.
Lots of difference values are printed in prim_advec after lim8 when COLUMN_OMP=true, and number of threads > 1 and -cc numa_node flag is used on Edison. Have examined the DCMIP solutions and found them to be damaged in this case. So the errors are real, not spurious. These errors might be related to issue #11.
David, a couple of issure: The current code, when setting the velocity, calls "set_elemement_state()" each timestep. This also sets the tracers, but I think only at the future timelevel, which is then overwritten by the actual tracer timestep. But I wanted to make sure this was correct?
Second question: where is the best place to set the values of tracers 5..qsize? If it was done in "set_element_state" it would automatically apply to all test cases, but this doesn't seem like a good idea.
Finally, if we bump this up to 50 tracers, becuae of the way set_element_state() works, we will make a copy of them every timestep. I worry this will skew our timings a small amount. As this copy is unneeded, should this dcmip wrapper interface be changed, perhaps a set_dynamics_state and set_tracer_state?
Az reported that the ne120 configuration crashes (negative layer thickness - time step too small).
This recent patch from John Dennis in HOMME (see below) might be related to this bug.
dennis
Fixed issues with COLUMN_OPENMP in the Eulerian Advection code.
--- trunk/src/share/prim_advection_mod.F90 2015-07-10 03:38:41 UTC (rev 4738)
+++ trunk/src/share/prim_advection_mod.F90 2015-07-16 15:27:16 UTC (rev 4739)
@@ -1,7 +1,7 @@
-#define NEWEULER_B4B 1
+#undef NEWEULER_B4B
module EXTRAE_MODULE
@@ -2355,11 +2355,14 @@
do ie = nets , nete
! add hyperviscosity to RHS. apply to Q at timelevel n0, Qdp(n0)/dp
-!$omp parallel do private(k, q)
+!$omp parallel do private(k)
do k = 1 , nlev ! Loop index added with implicit inversion (AAM)
dp(:,:,k) = elem(ie)%derived%dp(:,:,k) - rhs_multiplier*dt*elem(ie)%derived%divdp_proj(:,:,k)
enddo
+#if (defined COLUMN_OPENMP)
+!$omp parallel do private(q,k)
+#endif
do q = 1 , qsize
do k=1,nlev
Qtens_biharmonic(:,:,k,q,ie) = elem(ie)%state%Qdp(:,:,k,q,n0_qdp)/dp(:,:,k)
@@ -2390,10 +2393,10 @@
! nu_p>0): qtens_biharmonc *= elem()%psdiss_ave (for consistency, if nu_p=nu_q)
if ( nu_p > 0 ) then
do ie = nets , nete
+#ifdef NEWEULER_B4B
!$omp parallel do private(k, q, dp0, dpdiss)
!$omp parallel do private(k, q, dpdiss)
!$omp parallel do private(k)
!$omp parallel do private(q,k)
!$omp parallel do private(k, q, dp0)
!$omp parallel do private(k, q, dp0)
!$omp parallel do private(q,k,kptr)
This mini-app will mostly be used for performance work, and the test cases are needed to verify correctness. Is one test sufficient to verify correctness? When should a user need to run both tests?
We need to update the namelists so that limiter_option = 8
David, a couple of issure: The current code, when setting the velocity, calls "set_elemement_state()" each timestep. This also sets the tracers, but I think only at the future timelevel, which is then overwritten by the actual tracer timestep. But I wanted to make sure this was correct?
Second question: where is the best place to set the values of tracers 5..qsize? If it was done in "set_element_state" it would automatically apply to all test cases, but this doesn't seem like a good idea.
Small test cases are good for debugging. For performance, we need a ne120 test case that matches the ACME watercycle proptotype (our best estimate of what ACME v1 will look like),
To match ACME, we need:
for all test cases:
rsplit = 3 (remap every 3 timesteps)
limiter_option=8 (monotone limiter)
use PPM vertical remap
Just for the ne120 performance test:
set tracers to 50
There are 8 np loops in limiter_optim_iter_full subroutine in prim_advection_mod.F90. In most cases, np is 4 and most of the loops have trip counts of 4-by-4, 4, or 16. Since the call to this subroutine is already inside a nested OMP parallel region, further improvement should be done with SIMD. If vectorization is not possible, we should explore loop unroll by a factor of 4.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.