So this is a weird one for @tclune and @aoloso to ponder. Maybe even @atrayano.
To wit, for some work @aoloso is doing looking at the bad performance on the Skylake OPA nodes at NCCS (cc @wmputman), I built Open MPI 4.0.6 for both Skylake/OPA and Cascade Lake/Infiniband at NCCS and then build Baselibs.
I then built the model twice, one on Cascade Lake and one on Skylake (since Open MPI can only handle one interconnect at a time). I did this and the model ran just fine. The history outputs after a day are identical to Intel MPI. The state checkpoints are also identical to Intel MPI EXCEPT for DU and SS!
Comparing NC4 du_internal_checkpoint using nccmp...
Failure!
Checking for data differences
Variable Group Count Sum AbsSum Min Max Range Mean StdDev
DU / 4914428 0.00111215 0.00118107 -7.42602e-07 1.28244e-06 2.02504e-06 2.26302e-10 5.29118e-09
...
Comparing NC4 ss_internal_checkpoint using nccmp...
Failure!
Checking for data differences
Variable Group Count Sum AbsSum Min Max Range Mean StdDev
SS / 4912017 0.00117208 0.00126001 -4.20413e-08 1.49067e-07 1.91109e-07 2.38615e-10 1.89506e-09
And if we look closer, they are not just slightly different, but crazy different:
DIFFER : VARIABLE : DU : POSITION : [1,1,1,1] : VALUES : 0 <> 3.40241e-13
DIFFER : VARIABLE : DU : POSITION : [2,1,1,1] : VALUES : 0 <> 1.23087e-13
DIFFER : VARIABLE : DU : POSITION : [3,1,1,1] : VALUES : 0 <> 3.18833e-14
DIFFER : VARIABLE : DU : POSITION : [4,1,1,1] : VALUES : 0 <> 5.12426e-15
...
DIFFER : VARIABLE : DU : POSITION : [45,288,72,5] : VALUES : 1.06061e-23 <> 0
DIFFER : VARIABLE : DU : POSITION : [46,288,72,5] : VALUES : 3.76284e-25 <> 0
DIFFER : VARIABLE : DU : POSITION : [47,288,72,5] : VALUES : 1.55894e-25 <> 0
DIFFER : VARIABLE : DU : POSITION : [48,288,72,5] : VALUES : 8.44485e-26 <> 0
Weirdly, the differences are mostly "0 in Open MPI" as the "0 <> " on the left ends at [48,288,1,1]:
DIFFER : VARIABLE : DU : POSITION : [47,288,1,1] : VALUES : 0 <> 1.55894e-25
DIFFER : VARIABLE : DU : POSITION : [48,288,1,1] : VALUES : 0 <> 8.44485e-26
DIFFER : VARIABLE : DU : POSITION : [1,1,2,1] : VALUES : 1.17549e-38 <> 0
DIFFER : VARIABLE : DU : POSITION : [2,1,2,1] : VALUES : 1.17549e-38 <> 0
I think the indexing is [lat,lon,lev,unknown_dim1], so it's only level one of the first ungridded dimension??
This is baffling.
Again, all other checkpoints and all of history is zero-diff beyond this!