Code Monkey home page Code Monkey logo

Comments (36)

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: prckent

Wow! @ye-luo-luo @ye-luo-luo @ye-luo-luo @ye-luo-luo

This is an important breakthrough in the spline bug arena. This is a VMC RUN (even though the files are labeled DMC) and the energies are crazy in the real code on a point by point basis, i.e. it is not a small part of phase space that is wrong. This suggests a problem with pointers, indexing, conversions etc. It is interesting that the code does not crash and the complex version appears to get a reasonable (correct?) result.

My suggestion is that Ye @ye-luo-luo takes a look at this as a priority unless he is "full". A first challenge is to reproduce the problem on another system. It is certainly a very real bug on a Cray Intel system (eos) so is likely to be general.

This is our scariest and most important known bug. Hopefully spline bug the second is the same one as spline bug the first.

(Comment written after speaking to Jaron by phone)

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: prckent

This bug is particularly scary, just like the first spline bug, because we can not rule out the possibility that it is slightly biasing the results of production runs that otherwise appear normal.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: ye-luo

No problems shows on BG/Q. I did the DFT (QE 5.3.0) and VMC(QMCPAC rev 7337) both on BG/Q
qmca -q ev *.scalar.dat
dmcJ0_comp_twist0 series 0 -2554.870550 +/- 0.007271 404.970022 +/- 1.933910 0.1585
dmcJ0_real_twist0 series 0 -2554.833197 +/- 0.010036 407.298302 +/- 1.146522 0.1594

transfer the h5 to EOS and run QMCPACK rev7344 on EOS
qmca -q ev -e 5 *.scalar.dat
dmcJ0_comp_twist0 series 0 -2554.834115 +/- 0.079253 404.658918 +/- 5.366533 0.1584
dmcJ0_real_twist0 series 0 -2555.016185 +/- 0.084016 400.174285 +/- 3.251338 0.1566

@jtkrogel Please
1, transfer the h5 file to Mira and I will run QMCPACK.
2, copy my h5 (/gpfs/mira-fs1/projects/QMCSim/yeluo/spline_bug/Jaron/rerun/pwscf_output/pwscf.pwscf.h5) to your machine and run vmc with your qmcpack build.
So we can further investigate the issue.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: jtkrogel

I was able to track down the original orbital file and I am now transferring it to Mira (/gpfs/mira-fs1/projects/QMCSim/jtkrogel/transfer/for_ye/01_spline_bug2/pwscf.pwscf.h5). I reran with this file and the large variance behavior is present; Ye's H5 file resulted in a normal variance with rev7044 on EOS.

The file itself is irregularly large, which may relate to the problems seen (it is ~106 GB (!!) compared to 3.5 GB from Ye). This is directly due to the presence of psi_r data in the file. There may be a bug in QMCPACK's handling of the file.

It is unclear (a) why the file is so large, (b) why QMCPACK failed w/ the real code at this and one other volume, but was (apparently) fine at other volumes w/ real or complex code even though large H5 files were produced at each volume.

I will rebuild QE on EOS to see if normal size files and normal behavior result. I will also check the "spline bug 1" orbital file for size irregularities.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: jtkrogel

The "spline bug 1" files do not present size irregularities as compared w/ Ye's file. They were also generated with QE 5.1, but on another machine (OIC5).

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: prckent

It is definitely important to exchange the exact file that is known to cause problems on at least one machine.

Perhaps the FFTs are incorrect when the plane wave orbitals are transformed to the real space mesh inside QMCPACK, or some of the surrounding code is bad? This could be FFT and machine ( BG vs Intel ) dependent.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: ye-luo

@jtkrogel Is the transfer still going? I don't see anything but a single txt file.

Try this on your sick file.

h5ls -r pwscf.pwscf.h5/electrons > size.out
grep psi_g size.out | awk '{print $3,$4}' | uniq
My file yields {56799, 2}
I was wondering if one k point associated with the gamma point WF was corrupted or at least with a wrong size.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: jtkrogel

@ye-luo The files transfer is still going, estimated remaining time is 1 hour (transfer 60% complete at 10MB/s).

The 106GB file yields the same:
eos>h5ls -r pwscf.pwscf.h5/electrons | grep psi_g | awk '{print $3,$4}' | uniq
{56799, 2}

I found that the file contains psi_r data on a 120x120x120 grid. This accounts for the size difference between your file (3.5GB) and mine (106GB):
120.**3/56799*3.5 = 106.4

Am I correct that QMCPACK is reading this psi_r data? If so, this clearly has a bearing on the source and/or location of the bug.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: jtkrogel

@ye-luo The file transfer is now complete to Mira (/gpfs/mira-fs1/projects/QMCSim/jtkrogel/transfer/for_ye/01_spline_bug2/pwscf.pwscf.h5).

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: ye-luo

Since you also got {56799,2}, it means we have the same number of g-vectors in the reciprocal space in DFT.
Our files agree on the psi_g size per k points and band .
Your estimate of the file size is correct, the psi_r dominates.

I remember now, when I did the conversion, I commented "write_psir = .true." to avoid the real space WF in h5.
This makes the huge difference in file size.
QMCPACK should not use the psi_r. It loads the psi_g and does the FFT internally.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: jtkrogel

There is one psi_r for each psi_g in the file. Since psi_g dominates the file size, the ratio of psi_r size (120**3) to psi_g size (56799) gives the approximate file size ratio also.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: ye-luo

@jtkrogel QMCPACK may use the psi_r in the past but not now, the current code doesn't care about psi_r.

I tried the h5 from Jaron, the issue was reproduced.
rerun-h5-Jaron/dmcJ0_comp_twist0 series 0 -2554.833358 +/- 0.015526 404.594802 +/- 1.375452 0.1584
rerun-h5-Jaron/dmcJ0_real_twist0 series 0 -2178.009295 +/- 1.189484 1375076.599433 +/- 21024.229610 631.3456

I reran the DFT with the exact input files (using collect, and write_psir=.true.), I got a h5 of size 83GB. The size can be different from machine to machine due to different libhdf5.
rerun-dft-collect/dmcJ0_comp_twist0 series 0 -2554.798981 +/- 0.041567 399.644731 +/- 0.489775 0.1564
rerun-dft-collect/dmcJ0_real_twist0 series 0 -2554.839310 +/- 0.024979 406.804079 +/- 2.831341 0.1592

I also tried a conversion with write_psir=.false.
rerun-dft-collect-nopsi_r/dmcJ0_comp_twist0 series 0 -2554.860441 +/- 0.017068 406.506065 +/- 2.525934 0.1591
rerun-dft-collect-nopsi_r/dmcJ0_real_twist0 series 0 -2554.848793 +/- 0.023298 405.541774 +/- 1.655292 0.1587

Another case, neither collect nor write_psir,
rerun-dft-no-collect/dmcJ0_comp_twist0 series 0 -2554.824338 +/- 0.013409 405.586031 +/- 1.107173 0.1588
rerun-dft-no-collect/dmcJ0_real_twist0 series 0 -2554.844554 +/- 0.015553 404.325833 +/- 1.217949 0.1583

So it seems that a corrupt h5 causes the crazy behaviour.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: jtkrogel

As far as I can tell, the H5 file is valid; it just contains what we currently consider to be irrelevant information. It seems clear that QMCPACK is mishandling the file (same file but complex works and real doesn't).

The bigger question I have is whether this mishandling is generic of large file sizes, i.e. will we run into this routinely in the future in say 256-512 atom defect cells of NiO? I think we should still track down and patch the source of this mishandling.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: prckent

I think we need to narrow down the difference further (or at least I need to understand it better).

e.g. Is the file being read incorrectly or somehow processed incorrectly internally after reading?

Could we be near an integer limit somewhere, hence the apparent "large file" dependency?

Have we reproduced this problem on enough different systems that we can claim it is not (say) a problem with a particular HDF5 version and installation?

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: prckent

@jtkrogel Are the kinetic energies of the orbitals in the two files identical?

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: jtkrogel

We've also regenerated the orbital file w/ psir=.false. on EOS using QE 5.1. The file size is 3.4GB as expected.

The large change in energy and variance is still present for this file (will refer to this file as "vol0.98_eos_small"):
dmcJ0_real_twist0 series 0 -2177.251376 +/- 3.313361 1383981.935105 +/- 56682.731771 635.6556

Rerunning Ye's small file ("vol0.98_mira_small") on EOS does not present the problem, consistent w/ Ye's runs:
dmcJ0_real_twist0 series 0 -2554.890268 +/- 0.081604 399.433829 +/- 4.111847 0.1563

I've calculated the per orbital kinetic energy by directly summing the coefficients (and k^2) with a Python tool of mine. The largest KE difference across all orbitals in the two files above is 0.5 mHa. Typical KE's per orbital range from 0.5-8.0 Ha.

Conclusions: problem can persist w/o psir data (i.e. in small files). The two small files that do/do not trigger the bug are nearly identical in orbital contents (as manifested by matching oribtal KE), and so the problem is most likely limited to QMCPACK's usage of the files, not the files themselves (i.e. probably not the converter, unless something else in the files differs materially).

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: prckent

Does the current converter & QE 5.3 give the same file? Ye made a lot of changes. Do you need any help with an eos build?

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: jtkrogel

So far I'm of the opinion that the evidence points away from a bug in the QE SCF+conversion. We have cases w/ the bug appearing using orbital files from QE5.1 and QE5.3. Tests outside of QMCPACK on files w/ and w/o apparent bug show the orbitals are the same. Bug only shows up going from complex to real. To me everything is pointing pretty clearly to a bug inside QMCPACK and not with the orbital files. Thoughts?

I'm happy to explore QE5.3 on EOS. It would be good to know for sure that the converter is behaving consistently across versions/builds. I think we should start digging into QMCPACK itself concurrent with this.

I also plan to rerun the KE checks across all of Ye's QE 5.3 files from spline bug 1 that showed huge sensitivity in the VMC variance when making small changes in the DFT convergence parameters. If the KE's are not sensitive, then we will know the problem is in QMCPACK and it will give us an easier way to track down the bug (force VMC variance to match across files) than doing full walker traces, etc.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: prckent

Q1. How do we explain Ye's "rerun" results https://app.assembla.com/spaces/qmcdev/tickets/49/details?comment=1113850913

Q2. Have we showed a difference between complex and real versions for a single electron? i.e. cut down one of the "bad" runs to have only 1 or 2 electrons?

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: prckent

No real need to run 5.3 on eos. 5.1 is plenty old though.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: ye-luo

@jtkrogel "We have cases w/ the bug appearing using orbital files from QE5.1 and QE5.3." Do we have one from QE5.3?

I will try to compare the real/complex QMCPACK in the following aspects: spline coefficients, phase, evaluation.
If possible, run directly with planewave WF.
Not possible today for the maintenance but probably tomorrow.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: prckent

Planewave is super slow. I advise trying to cut down before running.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: jtkrogel

Paul/Ye: you are correct, we do not yet have a QE5.3 run that produces a buggy result for this ticket (I misread/misremembered Ye's rerun result). We do know that odd results can happen using the 5.3 toolchain (in context of similarly huge variances obtained intermittently for spline bug 1 https://app.assembla.com/spaces/qmcdev/tickets/40/details?comment=1113785513).

Might this be the time to put orbital KE sums (PW sum from H5 and riemann sum or analytic for splined orbs) into QMCPACK for orbital quality checks? Fringe benefit would be no more need to do meshfactor scans at the VMC level.

We are starting some low electron count runs here.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: prckent

  1. Let us get to the root cause of the spline bugs before adding general tests. Anything looking at orbital quality should be coordinated with the APW projection conversion which needs something similar.

  2. 5.3 runs needed then. Will remove one parameter from our comparison matrix.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: jtkrogel

Agreed on 1 and 2. QE 5.3 runs will be useful. I'm suspicious of the intermittency issues shown in the other bug, so I may vary the convergence parameters and look for a similar pattern. If the resulting orbital KE's come out similar across the board then I think we can safely conclude that any variability seen at the VMC level resides solely in QMCPACK.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: jtkrogel

The problematic "small" 3.4 GB file ("vol0.98_eos_small") is now available on Mira:
/gpfs/mira-fs1/projects/QMCSim/jtkrogel/transfer/for_ye/01_spline_bug2/vol0.98_eos_small/pwscf.pwscf.h5

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: jtkrogel

Some brief results for a reduced number of electrons (occupying from 2 to 100 up and down orbitals, full occupation is nup=208, ndown=200).

Results with real code:


nud   LocalEnergy                 Variance                          ratio 
  2  -1711.586138 +/-  0.022133       25.163081 +/-      0.201123     0.0147 
  4  -1757.027013 +/-  0.035981       52.466286 +/-      0.386141     0.0299 
  5  -1764.564283 +/-  3.576613    31020.140442 +/-   6531.582336    17.5795 
 10  -1500.868653 +/- 19.722575  1505729.495207 +/- 111609.817630  1003.2387 
 25  -2021.780528 +/-  4.744664   255578.583315 +/-  21096.982477   126.4126 
 50  -2375.185522 +/-  0.573504    12489.309935 +/-   2157.800910     5.2582 
100  -2511.362514 +/-  0.064731      375.992336 +/-     10.289014     0.1497 

Combined w/ the complex results below, it looks as though problem orbitals might start around nup/down==5. Still, the variance pattern is strange, with nup/down=100 appearing almost normal. This suggests more intermittency to me rather than specific problem orbitals.

Results with complex code:


nud   LocalEnergy                 Variance                     ratio 
  2  -1711.568537 +/- 0.000000     26.088409 +/-    0.000000   0.0152 
  4  -1757.114780 +/- 0.024181     49.651033 +/-    0.349108   0.0283 
  5  -1779.701254 +/- 0.042572     63.252476 +/-    1.437347   0.0355 
 10  -1867.579168 +/- 4.446231   2696.267913 +/- 2406.244516   1.4437  <=== large variance
 25  -2095.414049 +/- 0.081082    203.158111 +/-    1.780770   0.0970 
 50  -2378.986474 +/- 0.290910    249.256439 +/-    3.030250   0.1048 
100  -2511.442451 +/- 0.115850    315.629761 +/-    1.803683   0.1257 

The results at nup/down==10 show the first signs that there may also be a problem with the complex code. We are currently rerunning the nup/down=10 complex case w/ variations in the DFT convergence parameters to see if this behavior remains.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: prckent

This hints at an MPI bug in our code, or memory usage problem (bad pointer usage, incorrect free/alloc etc.). One possible strategy would be to compute a checksum of the spline buffers on each MPI task.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: jtkrogel

We have rerun this in serial, and the problem persists. Also, running with traces on confirms that every walker on every node has a large kinetic energy.

The large variance seen above with complex for nud==10 is due to the runs being too short (equilibration issues w/ partial occupation, similar to isolated molecules in a box).

With equilibration properly accounted for, the real code demonstrates no issue w/ occupation up to nud=100. We are currently performing a bisection search on nud between 100 and 200 to find at least one orbital that has a large kinetic energy.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: ye-luo

The bug has been fixed by improving the orbital phase rotation algorithm. QMCPACK real code is no more picky on h5.
Image: Orbital_scan.png|Image: Orbital_scan.png

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: ye-luo

@jtkrogel could you confirm the new fix on last friday solves the bug? I would like to close the ticket asap.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: jtkrogel

@ye-luo I can confirm that spline bug 2 is now resolved; for our test case, we get identical results as complex. I am running long trace runs for spline bug 1 to see whether it is also resolved.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: ye-luo

Fantastic. In principle, I would like to urge everyone using the real code + spline to adopt this fix. it is critical.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: jtkrogel

I absolutely agree. Has anyone run long versions of the ctest runs to see if there are changes vs the reference values?

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: ye-luo

In principle, the energy should not change because any rotation is valid.
But the variance may reduce a bit if the old rotation scheme doesn't like the h5.
For the tests, I noticed the diamond files are generated with old pwscf. When I scan the orbitals, there is some strange behaviour. If I reran the DFT, it becomes normal. However, no change in energy / flux estimator changes.

from qmcpack.

qmc-robot avatar qmc-robot commented on July 30, 2024

Comment by: prckent

Long tests: no. These need to be run. We have not been running them on oxygen recently due to clashes with the nightlies (something is taking too long, needs to be investigated).

from qmcpack.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.