Code Monkey home page Code Monkey logo

Comments (6)

jderouillat avatar jderouillat commented on May 28, 2024

Dear Savio,
This behavior is surprising.
In my opinion the 1st thing to do is to check the integrity of the checkpoint.
If you didn't can you confirm the result of the following command :

$ h5dump  -a /patch-000000/species   ../ClusterSim_2/checkpoints/dump-00000-0000000000.h5

According to your error, it should return :

... {
ATTRIBUTE "species" {
   DATATYPE  H5T_STD_U32LE
   DATASPACE  SCALAR
   DATA {
   (0): 0
   }
}
}

If it's the case don't you have an error file from the third simulation ?

Regards.

Julien

from smilei.

iclaserplasma avatar iclaserplasma commented on May 28, 2024

image

Seems like it might be the checkpoint file which is corrupted. I've attached the logfile from the final run to show the error that it observes.
log5.txt

If ClusterSim2 does need to be rerun can you recommend how I avoid this error?
Thanks
Savio

from smilei.

iltommi avatar iltommi commented on May 28, 2024

looks like h5dump command is not properly installed. So we still don't know if the file is corrupted and why it was corrupted.

Since the simulation did several checkpoints, a wise thing to try is keep on disk more than one checkpoint. You can achieve this with keep_n_dumps : https://smileipic.github.io/Smilei/namelist.html#keep_n_dumps

set it to 2 an even if the latest checkpoint is corrupted you will still have the previous one.

from smilei.

srozario121 avatar srozario121 commented on May 28, 2024

Ah I forgot to load some of the mpi modules before. Here is the results from the h5dump file:
[svr11@cx2-login checkpoints]$ h5dump -a /patch-000000/species dump-00000-0000000000.h5
HDF5 "dump-00000-0000000000.h5" {
ATTRIBUTE "species" {
DATATYPE H5T_STD_U32LE
DATASPACE SCALAR
DATA {
(0): 3
}
}
}

[svr11@cx2-login checkpoints]$ h5stat dump-00000-00000000*.h5
Filename: dump-00000-0000000000.h5
File information
# of unique groups: 48501
# of unique datasets: 336547
# of unique named datatypes: 0
# of unique links: 0
# of unique other: 0
Max. # of links to object: 1
Max. # of objects in group: 12126
File space information for file metadata (in bytes):
Superblock: 96
Superblock extension: 0
User block: 0
Object headers: (total/unused)
Groups: 9013816/0
Datasets(exclude compact data): 91540784/43684480
Datatypes: 0/0
Groups:
B-tree/List: 51970648
Heap: 9471152
Attributes:
B-tree/List: 0
Heap: 0
Chunked datasets:
Index: 0
Datasets:
Heap: 0
Shared Messages:
Header: 0
B-tree/List: 0
Heap: 0
Free-space managers:
Header: 0
Amount of free space: 0
Small groups (with 0 to 9 links):
# of groups with 0 link(s): 11106
# of groups with 9 link(s): 25269
Total # of small groups: 36375
Group bins:
# of groups with 0 link: 11106
# of groups with 1 - 9 links: 25269
# of groups with 10 - 99 links: 12125
# of groups with 10000 - 99999 links: 1
Total # of groups: 48501
Dataset dimension information:
Max. rank of datasets: 1
Dataset ranks:
# of dataset with rank 1: 336547
1-D Dataset information:
Max. dimension size of 1-D datasets: 57353
Small 1-D datasets (with dimension sizes 0 to 9):
# of datasets with dimension sizes 4: 7
# of datasets with dimension sizes 7: 14
# of datasets with dimension sizes 9: 21
Total # of small datasets: 42
1-D Dataset dimension bins:
# of datasets with dimension size 1 - 9: 42
# of datasets with dimension size 10 - 99: 52716
# of datasets with dimension size 100 - 999: 146748
# of datasets with dimension size 1000 - 9999: 131273
# of datasets with dimension size 10000 - 99999: 5768
Total # of datasets: 336547
Dataset storage information:
Total raw data size: 2459043134
Total external raw data size: 0
Dataset layout information:
Dataset layout counts[COMPACT]: 0
Dataset layout counts[CONTIG]: 336547
Dataset layout counts[CHUNKED]: 0
Dataset layout counts[VIRTUAL]: 0
Number of external files : 0
Dataset filters information:
Number of datasets with:
NO filter: 336547
GZIP filter: 0
SHUFFLE filter: 0
FLETCHER32 filter: 0
SZIP filter: 0
NBIT filter: 0
SCALEOFFSET filter: 0
USER-DEFINED filter: 0
Dataset datatype information:
# of unique datatypes used by datasets: 3
Dataset datatype #0:
Count (total/named) = (260739/0)
Size (desc./elmt) = (22/8)
Dataset datatype #1:
Count (total/named) = (25269/0)
Size (desc./elmt) = (14/2)
Dataset datatype #2:
Count (total/named) = (50539/0)
Size (desc./elmt) = (14/4)
Total dataset datatype count: 336547
Small # of attributes (objects with 1 to 10 attributes):
# of objects with 1 attributes: 12125
# of objects with 2 attributes: 36375
Total # of objects with small # of attributes: 48500
Attribute bins:
# of objects with 1 - 9 attributes: 48500
# of objects with 10 - 99 attributes: 1
Total # of objects with attributes: 48501
Max. # of attributes to objects: 12
Free-space persist: FALSE
Free-space section threshold: 1 bytes
Small size free-space sections (< 10 bytes):
Total # of small size sections: 0
Free-space section bins:
Total # of sections: 0
File space management strategy: H5F_FSPACE_STRATEGY_FSM_AGGR
File space page size: 4096 bytes
Summary of file space information:
File metadata: 161996496 bytes
Raw data: 2459043134 bytes
Amount/Percent of tracked free space: 0 bytes/0.0%
Unaccounted space: 1747760 bytes
Total space: 2622787390 bytes

I'll try with two dump files. Maybe one will work.

from smilei.

mccoys avatar mccoys commented on May 28, 2024

One note about this issue. If you terminate your job too early after the time of the checkpoint, then the storage of data into checkpoint files may be interrupted, causing corrupt files. To avoid this, you should let the simulation at least 5 minutes to complete the checkpoint. In some cases, 5 minutes is not sufficient.

from smilei.

iclaserplasma avatar iclaserplasma commented on May 28, 2024

The issue was in fact that I had run out of space and the checkpoint file couldn't complete its save! I think it is running fine now.
Thanks!

from smilei.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.