Hello, I'm trying to start a simulation from the output of a previous simulation.

<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

Restart issue about smilei HOT 6 CLOSED

smileipic commented on May 28, 2024

Restart issue

from smilei.

Comments (6)

jderouillat commented on May 28, 2024

Dear Savio,
This behavior is surprising.
In my opinion the 1st thing to do is to check the integrity of the checkpoint.
If you didn't can you confirm the result of the following command :

$ h5dump  -a /patch-000000/species   ../ClusterSim_2/checkpoints/dump-00000-0000000000.h5

According to your error, it should return :

... {
ATTRIBUTE "species" {
   DATATYPE  H5T_STD_U32LE
   DATASPACE  SCALAR
   DATA {
   (0): 0
   }
}
}

If it's the case don't you have an error file from the third simulation ?

Regards.

Julien

from smilei.

iclaserplasma commented on May 28, 2024

Seems like it might be the checkpoint file which is corrupted. I've attached the logfile from the final run to show the error that it observes.
log5.txt

If ClusterSim2 does need to be rerun can you recommend how I avoid this error?
Thanks
Savio

from smilei.

iltommi commented on May 28, 2024

looks like h5dump command is not properly installed. So we still don't know if the file is corrupted and why it was corrupted.

Since the simulation did several checkpoints, a wise thing to try is keep on disk more than one checkpoint. You can achieve this with keep_n_dumps : https://smileipic.github.io/Smilei/namelist.html#keep_n_dumps

set it to 2 an even if the latest checkpoint is corrupted you will still have the previous one.

from smilei.

srozario121 commented on May 28, 2024

Ah I forgot to load some of the mpi modules before. Here is the results from the h5dump file:
[svr11@cx2-login checkpoints]$ h5dump -a /patch-000000/species dump-00000-0000000000.h5
HDF5 "dump-00000-0000000000.h5" {
ATTRIBUTE "species" {
DATATYPE H5T_STD_U32LE
DATASPACE SCALAR
DATA {
(0): 3
}
}
}

[svr11@cx2-login checkpoints]$ h5stat dump-00000-00000000*.h5
Filename: dump-00000-0000000000.h5
File information
# of unique groups: 48501
# of unique datasets: 336547
# of unique named datatypes: 0
# of unique links: 0
# of unique other: 0
Max. # of links to object: 1
Max. # of objects in group: 12126
File space information for file metadata (in bytes):
Superblock: 96
Superblock extension: 0
User block: 0
Object headers: (total/unused)
Groups: 9013816/0
Datasets(exclude compact data): 91540784/43684480
Datatypes: 0/0
Groups:
B-tree/List: 51970648
Heap: 9471152
Attributes:
B-tree/List: 0
Heap: 0
Chunked datasets:
Index: 0
Datasets:
Heap: 0
Shared Messages:
Header: 0
B-tree/List: 0
Heap: 0
Free-space managers:
Header: 0
Amount of free space: 0
Small groups (with 0 to 9 links):
# of groups with 0 link(s): 11106
# of groups with 9 link(s): 25269
Total # of small groups: 36375
Group bins:
# of groups with 0 link: 11106
# of groups with 1 - 9 links: 25269
# of groups with 10 - 99 links: 12125
# of groups with 10000 - 99999 links: 1
Total # of groups: 48501
Dataset dimension information:
Max. rank of datasets: 1
Dataset ranks:
# of dataset with rank 1: 336547
1-D Dataset information:
Max. dimension size of 1-D datasets: 57353
Small 1-D datasets (with dimension sizes 0 to 9):
# of datasets with dimension sizes 4: 7
# of datasets with dimension sizes 7: 14
# of datasets with dimension sizes 9: 21
Total # of small datasets: 42
1-D Dataset dimension bins:
# of datasets with dimension size 1 - 9: 42
# of datasets with dimension size 10 - 99: 52716
# of datasets with dimension size 100 - 999: 146748
# of datasets with dimension size 1000 - 9999: 131273
# of datasets with dimension size 10000 - 99999: 5768
Total # of datasets: 336547
Dataset storage information:
Total raw data size: 2459043134
Total external raw data size: 0
Dataset layout information:
Dataset layout counts[COMPACT]: 0
Dataset layout counts[CONTIG]: 336547
Dataset layout counts[CHUNKED]: 0
Dataset layout counts[VIRTUAL]: 0
Number of external files : 0
Dataset filters information:
Number of datasets with:
NO filter: 336547
GZIP filter: 0
SHUFFLE filter: 0
FLETCHER32 filter: 0
SZIP filter: 0
NBIT filter: 0
SCALEOFFSET filter: 0
USER-DEFINED filter: 0
Dataset datatype information:
# of unique datatypes used by datasets: 3
Dataset datatype #0:
Count (total/named) = (260739/0)
Size (desc./elmt) = (22/8)
Dataset datatype #1:
Count (total/named) = (25269/0)
Size (desc./elmt) = (14/2)
Dataset datatype #2:
Count (total/named) = (50539/0)
Size (desc./elmt) = (14/4)
Total dataset datatype count: 336547
Small # of attributes (objects with 1 to 10 attributes):
# of objects with 1 attributes: 12125
# of objects with 2 attributes: 36375
Total # of objects with small # of attributes: 48500
Attribute bins:
# of objects with 1 - 9 attributes: 48500
# of objects with 10 - 99 attributes: 1
Total # of objects with attributes: 48501
Max. # of attributes to objects: 12
Free-space persist: FALSE
Free-space section threshold: 1 bytes
Small size free-space sections (< 10 bytes):
Total # of small size sections: 0
Free-space section bins:
Total # of sections: 0
File space management strategy: H5F_FSPACE_STRATEGY_FSM_AGGR
File space page size: 4096 bytes
Summary of file space information:
File metadata: 161996496 bytes
Raw data: 2459043134 bytes
Amount/Percent of tracked free space: 0 bytes/0.0%
Unaccounted space: 1747760 bytes
Total space: 2622787390 bytes

I'll try with two dump files. Maybe one will work.

from smilei.

mccoys commented on May 28, 2024

One note about this issue. If you terminate your job too early after the time of the checkpoint, then the storage of data into checkpoint files may be interrupted, causing corrupt files. To avoid this, you should let the simulation at least 5 minutes to complete the checkpoint. In some cases, 5 minutes is not sufficient.

from smilei.

iclaserplasma commented on May 28, 2024

The issue was in fact that I had run out of space and the checkpoint file couldn't complete its save! I think it is running fine now.
Thanks!

from smilei.

Restart issue about smilei HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent