Describe the bug Cached band mask seems to be corrupted <p di

The bug persists I'm afraid... To reproduce <ol

Thanks for updating the report <a class="user-mention notranslate" data-hovercard-type

Corrupted cached band mask about flarestack HOT 5 CLOSED

icecube commented on August 25, 2024

Corrupted cached band mask

from flarestack.

Comments (5)

mlincett commented on August 25, 2024 1

Yes, running 1 trial locally just to get the band masks written is the way to do it. I have a wrapper script that does just that, which we tested with @JannisNe yesterday and it works. I'm streamlining it today and one thought was to have it in the submitter module as a prior step to running trials on cluster, so as in the submit method
if self.use_cluster:
            if self.mh_dict["mh_name"] == "large_catalogue": 
                <run script locally module>
                self.submit_cluster(mh_dict)
            else: self.submit_cluster(mh_dict)
Another thought is to change the submits to dagmans, where first run the script that only writes the band masks and then run the trials in how many jobs specified, and run everything on the cluster. This will require changing a bit the HTCondorSubmitter, and it will be a solid fix, but honestly it won't make it faster than running the first step locally. Any ideas/suggestions how to move forward with this?

I prefer to run all the "preparatory" phases locally for the sake of easier control and (easier) debugging, so I like the idea of the Submitter class taking care of this. If we need/want to use a dagman we can change the implementation at a later stage.

As soon as you have a working implementation feel free to submit a PR :)

from flarestack.

robertdstein commented on August 25, 2024

This problem probably arises from many jobs trying to write the same zip file on the cluster, which is bad behaviour. Regardless of the reason, we can make sure that the script at least works when again locally after the failure on the cluster.

The way to do that would be to do a try/except BadZipfile statement, to catch this exception when loading the band mask, and instead re-make the band mask from scratch.

Specifically, this could be done for:

File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 474, in get_band_mask
self.load_band_mask(mask_index[0])

That could be replaced with:

try:
    self.load_band_mask(mask_index[0])
except BadZipFile:
    self.make_injection_band_mask()
    self.load_band_mask(mask_index[0])

Stopping the error on the cluster would be an additional and better fix for this specific problem, but this interim thing should be easy.

from flarestack.

sathanas31 commented on August 25, 2024

The bug persists I'm afraid...

To reproduce

Have a catalog with 10 sources and reduce the chunk size to 10, so just 1 chunk per cat
10 jobs running on DESY cluster
Band masks are created on the cache dir but cannot be loaded

Traceback error

File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 513, in get_band_mask
    self.load_band_mask(mask_index[0])
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 495, in load_band_mask
    self.band_mask_cache = sparse.load_npz(path)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack_venv/lib/python3.10/site-packages/scipy/sparse/_matrix_io.py", line 144, in load_npz
    return cls((loaded['data'], loaded['indices'], loaded['indptr']), shape=loaded['shape'])
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack_venv/lib/python3.10/site-packages/numpy/lib/npyio.py", line 249, in __getitem__
    magic = bytes.read(len(format.MAGIC_PREFIX))
  File "/cvmfs/icecube.opensciencegrid.org/py3-v4.2.1/RHEL_7_x86_64/lib/python3.10/zipfile.py", line 923, in read
    data = self._read1(n)
  File "/cvmfs/icecube.opensciencegrid.org/py3-v4.2.1/RHEL_7_x86_64/lib/python3.10/zipfile.py", line 999, in _read1
    data = self._decompressor.decompress(data, n)
zlib.error: Error -3 while decompressing data: invalid stored block lengths

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/multiprocess_wrapper.py", line 162, in <module>
    run_multiprocess(n_cpu=cfg.n_cpu, mh_dict=mh_dict)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/multiprocess_wrapper.py", line 141, in run_multiprocess
    with MultiProcessor(n_cpu=n_cpu, mh_dict=mh_dict) as r:
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/multiprocess_wrapper.py", line 57, in __init__
    inj = self.mh.get_injector(season)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/minimisation.py", line 310, in get_injector
    self._injectors[season_name] = self.add_injector(
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/minimisation.py", line 1267, in add_injector
    return season.make_injector(sources, **self.inj_dict)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/data/__init__.py", line 322, in make_injector
    return MCInjector.create(self, sources, **inj_kwargs)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 203, in create
    return BaseInjector.subclasses[inj_name](season, sources, **inj_dict)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 427, in __init__
    MCInjector.__init__(self, season, sources, **kwargs)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 240, in __init__
    self.n_exp = self.calculate_n_exp()
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 457, in calculate_n_exp
    self.n_exp[i]["n_exp"] = self.calculate_n_exp_single(source)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 295, in calculate_n_exp_single
    return np.sum(self.calculate_single_source(source, 1.0)["ow"])
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 288, in calculate_single_source
    source_mc, omega, band_mask = self.select_mc_band(source)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 261, in select_mc_band
    band_mask = self.get_band_mask(source, min_dec, max_dec)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 517, in get_band_mask
    self.load_band_mask(mask_index[0])
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 493, in load_band_mask
    del self.band_mask_cache
AttributeError: band_mask_cache

Expected behaviour
After masks are created by make_injection_band_mask() they can be loaded by load_band_mask() while jobs are running.

Additional info

It goes into the except clause you added, the band masks are created, and I checked if they can be loaded externally which they can. Not surprisingly, if I run the analysis again for the same catalog the masks are loaded fine.

It may be that while the jobs are running, masks are written by one job while others try to read it simultaneously, hence the error

from flarestack.

mlincett commented on August 25, 2024

Thanks for updating the report @sathanas31 . A couple of questions since I am not too familiar with this part of the code:

Is it a possibility to run a minimal number of trials locally before launching the jobs, and would that prevent any further issue when running on the cluster? If so, I think this would be the best workaround for the time being.

I think ultimately we should decouple any creation of cached files from the actual minimization process (see also #247).

from flarestack.

sathanas31 commented on August 25, 2024

Yes, running 1 trial locally just to get the band masks written is the way to do it.
I have a wrapper script that does just that, which we tested with @JannisNe yesterday and it works. I'm streamlining it today and one thought was to have it in the submitter module as a prior step to running trials on cluster, so as in the submit method

if self.use_cluster:
            if self.mh_dict["mh_name"] == "large_catalogue": 
                <run script locally module>
                self.submit_cluster(mh_dict)
            else: self.submit_cluster(mh_dict)

Another thought is to change the submits to dagmans, where first run the script that only writes the band masks and then run the trials in how many jobs specified, and run everything on the cluster. This will require changing a bit the HTCondorSubmitter, and it will be a solid fix, but honestly it won't make it faster than running the first step locally.
Any ideas/suggestions how to move forward with this?

from flarestack.

Corrupted cached band mask about flarestack HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent