Comments (5)
Yes, running 1 trial locally just to get the band masks written is the way to do it. I have a wrapper script that does just that, which we tested with @JannisNe yesterday and it works. I'm streamlining it today and one thought was to have it in the submitter module as a prior step to running trials on cluster, so as in the
submit
methodif self.use_cluster: if self.mh_dict["mh_name"] == "large_catalogue": <run script locally module> self.submit_cluster(mh_dict) else: self.submit_cluster(mh_dict)
Another thought is to change the submits to dagmans, where first run the script that only writes the band masks and then run the trials in how many jobs specified, and run everything on the cluster. This will require changing a bit the
HTCondorSubmitter
, and it will be a solid fix, but honestly it won't make it faster than running the first step locally. Any ideas/suggestions how to move forward with this?
I prefer to run all the "preparatory" phases locally for the sake of easier control and (easier) debugging, so I like the idea of the Submitter
class taking care of this. If we need/want to use a dagman we can change the implementation at a later stage.
As soon as you have a working implementation feel free to submit a PR :)
from flarestack.
This problem probably arises from many jobs trying to write the same zip file on the cluster, which is bad behaviour. Regardless of the reason, we can make sure that the script at least works when again locally after the failure on the cluster.
The way to do that would be to do a try/except BadZipfile statement, to catch this exception when loading the band mask, and instead re-make the band mask from scratch.
Specifically, this could be done for:
File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 474, in get_band_mask
self.load_band_mask(mask_index[0])
That could be replaced with:
try:
self.load_band_mask(mask_index[0])
except BadZipFile:
self.make_injection_band_mask()
self.load_band_mask(mask_index[0])
Stopping the error on the cluster would be an additional and better fix for this specific problem, but this interim thing should be easy.
from flarestack.
The bug persists I'm afraid...
To reproduce
- Have a catalog with 10 sources and reduce the chunk size to 10, so just 1 chunk per cat
- 10 jobs running on DESY cluster
- Band masks are created on the cache dir but cannot be loaded
Traceback error
File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 513, in get_band_mask
self.load_band_mask(mask_index[0])
File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 495, in load_band_mask
self.band_mask_cache = sparse.load_npz(path)
File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack_venv/lib/python3.10/site-packages/scipy/sparse/_matrix_io.py", line 144, in load_npz
return cls((loaded['data'], loaded['indices'], loaded['indptr']), shape=loaded['shape'])
File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack_venv/lib/python3.10/site-packages/numpy/lib/npyio.py", line 249, in __getitem__
magic = bytes.read(len(format.MAGIC_PREFIX))
File "/cvmfs/icecube.opensciencegrid.org/py3-v4.2.1/RHEL_7_x86_64/lib/python3.10/zipfile.py", line 923, in read
data = self._read1(n)
File "/cvmfs/icecube.opensciencegrid.org/py3-v4.2.1/RHEL_7_x86_64/lib/python3.10/zipfile.py", line 999, in _read1
data = self._decompressor.decompress(data, n)
zlib.error: Error -3 while decompressing data: invalid stored block lengths
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/multiprocess_wrapper.py", line 162, in <module>
run_multiprocess(n_cpu=cfg.n_cpu, mh_dict=mh_dict)
File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/multiprocess_wrapper.py", line 141, in run_multiprocess
with MultiProcessor(n_cpu=n_cpu, mh_dict=mh_dict) as r:
File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/multiprocess_wrapper.py", line 57, in __init__
inj = self.mh.get_injector(season)
File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/minimisation.py", line 310, in get_injector
self._injectors[season_name] = self.add_injector(
File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/minimisation.py", line 1267, in add_injector
return season.make_injector(sources, **self.inj_dict)
File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/data/__init__.py", line 322, in make_injector
return MCInjector.create(self, sources, **inj_kwargs)
File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 203, in create
return BaseInjector.subclasses[inj_name](season, sources, **inj_dict)
File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 427, in __init__
MCInjector.__init__(self, season, sources, **kwargs)
File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 240, in __init__
self.n_exp = self.calculate_n_exp()
File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 457, in calculate_n_exp
self.n_exp[i]["n_exp"] = self.calculate_n_exp_single(source)
File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 295, in calculate_n_exp_single
return np.sum(self.calculate_single_source(source, 1.0)["ow"])
File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 288, in calculate_single_source
source_mc, omega, band_mask = self.select_mc_band(source)
File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 261, in select_mc_band
band_mask = self.get_band_mask(source, min_dec, max_dec)
File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 517, in get_band_mask
self.load_band_mask(mask_index[0])
File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 493, in load_band_mask
del self.band_mask_cache
AttributeError: band_mask_cache
Expected behaviour
After masks are created by make_injection_band_mask()
they can be loaded by load_band_mask()
while jobs are running.
Additional info
It goes into the except
clause you added, the band masks are created, and I checked if they can be loaded externally which they can. Not surprisingly, if I run the analysis again for the same catalog the masks are loaded fine.
It may be that while the jobs are running, masks are written by one job while others try to read it simultaneously, hence the error
from flarestack.
Thanks for updating the report @sathanas31 . A couple of questions since I am not too familiar with this part of the code:
Is it a possibility to run a minimal number of trials locally before launching the jobs, and would that prevent any further issue when running on the cluster? If so, I think this would be the best workaround for the time being.
I think ultimately we should decouple any creation of cached files from the actual minimization process (see also #247).
from flarestack.
Yes, running 1 trial locally just to get the band masks written is the way to do it.
I have a wrapper script that does just that, which we tested with @JannisNe yesterday and it works. I'm streamlining it today and one thought was to have it in the submitter module as a prior step to running trials on cluster, so as in the submit
method
if self.use_cluster:
if self.mh_dict["mh_name"] == "large_catalogue":
<run script locally module>
self.submit_cluster(mh_dict)
else: self.submit_cluster(mh_dict)
Another thought is to change the submits to dagmans, where first run the script that only writes the band masks and then run the trials in how many jobs specified, and run everything on the cluster. This will require changing a bit the HTCondorSubmitter
, and it will be a solid fix, but honestly it won't make it faster than running the first step locally.
Any ideas/suggestions how to move forward with this?
from flarestack.
Related Issues (20)
- `numpy` v1.24 deprecates `np.float` HOT 1
- Flarestack incompatible with latest version of black HOT 6
- Support python 3.11 in CI HOT 5
- Deprecation of `scipy.interpolate.interp2d`
- Forking on macOS HOT 2
- Deprecation of labels in demo
- Some exceptions are caught silently
- Fail to retrieve jobids when calling wait_for_cluster() method HOT 3
- Determination of `scale_range` in minimisation.py fails for certain inputs HOT 3
- Path to a revamped tutorial
- Rename scratch directory and data directory HOT 2
- Improve `ResultsHandler` HOT 2
- Proposal to deprecate python 3.8 HOT 4
- Use `HTCondor` python bindings HOT 2
- use `black` 2024 HOT 5
- `coveralls` submission error HOT 2
- incorrect `n_trials` when `fixed_scale` is used HOT 2
- Energy-dependent pull correction HOT 4
- Suggestion to remove standard_llh HOT 5
- Deprecation of `interp2d` HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flarestack.