Code Monkey home page Code Monkey logo

Comments (5)

mlincett avatar mlincett commented on August 25, 2024 1

Yes, running 1 trial locally just to get the band masks written is the way to do it. I have a wrapper script that does just that, which we tested with @JannisNe yesterday and it works. I'm streamlining it today and one thought was to have it in the submitter module as a prior step to running trials on cluster, so as in the submit method

if self.use_cluster:
            if self.mh_dict["mh_name"] == "large_catalogue": 
                <run script locally module>
                self.submit_cluster(mh_dict)
            else: self.submit_cluster(mh_dict)

Another thought is to change the submits to dagmans, where first run the script that only writes the band masks and then run the trials in how many jobs specified, and run everything on the cluster. This will require changing a bit the HTCondorSubmitter, and it will be a solid fix, but honestly it won't make it faster than running the first step locally. Any ideas/suggestions how to move forward with this?

I prefer to run all the "preparatory" phases locally for the sake of easier control and (easier) debugging, so I like the idea of the Submitter class taking care of this. If we need/want to use a dagman we can change the implementation at a later stage.

As soon as you have a working implementation feel free to submit a PR :)

from flarestack.

robertdstein avatar robertdstein commented on August 25, 2024

This problem probably arises from many jobs trying to write the same zip file on the cluster, which is bad behaviour. Regardless of the reason, we can make sure that the script at least works when again locally after the failure on the cluster.

The way to do that would be to do a try/except BadZipfile statement, to catch this exception when loading the band mask, and instead re-make the band mask from scratch.

Specifically, this could be done for:

File "/afs/ifh.de/user/b/bradascf/flarestack/flarestack/core/injector.py", line 474, in get_band_mask
self.load_band_mask(mask_index[0])

That could be replaced with:

try:
    self.load_band_mask(mask_index[0])
except BadZipFile:
    self.make_injection_band_mask()
    self.load_band_mask(mask_index[0])

Stopping the error on the cluster would be an additional and better fix for this specific problem, but this interim thing should be easy.

from flarestack.

sathanas31 avatar sathanas31 commented on August 25, 2024

The bug persists I'm afraid...

To reproduce

  1. Have a catalog with 10 sources and reduce the chunk size to 10, so just 1 chunk per cat
  2. 10 jobs running on DESY cluster
  3. Band masks are created on the cache dir but cannot be loaded

Traceback error

File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 513, in get_band_mask
    self.load_band_mask(mask_index[0])
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 495, in load_band_mask
    self.band_mask_cache = sparse.load_npz(path)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack_venv/lib/python3.10/site-packages/scipy/sparse/_matrix_io.py", line 144, in load_npz
    return cls((loaded['data'], loaded['indices'], loaded['indptr']), shape=loaded['shape'])
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack_venv/lib/python3.10/site-packages/numpy/lib/npyio.py", line 249, in __getitem__
    magic = bytes.read(len(format.MAGIC_PREFIX))
  File "/cvmfs/icecube.opensciencegrid.org/py3-v4.2.1/RHEL_7_x86_64/lib/python3.10/zipfile.py", line 923, in read
    data = self._read1(n)
  File "/cvmfs/icecube.opensciencegrid.org/py3-v4.2.1/RHEL_7_x86_64/lib/python3.10/zipfile.py", line 999, in _read1
    data = self._decompressor.decompress(data, n)
zlib.error: Error -3 while decompressing data: invalid stored block lengths

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/multiprocess_wrapper.py", line 162, in <module>
    run_multiprocess(n_cpu=cfg.n_cpu, mh_dict=mh_dict)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/multiprocess_wrapper.py", line 141, in run_multiprocess
    with MultiProcessor(n_cpu=n_cpu, mh_dict=mh_dict) as r:
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/multiprocess_wrapper.py", line 57, in __init__
    inj = self.mh.get_injector(season)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/minimisation.py", line 310, in get_injector
    self._injectors[season_name] = self.add_injector(
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/minimisation.py", line 1267, in add_injector
    return season.make_injector(sources, **self.inj_dict)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/data/__init__.py", line 322, in make_injector
    return MCInjector.create(self, sources, **inj_kwargs)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 203, in create
    return BaseInjector.subclasses[inj_name](season, sources, **inj_dict)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 427, in __init__
    MCInjector.__init__(self, season, sources, **kwargs)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 240, in __init__
    self.n_exp = self.calculate_n_exp()
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 457, in calculate_n_exp
    self.n_exp[i]["n_exp"] = self.calculate_n_exp_single(source)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 295, in calculate_n_exp_single
    return np.sum(self.calculate_single_source(source, 1.0)["ow"])
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 288, in calculate_single_source
    source_mc, omega, band_mask = self.select_mc_band(source)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 261, in select_mc_band
    band_mask = self.get_band_mask(source, min_dec, max_dec)
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 517, in get_band_mask
    self.load_band_mask(mask_index[0])
  File "/afs/ifh.de/group/amanda/scratch/sathanas/stacking/flarestack/flarestack/core/injector.py", line 493, in load_band_mask
    del self.band_mask_cache
AttributeError: band_mask_cache

Expected behaviour
After masks are created by make_injection_band_mask() they can be loaded by load_band_mask() while jobs are running.

Additional info

It goes into the except clause you added, the band masks are created, and I checked if they can be loaded externally which they can. Not surprisingly, if I run the analysis again for the same catalog the masks are loaded fine.

It may be that while the jobs are running, masks are written by one job while others try to read it simultaneously, hence the error

from flarestack.

mlincett avatar mlincett commented on August 25, 2024

Thanks for updating the report @sathanas31 . A couple of questions since I am not too familiar with this part of the code:

Is it a possibility to run a minimal number of trials locally before launching the jobs, and would that prevent any further issue when running on the cluster? If so, I think this would be the best workaround for the time being.

I think ultimately we should decouple any creation of cached files from the actual minimization process (see also #247).

from flarestack.

sathanas31 avatar sathanas31 commented on August 25, 2024

Yes, running 1 trial locally just to get the band masks written is the way to do it.
I have a wrapper script that does just that, which we tested with @JannisNe yesterday and it works. I'm streamlining it today and one thought was to have it in the submitter module as a prior step to running trials on cluster, so as in the submit method

if self.use_cluster:
            if self.mh_dict["mh_name"] == "large_catalogue": 
                <run script locally module>
                self.submit_cluster(mh_dict)
            else: self.submit_cluster(mh_dict)

Another thought is to change the submits to dagmans, where first run the script that only writes the band masks and then run the trials in how many jobs specified, and run everything on the cluster. This will require changing a bit the HTCondorSubmitter, and it will be a solid fix, but honestly it won't make it faster than running the first step locally.
Any ideas/suggestions how to move forward with this?

from flarestack.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.