Code Monkey home page Code Monkey logo

lpcjobqueue's People

Contributors

agoose77 avatar lgray avatar nsmith- avatar yimuchen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

lpcjobqueue's Issues

Implement a variation for lxplus

Despite the repo's name, we can probably generalize the code to be useful on lxplus as well. I am not sure what limitations lxplus has, but I assume there are enough that the dask-jobqueue HTCondorCluster is not viable by itself? @maxgalli may be interested.

Fallback to /etc/condor/condor_config

The Schedd class is only capable of using a condor_config file present in the same folder. In order to make this code friendlier with people running their own variations of condor, the line:

os.environ["CONDOR_CONFIG"] = os.path.join(os.path.dirname(__file__), "condor_config")

Should fallback to the default directory /etc/condor/condor_config and then fail.

The location of the condor_config could also be a configurable parameter

simple_example fails if processor is imported from another module

I find it useful to define my analysis processor in a separate module to be imported by different scripts for use.

To replicate an issue I am having with my full analysis processor I moved MyProcessor from simple_example.py to a new file, simple_processor.py and then added a few lines to simple_example.py:

if __name__ == "__main__":
    from simple_processor import MyProcessor                                                                                                                                                                                                                                                        
    tic = time.time()

and

    cluster = LPCCondorCluster()
    cluster.transfer_input_files = ['simple_processor.py']

This fails with the following:

Singularity> python simple_example.py                                                                                                                                                                                                                                                               
WARNING: GSI authentication is enabled by your security configuration! GSI will not work in future releases.
For details, see https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=PlanToReplaceGridCommunityToolkit
/opt/conda/lib/python3.8/site-packages/coffea/util.py:154: FutureWarning: In coffea version v0.8.0 (target date: 31 Dec 2022), this will be an error.
(Set coffea.deprecations_as_errors = True to get a stack trace now.)
ImportError: coffea.hist is deprecated
  warnings.warn(message, FutureWarning)
Waiting for at least one worker...
WARNING: GSI authentication is being attempted! GSI will not work in future releases.
For details, see https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=PlanToReplaceGridCommunityToolkit
Traceback (most recent call last):       ] | 16% Completed |  0.6sn 41.6s
  File "simple_example.py", line 29, in <module>
    hists, metrics = processor.run_uproot_job(
  File "/opt/conda/lib/python3.8/site-packages/coffea/processor/__init__.py", line 104, in _run_x_job
    return run(
  File "/opt/conda/lib/python3.8/site-packages/coffea/processor/executor.py", line 1675, in __call__
    wrapped_out = self.run(fileset, processor_instance, treename)
  File "/opt/conda/lib/python3.8/site-packages/coffea/processor/executor.py", line 1825, in run
    wrapped_out, e = executor(chunks, closure, None)
  File "/opt/conda/lib/python3.8/site-packages/coffea/processor/executor.py", line 971, in __call__
    else _decompress(work.result())
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 280, in result
    raise exc.with_traceback(tb)
  File "/opt/conda/lib/python3.8/site-packages/coffea/processor/executor.py", line 217, in __call__
    out = self.function(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/coffea/processor/executor.py", line 1348, in automatic_retries
    raise e
  File "/opt/conda/lib/python3.8/site-packages/coffea/processor/executor.py", line 1333, in automatic_retries
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/coffea/processor/executor.py", line 1547, in _work_function
    processor_instance = cloudpickle.loads(lz4f.decompress(processor_instance))
ModuleNotFoundError: No module named 'simple_processor'
Last-ditch attempt to close HTCondor job 30148539 in finalizer! You should confirm the job exits!

Turn off dask profiler by default

In https://git.rwth-aachen.de/3pia/cms_analyses/common/-/blob/master/dask.yml#L34-37 the lines

distributed:
  worker:
    memory:
      target: 0.7
      spill: 0.9
      pause: 0.92
      terminate: 0
    profile:
      interval: 1d
      cycle: 2d
      low-level: False

adjust default dask worker memory limits and profiler settings. The worker memory pause fraction is a source of frequent headaches as jobs on that worker stall for a long time, often slowing down the overall processing. @pfackeldey would you recommend these defaults also here?

LCG is 100x faster when running pytorch models

For some reason when I use LCG_102 which has pytorch and coffea, my pytorch model runs at about 200 events/second. If I run the exact same test using the coffea-dask image in singularity, I get ~2 events/second. This speeds up to about 100 events/second if I set executor_args['workers']=1. Unfortunately there is nothing I have been able to do to replicate anything close to this performance when using LPCCondorCluster even when I set cluster_args['cores']=1 and cluster.adapt(minimum=1, maximum=1).

Pytorch is officially supported by Dask but... it seems like the dask/pytorch threading is somehow causing a bottleneck where the CPU is getting absolutely thrashed moving things around but almost nothing gets done.

AttributeError: 'LPCCondorJob' object has no attribute 'env_dict'

Trying to migrate my condor implementation from work_queue which doesn't play nice with LPC temp directories.

In the simple example the line

cluster = LPCCondorCluster()

needs to be

cluster = LPCCondorCluster(shared_temp_directory='/tmp', worker_extra_args=['--worker-port 10000:10070', '--nanny-port 10070:10100', '--no-dashboard'], job_script_prologue=[]) 

My job starts waiting for workers and then I get a never ending loop of

Singularity> python ZZ4b/nTupleAnalysis/scripts/coffea_lpcjobqueue.py
WARNING: GSI authentication is enabled by your security configuration! GSI will not work in future releases.
For details, see https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=PlanToReplaceGridCommunityToolkit
Waiting for at least one worker...
Task exception was never retrieved
future: <Task finished name='Task-46' coro=<_wrap_awaitable() done, defined at /opt/conda/lib/python3.8/asyncio/tasks.py:688> exception=AttributeError("'LPCCondorJob' object has no attribute 'env_dict'")>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/asyncio/tasks.py", line 695, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/spec.py", line 63, in _
    await self.start()
  File "/srv/.env/lib/python3.8/site-packages/lpcjobqueue/cluster.py", line 106, in start
    job = self.job_script()
  File "/srv/.env/lib/python3.8/site-packages/lpcjobqueue/cluster.py", line 87, in job_script
    quoted_environment = quote_environment(self.env_dict)
AttributeError: 'LPCCondorJob' object has no attribute 'env_dict'
2022-09-12 11:45:41,124 - tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fa7907636a0>>, <Task finished name='Task-45' coro=<SpecCluster._correct_state_internal() done, def
ined at /opt/conda/lib/python3.8/site-packages/distributed/deploy/spec.py:319> exception=AttributeError("'LPCCondorJob' object has no attribute 'env_dict'")>)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/spec.py", line 358, in _correct_state_internal
    await w  # for tornado gen.coroutine support
  File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/spec.py", line 63, in _
    await self.start()
  File "/srv/.env/lib/python3.8/site-packages/lpcjobqueue/cluster.py", line 106, in start
    job = self.job_script()
  File "/srv/.env/lib/python3.8/site-packages/lpcjobqueue/cluster.py", line 87, in job_script
    quoted_environment = quote_environment(self.env_dict)
AttributeError: 'LPCCondorJob' object has no attribute 'env_dict'

Failing to connect to schedd

I have been unable to get even simple_example.py to work since last week.

I have tried using older coffea-dask images and tried switching to a new interactive node, all with the same error:

Singularity> python simple_example.py
WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported.
For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/
/opt/conda/lib/python3.8/site-packages/coffea/util.py:154: FutureWarning: In coffea version v0.8.0 (target date: 31 Dec 2022), this will be an error.
(Set coffea.deprecations_as_errors = True to get a stack trace now.)
ImportError: coffea.hist is deprecated
  warnings.warn(message, FutureWarning)
Waiting for at least one worker...
Failed to connect to schedd.
Task exception was never retrieved
future: <Task finished name='Task-46' coro=<_wrap_awaitable() done, defined at /opt/conda/lib/python3.8/asyncio/tasks.py:688> exception=AssertionError()>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/asyncio/tasks.py", line 695, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/spec.py", line 64, in _
    assert self.status == Status.running
AssertionError

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.