coffeateam / lpcjobqueue Goto Github PK
View Code? Open in Web Editor NEWA dask-jobqueue plugin for the LPC Condor queue
License: BSD 3-Clause "New" or "Revised" License
A dask-jobqueue plugin for the LPC Condor queue
License: BSD 3-Clause "New" or "Revised" License
Despite the repo's name, we can probably generalize the code to be useful on lxplus as well. I am not sure what limitations lxplus has, but I assume there are enough that the dask-jobqueue HTCondorCluster
is not viable by itself? @maxgalli may be interested.
The Schedd class is only capable of using a condor_config
file present in the same folder. In order to make this code friendlier with people running their own variations of condor, the line:
lpcjobqueue/src/lpcjobqueue/schedd.py
Line 7 in b08e3c7
/etc/condor/condor_config
and then fail.
The location of the condor_config
could also be a configurable parameter
I find it useful to define my analysis processor in a separate module to be imported by different scripts for use.
To replicate an issue I am having with my full analysis processor I moved MyProcessor from simple_example.py to a new file, simple_processor.py and then added a few lines to simple_example.py:
if __name__ == "__main__":
from simple_processor import MyProcessor
tic = time.time()
and
cluster = LPCCondorCluster()
cluster.transfer_input_files = ['simple_processor.py']
This fails with the following:
Singularity> python simple_example.py
WARNING: GSI authentication is enabled by your security configuration! GSI will not work in future releases.
For details, see https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=PlanToReplaceGridCommunityToolkit
/opt/conda/lib/python3.8/site-packages/coffea/util.py:154: FutureWarning: In coffea version v0.8.0 (target date: 31 Dec 2022), this will be an error.
(Set coffea.deprecations_as_errors = True to get a stack trace now.)
ImportError: coffea.hist is deprecated
warnings.warn(message, FutureWarning)
Waiting for at least one worker...
WARNING: GSI authentication is being attempted! GSI will not work in future releases.
For details, see https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=PlanToReplaceGridCommunityToolkit
Traceback (most recent call last): ] | 16% Completed | 0.6sn 41.6s
File "simple_example.py", line 29, in <module>
hists, metrics = processor.run_uproot_job(
File "/opt/conda/lib/python3.8/site-packages/coffea/processor/__init__.py", line 104, in _run_x_job
return run(
File "/opt/conda/lib/python3.8/site-packages/coffea/processor/executor.py", line 1675, in __call__
wrapped_out = self.run(fileset, processor_instance, treename)
File "/opt/conda/lib/python3.8/site-packages/coffea/processor/executor.py", line 1825, in run
wrapped_out, e = executor(chunks, closure, None)
File "/opt/conda/lib/python3.8/site-packages/coffea/processor/executor.py", line 971, in __call__
else _decompress(work.result())
File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 280, in result
raise exc.with_traceback(tb)
File "/opt/conda/lib/python3.8/site-packages/coffea/processor/executor.py", line 217, in __call__
out = self.function(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/coffea/processor/executor.py", line 1348, in automatic_retries
raise e
File "/opt/conda/lib/python3.8/site-packages/coffea/processor/executor.py", line 1333, in automatic_retries
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/coffea/processor/executor.py", line 1547, in _work_function
processor_instance = cloudpickle.loads(lz4f.decompress(processor_instance))
ModuleNotFoundError: No module named 'simple_processor'
Last-ditch attempt to close HTCondor job 30148539 in finalizer! You should confirm the job exits!
In https://git.rwth-aachen.de/3pia/cms_analyses/common/-/blob/master/dask.yml#L34-37 the lines
distributed:
worker:
memory:
target: 0.7
spill: 0.9
pause: 0.92
terminate: 0
profile:
interval: 1d
cycle: 2d
low-level: False
adjust default dask worker memory limits and profiler settings. The worker memory pause fraction is a source of frequent headaches as jobs on that worker stall for a long time, often slowing down the overall processing. @pfackeldey would you recommend these defaults also here?
For some reason when I use LCG_102 which has pytorch and coffea, my pytorch model runs at about 200 events/second. If I run the exact same test using the coffea-dask image in singularity, I get ~2 events/second. This speeds up to about 100 events/second if I set executor_args['workers']=1
. Unfortunately there is nothing I have been able to do to replicate anything close to this performance when using LPCCondorCluster even when I set cluster_args['cores']=1
and cluster.adapt(minimum=1, maximum=1)
.
Pytorch is officially supported by Dask but... it seems like the dask/pytorch threading is somehow causing a bottleneck where the CPU is getting absolutely thrashed moving things around but almost nothing gets done.
Trying to migrate my condor implementation from work_queue which doesn't play nice with LPC temp directories.
In the simple example the line
cluster = LPCCondorCluster()
needs to be
cluster = LPCCondorCluster(shared_temp_directory='/tmp', worker_extra_args=['--worker-port 10000:10070', '--nanny-port 10070:10100', '--no-dashboard'], job_script_prologue=[])
My job starts waiting for workers and then I get a never ending loop of
Singularity> python ZZ4b/nTupleAnalysis/scripts/coffea_lpcjobqueue.py
WARNING: GSI authentication is enabled by your security configuration! GSI will not work in future releases.
For details, see https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=PlanToReplaceGridCommunityToolkit
Waiting for at least one worker...
Task exception was never retrieved
future: <Task finished name='Task-46' coro=<_wrap_awaitable() done, defined at /opt/conda/lib/python3.8/asyncio/tasks.py:688> exception=AttributeError("'LPCCondorJob' object has no attribute 'env_dict'")>
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/asyncio/tasks.py", line 695, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/spec.py", line 63, in _
await self.start()
File "/srv/.env/lib/python3.8/site-packages/lpcjobqueue/cluster.py", line 106, in start
job = self.job_script()
File "/srv/.env/lib/python3.8/site-packages/lpcjobqueue/cluster.py", line 87, in job_script
quoted_environment = quote_environment(self.env_dict)
AttributeError: 'LPCCondorJob' object has no attribute 'env_dict'
2022-09-12 11:45:41,124 - tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fa7907636a0>>, <Task finished name='Task-45' coro=<SpecCluster._correct_state_internal() done, def
ined at /opt/conda/lib/python3.8/site-packages/distributed/deploy/spec.py:319> exception=AttributeError("'LPCCondorJob' object has no attribute 'env_dict'")>)
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/opt/conda/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
future.result()
File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/spec.py", line 358, in _correct_state_internal
await w # for tornado gen.coroutine support
File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/spec.py", line 63, in _
await self.start()
File "/srv/.env/lib/python3.8/site-packages/lpcjobqueue/cluster.py", line 106, in start
job = self.job_script()
File "/srv/.env/lib/python3.8/site-packages/lpcjobqueue/cluster.py", line 87, in job_script
quoted_environment = quote_environment(self.env_dict)
AttributeError: 'LPCCondorJob' object has no attribute 'env_dict'
I have been unable to get even simple_example.py to work since last week.
I have tried using older coffea-dask images and tried switching to a new interactive node, all with the same error:
Singularity> python simple_example.py
WARNING: GSI authentication is enabled by your security configuration! GSI is no longer supported.
For details, see https://htcondor.org/news/plan-to-replace-gst-in-htcss/
/opt/conda/lib/python3.8/site-packages/coffea/util.py:154: FutureWarning: In coffea version v0.8.0 (target date: 31 Dec 2022), this will be an error.
(Set coffea.deprecations_as_errors = True to get a stack trace now.)
ImportError: coffea.hist is deprecated
warnings.warn(message, FutureWarning)
Waiting for at least one worker...
Failed to connect to schedd.
Task exception was never retrieved
future: <Task finished name='Task-46' coro=<_wrap_awaitable() done, defined at /opt/conda/lib/python3.8/asyncio/tasks.py:688> exception=AssertionError()>
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/asyncio/tasks.py", line 695, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/spec.py", line 64, in _
assert self.status == Status.running
AssertionError
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.