quantum-accelerators / covalent-hpc-plugin Goto Github PK
View Code? Open in Web Editor NEWCovalent plugin for HPC batch job schedulers (e.g. Slurm, PBS, LSF, Flux, Cobalt) built around PSI/J
License: Apache License 2.0
Covalent plugin for HPC batch job schedulers (e.g. Slurm, PBS, LSF, Flux, Cobalt) built around PSI/J
License: Apache License 2.0
Currently, when an error is raised, only the error message itself is reported rather than any details regarding the traceback. This makes it very difficult to debug, unlike when an error arises locally where a full traceback is present.
A clearer traceback report should be provided in the UI.
This is a mirror of AgnostiqHQ/covalent#1782
From Sankalp at Covalent:
For the first point, the DaskExecutor supports task packing currently so if you wanna give your custom plugin a shot that would be a good reference ๐ .
Steps you can use to check whether task packing is working as expected:
Temporarily add app_log.warning(f"Submitting {len(task_specs)} tasks to dask") to this line locally.
Run the workflow example you showed
Check covalent logs and you'll see that 1 task gets submitted to dask 2 times
Now, enable task packing with ct.set_config("sdk.task_packing", "true") in your workflow, and restart your notebook and the server
Then when you'll check the covalent logs, you'll see that 2 tasks were submitted to dask in 1 go -> i.e. the dictionary that was supposed to be created as a separate task, now gets packed with the task that uses it directly
That is essentially how it's going to work in practice and we're going to enable task packing by default once we have enough executors supporting it. It is still something that we haven't thoroughly tested but should help as a reference for the executors that do want to support it. (The send, poll and receive methods of the dask executor should help in understanding, and the class attribute of SUPPORTS_MANAGED_EXECUTION)
covalent
version: 0.233.1rc0covalent-hpc-plugin
version: 0.0.8job submission to radical pilot fail with below error
File "/pscratch/sd/p/prmantha/covalent-workdir/psij-0692d338-7cb9-4be8-9f3c-a92ef4a1840b-38.py", line 7, in <module>
job_executor = JobExecutor.get_instance("rp")
File "/global/common/software/m4408/pmantha/py39/lib/python3.9/site-packages/psij/job_executor.py", line 238, in get_instance
instance = selected.ecls(url=url, config=config)
File "/global/common/software/m4408/pmantha/py39/lib/python3.9/site-packages/psij/executors/rp.py", line 40, in __init__
super().__init__(url=url, config=config)
File "/global/common/software/m4408/pmantha/py39/lib/python3.9/site-packages/psij/job_executor.py", line 39, in __init__
assert config
AssertionError``
### How can we easily reproduce the issue?
```python
# Paste your *minimal* example code here
On Nersc perlmutter
import os
from pathlib import Path
from psij import Job, JobAttributes, JobExecutor, JobSpec, ResourceSpecV1
job_executor = JobExecutor.get_instance("rp")
job = Job(
JobSpec(
name="0692d338-7cb9-4be8-9f3c-a92ef4a1840b-114",
executable="python",
environment={},
launcher="single",
arguments=[str(Path(os.path.expandvars("/pscratch/sd/p/prmantha/covalent-workdir/script-0692d338-7cb9-4be8-9f3c-a92ef4a1840b-114.py")).expanduser().resolve())],
directory=Path(os.path.expandvars("/pscratch/sd/p/prmantha/covalent-workdir")).expanduser().resolve(),
stdout_path=Path(os.path.expandvars("/pscratch/sd/p/prmantha/covalent-workdir/stdout-0692d338-7cb9-4be8-9f3c-a92ef4a1840b-114.log")).expanduser().resolve(),
stderr_path=Path(os.path.expandvars("/pscratch/sd/p/prmantha/covalent-workdir/stderr-0692d338-7cb9-4be8-9f3c-a92ef4a1840b-114.log")).expanduser().resolve(),
pre_launch=Path(os.path.expandvars("/pscratch/sd/p/prmantha/covalent-workdir/pre-launch-0692d338-7cb9-4be8-9f3c-a92ef4a1840b-114.sh")).expanduser().resolve(),
resources=ResourceSpecV1(**{'node_count': 1, 'processes_per_node': 1, 'gpu_cores_per_process': 0}),
attributes=JobAttributes(**{'project_name': 'alkf', 'custom_attributes': {'slurm.constraint': 'cpu', 'slurm.qos': 'regular'}}),
)
)
job_executor.submit(job)
native_id = job.native_id
print(native_id)```
### If possible, please also upload any relevant files that might help with debugging.
_No response_
Take inspiration from above. AgnostiqHQ/covalent-slurm-plugin#86
covalent
version: 0.233.1rc0covalent-hpc-plugin
version: used current pip installOn perlmutter, i have command module load cudatoolkit/12.0 which emits some output during ssh login. Job submission using hpc plugin fails due to error - "Making remote directory failed: The following have been reloaded with a version change:
# Paste your *minimal* example code here
No response
i.e. run the server on the HPC machine and submit there without SSH. Linking this to AgnostiqHQ/covalent-slurm-plugin#55.
Tracking AgnostiqHQ/covalent-slurm-plugin#64. There was an initial attempted implementation in AgnostiqHQ/covalent-slurm-plugin#71.
Have the plugin write them out to a pre_launch
/post_launch
Bash file and then clean it up. A good example is loading of modules.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.