Code Monkey home page Code Monkey logo

covalent-hpc-plugin's Issues

Include full traceback in error reporting in UI

Currently, when an error is raised, only the error message itself is reported rather than any details regarding the traceback. This makes it very difficult to debug, unlike when an error arises locally where a full traceback is present.

A clearer traceback report should be provided in the UI.

This is a mirror of AgnostiqHQ/covalent#1782

Support task packing

What new feature would you like to see?

From Sankalp at Covalent:

For the first point, the DaskExecutor supports task packing currently so if you wanna give your custom plugin a shot that would be a good reference ๐Ÿ™‚ .
Steps you can use to check whether task packing is working as expected:
Temporarily add app_log.warning(f"Submitting {len(task_specs)} tasks to dask") to this line locally.
Run the workflow example you showed
Check covalent logs and you'll see that 1 task gets submitted to dask 2 times
Now, enable task packing with ct.set_config("sdk.task_packing", "true") in your workflow, and restart your notebook and the server
Then when you'll check the covalent logs, you'll see that 2 tasks were submitted to dask in 1 go -> i.e. the dictionary that was supposed to be created as a separate task, now gets packed with the task that uses it directly
That is essentially how it's going to work in practice and we're going to enable task packing by default once we have enough executors supporting it. It is still something that we haven't thoroughly tested but should help as a reference for the executors that do want to support it. (The send, poll and receive methods of the dask executor should help in understanding, and the class attribute of SUPPORTS_MANAGED_EXECUTION)

Covalant hpc RP plugin fails with error

Details about the Python environment

  • covalent version: 0.233.1rc0
  • covalent-hpc-plugin version: 0.0.8

What is the issue?

job submission to radical pilot fail with below error

File "/pscratch/sd/p/prmantha/covalent-workdir/psij-0692d338-7cb9-4be8-9f3c-a92ef4a1840b-38.py", line 7, in <module>
job_executor = JobExecutor.get_instance("rp")
File "/global/common/software/m4408/pmantha/py39/lib/python3.9/site-packages/psij/job_executor.py", line 238, in get_instance
instance = selected.ecls(url=url, config=config)
File "/global/common/software/m4408/pmantha/py39/lib/python3.9/site-packages/psij/executors/rp.py", line 40, in __init__
super().__init__(url=url, config=config)
File "/global/common/software/m4408/pmantha/py39/lib/python3.9/site-packages/psij/job_executor.py", line 39, in __init__
assert config
AssertionError``


### How can we easily reproduce the issue?

```python
# Paste your *minimal* example code here

On Nersc perlmutter

import os
from pathlib import Path
from psij import Job, JobAttributes, JobExecutor, JobSpec, ResourceSpecV1

job_executor = JobExecutor.get_instance("rp")

job = Job(
    JobSpec(
        name="0692d338-7cb9-4be8-9f3c-a92ef4a1840b-114",
        executable="python",
        environment={},
        launcher="single",
        arguments=[str(Path(os.path.expandvars("/pscratch/sd/p/prmantha/covalent-workdir/script-0692d338-7cb9-4be8-9f3c-a92ef4a1840b-114.py")).expanduser().resolve())],
        directory=Path(os.path.expandvars("/pscratch/sd/p/prmantha/covalent-workdir")).expanduser().resolve(),
        stdout_path=Path(os.path.expandvars("/pscratch/sd/p/prmantha/covalent-workdir/stdout-0692d338-7cb9-4be8-9f3c-a92ef4a1840b-114.log")).expanduser().resolve(),
        stderr_path=Path(os.path.expandvars("/pscratch/sd/p/prmantha/covalent-workdir/stderr-0692d338-7cb9-4be8-9f3c-a92ef4a1840b-114.log")).expanduser().resolve(),
        pre_launch=Path(os.path.expandvars("/pscratch/sd/p/prmantha/covalent-workdir/pre-launch-0692d338-7cb9-4be8-9f3c-a92ef4a1840b-114.sh")).expanduser().resolve(),
        
        resources=ResourceSpecV1(**{'node_count': 1, 'processes_per_node': 1, 'gpu_cores_per_process': 0}),
        attributes=JobAttributes(**{'project_name': 'alkf', 'custom_attributes': {'slurm.constraint': 'cpu', 'slurm.qos': 'regular'}}),
    )
)

job_executor.submit(job)
native_id = job.native_id
print(native_id)```

### If possible, please also upload any relevant files that might help with debugging.

_No response_

Job fails, because of .bashrc/profile scrpt loading on hpc perlmutter machine.

Details about the Python environment

  • covalent version: 0.233.1rc0
  • covalent-hpc-plugin version: used current pip install

What is the issue?

On perlmutter, i have command module load cudatoolkit/12.0 which emits some output during ssh login. Job submission using hpc plugin fails due to error - "Making remote directory failed: The following have been reloaded with a version change:

  1. cudatoolkit/11.7 => cudatoolkit/12.0" - When i commented the module load statemetns in my .bashrc, the job started.. so the below line should check for actual error code, rather than stderr output.

https://github.com/Quantum-Accelerators/covalent-hpc-plugin/blob/develop/covalent_hpc_plugin/hpc.py#L685

How can we easily reproduce the issue?

# Paste your *minimal* example code here

If possible, please also upload any relevant files that might help with debugging.

No response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.