csiro-mlai / dl_hpc_starter_pack Goto Github PK
View Code? Open in Web Editor NEWpip install the deep learning & HPC starter pack to begin your project.
License: Apache License 2.0
pip install the deep learning & HPC starter pack to begin your project.
License: Apache License 2.0
(simplified_22) bracewell-i1 simplified_22$ dlhpcstarter -t cifar10 -c baseline
Initial utilisation on GPU:0 is 0.057.
Initial utilisation on GPU:1 is 0.990.
Initial utilisation on GPU:2 is 0.288.
Initial utilisation on GPU:3 is 0.411.
Traceback (most recent call last):
File "/scratch2/nic261/environments/simplified_22/bin/dlhpcstarter", line 8, in <module>
sys.exit(main())
File "/scratch2/nic261/environments/simplified_22/lib/python3.9/site-packages/dlhpcstarter/__main__.py", line 37, in main
stages_fnc = importer(definition='stages', module='.'.join(['task', args.task, 'stages']))
File "/scratch2/nic261/environments/simplified_22/lib/python3.9/site-packages/dlhpcstarter/utils.py", line 46, in importer
module = importlib.import_module(module)
File "/apps/python/3.9.4/lib/python3.9/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 972, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 972, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'task'
(simplified_22) bracewell-i1 simplified_22$
i.e. it cannot see the task directory
(simplified_22) bracewell-i1 simplified_22$ ls
README.md jobs.sh main.py notes.txt requirements.txt task tools
(simplified_22) bracewell-i1 simplified_22$
@ashgillman thoughts?
E.g.
An entrypoint could do the following:
dlhpcstarter.create_task cifar10
would create the following:
task
task/cifar10
task/cifar10/model
task/cifar10/config
Allow the user to give a relative or absolute path to the module and definition of a model so that it does not have to be in task/TASK/model
or task/TASK/config
Maybe have something like the following alternative variables:
module_path
definition_path
config_path
that can be used instead of
module
definition
config
Had this working previously with lightning about a year ago, would be nice to have it again.
# NCCL debug flag
cluster.add_command('export NCCL_DEBUG=INFO')
https://pytorch-lightning.readthedocs.io/en/stable/clouds/cluster.html
https://pytorch-lightning.readthedocs.io/en/stable/clouds/cluster_advanced.html
May be cleaner/better to use this instead of manually managing this. May also remove the need to reload all the configs in stages.py.
Hello,
Firstly, I would like to extend my gratitude towards the development team for creating the dl_hpc_starter_pack. It's been immensely helpful for kickstarting projects in a high-performance computing context, and the integration with PyTorch Lightning and Hydra for configurations has notably streamlined the process.
I have successfully experimented with the CIFAR10 Baseline model as described in the documentation and looked into implementing customizations based on my project's requirements. While I appreciate the configurational ease provided, I encountered challenges trying to adjust the learning rate scheduling function and explore optimizers beyond the provided examples.
Learning Rate Scheduling:
I understand from the provided examples that optimizers are configured within the model's configure_optimizers method as seen in the Inheritance and Baseline model scripts. However, I am looking for guidance on integrating custom learning rate schedulers (e.g., cosine annealing) within this setup. Could you provide some insights or a template on how to correctly implement this?
Configuring Other Optimizers:
Similarly, I am interested in trying out different optimizers like RMSprop or Adamax. Is the process as straightforward as substituting the optimizer class in the configure_optimizers method, or are there other considerations (especially with respect to the configuration files or scheduler integration)?
Lastly, I wanted to ensure that my approach aligns with the package's aim to promote rapid development via configuration files and class inheritance. Therefore, any advice on maintaining or enhancing this aspect while introducing the customizations mentioned above would be greatly appreciated.
Thank you for your support and looking forward to your response.
Best regards, Xiwei Deng
Use this for job submission to Slurm:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.