Comments (5)
I have reserved an OnDemand session on gpu-c-1 with one gpu, with 32 nodes, 20GB memory per node. The training config I am using is here:
/local/scratch/groups/cmanalysis.grp/microDL_SP/config/lit_test/HEK_dNucl_2023_04_21_15.yml
It seems to have trained for 70 epochs over a day.
70 epochs/day is not too bad IMO. But you can probably get better performance by using more CPU cores (e.g. 64) in data loading.
from microdl.
How many GPUs do you have? Currently it doesn't work with multiple GPUs.
For a Slurm session check with:
echo $CUDA_VISIBLE_DEVICES
If there's more than one you will need to manually modify this line to use one specific GPU:
from microdl.
Thanks @ziw-liu! It works with one gpu.
Is it normal for the error to appear a while after the training starts? I am trying to understand if my training is slow. I am working on the scratch space on gpu-c-1.
from microdl.
Is it normal for the error to appear a while after the training starts?
Depending on what do you mean by 'start'. By default it will first do several dummy validation iterations as a sanity check, which might omit the multi-GPU error.
Slow training should be unrelated to this though. In my experiments it seemed to be I/O bound. Can you provide more information about the hardware resources and training config?
from microdl.
I have reserved an OnDemand session on gpu-c-1 with one gpu, with 32 nodes, 20GB memory per node. The training config I am using is here:
/local/scratch/groups/cmanalysis.grp/microDL_SP/config/lit_test/HEK_dNucl_2023_04_21_15.yml
It seems to have trained for 70 epochs over a day.
I am also not seeing any logging from the training. The default_root_dir
is set to this folder on the config:
/local/scratch/groups/cmanalysis.grp/microDL_SP/logs/lit_test/
. Is there any other parameter that needs to be changed on my config to start logging? Or is it not able to log on scratch space?
from microdl.
Related Issues (20)
- Config issues on gunpowder dataloading branch HOT 12
- Flatfielding computation error on gunpowder dataloading branch HOT 2
- Improve masking of fluorescence data in preprocessing HOT 2
- configs should use channel names and not channel index HOT 2
- Metadata Structure HOT 2
- z-score HOT 1
- Inference shouldn't use gunpowder HOT 5
- intensity.csv HOT 1
- Convert existing tensorflow models to pytorch HOT 3
- Does inference need its own config file? HOT 1
- data problem HOT 3
- Make model architecture compatible with deployment HOT 9
- Training error without augmentation HOT 1
- Training on gpu-c-1 scratch space HOT 7
- Normalization statistics don't need to be stored with additional key HOT 1
- Unexpected behavior in model batch prediction? HOT 8
- unify preprocessing CLI with lightning CLI HOT 1
- Brain_2.5DUNet HOT 4
- Paper Figure 5. HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from microdl.