Code Monkey home page Code Monkey logo

Comments (5)

ziw-liu avatar ziw-liu commented on September 27, 2024 1

I have reserved an OnDemand session on gpu-c-1 with one gpu, with 32 nodes, 20GB memory per node. The training config I am using is here:
/local/scratch/groups/cmanalysis.grp/microDL_SP/config/lit_test/HEK_dNucl_2023_04_21_15.yml
It seems to have trained for 70 epochs over a day.

70 epochs/day is not too bad IMO. But you can probably get better performance by using more CPU cores (e.g. 64) in data loading.

from microdl.

ziw-liu avatar ziw-liu commented on September 27, 2024

How many GPUs do you have? Currently it doesn't work with multiple GPUs.

For a Slurm session check with:

echo $CUDA_VISIBLE_DEVICES

If there's more than one you will need to manually modify this line to use one specific GPU:

from microdl.

Soorya19Pradeep avatar Soorya19Pradeep commented on September 27, 2024

Thanks @ziw-liu! It works with one gpu.
Is it normal for the error to appear a while after the training starts? I am trying to understand if my training is slow. I am working on the scratch space on gpu-c-1.

from microdl.

ziw-liu avatar ziw-liu commented on September 27, 2024

Is it normal for the error to appear a while after the training starts?

Depending on what do you mean by 'start'. By default it will first do several dummy validation iterations as a sanity check, which might omit the multi-GPU error.

Slow training should be unrelated to this though. In my experiments it seemed to be I/O bound. Can you provide more information about the hardware resources and training config?

from microdl.

Soorya19Pradeep avatar Soorya19Pradeep commented on September 27, 2024

I have reserved an OnDemand session on gpu-c-1 with one gpu, with 32 nodes, 20GB memory per node. The training config I am using is here:
/local/scratch/groups/cmanalysis.grp/microDL_SP/config/lit_test/HEK_dNucl_2023_04_21_15.yml
It seems to have trained for 70 epochs over a day.

I am also not seeing any logging from the training. The default_root_dir is set to this folder on the config:
/local/scratch/groups/cmanalysis.grp/microDL_SP/logs/lit_test/. Is there any other parameter that needs to be changed on my config to start logging? Or is it not able to log on scratch space?

from microdl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.