Code Monkey home page Code Monkey logo

Comments (9)

jrmadsen avatar jrmadsen commented on May 26, 2024

conda activate stemdl

where does this conda env come from?

Do you know how the PyTorch execution model changes when multiple GPUs are used? Does it fork for each additional GPU? Bc I’m seeing 3 fork calls which suggests that might be the root cause of the issue.

from omnitrace.

daviteix avatar daviteix commented on May 26, 2024

My mistake, it should have been: conda create stemdl. Yes, it uses fork. Is there a workaround?

from omnitrace.

jrmadsen avatar jrmadsen commented on May 26, 2024

fork has caused a number of problems in the past, mostly related to perfetto bc of a background thread. You might want to try perfetto with the system backend. You will probably want to increase the flush and write periods to the same as the duration in the perfetto config file (see sample here) because of quirks w.r.t. how perfetto writes that file and how omnitrace writes some perfetto data — essentially once perfetto flushes/writes data, you can’t add any time-stamped data that happened before that point and a fair amount of data gathered through sampling isn’t passed to perfetto until finalization bc we have to map instruction pointers to line info and doing so while sampling adds too much overhead during runtime

from omnitrace.

daviteix avatar daviteix commented on May 26, 2024

Is there a command example when using omnitrace-python? I have tried without success:
export OMNITRACE_PERFETTO_BACKEND=system
omnitrace-perfetto-traced --background
omnitrace-perfetto --out ./omnitrace-perfetto.proto --txt -c ${OMNITRACE_ROOT}/rocm-5.4/share/omnitrace/omnitrace.cfg --background
omnitrace-python-3.8 -- ./stemdl_classification.py --config ./stemdlConfig.yaml
The option --perfetto-backend=system is not valid for omnitrace-python.

from omnitrace.

jrmadsen avatar jrmadsen commented on May 26, 2024

Update: I’ve tracked down the issue. It’s not related to perfetto, but rather the sys.argv passed to omnitrace’s __main__.py upon re-entry after PyTorch forks. I should have a PR merged with the fix by tomorrow afternoon.

from omnitrace.

daviteix avatar daviteix commented on May 26, 2024

from omnitrace.

jrmadsen avatar jrmadsen commented on May 26, 2024

Only difference is I am not using slurm

Ah yeah, I’m running this on Lockhart and without using SLURM, I end up with only 1 CPU available to me (e.g. nproc returns 1) whereas srun nproc returns 128. Given all the threads that are created, I figured that was desirable and maybe just an omission in the instructions. As it turns out, I assumed, incorrectly, that the execution model would be the same.

It appears PyTorch will make even more forks when nproc < ngpu and these forks appear to not retain the variable I stored in #291 to re-patch sys.argv. Storing it in an environment variable in #292 appears to do the trick.

from omnitrace.

jrmadsen avatar jrmadsen commented on May 26, 2024

By the way, if you are also running on Lockhart, I'd highly recommend using srun. PyTorch may try to compensate by forking instead of creating threads but from viewing top while that code was running, all 4 of the forked processes were all sharing the same CPU (i.e. their CPU% was all roughly ~25% instead of ~100%, which is what you would see if they were running on separate CPUs)

from omnitrace.

daviteix avatar daviteix commented on May 26, 2024

Thanks #292 fixed the issue.

from omnitrace.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.