Code Monkey home page Code Monkey logo

Comments (10)

mm-assistant avatar mm-assistant commented on August 29, 2024

We recommend using English or English & Chinese for issues so that we could have broader discussion.

from mmyolo.

hhaAndroid avatar hhaAndroid commented on August 29, 2024

@huoshuai-dot Is it possible to reduce the batchsize and try it?

from mmyolo.

huoshuai-dot avatar huoshuai-dot commented on August 29, 2024

@huoshuai-dot Is it possible to reduce the batchsize and try it?

单卡我设置bs=16 是ok的 但是 多卡卡住的时候显存一直是500M左右 但是 单卡训练显存应该很大才对 感觉应该是读数据这块有什么问题导致挂起了,batchsize减少还是有这个问题

from mmyolo.

huoshuai-dot avatar huoshuai-dot commented on August 29, 2024

@hhaAndroid 还有一个现象就是 我把pretrained 模型给注释掉后 (不适用imagenet pretrained 模型初始化权重),训练过程是可以进行下去的,莫非跟这个预训练模型加载有关系?

from mmyolo.

hhaAndroid avatar hhaAndroid commented on August 29, 2024

@huoshuai-dot I have not encountered this situation. Can you upload your training log?

from mmyolo.

hhaAndroid avatar hhaAndroid commented on August 29, 2024

@huoshuai-dot Is it possible that the pytorch version is too high? Can you consider switching to pytorch1.9 and try it?

from mmyolo.

huoshuai-dot avatar huoshuai-dot commented on August 29, 2024

@huoshuai-dot Is it possible that the pytorch version is too high? Can you consider switching to pytorch1.9 and try it?

我的torch版本是1.12.1 确实版本比较高,我可以按照readme里面的配置安装下python环境再试试

from mmyolo.

huoshuai-dot avatar huoshuai-dot commented on August 29, 2024

@huoshuai-dot Is it possible that the pytorch version is too high? Can you consider switching to pytorch1.9 and try it?

@hhaAndroid 按照readme换了1.10的torch 问题还是存在 这个还可能是什么原因呢?

from mmyolo.

huoshuai-dot avatar huoshuai-dot commented on August 29, 2024

@hhaAndroid 你好 我昨天装了docker镜像 然后跑了一个例子,还是遇到了相同的问题 这次 挂起很久之后报如下错误:
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:30:00)
Traceback (most recent call last):
File "./tools/train.py", line 106, in
main()
File "./tools/train.py", line 95, in main
runner = Runner.from_cfg(cfg)
File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 458, in from_cfg
cfg=cfg,
File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 345, in init
self.setup_env(env_cfg)
File "/opt/conda/lib/python3.7/site-packages/mmengine/runner/runner.py", line 644, in setup_env
init_dist(self.launcher, **dist_cfg)
File "/opt/conda/lib/python3.7/site-packages/mmengine/dist/utils.py", line 56, in init_dist
_init_dist_pytorch(backend, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/mmengine/dist/utils.py", line 86, in _init_dist_pytorch
torch_dist.init_process_group(backend=backend, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1565) of binary: /opt/conda/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
不知道这个问题怎么排查?

from mmyolo.

hhaAndroid avatar hhaAndroid commented on August 29, 2024

NCCL driver issue, resolved

from mmyolo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.