Code Monkey home page Code Monkey logo

Comments (12)

Link-Li avatar Link-Li commented on June 22, 2024

我不太理解你为什么要设置gpu0_bsz = 3500,gpu0_bsz参数表示你要在0号GPU上面分配多少条数据,一般gpu0_bsz设置的肯定会比batch size要小

from balanced-dataparallel.

ZorrowHu avatar ZorrowHu commented on June 22, 2024

我现在设置成:

batch_szie = 100
gpu0_bsz = 50

但是GPU的使用情况依旧。。。大致都是GPU0: 9.4/12GB,GPU1: 1.5/12GB

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 456.81       Driver Version: 456.81       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  TITAN Xp           WDDM  | 00000000:AF:00.0  On |                  N/A |
| 50%   81C    P2   253W / 250W |   9340MiB / 12288MiB |     75%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp           WDDM  | 00000000:D8:00.0 Off |                  N/A |
| 26%   49C    P2    61W / 250W |   1649MiB / 12288MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

代码运行到最后是这样报错的

len(inputs):  2
self.device_ids[:len(inputs)] [0, 1]
replicas: 2
        Loss:   55605.012
start predicting:  2020-11-10 15:07:37.934568
len(inputs):  2
self.device_ids[:len(inputs)] [0, 1]
replicas: 2
len(inputs):  2
self.device_ids[:len(inputs)] [0, 1]
replicas: 2
Traceback (most recent call last):
  File "e:/pythonFile/chengzuo/TASGC/src/main.py", line 104, in <module>    
    main()
  File "e:/pythonFile/chengzuo/TASGC/src/main.py", line 80, in main
    hit, mrr = train_test(model, train_data, test_data)
  File "e:\pythonFile\chengzuo\TASGC\src\model.py", line 278, in train_test 
    targets, scores = forward(model, i, test_data)
  File "e:\pythonFile\chengzuo\TASGC\src\model.py", line 238, in forward    
    return targets, model.module.compute_scores(seq_hidden, mask)
  File "e:\pythonFile\chengzuo\TASGC\src\model.py", line 108, in compute_scores
    scores = torch.sum(a * b, -1)  # b,n
RuntimeError: CUDA out of memory. Tried to allocate 1.61 GiB (GPU 0; 12.00 GiB total capacity; 7.49 GiB already allocated; 820.44 MiB free; 8.73 GiB reserved in total by PyTorch)

from balanced-dataparallel.

Link-Li avatar Link-Li commented on June 22, 2024

我不太清楚你模型的大小,你可以gpu0_bsz设置为1,而batch size设置为200试试

from balanced-dataparallel.

ZorrowHu avatar ZorrowHu commented on June 22, 2024

gpu0_bsz设置为1,而batch size设置为200,GPU0的占用还是非常高,GPU1几乎没动

from balanced-dataparallel.

Link-Li avatar Link-Li commented on June 22, 2024

你在运行python代码之前,在python命令前面加上CUDA_VISIBLE_DEVICES=0,1
然后再试一试,现在似乎在代码里面使用os.environ["CUDA_VISIBLE_DEVICES"] = '0, 1' 没法指定GPU了

from balanced-dataparallel.

Link-Li avatar Link-Li commented on June 22, 2024

同时建议你用单块GPU跑一下,看看一块GPU最大能使用的batch size是多大

from balanced-dataparallel.

ZorrowHu avatar ZorrowHu commented on June 22, 2024

你是指这样打命令吗?

set CUDA_VISIBLE_DEVICES=0,1 #设置
python main.py #然后再运行自己的程序

我试了一下结果依然没变......
有趣的是我在代码里面调换了一下两块GPU的位置:

os.environ["CUDA_VISIBLE_DEVICES"] = '1, 0' 

这次就是GPU1占用率很高,GPU0占用很低了。
折腾了好久也没能解决,估计只能降低batch size的大小了

from balanced-dataparallel.

Link-Li avatar Link-Li commented on June 22, 2024

CUDA_VISIBLE_DEVICES=0,1 python main.py
感觉就是你模型太大了,数据本身太小了,你单GPU下,batch size为1,占多少显存

from balanced-dataparallel.

ZorrowHu avatar ZorrowHu commented on June 22, 2024

单GPU的情况下,batch size=1占0.9/12GB,batch size=50占5.9/12GB

from balanced-dataparallel.

Link-Li avatar Link-Li commented on June 22, 2024

我不太清楚你这个是咋回事,我这边用过的都没这个问题。你可以试试官方的多GPU那个DP

from balanced-dataparallel.

ZorrowHu avatar ZorrowHu commented on June 22, 2024

过了好多天重新回来看你的回复,我觉得你说的“感觉就是你模型太大了,数据本身太小了”应该没错,因为我的数据集本身也就15.8MB。上网查了下,有可能是因为model里面用了太多nn.Linear(),导致模型过于庞大。。。。所以我这个情况应该不是data parallel能解决的吧

from balanced-dataparallel.

Link-Li avatar Link-Li commented on June 22, 2024

如果你单个模型放一个GPU都放不下,那建议把模型拆开,一部分放GPU0,一部分放GPU1。但是这样效率会很低下。而且模型放太多Linear意义不大吧?

from balanced-dataparallel.

Related Issues (15)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.