Comments (12)
我不太理解你为什么要设置gpu0_bsz = 3500,gpu0_bsz参数表示你要在0号GPU上面分配多少条数据,一般gpu0_bsz设置的肯定会比batch size要小
from balanced-dataparallel.
我现在设置成:
batch_szie = 100
gpu0_bsz = 50
但是GPU的使用情况依旧。。。大致都是GPU0: 9.4/12GB,GPU1: 1.5/12GB
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 456.81 Driver Version: 456.81 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 TITAN Xp WDDM | 00000000:AF:00.0 On | N/A |
| 50% 81C P2 253W / 250W | 9340MiB / 12288MiB | 75% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp WDDM | 00000000:D8:00.0 Off | N/A |
| 26% 49C P2 61W / 250W | 1649MiB / 12288MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
代码运行到最后是这样报错的
len(inputs): 2
self.device_ids[:len(inputs)] [0, 1]
replicas: 2
Loss: 55605.012
start predicting: 2020-11-10 15:07:37.934568
len(inputs): 2
self.device_ids[:len(inputs)] [0, 1]
replicas: 2
len(inputs): 2
self.device_ids[:len(inputs)] [0, 1]
replicas: 2
Traceback (most recent call last):
File "e:/pythonFile/chengzuo/TASGC/src/main.py", line 104, in <module>
main()
File "e:/pythonFile/chengzuo/TASGC/src/main.py", line 80, in main
hit, mrr = train_test(model, train_data, test_data)
File "e:\pythonFile\chengzuo\TASGC\src\model.py", line 278, in train_test
targets, scores = forward(model, i, test_data)
File "e:\pythonFile\chengzuo\TASGC\src\model.py", line 238, in forward
return targets, model.module.compute_scores(seq_hidden, mask)
File "e:\pythonFile\chengzuo\TASGC\src\model.py", line 108, in compute_scores
scores = torch.sum(a * b, -1) # b,n
RuntimeError: CUDA out of memory. Tried to allocate 1.61 GiB (GPU 0; 12.00 GiB total capacity; 7.49 GiB already allocated; 820.44 MiB free; 8.73 GiB reserved in total by PyTorch)
from balanced-dataparallel.
我不太清楚你模型的大小,你可以gpu0_bsz设置为1,而batch size设置为200试试
from balanced-dataparallel.
gpu0_bsz设置为1,而batch size设置为200,GPU0的占用还是非常高,GPU1几乎没动
from balanced-dataparallel.
你在运行python代码之前,在python命令前面加上CUDA_VISIBLE_DEVICES=0,1
然后再试一试,现在似乎在代码里面使用os.environ["CUDA_VISIBLE_DEVICES"] = '0, 1' 没法指定GPU了
from balanced-dataparallel.
同时建议你用单块GPU跑一下,看看一块GPU最大能使用的batch size是多大
from balanced-dataparallel.
你是指这样打命令吗?
set CUDA_VISIBLE_DEVICES=0,1 #设置
python main.py #然后再运行自己的程序
我试了一下结果依然没变......
有趣的是我在代码里面调换了一下两块GPU的位置:
os.environ["CUDA_VISIBLE_DEVICES"] = '1, 0'
这次就是GPU1占用率很高,GPU0占用很低了。
折腾了好久也没能解决,估计只能降低batch size的大小了
from balanced-dataparallel.
CUDA_VISIBLE_DEVICES=0,1 python main.py
感觉就是你模型太大了,数据本身太小了,你单GPU下,batch size为1,占多少显存
from balanced-dataparallel.
单GPU的情况下,batch size=1占0.9/12GB,batch size=50占5.9/12GB
from balanced-dataparallel.
我不太清楚你这个是咋回事,我这边用过的都没这个问题。你可以试试官方的多GPU那个DP
from balanced-dataparallel.
过了好多天重新回来看你的回复,我觉得你说的“感觉就是你模型太大了,数据本身太小了”应该没错,因为我的数据集本身也就15.8MB。上网查了下,有可能是因为model里面用了太多nn.Linear(),导致模型过于庞大。。。。所以我这个情况应该不是data parallel能解决的吧
from balanced-dataparallel.
如果你单个模型放一个GPU都放不下,那建议把模型拆开,一部分放GPU0,一部分放GPU1。但是这样效率会很低下。而且模型放太多Linear意义不大吧?
from balanced-dataparallel.
Related Issues (15)
- 参数应该如何传递?
- dataparallel运行到某个epoch死锁 HOT 2
- error HOT 2
- BalancedDataParallel 如何指定显卡 HOT 6
- 代码没问题,只是稍微缓解不平衡,但还是不平衡。 HOT 5
- 我这个说'BalancedDataParallel' object has no attribute 'model' 请问知道是什么原因吗 HOT 1
- 用原版的就可以,用这个版本就一直报错 IndexError: tuple index out of range HOT 1
- 参数如何传递? HOT 5
- Your code is error when gpu0bs=0 HOT 1
- 关于batchsize=1 HOT 2
- IndexError: tuple index out of range HOT 4
- inputs is empty HOT 1
- 使用后还是不平衡 HOT 4
- pytorch1.3.0+py36+3090 cuda11.2下面parallel_apply报错 HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from balanced-dataparallel.