link-li / balanced-dataparallel Goto Github PK
View Code? Open in Web Editor NEW这里是改进了pytorch的DataParallel, 用来平衡第一个GPU的显存使用量
这里是改进了pytorch的DataParallel, 用来平衡第一个GPU的显存使用量
您好,我想问一下,我用balanceddataparallel并行训练后,想实时进行batchsize=1的推断行为,这个函数会报错,想问下您知道怎么解决吗?谢谢~
File "/data/lw/2Dxiangao/data_parallel_my_v2.py", line 89, in scatter
bsz = inputs[0].size(self.dim)
AttributeError: 'list' object has no attribute 'size'
When gpu0bs=0, your code will definitely meet an error, but the official one will not.
Please see data_parallel_my.py#L69-L71:
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
if self.gpu0_bsz == 0:
replicas = replicas[1:]
this will result replicas
is 1 smaller than we need.
我的代码中有两个模型,一共四张卡。代码如下:
dist_model = BalancedDataParallel(0,model,dim=0)
dist_Disc = BalancedDataParallel(0,model_Disc,dim=0)
我把第一张卡batchsize设为零,想最大的利用后面三张卡,但是这样效果非常差,请问是什么原因?谢谢!
batch_size = 32的时候,报了错,好像跟模型输入有关,没想通哪里错了
model = BalancedDataParallel(8, model, dim=0).cuda()
Traceback (most recent call last):
File "train.py", line 580, in
main()
File "train.py", line 576, in main
train(model, logger, device, multi_gpu, pad_idx, train_dataset, validate_dataset, args)
File "train.py", line 423, in train
pad_idx=pad_idx, args=args)
File "train.py", line 290, in train_epoch
keyword_ids=keyword_ids)
File "/home/workspace/data_parallel.py", line 63, in forward
inputs, kwargs = self.scatter(inputs, kwargs, device_ids)
File "/home/workspace/data_parallel.py", line 88, in scatter
bsz = inputs[0].size(self.dim)
IndexError: tuple index out of range
torch.nn.modules.module.ModuleAttributeError: 'BalancedDataParallel' object has no attribute 'model'
bsz = inputs[0].size(self.dim)
IndexError: tuple index out of range
原版是这样写的:
model = DataParallel(model, device_ids=[int(i) for i in args.device.split(',')])
按这个版本的介绍这样写:
model = BalancedDataParallel(1,model, dim=0).cuda()
就一直报错。
这个的说明内容也太少了吧。
不知道从何排错。
您好 我想用4快卡的机器训练,但只想其中2块卡运行此程序,请问如何设定呢?
我已尝试修改self.device_ids, 但其仍使用所有显卡训练。可以帮助我解决此问题吗?
你好,不知道你在使用dataparallel的时候有没有遇到过训练到某个epoch,卡在dataloader这里的情况呢
感谢老哥的分享,但是初始化的时候报错,还请老哥解答一下,十分感谢!!以下是我的代码:
net = buildmodel(args.netname)
if len(args.num_gpus) > 1:
net = BalancedDataParallel(gpu0_bsz=args.maingpu_bs, net, dim=0)
# net = torch.nn.DataParallel(net)
net.cuda()
print('init model')
报错:
net = BalancedDataParallel(gpu0_bsz=args.maingpu_bs, net, dim=0)
SyntaxError: positional argument follows keyword argument(net)
请问如何解决?十分感谢
bsz: 158
num_dev: 6
gpu0_bsz: 1
bsz_unit: 31
chunk_sizes: [1, 32, 32, 31, 31, 31]
len(inputs): 6
self.device_ids[:len(inputs)] [0, 1, 2, 3, 4, 5]
replicas: 6
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 13347 C python 10505MiB |
| 1 13347 C python 4991MiB |
| 4 13347 C python 4991MiB |
| 5 13347 C python 4925MiB |
| 6 13347 C python 4925MiB |
| 7 13347 C python 4925MiB |
+-----------------------------------------------------------------------------+
单个GPU只能跑一个1 batch,所以将gpusz设为0,总共的batch为1。但是这样却发现跑不了了。
运行一个gpu时,bz=1,gpu占用量11750/12196
运行两个gpu, bz=1,gpu0=0,gpu占用: gpu0:12179 gpu1:5245
I am trying to run the model with BalancedDataParallel
and I initialized my network with
tt = TT(args, device).to(device); if ((not args.cpu) and (args.num_gpu > 1)): tt = BalancedDataParallel(1 ,tt, dim=0);
.
But I keep getting this error:
data_parallel_my.py", line 79, in scatter bsz = inputs[0].size(self.dim) IndexError: tuple index out of range
I printed inputs
out and found it to be empty. My code works with DataParallel
itself so I am not sure what went wrong. Here is where I try to call the network: srhd, S, T_lv3, T_lv2, T_lv1 = tt(lr=sr, ref=ref)
谢谢你的代码分享!之前一直苦于GPU显存不足的问题无法在大数据集上进行实验,看到了DataParallel的方法后找到了您的知乎分享,但是我在实现的过程中仍然没有解决显存不足的问题。程序的正确性上应该是没问题的,在小数据集上能够正确地跑出实验结果。
我对您的代码在自己的项目中进行了整合实现,代码类似您文档中的:
os.environ["CUDA_VISIBLE_DEVICES"] = '0, 1' #有两块GPU,编号分别为0和1
batch_szie = 100
gpu0_bsz = 3500
acc_grad = 1
model = BalancedDataParallel(gpu0_bsz // acc_grad, SessionGraph(opt, n_node), dim=0) #SessionGraph为我实际用的model
model = model.cuda()
上面的batch_szie和gpu0_bsz这么设置是因为,之前在单GPU上运行时,batchSize为100的7500条数据对显存的占用为9.4/12GB,模型是可以完成训练的;但是紧接着利用模型对测试数据进行预测就报显存不足的错误。
在利用BalancedDataParallel运行的过程中我查看了两块GPU的现存占用情况,发现似乎GPU1基本上没有怎么被利用到:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 456.81 Driver Version: 456.81 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 TITAN Xp WDDM | 00000000:AF:00.0 On | N/A |
| 51% 81C P2 103W / 250W | 9294MiB / 12288MiB | 44% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp WDDM | 00000000:D8:00.0 Off | N/A |
| 26% 48C P2 61W / 250W | 1612MiB / 12288MiB | 4% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
我所使用的model SessionGraph大致如下:
class SessionGraph(Module):
...
self.loss_function = nn.CrossEntropyLoss()
self.optimizer = torch.optim.Adam(self.parameters(), lr=opt.lr, weight_decay=opt.l2)
self.scheduler = torch.optim.lr_scheduler.StepLR(self.optimizer, step_size=opt.lr_dc_step, gamma=opt.lr_dc)
我的错误是否在别的地方呢,比如说训练与预测的过程中?
程序运行过程中的数据基本都是一样的,如下所示:
...
self.device_ids[:len(inputs)] [0, 1]
replicas: 2
len(inputs): 2
self.device_ids[:len(inputs)] [0, 1]
replicas: 2
len(inputs): 2
self.device_ids[:len(inputs)] [0, 1]
replicas: 2
len(inputs): 2
...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.