Code Monkey home page Code Monkey logo

balanced-dataparallel's People

Contributors

link-li avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

balanced-dataparallel's Issues

关于batchsize=1

您好,我想问一下,我用balanceddataparallel并行训练后,想实时进行batchsize=1的推断行为,这个函数会报错,想问下您知道怎么解决吗?谢谢~

error

File "/data/lw/2Dxiangao/data_parallel_my_v2.py", line 89, in scatter
bsz = inputs[0].size(self.dim)
AttributeError: 'list' object has no attribute 'size'

Your code is error when gpu0bs=0

When gpu0bs=0, your code will definitely meet an error, but the official one will not.
Please see data_parallel_my.py#L69-L71:

        replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
        if self.gpu0_bsz == 0:
            replicas = replicas[1:]

this will result replicas is 1 smaller than we need.

您好,我使用这个代码,发现评估效果特别差。

我的代码中有两个模型,一共四张卡。代码如下:
dist_model = BalancedDataParallel(0,model,dim=0)
dist_Disc = BalancedDataParallel(0,model_Disc,dim=0)
我把第一张卡batchsize设为零,想最大的利用后面三张卡,但是这样效果非常差,请问是什么原因?谢谢!

IndexError: tuple index out of range

batch_size = 32的时候,报了错,好像跟模型输入有关,没想通哪里错了
model = BalancedDataParallel(8, model, dim=0).cuda()

Traceback (most recent call last):
File "train.py", line 580, in
main()
File "train.py", line 576, in main
train(model, logger, device, multi_gpu, pad_idx, train_dataset, validate_dataset, args)
File "train.py", line 423, in train
pad_idx=pad_idx, args=args)
File "train.py", line 290, in train_epoch
keyword_ids=keyword_ids)
File "/home/workspace/data_parallel.py", line 63, in forward
inputs, kwargs = self.scatter(inputs, kwargs, device_ids)
File "/home/workspace/data_parallel.py", line 88, in scatter
bsz = inputs[0].size(self.dim)
IndexError: tuple index out of range

BalancedDataParallel 如何指定显卡

您好 我想用4快卡的机器训练,但只想其中2块卡运行此程序,请问如何设定呢?
我已尝试修改self.device_ids, 但其仍使用所有显卡训练。可以帮助我解决此问题吗?

pytorch1.3.0+py36+3090 cuda11.2下面parallel_apply报错

你好,我在进行单机4卡训练任务时,使用了你V2版本的分布式代码。结果发现程序运行到‘’return parallel_apply(replicas, inputs, kwargs, device_ids[:len(inputs)])‘’时报错。

image
紧跟着cuda又报错了,
image
请问第一个错误是什么问题呢?

参数如何传递?

感谢老哥的分享,但是初始化的时候报错,还请老哥解答一下,十分感谢!!以下是我的代码:

 net = buildmodel(args.netname)
   
 if len(args.num_gpus) > 1:
        net = BalancedDataParallel(gpu0_bsz=args.maingpu_bs, net, dim=0)
        # net = torch.nn.DataParallel(net)
net.cuda()
print('init model')

报错:

net = BalancedDataParallel(gpu0_bsz=args.maingpu_bs, net, dim=0)

SyntaxError: positional argument follows keyword argument(net)

请问如何解决?十分感谢

代码没问题,只是稍微缓解不平衡,但还是不平衡。

bsz: 158
num_dev: 6
gpu0_bsz: 1
bsz_unit: 31
chunk_sizes: [1, 32, 32, 31, 31, 31]
len(inputs): 6
self.device_ids[:len(inputs)] [0, 1, 2, 3, 4, 5]
replicas: 6

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 13347 C python 10505MiB |
| 1 13347 C python 4991MiB |
| 4 13347 C python 4991MiB |
| 5 13347 C python 4925MiB |
| 6 13347 C python 4925MiB |
| 7 13347 C python 4925MiB |
+-----------------------------------------------------------------------------+

使用后还是不平衡

单个GPU只能跑一个1 batch,所以将gpusz设为0,总共的batch为1。但是这样却发现跑不了了。
运行一个gpu时,bz=1,gpu占用量11750/12196
运行两个gpu, bz=1,gpu0=0,gpu占用: gpu0:12179 gpu1:5245

inputs is empty

I am trying to run the model with BalancedDataParallel and I initialized my network with
tt = TT(args, device).to(device); if ((not args.cpu) and (args.num_gpu > 1)): tt = BalancedDataParallel(1 ,tt, dim=0);.

But I keep getting this error:
data_parallel_my.py", line 79, in scatter bsz = inputs[0].size(self.dim) IndexError: tuple index out of range

I printed inputs out and found it to be empty. My code works with DataParallel itself so I am not sure what went wrong. Here is where I try to call the network: srhd, S, T_lv3, T_lv2, T_lv1 = tt(lr=sr, ref=ref)

代码能够正常运行,但显存不足的问题仍未解决

谢谢你的代码分享!之前一直苦于GPU显存不足的问题无法在大数据集上进行实验,看到了DataParallel的方法后找到了您的知乎分享,但是我在实现的过程中仍然没有解决显存不足的问题。程序的正确性上应该是没问题的,在小数据集上能够正确地跑出实验结果。
我对您的代码在自己的项目中进行了整合实现,代码类似您文档中的:

os.environ["CUDA_VISIBLE_DEVICES"] = '0, 1'      #有两块GPU,编号分别为0和1
batch_szie = 100
gpu0_bsz = 3500
acc_grad = 1
model = BalancedDataParallel(gpu0_bsz // acc_grad, SessionGraph(opt, n_node), dim=0)   #SessionGraph为我实际用的model
model = model.cuda()

上面的batch_szie和gpu0_bsz这么设置是因为,之前在单GPU上运行时,batchSize为100的7500条数据对显存的占用为9.4/12GB,模型是可以完成训练的;但是紧接着利用模型对测试数据进行预测就报显存不足的错误。
在利用BalancedDataParallel运行的过程中我查看了两块GPU的现存占用情况,发现似乎GPU1基本上没有怎么被利用到:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 456.81       Driver Version: 456.81       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  TITAN Xp           WDDM  | 00000000:AF:00.0  On |                  N/A |
| 51%   81C    P2   103W / 250W |   9294MiB / 12288MiB |     44%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp           WDDM  | 00000000:D8:00.0 Off |                  N/A |
| 26%   48C    P2    61W / 250W |   1612MiB / 12288MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

捕获
捕获2

我所使用的model SessionGraph大致如下:

class SessionGraph(Module):
        ...
        self.loss_function = nn.CrossEntropyLoss()
        self.optimizer = torch.optim.Adam(self.parameters(), lr=opt.lr, weight_decay=opt.l2)
        self.scheduler = torch.optim.lr_scheduler.StepLR(self.optimizer, step_size=opt.lr_dc_step, gamma=opt.lr_dc)

我的错误是否在别的地方呢,比如说训练与预测的过程中?
程序运行过程中的数据基本都是一样的,如下所示:

...
self.device_ids[:len(inputs)] [0, 1]
replicas: 2
len(inputs):  2
self.device_ids[:len(inputs)] [0, 1]
replicas: 2
len(inputs):  2
self.device_ids[:len(inputs)] [0, 1]
replicas: 2
len(inputs):  2
...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.