How long did you take on training with ssd? I spent ~2 hours

Which dataset you use for training? train or <code cl

training on train+val+vg , model is <code class="notra

How long did you take on training with ssd? about mcan-vqa HOT 11 CLOSED

milvlg commented on June 19, 2024

How long did you take on training with ssd?

from mcan-vqa.

Comments (11)

MIL-VLG commented on June 19, 2024

Which dataset you use for training? train or train+val or train+val+vg? And which model (small or large) you use?

from mcan-vqa.

orchardchang commented on June 19, 2024

training on train+val+vg, model is large.

from mcan-vqa.

MIL-VLG commented on June 19, 2024

I think the bottleneck is not on the I/O for the large model. You can check the GPU usage. If they are always 100%, that means the majority of time is spent on model training rather than data loading.

On our workstation, the time for one epoch is about 2 hours for large model on 1 TitanV GPU
ps. Using two GPUs has little help for acceleration in our experiments.

We have tried to load all files into memory. The speed is nearly the same on a SSD drive if you use more than 8 workers

from mcan-vqa.

orchardchang commented on June 19, 2024

GPU usage is always about 96% when I only start one run.py on single 2080ti GPU.

However, if I simultaneously start 2 run.py on 2 separated GPUs on my workstation, the GPU usage is not always 96% and sometimes 0%. And I check the I/O is about 100 MB/S on reading. I think in this case the I/O limits the training speed because two programs both need to load files.

So I increase the number of CPUs and workers. One run.py is set cpu=10, num_worker=15. I don't know whether it is helpful for acceleration. Maybe I continue to increase cpus or workers?

Thank you!

from mcan-vqa.

MIL-VLG commented on June 19, 2024

if cpu=10 means setting torch.set_num_thread(2) to 10, I recommend not to modify this param. This does not contribute to the speed of data loading (that's why we don't provide an interface to modify this). The only related param is num_worker (NW), and we find that 8 is the optimal param in our environment.

Since you are using an HDD, I think the upper bound reading speed is 100~MB/s. which is nearly saturated for one model training. That is why your second model is stuck. With an SSD drive, it is able to train 4 independent large models at the same time with the same training speed for each model.

Loading all features into memory can really speed-up if you are using an HDD drive. However, memory usage should be fully considered. For each model, it takes about 40GB memory during training for train+val+vg data split. That's the reason we do not use this strategy.

from mcan-vqa.

MIL-VLG commented on June 19, 2024

To summarize, using an SSD is the most economical strategy to obtain high I/O speed if you want to train multiple models at the same time.

from mcan-vqa.

orchardchang commented on June 19, 2024

Thank you for you kind help and suggestion!

I have 256GB memory on my workstation and I will try to use memory for acceleration. Could you provide tips about memory usage or code to load files into memory.

By the way, your team is the VQA challenge 2019 champion, right?

from mcan-vqa.

MIL-VLG commented on June 19, 2024

We will provide such memory-based option soon by loading all the npz file into memory at the model initialization stage (which may take some time if they are loaded from an HDD).

Yes, we are the winner of VQA Challenge 2019 : )

from mcan-vqa.

orchardchang commented on June 19, 2024

Thank you very much! I'm looking forward to your updating.

Congratulations!

from mcan-vqa.

MIL-VLG commented on June 19, 2024

This function has been added, Please see our latest merge

from mcan-vqa.

orchardchang commented on June 19, 2024

I see the update. Thank you a lot.

from mcan-vqa.

How long did you take on training with ssd? about mcan-vqa HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent