Code Monkey home page Code Monkey logo

Comments (11)

MIL-VLG avatar MIL-VLG commented on June 19, 2024

Which dataset you use for training? train or train+val or train+val+vg? And which model (small or large) you use?

from mcan-vqa.

orchardchang avatar orchardchang commented on June 19, 2024

training on train+val+vg, model is large.

from mcan-vqa.

MIL-VLG avatar MIL-VLG commented on June 19, 2024

I think the bottleneck is not on the I/O for the large model. You can check the GPU usage. If they are always 100%, that means the majority of time is spent on model training rather than data loading.

On our workstation, the time for one epoch is about 2 hours for large model on 1 TitanV GPU
ps. Using two GPUs has little help for acceleration in our experiments.

We have tried to load all files into memory. The speed is nearly the same on a SSD drive if you use more than 8 workers

from mcan-vqa.

orchardchang avatar orchardchang commented on June 19, 2024

GPU usage is always about 96% when I only start one run.py on single 2080ti GPU.

However, if I simultaneously start 2 run.py on 2 separated GPUs on my workstation, the GPU usage is not always 96% and sometimes 0%. And I check the I/O is about 100 MB/S on reading. I think in this case the I/O limits the training speed because two programs both need to load files.

So I increase the number of CPUs and workers. One run.py is set cpu=10, num_worker=15. I don't know whether it is helpful for acceleration. Maybe I continue to increase cpus or workers?

Thank you!

from mcan-vqa.

MIL-VLG avatar MIL-VLG commented on June 19, 2024

if cpu=10 means setting torch.set_num_thread(2) to 10, I recommend not to modify this param. This does not contribute to the speed of data loading (that's why we don't provide an interface to modify this). The only related param is num_worker (NW), and we find that 8 is the optimal param in our environment.

Since you are using an HDD, I think the upper bound reading speed is 100~MB/s. which is nearly saturated for one model training. That is why your second model is stuck. With an SSD drive, it is able to train 4 independent large models at the same time with the same training speed for each model.

Loading all features into memory can really speed-up if you are using an HDD drive. However, memory usage should be fully considered. For each model, it takes about 40GB memory during training for train+val+vg data split. That's the reason we do not use this strategy.

from mcan-vqa.

MIL-VLG avatar MIL-VLG commented on June 19, 2024

To summarize, using an SSD is the most economical strategy to obtain high I/O speed if you want to train multiple models at the same time.

from mcan-vqa.

orchardchang avatar orchardchang commented on June 19, 2024

Thank you for you kind help and suggestion!

I have 256GB memory on my workstation and I will try to use memory for acceleration. Could you provide tips about memory usage or code to load files into memory.

By the way, your team is the VQA challenge 2019 champion, right?

from mcan-vqa.

MIL-VLG avatar MIL-VLG commented on June 19, 2024

We will provide such memory-based option soon by loading all the npz file into memory at the model initialization stage (which may take some time if they are loaded from an HDD).

Yes, we are the winner of VQA Challenge 2019 : )

from mcan-vqa.

orchardchang avatar orchardchang commented on June 19, 2024

Thank you very much! I'm looking forward to your updating.

Congratulations!

from mcan-vqa.

MIL-VLG avatar MIL-VLG commented on June 19, 2024

This function has been added, Please see our latest merge

from mcan-vqa.

orchardchang avatar orchardchang commented on June 19, 2024

I see the update. Thank you a lot.

from mcan-vqa.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.