Comments (11)
Which dataset you use for training? train
or train+val
or train+val+vg
? And which model (small
or large
) you use?
from mcan-vqa.
training on train+val+vg
, model is large
.
from mcan-vqa.
I think the bottleneck is not on the I/O for the large model. You can check the GPU usage. If they are always 100%, that means the majority of time is spent on model training rather than data loading.
On our workstation, the time for one epoch is about 2 hours for large
model on 1 TitanV GPU
ps. Using two GPUs has little help for acceleration in our experiments.
We have tried to load all files into memory. The speed is nearly the same on a SSD drive if you use more than 8 workers
from mcan-vqa.
GPU usage is always about 96% when I only start one run.py
on single 2080ti GPU.
However, if I simultaneously start 2 run.py
on 2 separated GPUs on my workstation, the GPU usage is not always 96% and sometimes 0%. And I check the I/O is about 100 MB/S on reading. I think in this case the I/O limits the training speed because two programs both need to load files.
So I increase the number of CPUs and workers. One run.py
is set cpu=10, num_worker=15. I don't know whether it is helpful for acceleration. Maybe I continue to increase cpus or workers?
Thank you!
from mcan-vqa.
if cpu=10 means setting torch.set_num_thread(2)
to 10, I recommend not to modify this param. This does not contribute to the speed of data loading (that's why we don't provide an interface to modify this). The only related param is num_worker (NW)
, and we find that 8 is the optimal param in our environment.
Since you are using an HDD, I think the upper bound reading speed is 100~MB/s. which is nearly saturated for one model training. That is why your second model is stuck. With an SSD drive, it is able to train 4 independent large models at the same time with the same training speed for each model.
Loading all features into memory can really speed-up if you are using an HDD drive. However, memory usage should be fully considered. For each model, it takes about 40GB memory during training for train+val+vg
data split. That's the reason we do not use this strategy.
from mcan-vqa.
To summarize, using an SSD is the most economical strategy to obtain high I/O speed if you want to train multiple models at the same time.
from mcan-vqa.
Thank you for you kind help and suggestion!
I have 256GB memory on my workstation and I will try to use memory for acceleration. Could you provide tips about memory usage or code to load files into memory.
By the way, your team is the VQA challenge 2019 champion, right?
from mcan-vqa.
We will provide such memory-based option soon by loading all the npz file into memory at the model initialization stage (which may take some time if they are loaded from an HDD).
Yes, we are the winner of VQA Challenge 2019 : )
from mcan-vqa.
Thank you very much! I'm looking forward to your updating.
Congratulations!
from mcan-vqa.
This function has been added, Please see our latest merge
from mcan-vqa.
I see the update. Thank you a lot.
from mcan-vqa.
Related Issues (20)
- net.train() not called again after evaluation finishes HOT 2
- the problem about result jons file HOT 2
- VQA CPv2 HOT 1
- linear fusion model HOT 1
- Visualizations of the learned attention maps HOT 1
- How to use pretrained models HOT 1
- improved results in val set but decrease results in online evaluation HOT 1
- Why use only image guided-attention rather than both image and question guided-attention? HOT 1
- Features' file loading in the code HOT 1
- How did you create "npz" files from tsv files HOT 1
- single question test HOT 3
- box is xyxy or xywh? HOT 1
- Can I run code with CPU? HOT 1
- mcan_encoder_decoder
- Why does the answer dictionary need to be filtered with less than 8 occurrences HOT 1
- test-dev or test-std
- Co-Attention? HOT 1
- Why our reproduced accuracy is so much higher than in the paper. HOT 1
- About initialize the GloVe HOT 1
- help!
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mcan-vqa.