Comments (6)
And after I killed the program manually,
The traceback is as the following:
^CException ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f87c6c81898>>
Traceback (most recent call last):
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 717, in __del__
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f01aa828c18>>
Traceback (most recent call last):
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 717, in __del__
self._shutdown_workers()
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 685, in _shutdown_workers
self.done_event.set()
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/synchronize.py", line 346, in set
self._shutdown_workers()
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 685, in _shutdown_workers
self.done_event.set()
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/synchronize.py", line 346, in set
with self._cond:
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/synchronize.py", line 230, in __enter__
return self._lock.__enter__()
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
with self._cond:
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/synchronize.py", line 230, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
return self._lock.__enter__()
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
File "search_itm.py", line 721, in <module>
join=True
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
while not spawn_context.join():
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 73, in join
timeout=timeout,
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/connection.py", line 911, in wait
ready = selector.select(timeout)
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/selectors.py", line 376, in select
fd_event_list = self._poll.poll(timeout)
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
And the output of nvidia-smi
is as the following now:
Sat Feb 27 19:02:11 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40c On | 00000000:02:00.0 Off | 0 |
| 23% 25C P8 20W / 235W | 12MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K40c On | 00000000:03:00.0 Off | 0 |
| 23% 42C P0 69W / 235W | 9840MiB / 11441MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K40m On | 00000000:82:00.0 Off | 0 |
| N/A 24C P8 20W / 235W | 12MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K40m On | 00000000:83:00.0 Off | 0 |
| N/A 34C P0 67W / 235W | 9840MiB / 11441MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 24426 C ...naconda3/envs/py36-t101-cu90/bin/python 9827MiB |
| 3 24428 C ...naconda3/envs/py36-t101-cu90/bin/python 9827MiB |
+-----------------------------------------------------------------------------+
And after I killed the two processes on GPU 1 and 3 by nvidia-smi | grep 'python' | awk '{ print $3 }' | xargs -n1 kill -9
, I got the following output:
(py36-t101-cu90) zhouxx@gpu79:~/gprojects/mmnas$ Process SpawnProcess-2:
Traceback (most recent call last):
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 21, in _wrap
pass # SIGINT; Killed by parent, do nothing
KeyboardInterrupt
(py36-t101-cu90) zhouxx@gpu79:~/gprojects/mmnas$ Process SpawnProcess-4:
Traceback (most recent call last):
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 21, in _wrap
pass # SIGINT; Killed by parent, do nothing
KeyboardInterrupt
Process SpawnProcess-3:
Traceback (most recent call last):
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 21, in _wrap
pass # SIGINT; Killed by parent, do nothing
KeyboardInterrupt
Process SpawnProcess-1:
Traceback (most recent call last):
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 21, in _wrap
pass # SIGINT; Killed by parent, do nothing
KeyboardInterrupt
^C
(py36-t101-cu90) zhouxx@gpu79:~/gprojects/mmnas$ ^C
(py36-t101-cu90) zhouxx@gpu79:~/gprojects/mmnas$
(py36-t101-cu90) zhouxx@gpu79:~/gprojects/mmnas$ /home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 20 leaked semaphores to clean up at shutdown
len(cache))
/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 20 leaked semaphores to clean up at shutdown
len(cache))
It seems that a process goes wrong (maybe out of memory but no hints) and the others are waiting.
But I cannot still figure it out.
I'll appreciate it if anyone can help.
from mmnas.
It really seems like facebookresearch/fairseq#708 (comment) .
from mmnas.
I have just noticed the prerequisites in README.
I think maybe the requirement of 150GB memory for ITM is the essential reason for the problem above.
I checked the RAM size of my server.
(py36-t041-cu90) zhouxx@gpu79:~/gprojects/mmnas/logs/ckpts$ free -m -h
total used free shared buff/cache available
Mem: 94G 71G 844M 22G 22G 528M
Swap: 29G 29G 1.2M
Is there any tricky way to reduce the memory cost but do not reduce the batch size?
from mmnas.
And I wonder why ITM requires so much mem.
I'll appreciate it if anyone could explain that.
from mmnas.
Sorry for the late reply. The ITM indeed need so much memory for a deep model like MMnas. If the memory is not sufficient, maybe you can reduce the hidden dimension from 512 to 256 to have a try.
The reason for the large memory is that we need to forward the positive samples along with its negative samples into the network, which makes it more memory consuming compared to other tasks.
from mmnas.
@MIL-VLG Got it! Thanks!
from mmnas.
Related Issues (11)
- can't open https://scanproject.blob.core.windows.net/scan-data/data_no_feature.zip
- Potential bugs in `train_itm.py` when generating negative samples
- Implementation details of hyper-parameters in CfgSearch and Cfg HOT 2
- Some questions about how to calculate the gradient of 'alpha_prob'
- requirements.txt ηΌΊε€±
- Why does the warning occur? And is it necessary to fix it? HOT 1
- You might ignore the `requirements.txt`. HOT 1
- Errors during searching VQA. HOT 1
- Why add 0 loss to the original loss? HOT 2
- Some questions about the performance when searching with `MODE='full'` or `MODE='two'`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mmnas.