Comments (7)
Could you restart training whether the problem happens again or not.
from softgroup.
One possible problem is that your RAM is not big enough. Current data is prefetched
SoftGroup/data/scannetv2_inst.py
Line 48 in 253ef54
It can be resolved by loading data in trainLoader()
function here
from softgroup.
I have tried several times and the problem happened at different epochs, e.g. 170, 230, 260 etc. The RAM is also sufficient.
In one case, it outputted more messages:
Exception in thread Thread-283:
Traceback (most recent call last):
File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/shichen/miniconda3/envs/obj33/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in _pin_memory_loop
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 289, in rebuild_storage_fd
fd = df.detach()
File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/multiprocessing/connection.py", line 493, in Client
answer_challenge(c, authkey)
File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/multiprocessing/connection.py", line 732, in answer_challenge
message = connection.recv_bytes(256) # reject large message
File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
from softgroup.
How big is your RAM memory?
from softgroup.
180G. The program only costs about 20G.
from softgroup.
It seems the current code makes the RAM mem increase every epoch. I will remove prefetch in next commit.
from softgroup.
The data prefetch is removed at 91c58d1. Could if check whether the problem happens again?
from softgroup.
Related Issues (20)
- when i training stpls3d,i have some problems HOT 1
- Install issues HOT 3
- AssertionError: No instance result - results/pred_instance/scene0011_00.txt. HOT 1
- "AssertionError: empty batch" error when training your own dataset HOT 4
- What is the split strategy when working with S3DIS data set? HOT 4
- Explanation of some config parameters HOT 2
- TypeError: forward() takes 6 positional arguments but 9 were given(maybe something wrong with the source codes? HOT 1
- How to obtain voxel labels? HOT 2
- Request code to evaluate mPrec mRec mCov mWcov HOT 1
- Transform_train for custom dataset HOT 1
- Train SoftGroup on different Scenarios HOT 1
- Custom dataset visualization HOT 3
- Why do you need "force_fp32"? HOT 2
- Training batch is empty HOT 2
- Inference without label
- How do I run this network with my own data?
- Volume of instances
- Model reproducibility
- BatchNorm not training
- Understanding the how the visualization process works on instance segmentation
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from softgroup.