Code Monkey home page Code Monkey logo

Comments (7)

thangvubk avatar thangvubk commented on May 25, 2024

Could you restart training whether the problem happens again or not.

from softgroup.

thangvubk avatar thangvubk commented on May 25, 2024

One possible problem is that your RAM is not big enough. Current data is prefetched

self.train_files = [torch.load(i) for i in train_file_names]

It can be resolved by loading data in trainLoader() function here

from softgroup.

cshizhe avatar cshizhe commented on May 25, 2024

I have tried several times and the problem happened at different epochs, e.g. 170, 230, 260 etc. The RAM is also sufficient.
In one case, it outputted more messages:

Exception in thread Thread-283:
Traceback (most recent call last):
  File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/shichen/miniconda3/envs/obj33/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in _pin_memory_loop
    r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
  File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 289, in rebuild_storage_fd
    fd = df.detach()
  File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/multiprocessing/connection.py", line 493, in Client
    answer_challenge(c, authkey)
  File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/multiprocessing/connection.py", line 732, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
  File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/shichen/miniconda3/envs/obj3d/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

from softgroup.

thangvubk avatar thangvubk commented on May 25, 2024

How big is your RAM memory?

from softgroup.

cshizhe avatar cshizhe commented on May 25, 2024

180G. The program only costs about 20G.

from softgroup.

thangvubk avatar thangvubk commented on May 25, 2024

It seems the current code makes the RAM mem increase every epoch. I will remove prefetch in next commit.

from softgroup.

thangvubk avatar thangvubk commented on May 25, 2024

The data prefetch is removed at 91c58d1. Could if check whether the problem happens again?

from softgroup.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.