unable to start train about elf HOT 17 CLOSED

grypes commented on May 22, 2024

unable to start train

from elf.

Comments (17)

yuandong-tian commented on May 22, 2024

@git-hcLee It is good from my side.. What is your OS version and gcc version?

from elf.

qiqiguaitm commented on May 22, 2024

looks like the situation of mine, #14, you can try my way to work around without random seed.

from elf.

EasyHard commented on May 22, 2024

Could you post the backtrace of the dump? For me I rebuilt pytorch from source using gcc 5.4.0-1 then it works fine.

from elf.

grypes commented on May 22, 2024

@yuandong-tian $ gcc --version gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609

from elf.

grypes commented on May 22, 2024

@EasyHard Thanks, I'll try it.

from elf.

LinZichuan commented on May 22, 2024

I met the same problem. I also got the Segmentation fault. I use gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0. Can not start to train using run.py.

from elf.

EasyHard commented on May 22, 2024

Could any of you post a backtrace of the dump? Just for more information.
gdb python
r run.py
bt

from elf.

Liujiachen commented on May 22, 2024

hi,I can run standalone backend game_MC successfully, but when I try to train, I got a message as below:
Traceback (most recent call last):
File "run.py", line 142, in
game = load_module(os.environ["game"]).Loader()
File "/home/myubuntu/ELF-master/rlpytorch/utils.py", line 510, in load_module
module = import(os.path.basename(mod))
File "./rts/game_MC/game.py", line 8, in
import minirts
ImportError: /home/myubuntu/anaconda3/lib/python3.5/site-packages/torch/lib/../../../../libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./rts/game_MC/minirts.so)

from elf.

yuandong-tian commented on May 22, 2024

@Liujiachen: Check your gcc and libcpp version?

from elf.

gchlodzinski commented on May 22, 2024

Hi, I am also having segmentation fault problem.
Here is what I am using:
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)

And here is what EasyHard was asking for:
(gdb) r run.py
Starting program: /usr/bin/python3 run.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff3a17700 (LWP 11218)]
[New Thread 0x7ffff1216700 (LWP 11219)]
[New Thread 0x7ffff0a15700 (LWP 11220)]
[Thread 0x7ffff0a15700 (LWP 11220) exited]
[Thread 0x7ffff1216700 (LWP 11219) exited]
[Thread 0x7ffff3a17700 (LWP 11218) exited]
Namespace(T=6, actor_only=False, additional_labels=None, ai_type='AI_NN', batchsize=128, discount=0.99, entropy_ratio=0.01, epsilon=0.0, eval=False, freq_update=1, fs_ai=50, fs_opponent=50, game_multi=None, gpu=None, grad_clip_norm=None, greedy=False, handicap_level=0, latest_start=1000, latest_start_decay=0.7, load=None, max_tick=30000, mcts_threads=64, min_prob=1e-06, num_episode=10000, num_games=1024, num_minibatch=5000, opponent_type='AI_SIMPLE', ratio_change=0, record_dir='./record', sample_node='pi', sample_policy='epsilon-greedy', save_dir=None, save_prefix='save', seed=0, simple_ratio=-1, tqdm=False, verbose_collector=False, verbose_comm=False, wait_per_group=False)

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
GI__IO_fread (buf=0x7fffffffc75c, size=4, count=1, fp=0x0) at iofread.c:37
37 iofread.c: No such file or directory.
(gdb) bt
#0 GI__IO_fread (buf=0x7fffffffc75c, size=4, count=1, fp=0x0)
at iofread.c:37
#1 0x00007fffd103ea4e in std::random_device::M_getval() ()
from /usr/local/lib/python3.5/dist-packages/torch/lib/libTHC.so.1
#2 0x00007fffbac01ffb in GameContext::GameContext(ContextOptions const&, PythonOptions const&) () from ./rts/game_MC/minirts.so
#3 0x00007fffbac03b6f in void pybind11::cpp_function::initialize<void pybind11::detail::init<ContextOptions const&, PythonOptions const&>::execute<pybind11::class, , 0>(pybind11::class&)::{lambda(GameContext*, ContextOptions const&, PythonOptions const&)#1}, void, GameContext*, ContextOptions const&, PythonOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::class&&, void ()(GameContext, ContextOptions const&, PythonOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::operator()(pybind11::detail::function_call) const () from ./rts/game_MC/minirts.so
#4 0x00007fffbac03c9e in void pybind11::cpp_function::initialize<void pybind11::detail::init<ContextOptions const&, PythonOptions const&>::execute<pybind11::class, , 0>(pybind11::class&)::{lambda(GameContext*, ContextOptions const&, PythonOptions const&)#1}, void, GameContext*, ContextOptions const&, PythonOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::class_&&, void ()(GameContext, ContextOptions const&, PythonOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(---Type to continue, or q to quit---
pybind11::detail::function_call) () from ./rts/game_MC/minirts.so
#5 0x00007fffbabe9f7d in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from ./rts/game_MC/minirts.so
#6 0x00000000004e9bc7 in PyCFunction_Call ()
#7 0x00000000005b7167 in PyObject_Call ()
#8 0x00000000004f413e in ?? ()
#9 0x00000000005b7167 in PyObject_Call ()
#10 0x000000000054d359 in ?? ()
#11 0x000000000055d17c in ?? ()
#12 0x00000000005b7167 in PyObject_Call ()
#13 0x0000000000528d06 in PyEval_EvalFrameEx ()
#14 0x0000000000528814 in PyEval_EvalFrameEx ()
#15 0x000000000052d2e3 in ?? ()
#16 0x000000000052dfdf in PyEval_EvalCode ()
#17 0x00000000005fd2c2 in ?? ()
#18 0x00000000005ff76a in PyRun_FileExFlags ()
#19 0x00000000005ff95c in PyRun_SimpleFileExFlags ()
#20 0x000000000063e7d6 in Py_Main ()
#21 0x00000000004cfe41 in main ()

from elf.

EasyHard commented on May 22, 2024

@gchlodzinski Your stack looks similar to what I've encountered. Compiling pytorch from source with gcc-5.4 helped me on this. Haven't got a chance to really figure out why this happens though.

from elf.

gchlodzinski commented on May 22, 2024

@EasyHard Thanks, it helped to get things started.
But now using sample training gets only to step 147 with error (at the end of traceback):

RuntimeError: input and target have different number of elements: input[128 x 1] has 128 elements, while target[128 x 128] has 16384 elements at /home/grzegorz/pytorch/torch/lib/THCUNN/generic/SmoothL1Criterion.cu:12

Edit: moreover I have the same result even when I reinstall the whole system from scratch and used this time conda for python and packages. It still crashes when I change batch size to various different numbers (but power of 2) - just at different iteration number.

from elf.

LinZichuan commented on May 22, 2024

@gchlodzinski Hi, have you solved the above problem?

from elf.

LinZichuan commented on May 22, 2024

@yuandong-tian
I updated the repo to latest version and re-compiled everything, but it still cannot start to train.

Version: 99b9e219b9e23bdc7c5e710c0aec531219d5e9e0_
Num Actions: 9
Num unittype: 6
#recv_thread = 4
0%| | 0/5000 [00:00<?, ?it/s]Traceback (most recent call last):
File "run.py", line 194, in
runner.run()
File "/home/ziclin/ELF/rlpytorch/trainer.py", line 179, in run
self.GC.Run()
File "/home/ziclin/ELF/elf/utils_elf.py", line 254, in Run
res = self._call(self.infos)
File "/home/ziclin/ELF/elf/utils_elf.py", line 245, in _call
reply = self._cb[infos.gid](sel, sel_gpu)
File "/home/ziclin/ELF/rlpytorch/trainer.py", line 109, in actor
self.stats.feed_batch(sel)
File "/home/ziclin/ELF/rlpytorch/stats.py", line 188, in feed_batch
return self.collector.feed_batch(batch, hist_idx=hist_idx)
File "/home/ziclin/ELF/rlpytorch/stats.py", line 68, in feed_batch
ids = batch["id"][hist_idx]
File "/home/ziclin/ELF/elf/utils_elf.py", line 84, in getitem
raise KeyError("Batch(): specified key: %s or %s not found!" % (key, key_with_last))
KeyError: 'Batch(): specified key: id or last_id not found!'

./script.sh: line 1: 18981 Segmentation fault (core dumped) game=./rts/game_MC/game model=actor_critic model_file=./rts/game_MC/model python3 run.py --num_games 1024 --batchsize 128 --freq_update 50 --fs_opponent 20 --latest_start 500 --latest_start_decay 0.99 --opponent_type AI_SIMPLE --tqdm --gpu 0 --T 20

from elf.

gchlodzinski commented on May 22, 2024

@LinZichuan, I was not able to find solution to my runtime error problem. I also tried to run ELF on Mac OS but there failed as well (strange CUDA error message).
Edit:
@LinZichuan, Right now I am having the same problem as your after I got the new set of sources.
But it gets solved by changing commandline.

from elf.

yuandong-tian commented on May 22, 2024

@LinZichuan See #45

from elf.

yuandong-tian commented on May 22, 2024

@LinZichuan @gchlodzinski @git-hcLee This commit f268feb might address your issue.

from elf.

unable to start train about elf HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent