Comments (17)
@git-hcLee It is good from my side.. What is your OS version and gcc version?
from elf.
looks like the situation of mine, #14, you can try my way to work around without random seed.
from elf.
Could you post the backtrace of the dump? For me I rebuilt pytorch from source using gcc 5.4.0-1 then it works fine.
from elf.
@yuandong-tian $ gcc --version gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
from elf.
@EasyHard Thanks, I'll try it.
from elf.
I met the same problem. I also got the Segmentation fault. I use gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0. Can not start to train using run.py.
from elf.
Could any of you post a backtrace of the dump? Just for more information.
gdb python
r run.py
bt
from elf.
hi,I can run standalone backend game_MC successfully, but when I try to train, I got a message as below:
Traceback (most recent call last):
File "run.py", line 142, in
game = load_module(os.environ["game"]).Loader()
File "/home/myubuntu/ELF-master/rlpytorch/utils.py", line 510, in load_module
module = import(os.path.basename(mod))
File "./rts/game_MC/game.py", line 8, in
import minirts
ImportError: /home/myubuntu/anaconda3/lib/python3.5/site-packages/torch/lib/../../../../libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by ./rts/game_MC/minirts.so)
from elf.
@Liujiachen: Check your gcc and libcpp version?
from elf.
Hi, I am also having segmentation fault problem.
Here is what I am using:
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
And here is what EasyHard was asking for:
(gdb) r run.py
Starting program: /usr/bin/python3 run.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff3a17700 (LWP 11218)]
[New Thread 0x7ffff1216700 (LWP 11219)]
[New Thread 0x7ffff0a15700 (LWP 11220)]
[Thread 0x7ffff0a15700 (LWP 11220) exited]
[Thread 0x7ffff1216700 (LWP 11219) exited]
[Thread 0x7ffff3a17700 (LWP 11218) exited]
Namespace(T=6, actor_only=False, additional_labels=None, ai_type='AI_NN', batchsize=128, discount=0.99, entropy_ratio=0.01, epsilon=0.0, eval=False, freq_update=1, fs_ai=50, fs_opponent=50, game_multi=None, gpu=None, grad_clip_norm=None, greedy=False, handicap_level=0, latest_start=1000, latest_start_decay=0.7, load=None, max_tick=30000, mcts_threads=64, min_prob=1e-06, num_episode=10000, num_games=1024, num_minibatch=5000, opponent_type='AI_SIMPLE', ratio_change=0, record_dir='./record', sample_node='pi', sample_policy='epsilon-greedy', save_dir=None, save_prefix='save', seed=0, simple_ratio=-1, tqdm=False, verbose_collector=False, verbose_comm=False, wait_per_group=False)
Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
GI__IO_fread (buf=0x7fffffffc75c, size=4, count=1, fp=0x0) at iofread.c:37
37 iofread.c: No such file or directory.
(gdb) bt
#0 GI__IO_fread (buf=0x7fffffffc75c, size=4, count=1, fp=0x0)
at iofread.c:37
#1 0x00007fffd103ea4e in std::random_device::M_getval() ()
from /usr/local/lib/python3.5/dist-packages/torch/lib/libTHC.so.1
#2 0x00007fffbac01ffb in GameContext::GameContext(ContextOptions const&, PythonOptions const&) () from ./rts/game_MC/minirts.so
#3 0x00007fffbac03b6f in void pybind11::cpp_function::initialize<void pybind11::detail::init<ContextOptions const&, PythonOptions const&>::execute<pybind11::class, , 0>(pybind11::class&)::{lambda(GameContext*, ContextOptions const&, PythonOptions const&)#1}, void, GameContext*, ContextOptions const&, PythonOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::class&&, void ()(GameContext, ContextOptions const&, PythonOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::operator()(pybind11::detail::function_call) const () from ./rts/game_MC/minirts.so
#4 0x00007fffbac03c9e in void pybind11::cpp_function::initialize<void pybind11::detail::init<ContextOptions const&, PythonOptions const&>::execute<pybind11::class, , 0>(pybind11::class&)::{lambda(GameContext*, ContextOptions const&, PythonOptions const&)#1}, void, GameContext*, ContextOptions const&, PythonOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::class_&&, void ()(GameContext, ContextOptions const&, PythonOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(---Type to continue, or q to quit---
pybind11::detail::function_call) () from ./rts/game_MC/minirts.so
#5 0x00007fffbabe9f7d in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from ./rts/game_MC/minirts.so
#6 0x00000000004e9bc7 in PyCFunction_Call ()
#7 0x00000000005b7167 in PyObject_Call ()
#8 0x00000000004f413e in ?? ()
#9 0x00000000005b7167 in PyObject_Call ()
#10 0x000000000054d359 in ?? ()
#11 0x000000000055d17c in ?? ()
#12 0x00000000005b7167 in PyObject_Call ()
#13 0x0000000000528d06 in PyEval_EvalFrameEx ()
#14 0x0000000000528814 in PyEval_EvalFrameEx ()
#15 0x000000000052d2e3 in ?? ()
#16 0x000000000052dfdf in PyEval_EvalCode ()
#17 0x00000000005fd2c2 in ?? ()
#18 0x00000000005ff76a in PyRun_FileExFlags ()
#19 0x00000000005ff95c in PyRun_SimpleFileExFlags ()
#20 0x000000000063e7d6 in Py_Main ()
#21 0x00000000004cfe41 in main ()
from elf.
@gchlodzinski Your stack looks similar to what I've encountered. Compiling pytorch from source with gcc-5.4 helped me on this. Haven't got a chance to really figure out why this happens though.
from elf.
@EasyHard Thanks, it helped to get things started.
But now using sample training gets only to step 147 with error (at the end of traceback):
RuntimeError: input and target have different number of elements: input[128 x 1] has 128 elements, while target[128 x 128] has 16384 elements at /home/grzegorz/pytorch/torch/lib/THCUNN/generic/SmoothL1Criterion.cu:12
Edit: moreover I have the same result even when I reinstall the whole system from scratch and used this time conda for python and packages. It still crashes when I change batch size to various different numbers (but power of 2) - just at different iteration number.
from elf.
@gchlodzinski Hi, have you solved the above problem?
from elf.
@yuandong-tian
I updated the repo to latest version and re-compiled everything, but it still cannot start to train.
Version: 99b9e219b9e23bdc7c5e710c0aec531219d5e9e0_
Num Actions: 9
Num unittype: 6
#recv_thread = 4
0%| | 0/5000 [00:00<?, ?it/s]Traceback (most recent call last):
File "run.py", line 194, in
runner.run()
File "/home/ziclin/ELF/rlpytorch/trainer.py", line 179, in run
self.GC.Run()
File "/home/ziclin/ELF/elf/utils_elf.py", line 254, in Run
res = self._call(self.infos)
File "/home/ziclin/ELF/elf/utils_elf.py", line 245, in _call
reply = self._cb[infos.gid](sel, sel_gpu)
File "/home/ziclin/ELF/rlpytorch/trainer.py", line 109, in actor
self.stats.feed_batch(sel)
File "/home/ziclin/ELF/rlpytorch/stats.py", line 188, in feed_batch
return self.collector.feed_batch(batch, hist_idx=hist_idx)
File "/home/ziclin/ELF/rlpytorch/stats.py", line 68, in feed_batch
ids = batch["id"][hist_idx]
File "/home/ziclin/ELF/elf/utils_elf.py", line 84, in getitem
raise KeyError("Batch(): specified key: %s or %s not found!" % (key, key_with_last))
KeyError: 'Batch(): specified key: id or last_id not found!'
./script.sh: line 1: 18981 Segmentation fault (core dumped) game=./rts/game_MC/game model=actor_critic model_file=./rts/game_MC/model python3 run.py --num_games 1024 --batchsize 128 --freq_update 50 --fs_opponent 20 --latest_start 500 --latest_start_decay 0.99 --opponent_type AI_SIMPLE --tqdm --gpu 0 --T 20
from elf.
@LinZichuan, I was not able to find solution to my runtime error problem. I also tried to run ELF on Mac OS but there failed as well (strange CUDA error message).
Edit:
@LinZichuan, Right now I am having the same problem as your after I got the new set of sources.
But it gets solved by changing commandline.
from elf.
@LinZichuan See #45
from elf.
@LinZichuan @gchlodzinski @git-hcLee This commit f268feb might address your issue.
from elf.
Related Issues (20)
- Get a UserWarning: Implicit dimension choice for softmax when training mini-RTS following Install tutorial.
- RuntimeError: The expanded size of the tensor (1) must match the existing size (128) HOT 2
- TypeError: cuda() got an unexpected keyword argument 'device_id' HOT 1
- ModuleNotFoundError: No module named 'go_game'
- Can not run train_minirts.sh
- [Questions:] What is the pytorch version for atari?
- Cannot run with python 3.7 HOT 1
- train MiniRTS is slow - no information when will end HOT 1
- Is there any existing implement of training with different type RL models? HOT 1
- Modify Game
- KeyError: 'Batch(): specified key: res or last_res not found!' HOT 2
- Memory leak
- implement for pytorch 0.4.0? HOT 2
- Trouble fixing Travis CI build HOT 2
- Minimum working example for the RTS game with an RL lifecycle? HOT 1
- Problems when running selfplay_minirts
- Flipped tensor dimensions in reply when running train_minirts.sh HOT 3
- Are there plans to upgrade to Torch 1.5/1.6?
- Python 3.7: SyntaxError: invalid syntax due to the use of `async` which is a keyword
- game_TD not compilable
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from elf.