junhyukoh / self-imitation-learning Goto Github PK
View Code? Open in Web Editor NEWICML 2018 Self-Imitation Learning
License: MIT License
ICML 2018 Self-Imitation Learning
License: MIT License
Hello.
I firstly change the policy in <run_atari_sil.py> by:
parser.add_argument('--policy', help='Policy architecture', choices=['cnn', 'lstm', 'lnlstm'], default='lstm')
Then I run A2C+SIL on Atari games :
python baselines/a2c/run_atari_sil.py --env BreakoutNoFrameskip-v4
I got error:
Logging to /tmp/a2c
2018-12-25 14:46:34.107377: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
WARNING:tensorflow:From e:\output\python_output\hardrlwithyoutube\self-imitation-learning-master\baselines\common\distributions.py:148: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be re
moved in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.
See `tf.nn.softmax_cross_entropy_with_logits_v2`.
Traceback (most recent call last):
File "E:\Output\Python_output\HardRLWithYoutube\venv_self-imitation-learning-master\lib\site-packages\tensorflow\python\framework\ops.py", line 1628, in _create_c_op
c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension size must be evenly divisible by 15 but is 8192 for 'model_2/Reshape_1' (op: 'Reshape') with input shapes: [16,512], [3] and with input tensors computed as partia
l shapes: input[1] = [3,5,?].
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "baselines/a2c/run_atari_sil.py", line 38, in <module>
main()
File "baselines/a2c/run_atari_sil.py", line 35, in main
num_env=16)
File "baselines/a2c/run_atari_sil.py", line 20, in train
sil_update=sil_update, sil_beta=sil_beta)
File "e:\output\python_output\hardrlwithyoutube\self-imitation-learning-master\baselines\a2c\a2c_sil.py", line 161, in learn
max_grad_norm=max_grad_norm, lr=lr, alpha=alpha, epsilon=epsilon, total_timesteps=total_timesteps, lrschedule=lrschedule, sil_update=sil_update, sil_beta=sil_beta)
File "e:\output\python_output\hardrlwithyoutube\self-imitation-learning-master\baselines\a2c\a2c_sil.py", line 35, in __init__
sil_model = policy(sess, ob_space, ac_space, nenvs, nsteps, reuse=True)
File "e:\output\python_output\hardrlwithyoutube\self-imitation-learning-master\baselines\a2c\policies.py", line 66, in __init__
xs = batch_to_seq(h, nenv, nsteps)
File "e:\output\python_output\hardrlwithyoutube\self-imitation-learning-master\baselines\a2c\utils.py", line 74, in batch_to_seq
h = tf.reshape(h, [nbatch, nsteps, -1])
File "E:\Output\Python_output\HardRLWithYoutube\venv_self-imitation-learning-master\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 7759, in reshape
"Reshape", tensor=tensor, shape=shape, name=name)
File "E:\Output\Python_output\HardRLWithYoutube\venv_self-imitation-learning-master\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "E:\Output\Python_output\HardRLWithYoutube\venv_self-imitation-learning-master\lib\site-packages\tensorflow\python\util\deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "E:\Output\Python_output\HardRLWithYoutube\venv_self-imitation-learning-master\lib\site-packages\tensorflow\python\framework\ops.py", line 3274, in create_op
op_def=op_def)
File "E:\Output\Python_output\HardRLWithYoutube\venv_self-imitation-learning-master\lib\site-packages\tensorflow\python\framework\ops.py", line 1792, in __init__
control_input_ops)
File "E:\Output\Python_output\HardRLWithYoutube\venv_self-imitation-learning-master\lib\site-packages\tensorflow\python\framework\ops.py", line 1631, in _create_c_op
raise ValueError(str(e))
ValueError: Dimension size must be evenly divisible by 15 but is 8192 for 'model_2/Reshape_1' (op: 'Reshape') with input shapes: [16,512], [3] and with input tensors computed as partial shapes: input[1] = [3,5,?].
What can I do to fix this? Thank you very much!
Is there a reason that SIL requires using the np.sign(reward)
to do all of the training, rather than the raw rewards themselves?
Thanks for this paper.
In the third part (the last line on the right of the second page), you say that
$\pi_{\theta}, V_{\theta}(s)$ are the policy (i.e. actor) and the value function parameterized by$\theta$ .
I want to know how the policy and the value function use the same parameters
Looking forward to your answers. Thanks in advance.
In the paper, sil value loss is defined as 0.5 * max(0, (R-V))^2. Howerver in the code, the value loss is defined as below
self.vf_loss = tf.reduce_sum(self.W * v_estimate * tf.stop_gradient(delta)) / self.num_samples
which means that the value loss is 0.5 * V * clip((V-R), -5, 0).
What's the advantage of this implementation. Thanks
In the equation in the paper, there is no entropy term in the SIL policy loss, how come in the code there is one?
self.loss = self.pg_loss - entropy * self.w_entropy
I do not see a way to replicate grid world experiment from the paper using code that is available in the repository. Is there a way and if not, could you please publish the code?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.