nagataka / read-a-paper Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 1.0 125 KB

Survey

read-a-paper's People

Contributors

Stargazers

Watchers

Forkers

trinity-wang

read-a-paper's Issues

Multi-Level Discovery of Deep Options

Summary

Link

Multi-Level Discovery of Deep Options

Author/Institution

Roy Fox, Sanjay Krishnan, Ion Stoica, Ken Goldberg
UC Berkeley

What is this

Proposed "Discovery of Deep Options (DDO)" algorithm

a policy-gradient algorithm
be learned from a set of demonstration trajectories
- based on Hierarchical Behavioral Cloning

Comparison with previous researches. What are the novelties/good points?

Key points

How the author proved effectiveness of the proposal?

Any discussions?

What should I read next?

Visual Semantic Planning using Deep Successor Representations

Summary

論文リンク

https://arxiv.org/abs/1705.08080

著者/所属機関

Yuke Zhu, Daniel Gordon, Eric Kolve, Dieter Fox, Li Fei-Fei, Abhinav Gupta, Roozbeh Mottaghi, Ali Farhadi
Allen Institute for Artificial Intelligence, CMU, Stanford, and UW

Survey

どんなもの？

先行研究と比べてどこがすごい？

技術や手法のキモはどこ？

どうやって有効だと検証した？

議論はある？

次に読むべき論文は？

World Models

Summary

Link

Author/Institution

David Ha, Jürgen Schmidhuber

What is this

Proposed a novel reinforcement learning agent architecture consists of world model which learn a compressed spatial and temporal representation of the environment in an unsupervised manner, and controller.

Network architecture is the following:

VAE (Vision)
- compress what the agent sees at each time frame
- encodes the high-dimensional observation into a low-dimensional latent vector $z$
MDN-RNN (Memory)
- compress what happens over time
- The M model serves as a predictive model of the future z vectors that V is expected to produce
- MDN: Mixture Density Network combined with a RNN
  - MDN outputs the parameters of a mixture of Gaussian distribution used to sample a prediction of the next latent vector z
- integrates the historical codes to create a representation that can predict future states
Controller model
- select good actions using the representations from both V and M
- $a_t = W_c [z_t h_t] + b_c$
  - very compact representation as most of the model’s complexity, and model parameters to reside in V and M
- Covariance-Matrix Adaptation Evolution Strategy (CMA-ES) (Hansen, 2016; Hansen & Ostermeier, 2001) is used to optimize the parameters of C since it is known to work well for solution spaces of up to a few thousand parameters

Comparison with previous researches. What are the novelties/good points?

Most existing model-based approaches learn a model of the RL environment, but still train on the actual environment. In this work, they explored fully replacing an actual RL environment with a generated one, training our agent’s controller only inside of the environment generated by its own internal world model, and transfer this policy back into the actual environment.

Key points

Collect 10,000 rollouts from a random policy
Train VAE(V) to encode each frame into a latent vector $z \in R^{32}$
Train MDN-RNN(M) to model $P(z_{t+1}|a_t, z_t, h_t)$
Define Controller(C) as $a_t = W_c[z_t h_t] + b_c$
Use CMA-ES to solve for a $W_c$ and $b_c$ that maximizes the expected cumulative reward

In this experiment, the world model (V and M) has no knowledge about the actual reward signals from the environment. Its task is simply to compress and predict the sequence of image frames observed. Only the Controller (C) Model has access to the reward information from the environment.

How the author proved effectiveness of the proposal?

Any discussions?

Scalability?
How far this model can predict?

What should I read next?

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

Summary

Link

Author/Institution

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel
UC Berkeley, OpenAI

What is this

Present a simple modification to the generative adversarial network objective that encourages it to learn interpretable and meaningful representations.
- Maximizing the mutual information between fixed small subset of the GAN's noise variables and the observations

Comparison with previous researches. What are the novelties/good points?

No supervision of any kind

Key points

z in the GAN. In this paper, they decompose the input noise vector into two parts:

z, which is treated as source of incompressible noise
c, which we will call the latent code and will target the salient structured semantic features of the data distribution

GAN's minimax game formalization

InfoGAN's formulation where I is a mutual information

How the author proved effectiveness of the proposal?

Experiments on MNIST, CelebA, and SVHN
Two goals
- If mutual information can be maximized efficiently
- To evaluate if InfoGAN can learn disentangled and interpretable representations
  - Confirmed by making use of the generator to vary only one latent factor at a time in order to assess if varying such factor results in only one type of semantic variation in generated iamges

Any discussions?

What should I read next?

Representation learning: A review and new perspectives

The Optin-Critic Architecture

Summary

Link

The Optin-Critic Architecture

Author/Institution

Pierre-Luc Bacon, Jean Harb, Doina Precup
McGill University

What is this

Derived policy gradient theorems for options and propose a new option-critic architecture
- capable of learning both the internal policies and the termination conditions of options
- no need to provide any additional rewards or subgoals

Comparison with previous researches. What are the novelties/good points?

The proposed method enable a gradual learning process of the intra-option policies and termination functions, simultaneously with the policy over them.

Existing works have two problems
- The majority of the existing work has focused on finding subgoals (useful states that an agent should reach) and subsequently learning policies to achieve them. This idea has led to interesting methods but ones which are also difficult to scale up given their “combinatorial” flavor.
- Additionally, learning policies associated with subgoals can be expensive in terms of data and computation time; in the worst case, it can be as expensive as solving the entire task.

Key points

How the author proved effectiveness of the proposal?

Any discussions?

What should I read next?

Model-based RL

Synthesis and Stabilization of Complex Behaviors through Online Trajectory Optimization

AN IMPROVED MINIMUM ERROR ENTROPY CRITERION WITH SELF ADJUSTING STEP-SIZE

Summary

Link

An Improved Minimum Error Entropy Criterion with Self Adjusting Step-Size

Author/Institution

Seungju Han, Sudhir Rao, D. Erdogmus, J. Principe
University of Florida and Oregon Health and Science University

What is this

Propose Minimum Error Entropy with self adjusting step-size (MEE-SAS)
- Changes step size to take large step when far away from the optimal solution
- It will be a small step when near the solution
- Faster in terms of convergence compared to Minimum Error Entropy (MEE)

Comparison with previous researches. What are the novelties/good points?

Comparison with MEE

Key

MEE\mbox{-}SAS: J(e) = min_{\mathbf w}[V(0) - V(e)]^2

where $V(e)$ can be approximated by

$V(e) \approx \frac{1}{L} \sum_{i=k-L}^{k-1}\mathbf{K}_{\sigma\sqrt{2}}(e_k-e_i)$
CodeCogsEqn (7)

See the section "INFORMATION THEORETIC CRITERIA" for derivation.

How the author proved effectiveness of the proposal?

Tested the performance for two classic problems of system identification and prediction

Any discussions?

As stated in the paper, the following part might suggest the important point to think about adaptive step-size:
"However, we show that MEE performs better than MEE-SAS in situations where tracking ability of the optimal solution is required like in the case of non-stationary signals."

What should I read next?

A new normalized minimum-error entropy algorithm with reduced computational complexity
- "A new normalized minimum-error entropy (NMEE) algorithm is proposed as an alternative to the minimum-error entropy (MEE) and the minimum-error entropy with self-adjusting step size (MEE-SAS) algorithms. The proposed NMEE algorithm requires fewer iterations and less computation to converge and yields lower misadjustment as compared to those of the MEE and the MEE-SAS algorithms."

Model-Based Reinforcement Learning for Atari

Summary

Link

Author/Institution

What is this

Simulated Policy Learning (SimPLe)
- Video prediction techniques + policy training within the learned model
Look like Dyna-style World models

Comparison with previous researches. What are the novelties/good points?

Key points

a skip-connected convolutional encoder and decoder, which outputs the next predicted frame and expected reward
a convolutional inference network which approximates the posterior given the next frame
LSTM based network, which is trained to approximate each bit given the previous ones

How the author proved effectiveness of the proposal?

Experiments on Atari games
- with a budget restricted to 100K time steps – roughly to two hours of a play time
- outperforms state-of-the-art model-free algorithms (Rainbow)

Any discussions?

What should I read next?

β-VAE: LEARNING BASIC VISUAL CONCEPTS WITH A CONSTRAINED VARIATIONAL FRAMEWORK

Summary

Link

Paper: beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework
Dataset: dsprites-dataset

Author/Institution

What is this

Introduce β-VAE, a new state-of-the-art framework for automated discovery of interpretable factorized latent representations from raw image data in a completely unsupervised manner

Main contributions are the following:

we propose β-VAE, a new unsupervised approach for learning disentangled representations of independent visual data generative factors;
we devise a protocol to quantitatively compare the degree of disentanglement learnt by different models;
we demonstrate both qualitatively and quantitatively that our β-VAE approach achieves state-of-the-art disentanglement performance compared to various baselines on a variety of complex datasets

Comparison with previous researches. What are the novelties/good points?

introduce an adjustable hyperparameter β that balances latent channel capacity and independence constraints with reconstruction accuracy

Unlike InfoGAN, β-VAE is stable to train, makes few assumptions about the data and relies on tuning a single hyperparameter β, which can be directly optimised through a hyperparameter search using weakly labelled data or through heuristic visual inspection for purely unsupervised data

Key points

"With β > 1 the model is pushed to learn a more efficient latent representation of the data, which is disentangled if the data contains at least some underlying factors of variation that are independent."

Assuming V and W which are ground truth data generative factors (See chapter 2): conditionally independent factors v and conditionally dependent factors w
data generation is governed by these factors as p(x|v,w) and it is called Sim(v,w), where 'Sim' comes from "the true world Simulator"

Relationship with a set of generative latent factors z: p(x|z) approximates p(x|v,w)

Learn z with a posterior q(z|x)

the prior distribution is an isotropic unit Gaussian (p(z) = N(0,I))

Disentanglement Metric

A linear classifier based approach.
Fix one of the generative factors and randomly sampling all others. The goal of the classifier is to predict the index y of the generative factor that was kept fixed.

How the author proved effectiveness of the proposal?

Qualitative evaluation on CelebA, chairs and faces
Quantitative evaluation and compared with ICP, PCA, VAE, DC-IGN, and InfoGAN

Any discussions?

What should I read next?

Understanding disentangling in β-VAE

Finish "recommended-reading" list by the end of 2018

Introduction to Computational Neuroscience

Friston, Karl. "The free-energy principle: a unified brain theory?." Nature Reviews Neuroscience 11.2 (2010): 127-138. http://www.nature.com/nrn/journal/v11/n2/pdf/nrn2787.pdf
Rafal Bogacz, A tutorial on the free-energy framework for modelling perception and learning, Journal of Mathematical Psychology, Volume 76, Part B, February 2017, Pages 198-211, ISSN 0022-2496. https://doi.org/10.1016/j.jmp.2015.11.003
Douglas, Rodney J., and Kevan AC Martin. "Mapping the matrix: the ways of neocortex." Neuron 56.2 (2007): 226-238. http://www.sciencedirect.com/science/article/pii/S0896627307007787#
Chklovskii, Dmitri B., B. W. Mel, and K. Svoboda. "Cortical rewiring and information storage." Nature 431.7010 (2004): 782. https://search.proquest.com/docview/204560208?pq-origsite=gscholar
Poggio, Tomaso, and Emilio Bizzi. "Generalization in vision and motor control." Nature 431.7010 (2004): 768. http://cbcl.csail.mit.edu/publications/ps/nature03014.pdf
[ON] Dayan, Peter, and Laurence F. Abbott. Theoretical neuroscience. Vol. 806. Cambridge, MA: MIT Press, 2001.
[ON] Gerstner, Wulfram, et al. Neuronal dynamics: From single neurons to networks and models of cognition. Cambridge University Press, 2014. (Available online: http://neuronaldynamics.epfl.ch/)

Machine learning and Neuroscience

[ON] Hassabis, Demis, et al. "Neuroscience-Inspired Artificial Intelligence." Neuron 95.2 (2017): 245-258. http://www.sciencedirect.com/science/article/pii/S0896627317305093
* Marblestone, Adam H., Greg Wayne, and Konrad P. Kording. Marblestone, A. H., Wayne, G., & Kording, K. P. (2016). Toward an integration of deep learning and neuroscience. Frontiers in computational neuroscience, 10. http://journal.frontiersin.org/article/10.3389/fncom.2016.00094/full
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2016). Building machines that learn and think like people. Behavioral and Brain Sciences, 1-101. https://arxiv.org/pdf/1604.00289

Neuromorphic Engineering:

Douglas, Rodney, Misha Mahowald, and Carver Mead. "Neuromorphic analogue VLSI." Annual review of neuroscience18.1 (1995): 255-281. http://authors.library.caltech.edu/1497/1/DOUarn95.pdf
A reconfigurable on-line learning spiking neuromorphic processor comprising 256 neurons and 128K synapses http://journal.frontiersin.org/article/10.3389/fnins.2015.00141/full
P. Lichtsteiner, C. Posch and T. Delbruck, "A 128× 128 120 dB 15 μs Latency Asynchronous Temporal Contrast Vision Sensor," in IEEE Journal of Solid-State Circuits, vol. 43, no. 2, pp. 566-576, Feb. 2008. http://ieeexplore.ieee.org/abstract/document/4444573/
A million spiking-neuron integrated circuit with a scalable communication network and interface http://www.sciencemag.org/content/345/6197/668.short
Memory and information processing in neuromorphic systems http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7159144
Indiveri, Giacomo, et al. "Neuromorphic silicon neuron circuits." Frontiers in neuroscience 5 (2011). http://www.ncbi.nlm.nih.gov/pmc/articles/pmc3130465/
Azghadi, Mostafa Rahimi, et al. "Spike-based synaptic plasticity in silicon: design, implementation, application, and challenges." Proceedings of the IEEE 102.5 (2014): 717-737. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3130465/pdf/fnins-05-00073.pdf
Finding a roadmap to achieve large neuromorphic hardware systems http://journal.frontiersin.org/article/10.3389/fnins.2013.00118/full
Furber, Steve B., et al. "The spinnaker project." Proceedings of the IEEE 102.5 (2014): 652-665. http://ieeexplore.ieee.org/iel7/5/6807530/06750072.pdf

Memristors and Nanotechnologies for Neuromorphic Hardware

Querlioz, Damien, et al. "Bioinspired programming of memory devices for implementing an inference engine." Proceedings of the IEEE 103.8 (2015): 1398-1416. https://pdfs.semanticscholar.org/f791/b5bd796eaa10fd623fc842db5db561f45b2b.pdf

Spiking neural networks:

Vreeken, Jilles. "Spiking neural networks, an introduction." (2003). https://dspace.library.uu.nl/bitstream/handle/1874/24416/vreeken_03_spikingneuralnetworks.pdf?sequence=2
Neftci, Emre O., et al. "Event-Driven Random Back-Propagation: Enabling Neuromorphic Deep Learning Machines." Frontiers in Neuroscience 11 (2017). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5478701/
STICK: Spike Time Interval Computational Kernel, a Framework for General Purpose Computation Using Neurons, Precise Timing, Delays, and Synchrony http://www.mitpressjournals.org/doi/abs/10.1162/NECO_a_00783#.VlYQAnVBTEU
Nengo and the neural engineering framework: A large-scale model of the functioning brain http://science.sciencemag.org/content/338/6111/1202
Real-time classification and sensor fusion with a spiking deep belief network. http://www.frontiersin.org/neuromorphic_engineering/10.3389/fnins.2013.00178/abstract.
Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing http://arxiv.org/pdf/1603.08270
Vogels, Tim P., and Larry F. Abbott. "Signal propagation and logic gating in networks of integrate-and-fire neurons." Journal of neuroscience 25.46 (2005): 10786-10795. http://www.jneurosci.org/content/25/46/10786.full.pdf
Abbott, L. F., and Wade G. Regehr. "Synaptic computation." Nature431.7010 (2004): 796. http://www.ee.columbia.edu/~aurel/nature%20insight04/synaptic%20computation04.pdf

Learning in Neural Circuits

Song, Sen, Kenneth D. Miller, and Larry F. Abbott. "Competitive Hebbian learning through spike-timing-dependent synaptic plasticity." Nature neuroscience 3.9 (2000): 919-926. https://www.nature.com/neuro/journal/v3/n9/abs/nn0900_919.html
Abbott, Larry F., and Sacha B. Nelson. "Synaptic plasticity: taming the beast." Nature neuroscience 3.11s (2000): 1178. http://search.proquest.com/openview/6779c1043e8fa25f397a1cdedfb67051/1?pq-origsite=gscholar&cbl=44706
Bi, Guo-qiang, and Mu-ming Poo. "Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type." Journal of neuroscience 18.24 (1998): 10464-10472. http://www.jneurosci.org/content/18/24/10464.short
Graupner, Michael, and Nicolas Brunel. "Calcium-based plasticity model explains sensitivity of synaptic changes to spike pattern, rate, and dendritic location." Proceedings of the National Academy of Sciences 109.10 (2012): 3991-3996. http://www.pnas.org/content/109/10/3991.short
Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. Vol. 1. No. 1. Cambridge: MIT press, 1998. http://www.academia.edu/download/38529120/9780262257053_index.pdf
Markram, Henry, Wulfram Gerstner, and Per Jesper Sjöström. "Spike-timing-dependent plasticity: a comprehensive overview." Frontiers in synaptic neuroscience 4 (2012). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3395004/
Fusi, Stefano, and L. F. Abbott. "Limits on the memory storage capacity of bounded synapses." Nature neuroscience 10.4 (2007): 485-493. https://www.nature.com/articles/nn1859
Gütig, Robert, and Haim Sompolinsky. "The tempotron: a neuron that learns spike timing-based decisions." Nature neuroscience 9.3 (2006): 420. https://capocaccia.ethz.ch/capo/raw-attachment/wiki/2010/ifslwta10/Gutig_Sompolinsky06.pdf

Neural Circuit Dynamics

Wong, K., & Wang, X. (2006). A recurrent network mechanism of time integration in perceptual decisions. J. Neurosci., 26 (4), 1314. http://www.jneurosci.org/content/26/4/1314.long
Durstewitz, Daniel, and Gustavo Deco. "Computational significance of transient dynamics in cortical networks." European Journal of Neuroscience 27.1 (2008): 217-227. http://onlinelibrary.wiley.com/doi/10.1111/j.1460-9568.2007.05976.x/full

Bayesian Brain

Rao, Rajesh PN, and Dana H. Ballard. "Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects." Nature neuroscience 2.1 (1999).
Shriki, Oren, David Hansel, and Haim Sompolinsky. "Rate models for conductance-based cortical neuronal networks." Neural computation 15.8 (2003): 1809-1841.
Rao, Rajesh PN. "Bayesian computation in recurrent neural circuits." Neural computation 16.1 (2004): 1-38.
Ma, Wei Ji, et al. "Bayesian inference with probabilistic population codes." Nature neuroscience 9.11 (2006): 1432-1438.
Nessler, Bernhard, et al. "Bayesian computation emerges in generic cortical microcircuits through spike-timing-dependent plasticity." PLoS computational biology 9.4 (2013): e1003037.

Artificial Neural Networks and Deep Learning

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016. http://www.deeplearningbook.org/
These resources are excellent starting points:
The Google TensorFlow Tutorial is a great one too: https://www.tensorflow.org/tutorials/
For high-level deep learning, check out Keras:https://keras.io/
"Representation Learning: A Review and New Perspectives, Yoshua Bengio" University of Montreal, Montreal ; Aaron Courville ; Pascal Vincent, 2013
- This is a review paper written by the deep learning trio Hinton, Lecun, Bengio:
Glorot and Bengio, "Understanding the difficulty of training deep feedforward neural networks", JMLR 2010 http://proceedings.mlr.press/v9/glorot10a.html

Contrastive Explanations for Reinforcement Learning via Embedded Self Predictions

Summary

Link

Contrastive Explanations for Reinforcement Learning via Embedded Self Predictions

Author/Institution

Zhengxian Lin, Kin-Ho Lam, Alan Fern
Oregon State

What is this

One-sentence
The authors propose an architecture than can explain why an agent prefer one action over another one by utilizing integrated gradients to explain the association between value function and its input-GVF features.

Full
This work proposes an architecture that can explain why an agent prefers one action over another. First, they use GVF to learn accumulated manual designed features which are easier for human to understand. Second, the value function is not based on raw state representation but based on GVF features. Then when we want to understand why the Q(a) is higher than Q(b), we can associate Q(a) - Q(b) with the GVF features. Third, they use integrated gradients to find the importance of GVF features to the difference between Q(a) and Q(b). Briefly speaking, integrated gradients is a local explanation method that approximates the non-linear function with a linear function. The weights in the approximated linear function could be used directly to indicate the influence or importance of input features. Finally, they also used minimal sufficient explanation (MSX) to handle the problem of large number of GVF features.

Comparison with previous researches. What are the novelties/good points?

No benchmark I think

Key points

GVF, Integrated gradients, RL explanability

How the author proved effectiveness of the proposal?

Demonstrate how their approach help people to understand the agent behaviors on several simple but interesting tasks. Good experiments and demonstration. It's understandable that the tasks are relative simple.

Any discussions?

It's a very interesting method but can be hard to be successful in more complicated tasks. One thing we can learn from that they provide some proof in table-based case. It's not hard, but you know, math is a trick for top conferences.

What should I read next?

I will probably not do extended reading since I don't see the potential of future work or its application on more complex tasks. But I'm happy to see they use integrated gradients. I might also read MSX if I have a chance to use it.

Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

Summary

Link

Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models
Official Code

Author/Institution

Kurtland Chua, Roberto Calandra, Rowan McAllister, Sergey Levine
UC Berkeley

What is this

"We propose a new algorithm called probabilistic ensembles with trajectory sampling (PETS) that combines uncertainty-aware deep network dynamics models with sampling-based uncertainty propagation."
- "Model-based reinforcement learning (RL) algorithms can attain excellent sample efficiency, but often lag behind the best model-free algorithms in terms of asymptotic performance. "
- "Our comparison to state-of-the-art model-based and model-free deep RL algorithms shows that our approach matches the asymptotic performance of model-free algorithms on several challenging benchmark tasks, while requiring significantly fewer samples (e.g., 8 and 125 times fewer samples than Soft Actor Critic and Proximal Policy Optimization respectively on the half-cheetah task)."

Comparison with previous researches. What are the novelties/good points?

"While a number of prior works have explored uncertainty-aware deep neural network models [Neal, 1995, Lakshminarayanan et al., 2017], including in the context of RL [Gal et al., 2016, Depeweg et al., 2016], our work is, to our knowledge, the first to bring these components together in a deep MBRL framework that reaches the asymptotic performance of state-of-the-art model-free RL methods on benchmark control tasks."
- "these components" == ensembling and outputting Gaussian distribution parameters

Key points

Two types of uncertainty:

Aleatoric uncertainty
- Arises from inherent stochasticities of a system (e.g. observation noise and process noise)
- Aleatoric uncertainty can be captured by outputting the parameters of a parameterized distribution
Epistemic uncertainty
- corresponds to subjective uncertainty about the dynamics function, due to a lack of sufficient data to uniquely determine the underlying system exactly.
- In the limit of infinite data, epistemic uncertainty should vanish

Planning and control

Algorithm

How the author proved effectiveness of the proposal?

Comparison to state-of-the-art model-based and model-free deep RL algorithms
- Showed that "our approach matches the asymptotic performance of model-free algorithms on several challenging benchmark tasks, while requiring significantly fewer samples (e.g., 8 and 125 times fewer samples than Soft Actor Critic and Proximal Policy Optimization respectively on the half-cheetah task)."

Any discussions?

What should I read next?

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

Summary

Link

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

Author/Institution

What is this

Show that the use of dropout (and its variants) in NNs can be interpreted as a Bayesian approximation of a well known probabilistic model: the Gaussian process (GP)

Comparison with previous researches. What are the novelties/good points?

Key points

How the author proved effectiveness of the proposal?

Any discussions?

What should I read next?

Neural Network Memory Architectures for Autonomous Robot Navigation

Summary

以下三つが本論文の貢献

memory, separability, そして generalization ability の関連を調査した
Vapnik-Chervonenkis (VC) dimension を見積もることによりネットワークの般化性能を評価する手法を提案
sequential prediction な問題における教師あり学習の新しい訓練アルゴリズムを開発した

論文リンク

https://arxiv.org/abs/1705.08049

著者/所属機関

Steven W Chen, Nikolay Atanasov, Arbaaz Khan, Konstantinos Karydis, Daniel D. Lee, Vijay Kumar
UPenn

Survey

どんなもの？

先行研究と比べてどこがすごい？

Unlike traditional feedback motion planning approaches that rely on accurate global maps, our approach can infer appropriate actions directly from sensed information by using a neural network policy representation

技術や手法のキモはどこ？

どうやって有効だと検証した？

議論はある？

次に読むべき論文は？

Role Playing Learning for Socially Concomitant Mobile Robot Navigation

Summary

論文リンク

https://arxiv.org/abs/1705.10092

著者/所属機関

Mingming Li, Rui Jiang, Shuzhi Sam Ge, Tong Heng Lee
NUS

Survey

どんなもの？

先行研究と比べてどこがすごい？

技術や手法のキモはどこ？

どうやって有効だと検証した？

議論はある？

次に読むべき論文は？

Learning a Prior over Intent via Meta-Inverse Reinforcement Learning

https://arxiv.org/abs/1805.12573

Deep Reinforcement Learning with Double Q-learning

Summary

Link

Deep Reinforcement Learning with Double Q-learning

Author/Institution

Hado van Hasselt, Arthur Guez, David Silver
Google DeepMind

What is this

Comparison with previous researches. What are the novelties/good points?

Key points

How the author proved effectiveness of the proposal?

Any discussions?

What should I read next?

Learning deep structured semantic models for web search using clickthrough data

Summary

Link

Learning deep structured semantic models for web search using clickthrough data

Author/Institution

Po-Sen Huang (UIUC), Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, Larry Heck(MSR)

What is this

Comparison with previous researches. What are the novelties/good points?

Key points

How the author proved effectiveness of the proposal?

Any discussions?

What should I read next?

Sim-to-Real

DARLA: Improving Zero-Shot Transfer in Reinforcement Learning

Summary

Link

DARLA: Improving Zero-Shot Transfer in Reinforcement Learning

Author/Institution

What is this

Proposes DARLA (DisentAngled Representation Learning Agent)
Problems/Motivation
- Learning good internal representations with both source and target domain data
  - The reliance on target domain information can be problematic, as the data may be expensive or difficult to obtain
- Learning exclusively on the source domain using deep RL appraoch
  - Poor domain adaptation performance
- DARLA tackle both issues by focusing on learning underlying low-dimensional factorised representation of the world
Demonstrate how disentangled representations can improve the robustness of RL algorithms in domain adaptation scenarios
- The theoretical utility of disentangled representations for reinforcement learning has been described before, but it has not been empirically validated
RL algorithms
- DQN
- A3C
- Model-Free Episodic Control

Comparison with previous researches. What are the novelties/good points?

Key points

Consists of a three stage pipeline
1. learning to see
2. learning to act
3. transfer
replaces the reconstruction loss in the VAE objective as follows
- J is a denoising autoencoder
"the disentangled model used for DARLA was trained with a β hyperparameter value of 1"
- "Note that by replacing the pixel based reconstruction loss in Eq. 1 with high-level feature recon- struction loss in Eq. 2 we are no longer optimising the vari- ational lower bound, and β-VAEDAE with β = 1 loses its equivalence to the Variational Autoencoder (VAE) frame- work as proposed by (Kingma & Welling, 2014; Rezende et al., 2014)."

How the author proved effectiveness of the proposal?

Experiments
- DeepMind Lab
- Jaco robotic arm (including sim2real set-up: Mujoco simulation is the source domain and the real robotic arm is the target domain)

Any discussions?

What should I read next?

Fast Exploration with Simplified Models and Approximately Optimistic Planning in Model Based Reinforcement Learning

Summary

Link

Fast Exploration with Simplified Models and Approximately Optimistic Planning in Model Based Reinforcement Learning

https://medium.com/rkeramati/towards-reinforcement-learning-inspired-by-humans-without-human-demonstrations-a7c111a4d0de

This review is also informative.
https://openreview.net/forum?id=HygS7n0cFQ

Author/Institution

Ramtin Keramati, Jay Whang, Patrick Cho, Emma Brunskill
Stanford

What is this

Investigate how to perform strategic exploration when exact planning is not feasible and empirically show that optimistic Monte Carlo Tree Search outperforms posterior sampling methods
Show how to learn simple deterministic models to support fast learning using object representation. Introduced a novel algorithm: Strategic Object Oriented Reinforcement Learning (SOORL)

Comparison with previous researches. What are the novelties/good points?

Key points

optimistic MCTS
- a variant of MCTS
SOORL

How the author proved effectiveness of the proposal?

achieve positive reward in the notoriously difficult Atari game Pitfall! within 50 episodes. Almost no RL methods have achieved positive reward on Pitfall! without human demonstrations, and even with demonstrations, such approaches often take hundreds of millions of frames to learn (Aytar et al., 2018; Hester et al., 2017).

Any discussions?

we assume two forms of prior knowledge–a predefined object representation and a class of potential model features

What should I read next?

Think about A simple neural network module for relational reasoning as one of the reviewer suggested.

Overcoming catastrophic forgetting in neural networks

Summary

Link

Overcoming catastrophic forgetting in neural networks

Author/Institution

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, Raia Hadsell
DeepMind

What is this

Comparison with previous researches. What are the novelties/good points?

Key points

How the author proved effectiveness of the proposal?

Any discussions?

What should I read next?

Learning Latent Dynamics for Planning from Pixels

Summary

Link

Learning Latent Dynamics for Planning from Pixels

Official repo: google-research/planet

Author/Institution

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, James Davidson
Google Brain, University of Toronto, DeepMind, Google Research, University of Michigan

What is this

Proposed the Deep Planning Network (PlaNet)
- a purely model-based agent that learns the environment dynamics from images and choose actions through fast online planning in latent space.
Proposed a multi-step variational inference objective that we name latent overshooting.
Showed that the agent solved continuous control tasks, partial observability, and sparse rewards problems using only pixel observations
Achieved close to or sometimes higher performance than strong model-free algorithms
Control: Model Predictive Control (MPC)
- replan at each step (sounds computationally expensive)
Planning algorithm: Cross entropy method (CEM) to search for the best action sequence under the model
- Why CEM?: "We decided on this algorithm because of its robustness and because it solved all considered tasks when given the true dynamics for planning"

"Model" in this architecture refers three thigs:

transition model $p(s_t|s_{t-1}, a_{t-1})$
- Gaussian with mean and variance parameterized by a feed-forward neural network
observation model $p(o_t|s_t)$
- Gaussian with mean parameterized by a deconvolutional neural network and identity covariance
reward model $p(r_t|s_t)$
- scalar Gaussian with mean parameterized by a feed-forward neural network and unit variance

and policy $p(a_t|o_t,a_t)$ aimes to maximize the expected sum of rewards.

Comparison with previous researches. What are the novelties/good points?

The robotics community focuses on video prediction models for planning (Agrawal et al., 2016; Finn & Levine, 2017; Ebert et al., 2018; Zhang et al., 2018) that deal with the visual complexity of the real world and solve tasks with a simple gripper, such as grasping or pushing objects.
- In comparison, we focus on simulated environments, where we leverage latent planning to scale to larger state and action spaces, longer planning horizons, as well as sparse reward tasks
E2C (Watter et al., 2015) and RCE (Banija- mali et al., 2017) embed images into a latent space, where they learn local-linear latent transitions and plan for actions using LQR. These methods balance simulated cartpoles and control 2-link arms from images, but have been difficult to scale up.
- We lift the Markov assumption of these models, making our method applicable under partial observability, and present results on more challenging environments that include longer planning horizons, contact dynamics, and sparse

Key points

Regarding recurrent network for planning, they claim the following:

our experiments show that both stochastic and deterministic paths in the transition model are crucial for successful planning

and the network architecture looks like Figure2 (c) which is called Recurrent state-space model (RSSM)

How the author proved the effectiveness of the proposal?

Experiments in continuous control tasks:
Cartpole Swing Up, Reacher Easy, Cheetah Run, Finger Spin, Cup Catch, and Walker Walk from DeepMind control suite

Confirmed that the proposed model achieved comparable performance to the best model-free algorithms while using 200× fewer episodes and similar or less computation time.

Any discussions?

What should I read next?

Broader contextual review:

Hierarchical Reinforcement Learning

Dayan and Hinton, 1993 Feudal RL
Parr and Russell, 1998 RL with Hierarchies of Machines
Sutton, Precup, Singh, 1999 Options
Precup, 2000 Temporal abstraction in reinforcement learning
- Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning
Dietterich et al, 2000 MaxQ
Fox, Moshkovitz, Tishby, 2016 Principled Option Learning
Heesset al, 2016 Learning and transfer of modulated locomotor controllers
Vezhnevets et al 2017 Feudal Networks for HRL
Bacon, Harb, Precup, 2017 Option-CriticnFlorensa, Duan, Abbeel, 2017 SNNs for HRL
Andreas, Klein, Levine, 2017 Policy Sketches
Frans, Ho, Chen, Abbeel, Schulman, 2017 Meta Learning Shared Hierarchies

Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition

Summary

Link

Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition

Author/Institution

Thomas G. Dietterich
Oregon State University

What is this

Proposed MAXQ decomposition by which express value function in a hierarchical manner.

decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs
decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs

In addition, proposed MAXQ Q-learning algorithm based on MAXQ, which is an online model-free algorithm. The author proved that this algorithm is guaranteed to converge "recursive optimal policy" which is a locally optimal policy.

Comparison with previous researches. What are the novelties/good points?

Three approaches to express subtasks

to define each subtask in terms of a fixed policy that is provided by the programmer: Options,
to define each subtask in terms of a non-deterministic finite-state controller: HAM
to define each subtask in terms of a termination predicate and a local reward function
The approach in this paper corresponds to 3

In this context, these are the previous works

Key points

First, introduce $Q(p,s,a)$ which is the expected total reward of performing subtask $p$ starting in state s, executing action $a$ and then following the optimal policy thereafter.

Then, decompose it as follows:

$Q(p,s,a) = V(a,s) + C(p,s,a)$

$V(a,s)$: The value function for the subtask $a$
- the expected total reward received while executing action $a$
$C(p,s,a)$: Completion function
- the expected total reward of completing parent task $p$ after $a$ has returned

How the author proved effectiveness of the proposal?

The taxi problem

Any discussions?

What should I read next?

MAXQ itself is not a method to learn the structure of hierarchy itself. Techniques like Bayesian Belief Nets (Pearl, 1998) would be one of the key as the author wrote in the paper.

Some readers may be disappointed that MAXQ provides no way of learning the structure of
the hierarchy. Our philosophy in developing MAXQ (which we share with other reinforcement learning researchers, notably Parr and Russell) has been to draw inspiration from the development of Belief Networks (Pearl, 1988). Belief networks were first introduced as a formalism in which the knowledge engineer would describe the structure of the networks and domain experts would provide the necessary probability estimates. Subsequently, methods were developed for learning the probability values directly from observational data. Most recently, several methods have been developed for learning the structure of the belief networks from data, so that the dependence on the knowledge engineer is reduced.

In terms of terminate condition of a subtask,Dean and Lin (1995) could be a good reference

the termination predicate method requires the programmer to guess the relative desirability of the different states in which the subtask might terminate. This can also be difficult, although Dean and Lin show how these guesses can be revised automatically by the learning algorithm"

Model-based reinforcement learning for biological sequence design

Summary

Link

Model-based reinforcement learning for biological sequence design

Author/Institution

Christof Angermueller, David Dohan, David Belanger, Ramya Deshpande, Kevin Murphy, Lucy Colwell

What is this

Ref: Algorithm 1: DyNA PPO

Comparison with previous researches. What are the novelties/good points?

Key points

Our method updates the policy’s parameters using sequences x generated by the current policy πθ(x), but evaluated using a learned surrogate f'(x), instead of the true, but unknown, oracle reward function f(x).

How the author proved effectiveness of the proposal?

Any discussions?

What should I read next?

Visual Semantic Planning using Deep Successor Representations

Summary

論文リンク

https://arxiv.org/abs/1705.08080

著者/所属機関

Yuke Zhu, Daniel Gordon, Eric Kolve, Dieter Fox, Li Fei-Fei, Abhinav Gupta, Roozbeh Mottaghi, Ali Farhadi
Allen Institute for Artificial Intelligence, CMU, Stanford, University of Washington

Survey

どんなもの？

address the problem of visual semantic planning: the task of predicting a sequence of actions from visual observations that transform a dynamic environment from an initial state to a goal state.

Four challenges

Performing each of actions in a visual dynamic environment requires deep visual understanding of that environment
- THOR framework を用いて interaction 可能な学習環境にて学習を行った
The variability of visual observations and possible actions makes naive exploration intractable
- RL で行われているような policy learning problem として扱った
- RL をキックするために imitation learning を用いた
Emit a sequence of actions such that the agent ends in the goal state and the effects of the preceding actions meet the preconditions of the proceeding one
- 同上
Previous knowledge about one task should make it easier to learn the next one
- develop a deep predictive model based on successor representation

先行研究と比べてどこがすごい？

技術や手法のキモはどこ？

どうやって有効だと検証した？

議論はある？

次に読むべき論文は？

背景として以下を少し詳細にフォローしておきたい。

Neuroscience-Inspired Artificial Intelligence

Summary

Neuroscience と所謂 AI のこれまでの研究と発展について振り返りつつ、Past / Present / Future についてまとめたサーベイ。
AI システムにおいては、人体の（脳の）仕組みを模した Neuromorphic computing のようなアプローチと、厳密には実態に即さなくても良く欲しい結果が得られれば良しとするアプローチがある。（DeepMindでの著者の仕事も主には後者であって、"biological plausibility is a guid" と本文でも描かれている。もちろんDeepMindにはNeuroscienceガチ勢は沢山居て、Hassabis氏含め皆両方の方向性の進歩とその絶妙なバランス/ブレンドを志向しているのだろうけど）

著者は、NeuroscienceのAI分野に対するベネフィットとして以下の二点を挙げている。

neuroscience provides a rich source of inspiration for new types of algorithms and archtectures
neuroscience can provide validation of AI techniques that already exist

論文リンク

http://www.cell.com/neuron/pdf/S0896-6273(17)30509-3.pdf

著者/所属機関

Demis Hassabis,1,2,* Dharshan Kumaran,1,3 Christopher Summerfield,1,4 and Matthew Botvinick1,2

DeepMind
Gatsby Computational Neuroscience Unit
Institute of Cognitive Neuroscience, University College London
Department of Experimental Psychology, University of Oxford

Survey

どんなもの？

先行研究と比べてどこがすごい？

技術や手法のキモはどこ？

どうやって有効だと検証した？

議論はある？

次に読むべき論文は？

個人的に気になったのは以下など

Learning Invariant Representations for Reinforcement Learning without Reconstruction

Summary

Link

Learning Invariant Representations for Reinforcement Learning without Reconstruction

Author/Institution

Amy Zhang, Rowan McAllister, Roberto Calandra, Yarin Gal, Sergey Levine
UCB, FAIR, McGill, Oxford

What is this

One-sentence
It proposes an approach of representation learning for RL to focus on task-relevant features while ignoring task-irrelevant ones based on the idea of bisimulation.
Full
Appropriate representation could help RL agents to learn faster and also achieve other benefits such as improved generalization. In this work, the authors propose Deep Bisimulation for Control (DBC) to learn RL control and also representation at the same time. Bisimulation metric provides a measurement of the similarity of two states based on the reward and state transition dynamics. The idea of this work is that the l1 distance of two latent state representation should approximate the bisimulation metric of the two states. One nice thing of this approach is that since bisimulation metric will consider task-relevant information only, the constraint or regularization on latent state representation learning drives the representation to ignore task-irrelevant features. This is well demonstrated by CARLA tasks.
Another thing to notice is that they used iterative update (policy, environment model and representation) in implementation. Sometimes engineering tricks are important to make your fancy idea really work.

Comparison with previous researches. What are the novelties/good points?

Compared with other representation learning approaches such as reconstruction-based or contrastive learning based approaches.

Key points

Bisimulation metric

How the author proved effectiveness of the proposal?

Use MuJuCo to show their approach leads to higher rewards or faster convergence
Use CARLA to show the generalization advantage of their approach

Any discussions?

It's actually related to our work Domain Adaptation In Reinforcement Learning Via Latent Unified State Representation
. Our approach could be categorized into reconstruction-based state representation learning approaches. I agree representation learning could matter a lot for RL.
Bisimulation is also interesting and have a good potential for further research. It could have more utilization in RL.
Their idea is simple but interesting. Their section 5 of proof could be a good plus. Their experiments are also valid.

What should I read next?

Learning continuous latent space models for representation learning
Scalable methods for computing state similarity in deterministic Markov decision processes.

Disentangling Visual Embeddings for Attributes and Objects

Summary

Link

Author/Institution

Nirat Saini, Khoi Pham, Abhinav Shrivastava
UMD

What is this

The question "How Can We Better Capture Subtly Distinct Features Associated with Attributes?"

They propose a novel approach, OADis, to disentangle attribute and object visual features, where visual embedding for peeled is distinct and independent of embedding for apple.
They compose unseen pairs in the visual space using the disentangled features.
- Following Compositional Zero-shot Learning (CZSL) setup, they show competitive improvement over prior works on standard datasets.
- Datasets: MIT-States and UT-Zappos are commonly used in compositional learning task
They propose a new large-scale benchmark for CZSL using an existing attribute dataset VAW, and show that OADis outperforms existing baselines.

Comparison with previous researches. What are the novelties/good points?

Prior works employ supervision from the linguistic space, and use pre-trained word embeddings to better separate and compose attribute-object pairs for recognition.

Most prior works of compositional learning employ supervision from the linguistic space:
- use pre-trained word embeddings to better separate and compose attribute-object pairs for recognition
In this paper, they shift the focus back to the visual space
- (Probably the reason is that the possible number of attribute-object pair is large and learn to disentangle them visually could bring us scalability and better generalizability..?)
GANs, CLIP, DALL-E?
- They require a large dataset, and requires high computational power
- This work aim to be more smaller setup

Key points

Components
1. Image Encoder (IE)
2. Object-Conditioned Network (OCN)
3. Label Embedder (LE)
4. Cosine Classifier (CosCls)
5. Attribute Affinity Network (AAN) and Object Affinity Network

Compositional Zero-shotLearning (CZSL)

How the author proved effectiveness of the proposal?

MIT-States [24] and UT-Zappos [56] are commonly used to study this task

Any discussions?

Not sure how well the object affinity network learns the attr? (v’_attr)
Same for the attribute affinity network. How well this network learns obj? (v’_obj)
Not only L_attr and L_obj, but also L_seen and L_unseen work together to shape these embeddings.

What should I read next?

Unifying Count-Based Exploration and Intrinsic Motivation

Summary

Link

https://arxiv.org/abs/1606.01868

Author/Institution

Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, Remi Munos
DeepMind

What is this

Proposed 'pseudo-count' approach to have an agent explore unknown environment.

Comparison with previous researches. What are the novelties/good points?

Advantage compared to count-based approach:

Key points

Solve a following equation:

where
This is the probability assigned to x by the density model after observing a new occurrence of x. Density model is Context Tree Switching (Bellemare et al 2014)

is a pseudo-count function and
is a pseudo-count total.

Add a bonus which is

How the author proved effectiveness of the proposal?

Atari 2600 games using Arcade Learning Environment (ALE). Especially, Montezuma's revenge showed novel result.

Any discussions?

Quote from the paper:

Induced metric

We did not address the question of where the generalization comes from. Clearly, the choice of density model induces a particular metric over the state space. A better understanding of this metric should allow us to tailor the density model to the problem of exploration.

Compatible value function

There may be a mismatch in the learning rates of the density model and the value function: DQN learns much more slowly than our CTS model. As such, it should be beneficial to design value functions compatible with density models (or vice-versa).

The continuous case

Although we focused here on countable state spaces, we can as easily define a pseudo-count in terms of probability density functions. At present it is unclear whether this provides us

What should I read next?

Exploration by Random Network Distillation

Reinforcement Learning with Unsupervised Auxiliary Tasks

Summary

Link

Reinforcement Learning with Unsupervised Auxiliary Tasks

Author/Institution

Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, Koray Kavukcuoglu
DeepMind

What is this

Comparison with previous researches. What are the novelties/good points?

Key points

How the author proved effectiveness of the proposal?

Any discussions?

What should I read next?

Autoencoding beyond pixels using a learned similarity metric

Summary

Link

Autoencoding beyond pixels using a learned similarity metric
Official implementation
- They use DeepPy which seems the author's original deep learning framework..

Author/Institution

Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, Ole Winther
Technical University of Denmark, University of Copenhagen, Twitter

What is this

Combine VAEs and GANs.

Propose to use learned feature representations in the GAN discriminator as basis for the VAE reconstruction objective.
Thereby, replace element-wise errors with feature-wise errors.

Moreover, show that the network is able to disentangle factors of variation in the input data distribution and discover visual attributes in the high-level representation of the latent space.

Comparison with previous researches. What are the novelties/good points?

Why not just using VAE?
- Pixel-wise metric (e.g. MSE) is not appropriate for images. Discriminator brings another approach into here
Why only GAN doesn't make sense?
- GAN doesn't have encoding ability. GAN itself will not be enough for some purpose

Key points

Collapse the VAE decoder and the GAN generator into one
- Share parameters
- Train jointly
Replace element-wise reconstruction metric with a feature-wise etric expressed in the discriminator

The loss is consists of these three different losses

$L_{prior}$: KL from VAE
$L^{Dis_l}_{llike}$: reconstruction error expressed in the GAN discriminator
$L_GAN$: standard GAN loss function = $log(Dis(x)) + log(1-Dis(Gen(z)))$

Algorithm

How the author proved effectiveness of the proposal?

Conducted experiments with CelebA dataset and showed that the generative models trained with learned similarity measures produced better image samples than models trained with element-wise error measures.

Any discussions?

How is performance in terms of computational cost?
How to determine when to finish GANs training? (maybe need to check the code)

What should I read next?

Note

How is performance. It may faster or computationaly less expensive compare to MSE?

SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient

Summary

Link

SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient
Code

Author/Institution

Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu
Shanghai Jiao Tong University, University College London

What is this

Proposed a sequence generation framework called SeqGAN

Comparison with previous researches. What are the novelties/good points?

GAN has limitations when the goal is for generating sequences of discrete tokens
- the discrete outputs from the generative model make it difficult to pass the gradient update from the discriminative model to the generative model
- the discriminative model can only assess a complete sequence, while for a partially generated sequence, it is non- trivial to balance its current score and the future one once the entire sequence has been generated

Key points

Modeling the data generator as a stochastic policy in reinforcement learning (RL), SeqGAN bypasses the generator differentiation problem by directly performing gradient policy update (REINFORCE).
Apply Monte Carlo search with a roll-out policy to sample the unknown last tokens

How the author proved effectiveness of the proposal?

Any discussions?

What should I read next?

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
How (not) to train your generative model: Sched- uled sampling, likelihood, adversary?

Insights on representational similarity in neural networks with canonical correlation

https://ai.googleblog.com/2018/06/how-can-neural-network-similarity-help.html

Auto-Encoding Variational Bayes

Summary

Link

Auto-Encoding Variational Bayes

Author/Institution

Diederik P. Kingma, Max Welling
Universiteit van Amsterdam

What is this

Comparison with previous researches. What are the novelties/good points?

Key points

How the author proved effectiveness of the proposal?

Any discussions?

What should I read next?

Adversarial Feature Matching for Text Generation

Summary

Link

Adversarial Feature Matching for Text Generation

Author/Institution

Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, Lawrence Carin
Duke University

What is this

Comparison with previous researches. What are the novelties/good points?

Key points

How the author proved effectiveness of the proposal?

Any discussions?

What should I read next?

Data-Efficient Hierarchical Reinforcement Learning

https://arxiv.org/abs/1805.08296

Exploration by Random Network Distillation

Summary

Link

Exploration by Random Network Distillation

Author/Institution

Yuri Burda, Harrison Edwards, Amos Storkey, Oleg Klimov
OpenAI & Univ. of Edinburgh

What is this

Comparison with previous researches. What are the novelties/good points?

Comparison with Pseudo-count? (Bellemare et al., 2016)

Key points

Define exploration bonus as the prediction error for a problem related to the agent's transitions.
The prediction error is expected to be higher for novel states dissimilar to the ones the predictor has been trained on. This characteristic is exactly what we expect to see for exploration bonus.

Combine intrinsic rewards with extrinsic rewards. This paper also suggests to use non-episodic intrinsic reward.

How the author proved effectiveness of the proposal?

Any discussions?

How to incentivize long time decision is one of the key future work as stated in the paper.

global exploration that involves coordinated decisions over long time horizons is beyond the reach of our method

(e.g. in Montezuma's Revenge, the agent needs to save some keys for the future rather consuming it soon)

What should I read next?

Hierarchical and Interpretable Skill Acquisition in Multi-task Reinforcement Learning

Hierarchical and Interpretable Skill Acquisition in Multi-task Reinforcement Learning
https://openreview.net/forum?id=SJJQVZW0b

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Summary

Link

PILCO: a model-based and data-efficient approach to policy search
PILCO - 第一回高橋研究室モデルベース強化学習勉強会

Author/Institution

What is this

Abstract Quote

In this paper, we introduce pilco, a practical, data-efficient model-based policy search method. Pilco reduces model bias, one of the key problems of model-based reinforcement learning, in a principled way. By learning a probabilistic dynamics model and explicitly incorporating model uncertainty into long-term planning, pilco can cope with very little data and facilitates learning from scratch in only a few trials. Policy evaluation is performed in closed form using state-of-the-art approximate inference. Furthermore, policy gradients are computed analytically for policy improvement. We report unprecedented learning efficiency on challenging and high-dimensional control tasks.

Comparison with previous researches. What are the novelties/good points?

Key points

The framework consists of the dynamics model, analytic approximate policy evaluation, and gradient- based.

Compute probability distribution at time step t as p_\theta (x_t), then compute the cost function J^\pi(\theta)

cost c(x) can be solved analytically (eq. 25)

Analytic derivatives of J can be computed, and "standard gradient-based non-convex optimization methods, e.g., CG or L- BFGS" are used to update the parameter \theta

How the author proved effectiveness of the proposal?

Cart-Pole (real)
Unicycle (simulation)

Any discussions?

What should I read next?

[RL] Learning to Predict by the Methods of Temporal Differences

Sutton, R.S. Machine Learning (1988) 3: 9. https://doi.org/10.1023/A:1022633531479

Universal Value Function Approximators

Summary

Link

Universal Value Function Approximators (ICML'15)

Author/Institution

Tom Schaul, Dan Horgan, Karol Gregor, and David Silver
Google DeepMind

What is this

Comparison with previous researches. What are the novelties/good points?

Key points

How the author proved effectiveness of the proposal?

Any discussions?

What should I read next?

Relational inductive biases, deep learning, and graph networks

https://arxiv.org/abs/1806.01261

Selective Dyna-Style Planning Under Limited Model Capacity

Summary

Link

Selective Dyna-Style Planning Under Limited Model Capacity, ICML2020

Author/Institution

Zaheer Abbas, Samuel Sokota, Erin J. Talvitie, Martha White

What is this

Investigated the idea of selective planning: the agent should plan only in parts of the state space where the model is accurate
Showed that incorporating learned variance into planning can outperform the equivalent model-free method
Offered evidence that ensembling and heteroscedastic regression have complementary strengths
- suggesting that their combination is a more robust selective planning mechanism than either in isolation

Comparison with previous researches. What are the novelties/good points?

To make the weight on an h-step target to be inversely related to the cumulative uncertainty, the weight of an individual target by computing the softmax:

Key points

Heteroscedastic regression to determine predictive uncertainty arising from model inadequancy
- Model Inadequacy refers to the model's hypothesis class being unable to express the underlying function generating the data
- The loss: refer eq. 1
Another methods to detect parameter uncertainty
Aleatoric Uncertainty which comes from the randomness of the environment is irreducible and it's not the main focus of this work

How the author proved effectiveness of the proposal?

To show that Heteroscedastic Regression can estimates predictive uncertainty due to model inadequacy, the authors conducted an experiments to compare with four other methods:

Monte Carlo Dropout
Ensemble of NN
Randomized Prior Functions
Randomized Prior Functions with Bootstrapping

Any discussions?

What should I read next?

Estimating the mean and variance of the target probability distribution
- To deep dive the loss of the regression to predict variances
Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning
- To understand model-based value expansion (MVE)

Adaptive Information Gathering via Imitation Learning

Summary

Information gathering の問題を POMDP として設定、ポリシー学習に "clairvoyant oracle" という完全な情報を模倣する手法を提案。

論文リンク

https://arxiv.org/abs/1705.07834

著者/所属機関

Sanjiban Choudhury, Ashish Kapoor, Gireeja Ranade, Sebastian Scherer, and Debadeepta Dey
CMU and MSR

※ "This work was conducted by Sanjiban Choudhury as part of a summer internship at Microsoft Research, Redmond, USA" だそうだ

Survey

どんなもの？

先行研究と比べてどこがすごい？

技術や手法のキモはどこ？

clairvoyant oracle: an oracle that at train time has full knowledge about the world map and can compute maximally informative sensing locations.
この情報を imitate する。

どうやって有効だと検証した？

議論はある？

個人的に、この手法の凄さがイマイチまだピンときていない。