### Arxiv on Feb. 16th

16Feb18

Title: Reinforcement Learning from Imperfect Demonstrations
Authors: Yang Gao, Huazhe (Harry) Xu, Ji Lin, Fisher Yu, Sergey Levine, Trevor
Darrell
Categories: cs.AI cs.LG stat.ML

Robust real-world learning should benefit from both demonstrations and
interactions with the environment. Current approaches to learning from
demonstration and reward perform supervised learning on expert demonstration
data and use reinforcement learning to further improve performance based on the
are difficult to jointly optimize and such methods can be very sensitive to
noisy demonstrations. We propose a unified reinforcement learning algorithm,
Normalized Actor-Critic (NAC), that effectively normalizes the Q-function,
reducing the Q-values of actions unseen in the demonstration data. NAC learns
an initial policy network from demonstrations and refines the policy in the
environment, surpassing the demonstrator’s performance. Crucially, both
learning from demonstration and interactive refinement use the same objective,
unlike prior approaches that combine distinct supervised and reinforcement
losses. This makes NAC robust to suboptimal demonstration data since the method
is not forced to mimic all of the examples in the dataset. We show that our
unified reinforcement learning algorithm can learn robustly and outperform
existing baselines when evaluated on several realistic driving games.
https://arxiv.org/abs/1802.05313 ,  7631kb)

Title: From Gameplay to Symbolic Reasoning: Learning SAT Solver Heuristics in
the Style of Alpha(Go) Zero
Authors: Fei Wang, Tiark Rompf
Categories: cs.AI

Despite the recent successes of deep neural networks in various fields such
as image and speech recognition, natural language processing, and reinforcement
learning, we still face big challenges in bringing the power of numeric
optimization to symbolic reasoning. Researchers have proposed different avenues
such as neural machine translation for proof synthesis, vectorization of
symbols and expressions for representing symbolic patterns, and coupling of
neural back-ends for dimensionality reduction with symbolic front-ends for
decision making. However, these initial explorations are still only point
solutions, and bear other shortcomings such as lack of correctness guarantees.
In this paper, we present our approach of casting symbolic reasoning as games,
and directly harnessing the power of deep reinforcement learning in the style
of Alpha(Go) Zero on symbolic problems. Using the Boolean Satisfiability (SAT)
problem as showcase, we demonstrate the feasibility of our method, and the
advantages of modularity, efficiency, and correctness guarantees.
https://arxiv.org/abs/1802.05340 ,  52kb)

Title: Mean Field Multi-Agent Reinforcement Learning
Authors: Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, Jun Wang
Categories: cs.MA cs.AI cs.LG

Existing multi-agent reinforcement learning methods are limited typically to
a small number of agents. When the agent number increases largely, the learning
becomes intractable due to the curse of the dimensionality and the exponential
growth of user interactions. In this paper, we present Mean Field Reinforcement
Learning where the interactions within the population of agents are
approximated by those between a single agent and the average effect from the
overall population or neighboring agents; the interplay between the two
entities is mutually reinforced: the learning of the individual agent’s optimal
policy depends on the dynamics of the population, while the dynamics of the
population change according to the collective patterns of the individual
policies. We develop practical mean field Q-learning and mean field
Actor-Critic algorithms and analyze the convergence of the solution.
Experiments on resource allocation, Ising model estimation, and battle game
tasks verify the learning effectiveness of our mean field approaches in
handling many-agent interactions in population.
https://arxiv.org/abs/1802.05438 ,  729kb)

### Arxiv on Feb. 15th

15Feb18

No paper today in the digest about Deep Reinforcement Learning.

### Arxiv on Feb. 14th

15Feb18

Title: Efficient Exploration through Bayesian Deep Q-Networks
Authors: Kamyar Azizzadenesheli and Emma Brunskill and Animashree Anandkumar
Categories: cs.AI cs.LG stat.ML

We propose Bayesian Deep Q-Network (BDQN), a practical Thompson sampling
based Reinforcement Learning (RL) Algorithm. Thompson sampling allows for
targeted exploration in high dimensions through posterior sampling but is
usually computationally expensive. We address this limitation by introducing
uncertainty only at the output layer of the network through a Bayesian Linear
Regression (BLR) model. This layer can be trained with fast closed-form updates
and its samples can be drawn efficiently through the Gaussian distribution. We
apply our method to a wide range of Atari games in Arcade Learning
Environments. Since BDQN carries out more efficient exploration, it is able to
reach higher rewards substantially faster than a key baseline, the double deep
Q network (DDQN).
https://arxiv.org/abs/1802.04412 ,  2915kb)

Title: Diversity-Driven Exploration Strategy for Deep Reinforcement Learning
Authors: Zhang-Wei Hong, Tzu-Yun Shann, Shih-Yang Su, Yi-Hsiang Chang, Chun-Yi
Lee
Categories: cs.AI stat.ML

Efficient exploration remains a challenging research problem in reinforcement
learning, especially when an environment contains large state spaces, deceptive
local optima, or sparse rewards. To tackle this problem, we present a
diversity-driven approach for exploration, which can be easily combined with
both off- and on-policy reinforcement learning algorithms. We show that by
simply adding a distance measure to the loss function, the proposed methodology
significantly enhances an agent’s exploratory behaviors, and thus preventing
the policy from being trapped in local optima. We further propose an adaptive
scaling method for stabilizing the learning process. Our experimental results
in Atari 2600 show that our method outperforms baseline approaches in several
tasks in terms of mean scores and exploration efficiency.
https://arxiv.org/abs/1802.04564 ,  5800kb)

Title: Progressive Reinforcement Learning with Distillation for Multi-Skilled
Motion Control
Authors: Glen Berseth, Cheng Xie, Paul Cernek, Michiel Van de Panne
Categories: cs.AI cs.LG stat.ML

Deep reinforcement learning has demonstrated increasing capabilities for
continuous control problems, including agents that can move with skill and
agility through their environment. An open problem in this setting is that of
developing good strategies for integrating or merging policies for multiple
skills, where each individual skill is a specialist in a specific skill and its
associated state distribution. We extend policy distillation methods to the
continuous action setting and leverage this technique to combine expert
policies, as evaluated in the domain of simulated bipedal locomotion across
different classes of terrain. We also introduce an input injection method for
augmenting an existing policy network to exploit new input features. Lastly,
our method uses transfer learning to assist in the efficient acquisition of new
skills. The combination of these methods allows a policy to be incrementally
augmented with new skills. We compare our progressive learning and integration
via distillation (PLAID) method against three alternative baselines.
https://arxiv.org/abs/1802.04765 ,  4972kb)

Title: Efficient Model-Based Deep Reinforcement Learning with Variational State
Tabulation
Authors: Dane Corneil, Wulfram Gerstner and Johanni Brea
Categories: cs.LG cs.AI stat.ML

Modern reinforcement learning algorithms reach super-human performance in
many board and video games, but they are sample inefficient, i.e. they
typically require significantly more playing experience than humans to reach an
equal performance level. To improve sample efficiency, an agent may build a
model of the environment and use planning methods to update its policy. In this
article we introduce VaST (Variational State Tabulation), which maps an
environment with a high-dimensional state space (e.g. the space of visual
inputs) to an abstract tabular environment. Prioritized sweeping with small
backups, a highly efficient planning method, can then be used to update
state-action values. We show how VaST can rapidly learn to maximize reward in
transition probabilities.
https://arxiv.org/abs/1802.04325 ,  6651kb)

### Arxiv on Feb. 13th

13Feb18

Title: More Robust Doubly Robust Off-policy Evaluation
Categories: cs.AI

We study the problem of off-policy evaluation (OPE) in reinforcement learning
(RL), where the goal is to estimate the performance of a policy from the data
generated by another policy(ies). In particular, we focus on the doubly robust
(DR) estimators that consist of an importance sampling (IS) component and a
performance model, and utilize the low (or zero) bias of IS and low variance of
the model at the same time. Although the accuracy of the model has a huge
impact on the overall performance of DR, most of the work on using the DR
estimators in OPE has been focused on improving the IS part, and not much on
how to learn the model. In this paper, we propose alternative DR estimators,
called more robust doubly robust (MRDR), that learn the model parameter by
minimizing the variance of the DR estimator. We first present a formulation for
learning the DR model in RL. We then derive formulas for the variance of the DR
estimator in both contextual bandits and RL, such that their gradients
w.r.t.~the model parameters can be estimated from the samples, and propose
methods to efficiently minimize the variance. We prove that the MRDR estimators
are strongly consistent and asymptotically optimal. Finally, we evaluate MRDR
in bandits and RL benchmark problems, and compare its performance with the
existing methods.
https://arxiv.org/abs/1802.03493 ,  939kb)

Title: Beyond the One Step Greedy Approach in Reinforcement Learning
Authors: Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor
Categories: cs.AI

The famous Policy Iteration algorithm alternates between policy improvement
and policy evaluation. Implementations of this algorithm with several variants
of the latter evaluation stage, e.g, $n$-step and trace-based returns, have
been analyzed in previous works. However, the case of multiple-step lookahead
policy improvement, despite the recent increase in empirical evidence of its
strength, has to our knowledge not been carefully analyzed yet. In this work,
we introduce the first such analysis. Namely, we formulate variants of
multiple-step policy improvement, derive new algorithms using these definitions
and prove their convergence. Moreover, we show that recent prominent
Reinforcement Learning algorithms are, in fact, instances of our framework. We
thus shed light on their empirical success and give a recipe for deriving new
algorithms for future study.
https://arxiv.org/abs/1802.03654 ,  68kb)

Title: State Representation Learning for Control: An Overview
Authors: Timoth\’ee Lesort, Natalia D\’iaz-Rodr\’iguez, Jean-Fran\c{c}ois
Goudou, David Filliat
Categories: cs.AI

Representation learning algorithms are designed to learn abstract features
that characterize data. State representation learning (SRL) focuses on a
particular kind of representation learning where learned features are in low
dimension, evolve through time, and are influenced by actions of an agent. As
the representation learned captures the variation in the environment generated
by agents, this kind of representation is particularly suitable for robotics
and control scenarios. In particular, the low dimension helps to overcome the
curse of dimensionality, provides easier interpretation and utilization by
humans and can help improve performance and speed in policy learning algorithms
such as reinforcement learning.
This survey aims at covering the state-of-the-art on state representation
learning in the most recent years. It reviews different SRL methods that
involve interaction with the environment, their implementations and their
applications in robotics control tasks (simulated or real). In particular, it
highlights how generic learning objectives are differently exploited in the
reviewed algorithms. Finally, it discusses evaluation methods to assess the
representation learned and summarizes current and future lines of research.
https://arxiv.org/abs/1802.04181 ,  199kb)

Title: Deep Reinforcement Learning for Solving the Vehicle Routing Problem
Authors: Mohammadreza Nazari, Afshin Oroojlooy, Lawrence V. Snyder, Martin
Tak\’a\v{c}
Categories: cs.AI cs.LG stat.ML

We present an end-to-end framework for solving Vehicle Routing Problem (VRP)
using deep reinforcement learning. In this approach, we train a single model
that finds near-optimal solutions for problem instances sampled from a given
distribution, only by observing the reward signals and following feasibility
rules. Our model represents a parameterized stochastic policy, and by applying
a policy gradient algorithm to optimize its parameters, the trained model
produces the solution as a sequence of consecutive actions in real time,
without the need to re-train for every new problem instance. Our method is
faster in both training and inference than a recent method that solves the
Traveling Salesman Problem (TSP), with nearly identical solution quality. On
the more general VRP, our approach outperforms classical heuristics on
medium-sized instances in both solution quality and computation time (after
training). Our proposed framework can be applied to variants of the VRP such as
the stochastic VRP, and has the potential to be applied more generally to
combinatorial optimization problems.
https://arxiv.org/abs/1802.04240 ,  650kb)

Title: Sample Efficient Deep Reinforcement Learning for Dialogue Systems with
Large Action Spaces
Authors: Gell\’ert Weisz, Pawe{\l} Budzianowski, Pei-Hao Su, Milica Ga\v{s}i\’c
Categories: cs.CL cs.AI cs.LG stat.ML

In spoken dialogue systems, we aim to deploy artificial intelligence to build
automated dialogue agents that can converse with humans. A part of this effort
is the policy optimisation task, which attempts to find a policy describing how
to respond to humans, in the form of a function taking the current state of the
dialogue and returning the response of the system. In this paper, we
investigate deep reinforcement learning approaches to solve this problem.
Particular attention is given to actor-critic methods, off-policy reinforcement
learning with experience replay, and various methods aimed at reducing the bias
and variance of estimators. When combined, these methods result in the
previously proposed ACER algorithm that gave competitive results in gaming
environments. These environments however are fully observable and have a
relatively small action set so in this paper we examine the application of ACER
to dialogue policy optimisation. We show that this method beats the current
state-of-the-art in deep learning approaches for spoken dialogue systems. This
not only leads to a more sample efficient algorithm that can train faster, but
also allows us to apply the algorithm in more difficult environments than
before. We thus experiment with learning in a very large action space, which
has two orders of magnitude more actions than previously considered. We find
that ACER trains significantly faster than the current state-of-the-art.
https://arxiv.org/abs/1802.03753 ,  1351kb)

Title: ADC: Automated Deep Compression and Acceleration with Reinforcement
Learning
Authors: Yihui He, Song Han
Categories: cs.CV

Model compression is an effective technique facilitating the deployment of
neural network models on mobile devices that have limited computation resources
and a tight power budget. However, conventional model compression techniques
use hand-crafted features and require domain experts to explore the large
design space trading off model size, speed, and accuracy, which is usually
sub-optimal and time-consuming. In this paper, we propose Automated Deep
Compression (ADC) that leverages reinforcement learning in order to efficiently
sample the design space and greatly improve the model compression quality. We
achieved state-of-the-art model compression results in a fully automated way
without any human efforts. Under 4x FLOPs reduction, we achieved 2.7% better
accuracy than hand-crafted model compression method for VGG-16 on ImageNet. We
applied this automated, push-the-button compression pipeline to MobileNet and
achieved a 2x reduction in FLOPs, and a speedup of 1.49x on Titan Xp and 1.65x
on an Android phone (Samsung Galaxy S7), with negligible loss of accuracy.
https://arxiv.org/abs/1802.03494 ,  833kb)

Title: A note on reinforcement learning with Wasserstein distance
regularisation, with applications to multipolicy learning
Authors: Mohammed Amin Abdullah, Aldo Pacchiano, Moez Draief
Categories: cs.LG cs.AI

In this note we describe an application of Wasserstein distance to
Reinforcement Learning. The Wasserstein distance in question is between the
distribution of mappings of trajectories of a policy into some metric space,
and some other fixed distribution (which may, for example, come from another
policy). Different policies induce different distributions, so given an
underlying metric, the Wasserstein distance quantifies how different policies
are. This can be used to learn multiple polices which are different in terms of
such Wasserstein distances by using a Wasserstein regulariser. Changing the
sign of the regularisation parameter, one can learn a policy for which its
trajectory mapping distribution is attracted to a given fixed distribution.
https://arxiv.org/abs/1802.03976 ,  13kb)

### Arxiv on Feb. 12th

12Feb18

Title: A Unified Approach for Multi-step Temporal-Difference Learning with
Eligibility Traces in Reinforcement Learning
Authors: Long Yang, Minhao Shi, Qian Zheng, Wenjia Meng, Gang Pan
Categories: cs.AI cs.LG stat.ML

Recently, a new multi-step temporal learning algorithm, called $Q(\sigma)$,
unifies $n$-step Tree-Backup (when $\sigma=0$) and $n$-step Sarsa (when
$\sigma=1$) by introducing a sampling parameter $\sigma$. However, similar to
other multi-step temporal-difference learning algorithms, $Q(\sigma)$ needs
much memory consumption and computation time. Eligibility trace is an important
mechanism to transform the off-line updates into efficient on-line ones which
consume less memory and computation time. In this paper, we further develop the
original $Q(\sigma)$, combine it with eligibility traces and propose a new
algorithm, called $Q(\sigma ,\lambda)$, in which $\lambda$ is trace-decay
parameter. This idea unifies Sarsa$(\lambda)$ (when $\sigma =1$) and
$Q^{\pi}(\lambda)$ (when $\sigma =0$). Furthermore, we give an upper error
bound of $Q(\sigma ,\lambda)$ policy evaluation algorithm. We prove that
$Q(\sigma,\lambda)$ control algorithm can converge to the optimal value
function exponentially. We also empirically compare it with conventional
temporal-difference learning methods. Results show that, with an intermediate
value of $\sigma$, $Q(\sigma ,\lambda)$ creates a mixture of the existing
algorithms that can learn the optimal value significantly faster than the
extreme end ($\sigma=0$, or $1$).
https://arxiv.org/abs/1802.03171 ,  627kb)

Title: Balancing Two-Player Stochastic Games with Soft Q-Learning
Authors: Jordi Grau-Moya and Felix Leibfried and Haitham Bou-Ammar
Categories: cs.AI

Within the context of video games the notion of perfectly rational agents can
be undesirable as it leads to uninteresting situations, where humans face tough
adversarial decision makers. Current frameworks for stochastic games and
reinforcement learning prohibit tuneable strategies as they seek optimal
performance. In this paper, we enable such tuneable behaviour by generalising
soft Q-learning to stochastic games, where more than one agent interact
strategically. We contribute both theoretically and empirically. On the theory
side, we show that games with soft Q-learning exhibit a unique value and
generalise team games and zero-sum games far beyond these two extremes to cover
a continuous spectrum of gaming behaviour. Experimentally, we show how tuning
agents’ constraints affect performance and demonstrate, through a neural
network architecture, how to reliably balance games with high-dimensional
representations.
https://arxiv.org/abs/1802.03216 ,  1650kb)

Title: Learning Robust Options
Authors: Daniel J. Mankowitz, Timothy A. Mann, Pierre-Luc Bacon, Doina Precup
and Shie Mannor
Categories: cs.AI cs.LG stat.ML

Robust reinforcement learning aims to produce policies that have strong
guarantees even in the face of environments/transition models whose parameters
have strong uncertainty. Existing work uses value-based methods and the usual
primitive action setting. In this paper, we propose robust methods for learning
temporally abstract actions, in the framework of options. We present a Robust
Options Policy Iteration (ROPI) algorithm with convergence guarantees, which
learns options that are robust to model uncertainty. We utilize ROPI to learn
robust options with the Robust Options Deep Q Network (RO-DQN) that solves
multiple tasks and mitigates model misspecification due to model uncertainty.
We present experimental results which suggest that policy iteration with linear
features may have an inherent form of robustness when using coarse feature
representations. In addition, we present experimental results which demonstrate
that robustness helps policy iteration implemented on top of deep neural
networks to generalize over a much broader range of dynamics than non-robust
policy iteration.
https://arxiv.org/abs/1802.03236 ,  1636kb)

### Arxiv on Feb. 9th

09Feb18

Title: Deep Reinforcement Learning for Image Hashing
Authors: Jian Zhang, Yuxin Peng and Zhaoda Ye
Categories: cs.CV
Comments: 18 pages, submitted to ACM Transactions on Multimedia Computing,
Communications, and Applications (TOMM). arXiv admin note: text overlap with
arXiv:1612.02541

Deep hashing methods have received much attention recently, which achieve
promising results by taking advantage of the strong representation power of
deep networks. However, most existing deep hashing methods learn a whole set of
hashing functions independently and directly, while ignore the correlation
between different hashing functions that can promote the retrieval accuracy
greatly. Inspired by the sequential decision ability of deep reinforcement
learning, in this paper, we propose a new Deep Reinforcement Learning approach
for Image Hashing (DRLIH). Our proposed DRLIH models the hashing learning
problem as a Markov Decision Process (MDP), which learns each hashing function
by correcting the errors imposed by previous ones and promotes retrieval
accuracy. To the best of our knowledge, this is the first work that tries to
address hashing problem from deep reinforcement learning perspective. The main
contributions of our proposed DRLIH approach can be summarized as follows: (1)
We propose a deep reinforcement learning hashing network. In our proposed DRLIH
approach, we utilize recurrent neural network (RNN) as agents to model the
hashing functions, which take actions of projecting images into binary codes
sequentially, so that current hashing function learning can take previous
hashing functions’ error into account. (2) We propose a sequential learning
strategy based on proposed DRLIH. We define the state as a tuple of internal
features of RNN’s hidden layers and image features, which can well reflect
history decisions made by the agents. We also propose an action group method to
enhance the correlation of the hash functions in the same group. Experiments on
three widely-used datasets demonstrate the effectiveness of our proposed DRLIH
approach.
https://arxiv.org/abs/1802.02904 ,  751kb)

Title: Efficient collective swimming by harnessing vortices through deep
reinforcement learning
Authors: Siddhartha Verma and Guido Novati and Petros Koumoutsakos
Categories: physics.flu-dyn cs.AI physics.comp-ph

Fish in schooling formations navigate complex flow-fields replete with
mechanical energy in the vortex wakes of their companions. Their schooling
behaviour has been associated with evolutionary advantages including collective
energy savings. How fish harvest energy from their complex fluid environment
and the underlying physical mechanisms governing energy-extraction during
collective swimming, is still unknown. Here we show that fish can improve their
sustained propulsive efficiency by actively following, and judiciously
intercepting, vortices in the wake of other swimmers. This swimming strategy
leads to collective energy-savings and is revealed through the first ever
combination of deep reinforcement learning with high-fidelity flow simulations.
We find that a `smart-swimmer’ can adapt its position and body deformation to
synchronise with the momentum of the oncoming vortices, improving its average
swimming-efficiency at no cost to the leader. The results show that fish may
harvest energy deposited in vortices produced by their peers, and support the
conjecture that swimming in formation is energetically advantageous. Moreover,
this study demonstrates that deep reinforcement learning can produce navigation
algorithms for complex flow-fields, with promising implications for energy
savings in autonomous robotic swarms.
https://arxiv.org/abs/1802.02674 ,  9281kb)

### Arxiv on Feb. 8th

08Feb18

Title: A Critical Investigation of Deep Reinforcement Learning for Navigation
Authors: Vikas Dhiman, Shurjo Banerjee, Brent Griffin, Jeffrey M Siskind, Jason
J Corso
Categories: cs.RO cs.AI

The navigation problem is classically approached in two steps: an exploration
step, where map-information about the environment is gathered; and an
exploitation step, where this information is used to navigate efficiently. Deep
reinforcement learning (DRL) algorithms, alternatively, approach the problem of
navigation in an end-to-end fashion. Inspired by the classical approach, we ask
whether DRL algorithms are able to inherently explore, gather and exploit
map-information over the course of navigation. We build upon Mirowski et al.
[2017] work and introduce a systematic suite of experiments that vary three
parameters: the agent’s starting location, the agent’s target location, and the
maze structure. We choose evaluation metrics that explicitly measure the
algorithm’s ability to gather and exploit map-information. Our experiments show
that when trained and tested on the same maps, the algorithm successfully
gathers and exploits map-information. However, when trained and tested on
different sets of maps, the algorithm fails to transfer the ability to gather
and exploit map-information to unseen maps. Furthermore, we find that when the
goal location is randomized and the map is kept static, the algorithm is able
to gather and exploit map-information but the exploitation is far from optimal.
We open-source our experimental suite in the hopes that it serves as a
framework for the comparison of future algorithms and leads to the discovery of
robust alternatives to classical navigation methods.
https://arxiv.org/abs/1802.02274 ,  8585kb)
Title: Evaluation of Deep Reinforcement Learning Methods for Modular Robots
Authors: Risto Kojcev, Nora Etxezarreta, Alejandro Hern\’andez and V\’ictor
Mayoral
Categories: cs.RO

We propose a novel framework for Deep Reinforcement Learning (DRL) in modular
robotics using traditional robotic tools that extend state-of-the-art DRL
implementations and provide an end-to-end approach which trains a robot
directly from joint states. Moreover, we present a novel technique to transfer
these DLR methods into the real robot, aiming to close the simulation-reality
gap. We demonstrate the robustness of the performance of state-of-the-art DRL
methods for continuous action spaces in modular robots, with an empirical study
both in simulation and in the real robot where we also evaluate how
accelerating the simulation time affects the robot’s performance. Our results
show that extending the modular robot from 3 degrees-of-freedom (DoF), to 4
DoF, does not affect the robot’s learning. This paves the way towards training
modular robots using DRL techniques.
https://arxiv.org/abs/1802.02395 ,  3128kb)

Title: From Game-theoretic Multi-agent Log Linear Learning to Reinforcement
Learning
Categories: cs.LG cs.MA

Multi-agent Systems (MASs) have found a variety of industrial applications
from economics to robotics, owing to their high adaptability, scalability and
applicability. However, with the increasing complexity of MASs, multi-agent
control has become a challenging problem to solve. Among different approaches
to deal with this complex problem, game theoretic learning recently has
received researchers’ attention as a possible solution. In such learning
scheme, by playing a game, each agent eventually discovers a solution on its
own. The main focus of this paper is on enhancement of two types of
game-theoretic learning algorithms: log linear learning and reinforcement
learning. Each algorithm proposed in this paper, relaxes and imposes different
assumptions to fit a class of MAS problems. Numerical experiments are also
conducted to verify each algorithm’s robustness and performance.
https://arxiv.org/abs/1802.02277 ,  919kb)

### Arxiv on Feb. 7th

07Feb18

Title: Shared Autonomy via Deep Reinforcement Learning
Authors: Siddharth Reddy, Sergey Levine, Anca Dragan
Categories: cs.LG cs.HC cs.RO

In shared autonomy, user input is combined with semi-autonomous control to
achieve a common goal. The goal is often unknown ex-ante, so prior work enables
agents to infer the goal from user input and assist with the task. Such methods
tend to assume some combination of knowledge of the dynamics of the
environment, the user’s policy given their goal, and the set of possible goals
the user might target, which limits their application to real-world scenarios.
We propose a deep reinforcement learning framework for model-free shared
autonomy that lifts these assumptions. We use human-in-the-loop reinforcement
learning with neural network function approximation to learn an end-to-end
mapping from environmental observation and user input to agent action, with
task reward as the only form of supervision. Controlled studies with users (n =
16) and synthetic pilots playing a video game and flying a real quadrotor
demonstrate the ability of our algorithm to assist users with real-time control
tasks in which the agent cannot directly access the user’s private information
through observations, but receives a reward signal and user input that both
depend on the user’s intent. The agent learns to assist the user without access
to this private information, implicitly inferring it from the user’s input.
This allows the assisted user to complete the task more effectively than the
user or an autonomous agent could on their own. This paper is a proof of
concept that illustrates the potential for deep reinforcement learning to
enable flexible and practical assistive systems.
https://arxiv.org/abs/1802.01744 ,  2078kb)

### Arxiv on Feb. 6th

06Feb18

Title: Coordinated Exploration in Concurrent Reinforcement Learning
Authors: Maria Dimakopoulou, Benjamin Van Roy
Categories: cs.AI

We consider a team of reinforcement learning agents that concurrently learn
to operate in a common environment. We identify three properties – adaptivity,
commitment, and diversity – which are necessary for efficient coordinated
exploration and demonstrate that straightforward extensions to single-agent
optimistic and posterior sampling approaches fail to satisfy them. As an
alternative, we propose seed sampling, which extends posterior sampling in a
manner that meets these requirements. Simulation results investigate how
per-agent regret decreases as the number of agents grows, establishing
substantial advantages of seed sampling over alternative exploration schemes.
https://arxiv.org/abs/1802.01282 ,  441kb)

Title: IMPALA: Scalable Distributed Deep-RL with Importance Weighted
Actor-Learner Architectures
Authors: Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir
Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane
Legg, Koray Kavukcuoglu
Categories: cs.LG cs.AI

In this work we aim to solve a large collection of tasks using a single
reinforcement learning agent with a single set of parameters. A key challenge
is to handle the increased amount of data and extended training time, which is
already a problem in single task learning. We have developed a new distributed
agent IMPALA (Importance-Weighted Actor Learner Architecture) that can scale to
thousands of machines and achieve a throughput rate of 250,000 frames per
second. We achieve stable learning at high throughput by combining decoupled
acting and learning with a novel off-policy correction method called V-trace,
which was critical for achieving learning stability. We demonstrate the
effectiveness of IMPALA for multi-task reinforcement learning on DMLab-30 (a
set of 30 tasks from the DeepMind Lab environment (Beattie et al., 2016)) and
Atari-57 (all available Atari games in Arcade Learning Environment (Bellemare
et al., 2013a)). Our results show that IMPALA is able to achieve better
performance than previous agents, use less data and crucially exhibits positive
https://arxiv.org/abs/1802.01561 ,  4027kb)

Title: Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement
Learning
Authors: Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltru\v{s}aitis, Amir
Categories: cs.LG cs.AI cs.CL
Comments: ICMI 2017 Oral Presentation, Honorable Mention Award

With the increasing popularity of video sharing websites such as YouTube and
the scientific community. Contrary to previous works in multimodal sentiment
analysis which focus on holistic information in speech segments such as bag of
words representations and average facial expression intensity, we develop a
novel deep architecture for multimodal sentiment analysis that performs
modality fusion at the word level. In this paper, we propose the Gated
Multimodal Embedding LSTM with Temporal Attention (GME-LSTM(A)) model that is
composed of 2 modules. The Gated Multimodal Embedding alleviates the
difficulties of fusion when there are noisy modalities. The LSTM with Temporal
Attention performs word level fusion at a finer fusion resolution between input
modalities and attends to the most important time steps. As a result, the
GME-LSTM(A) is able to better model the multimodal structure of speech through
time and perform better sentiment comprehension. We demonstrate the
effectiveness of this approach on the publicly-available Multimodal Corpus of
Sentiment Intensity and Subjectivity Analysis (CMU-MOSI) dataset by achieving
state-of-the-art sentiment classification and regression results. Qualitative
analysis on our model emphasizes the importance of the Temporal Attention Layer
in sentiment prediction because the additional acoustic and visual modalities
are noisy. We also demonstrate the effectiveness of the Gated Multimodal
Embedding in selectively filtering these noisy modalities out. Our results and
analysis open new areas in the study of sentiment analysis in human
communication and provide new models for multimodal fusion.
https://arxiv.org/abs/1802.00924 ,  3382kb)

### Arxiv on Feb. 5th

05Feb18

Well, I will cheat a little for today since there is no paper related to Reinforcement Learning on Arxiv. As a consequence, I browse the one from Feb. 2nd…

Title: Elements of Effective Deep Reinforcement Learning towards Tactical
Driving Decision Making
Authors: Jingchu Liu, Pengfei Hou, Lisen Mu, Yinan Yu, Chang Huang
Categories: cs.AI cs.LG