### Arxiv on Mar. 2nd

02Mar18

Title: Towards Cooperation in Sequential Prisoner’s Dilemmas: a Deep Multiagent
Reinforcement Learning Approach
Authors: Weixun Wang, Jianye Hao, Yixi Wang, Matthew Taylor
Categories: cs.AI cs.GT cs.LG cs.MA

The Iterated Prisoner’s Dilemma has guided research on social dilemmas for
decades. However, it distinguishes between only two atomic actions: cooperate
and defect. In real-world prisoner’s dilemmas, these choices are temporally
extended and different strategies may correspond to sequences of actions,
reflecting grades of cooperation. We introduce a Sequential Prisoner’s Dilemma
(SPD) game to better capture the aforementioned characteristics. In this work,
we propose a deep multiagent reinforcement learning approach that investigates
the evolution of mutual cooperation in SPD games. Our approach consists of two
phases. The first phase is offline: it synthesizes policies with different
cooperation degrees and then trains a cooperation degree detection network. The
second phase is online: an agent adaptively selects its policy based on the
detected degree of opponent cooperation. The effectiveness of our approach is
demonstrated in two representative SPD 2D games: the Apple-Pear game and the
Fruit Gathering game. Experimental results show that our strategy can avoid
being exploited by exploitative opponents and achieve cooperation with
cooperative opponents.
https://arxiv.org/abs/1803.00162 ,  6332kb)

Title: Deep Reinforcement Learning for Sponsored Search Real-time Bidding
Authors: Jun Zhao, Guang Qiu, Ziyu Guan, Wei Zhao, Xiaofei He
Categories: cs.AI

Bidding optimization is one of the most critical problems in online
advertising. Sponsored search (SS) auction, due to the randomness of user query
behavior and platform nature, usually adopts keyword-level bidding strategies.
In contrast, the display advertising (DA), as a relatively simpler scenario for
auction, has taken advantage of real-time bidding (RTB) to boost the
performance for advertisers. In this paper, we consider the RTB problem in
sponsored search auction, named SS-RTB. SS-RTB has a much more complex dynamic
environment, due to stochastic user query behavior and more complex bidding
policies based on multiple keywords of an ad. Most previous methods for DA
cannot be applied. We propose a reinforcement learning (RL) solution for
handling the complex dynamic environment. Although some RL methods have been
changing” problem: the state transition probabilities vary between two days.
Motivated by the observation that auction sequences of two days share similar
transition patterns at a proper aggregation level, we formulate a robust MDP
model at hour-aggregation level of the auction data and propose a
control-by-model framework for SS-RTB. Rather than generating bid prices
directly, we decide a bidding model for impressions of each hour and perform
real-time bidding accordingly. We also extend the method to handle the
multi-agent problem. We deployed the SS-RTB system in the e-commerce search
auction platform of Alibaba. Empirical experiments of offline evaluation and
online A/B test demonstrate the effectiveness of our method.
https://arxiv.org/abs/1803.00259 ,  1972kb)

Title: Model-Based Value Estimation for Efficient Model-Free Reinforcement
Learning
Authors: Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I. Jordan, Joseph E.
Gonzalez, Sergey Levine
Categories: cs.LG cs.AI stat.ML

Recent model-free reinforcement learning algorithms have proposed
incorporating learned dynamics models as a source of additional data with the
intention of reducing sample complexity. Such methods hold the promise of
incorporating imagined data coupled with a notion of model uncertainty to
accelerate the learning of continuous control tasks. Unfortunately, they rely
on heuristics that limit usage of the dynamics model. We present model-based
value expansion, which controls for uncertainty in the model by only allowing
imagination to fixed depth. By enabling wider use of learned dynamics models
within a model-free reinforcement learning algorithm, we improve value
estimation, which, in turn, reduces the sample complexity of learning.
https://arxiv.org/abs/1803.00101 ,  4856kb)

Title: Inverse Reinforcement Learning via Nonparametric Spatio-Temporal Subgoal
Modeling
Authors: Adrian \v{S}o\v{s}i\’c, Elmar Rueckert, Jan Peters, Abdelhak M.
Zoubir, Heinz Koeppl
Categories: stat.ML cs.AI cs.LG cs.RO cs.SY

Recent advances in the field of inverse reinforcement learning (IRL) have
yielded sophisticated frameworks which relax the original modeling assumption
that the behavior of an observed agent reflects only a single intention.
Instead, the demonstration data is typically divided into parts, to account for
the fact that different trajectories may correspond to different intentions,
e.g., because they were generated by different domain experts. In this work, we
go one step further: using the intuitive concept of subgoals, we build upon the
premise that even a single trajectory can be explained more efficiently locally
within a certain context than globally, enabling a more compact representation
of the observed behavior. Based on this assumption, we build an implicit
intentional model of the agent’s goals to forecast its behavior in unobserved
situations. The result is an integrated Bayesian prediction framework which
provides smooth policy estimates that are consistent with the expert’s plan and
significantly outperform existing IRL solutions. Most notably, our framework
naturally handles situations where the intentions of the agent change with time
and classical IRL algorithms fail. In addition, due to its probabilistic
nature, the model can be straightforwardly applied in an active learning
setting to guide the demonstration process of the expert.
https://arxiv.org/abs/1803.00444 ,  1789kb)

### Arxiv on Mar. 1st

02Mar18

Title: Selective Experience Replay for Lifelong Learning
Authors: David Isele, Akansel Cosgun
Categories: cs.AI
Comments: Presented in 32nd Conference on Artificial Intelligence (AAAI 2018)

Deep reinforcement learning has emerged as a powerful tool for a variety of
learning tasks, however deep nets typically exhibit forgetting when learning
multiple tasks in sequence. To mitigate forgetting, we propose an experience
replay process that augments the standard FIFO buffer and selectively stores
experiences in a long-term memory. We explore four strategies for selecting
which experiences will be stored: favoring surprise, favoring reward, matching
the global training distribution, and maximizing coverage of the state space.
We show that distribution matching successfully prevents catastrophic
forgetting, and is consistently the best approach on all domains tested. While
distribution matching has better and more consistent performance, we identify
one case in which coverage maximization is beneficial – when tasks that receive
less trained are more important. Overall, our results show that selective
experience replay, when suitable selection algorithms are employed, can prevent
catastrophic forgetting.
https://arxiv.org/abs/1802.10269 ,  4709kb)

Title: Model-Ensemble Trust-Region Policy Optimization
Authors: Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter
Abbeel
Categories: cs.AI cs.LG cs.RO

Model-free reinforcement learning (RL) methods are succeeding in a growing
number of tasks, aided by recent advances in deep learning. However, they tend
to suffer from high sample complexity, which hinders their use in real-world
domains. Alternatively, model-based reinforcement learning promises to reduce
sample complexity, but tends to require careful tuning and to date have
succeeded mainly in restrictive domains where simple models are sufficient for
learning. In this paper, we analyze the behavior of vanilla model-based
reinforcement learning methods when deep neural networks are used to learn both
the model and the policy, and show that the learned policy tends to exploit
regions where insufficient data is available for the model to be learned,
causing instability in training. To overcome this issue, we propose to use an
ensemble of models to maintain the model uncertainty and regularize the
learning process. We further show that the use of likelihood ratio derivatives
yields much more stable learning than backpropagation through time. Altogether,
our approach Model-Ensemble Trust-Region Policy Optimization (ME-TRPO)
significantly reduces the sample complexity compared to model-free deep RL
methods on challenging continuous control benchmark tasks.
https://arxiv.org/abs/1802.10592 ,  7192kb)

Title: Deep Reinforcement Learning for Vision-Based Robotic Grasping: A
Simulated Comparative Evaluation of Off-Policy Methods
Authors: Deirdre Quillen, Eric Jang, Ofir Nachum, Chelsea Finn, Julian Ibarz,
Sergey Levine
Categories: cs.RO cs.LG stat.ML

In this paper, we explore deep reinforcement learning algorithms for
vision-based robotic grasping. Model-free deep reinforcement learning (RL) has
been successfully applied to a range of challenging environments, but the
proliferation of algorithms makes it difficult to discern which particular
approach would be best suited for a rich, diverse task like grasping. To answer
this question, we propose a simulated benchmark for robotic grasping that
emphasizes off-policy learning and generalization to unseen objects. Off-policy
learning enables utilization of grasping data over a wide variety of objects,
and diversity is important to enable the method to generalize to new objects
that were not seen during training. We evaluate the benchmark tasks against a
variety of Q-function estimation methods, a method previously proposed for
robotic grasping with deep neural network models, and a novel approach based on
a combination of Monte Carlo return estimation and an off-policy correction.
Our results indicate that several simple methods provide a surprisingly strong
competitor to popular algorithms such as double Q-learning, and our analysis of
stability sheds light on the relative tradeoffs between the algorithms.
https://arxiv.org/abs/1802.10264 ,  6883kb)
Authors: Parijat Dewangan, S Phaniteja, K Madhava Krishna, Abhishek Sarkar,
Balaraman Ravindran
Categories: cs.LG cs.AI cs.RO stat.ML

Most reinforcement learning algorithms are inefficient for learning multiple
tasks in complex robotic systems, where different tasks share a set of actions.
In such environments a compound policy may be learnt with shared neural network
parameters, which performs multiple tasks concurrently. However such compound
negate each other, making the learning unstable and sometimes less data
efficient. In this paper, we propose a new approach for simultaneous training
of multiple tasks sharing a set of common actions in continuous action spaces,
learning in a single actor-critic network. We also propose a simple heuristic
in the differential policy gradient update to further improve the learning. The
proposed architecture was tested on 8 link planar manipulator and 27 degrees of
freedom(DoF) Humanoid for learning multi-goal reachability tasks for 3 and 2
end effectors respectively. We show that our approach supports efficient
multi-task learning in complex robotic systems, outperforming related methods
in continuous action spaces.
https://arxiv.org/abs/1802.10463 ,  442kb)

Title: Learning by Playing – Solving Sparse Reward Tasks from Scratch
Authors: Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas
Degrave, Tom Van de Wiele, Volodymyr Mnih, Nicolas Heess, Jost Tobias
Springenberg
Categories: stat.ML cs.LG cs.RO
Comments: A video of the rich set of learned behaviours can be found at
https://youtu.be/mPKyvocNe_M

We propose Scheduled Auxiliary Control (SAC-X), a new learning paradigm in
the context of Reinforcement Learning (RL). SAC-X enables learning of complex
behaviors – from scratch – in the presence of multiple sparse reward signals.
To this end, the agent is equipped with a set of general auxiliary tasks, that
it attempts to learn simultaneously via off-policy RL. The key idea behind our
method is that active (learned) scheduling and execution of auxiliary policies
allows the agent to efficiently explore its environment – enabling it to excel
at sparse reward RL. Our experiments in several challenging robotic
manipulation settings demonstrate the power of our approach.
https://arxiv.org/abs/1802.10567 ,  8663kb)

### Arxiv on Feb. 28th

28Feb18

Title: Modeling Others using Oneself in Multi-Agent Reinforcement Learning
Authors: Roberta Raileanu, Emily Denton, Arthur Szlam, Rob Fergus
Categories: cs.AI cs.LG
Comments: 9 pages, 9 figures, submitted to ICML 2018

We consider the multi-agent reinforcement learning setting with imperfect
information in which each agent is trying to maximize its own utility. The
reward function depends on the hidden state (or goal) of both agents, so the
agents must infer the other players’ hidden goals from their observed behavior
in order to solve the tasks. We propose a new approach for learning in these
domains: Self Other-Modeling (SOM), in which an agent uses its own policy to
predict the other agent’s actions and update its belief of their hidden state
in an online manner. We evaluate this approach on three different tasks and
show that the agents are able to learn better policies using their estimate of
the other players’ hidden states, in both cooperative and adversarial settings.
https://arxiv.org/abs/1802.09640 ,  1597kb)

Title: Reinforcement and Imitation Learning for Diverse Visuomotor Skills
Authors: Yuke Zhu, Ziyu Wang, Josh Merel, Andrei Rusu, Tom Erez, Serkan Cabi,
Saran Tunyasuvunakool, J\’anos Kram\’ar, Raia Hadsell, Nando de Freitas,
Nicolas Heess
Categories: cs.RO cs.AI cs.LG

We propose a model-free deep reinforcement learning method that leverages a
small amount of demonstration data to assist a reinforcement learning agent. We
apply this approach to robotic manipulation tasks and train end-to-end
visuomotor policies that map directly from RGB camera inputs to joint
velocities. We demonstrate that our approach can solve a wide variety of
visuomotor tasks, for which engineering a scripted controller would be
laborious. Our experiments indicate that our reinforcement and imitation agent
achieves significantly better performances than agents trained with
reinforcement learning or imitation learning alone. We also illustrate that
these policies, trained with large visual and dynamics variations, can achieve
preliminary successes in zero-shot sim2real transfer. A brief visual
description of this work can be viewed in https://youtu.be/EDl8SQUNjj0
https://arxiv.org/abs/1802.09564 ,  7740kb)

Title: Real-Time Bidding with Multi-Agent Reinforcement Learning in Display
Authors: Junqi Jin, Chengru Song, Han Li, Kun Gai, Jun Wang, Weinan Zhang
Categories: stat.ML cs.AI cs.LG

visiting user. To optimize a specific goal such as maximizing the revenue led
by ad placements, advertisers not only need to estimate the relevance between
the ads and user’s interests, but most importantly require a strategic response
with respect to other advertisers bidding in the market. In this paper, we
formulate bidding optimization with multi-agent reinforcement learning. To deal
with a large number of advertisers, we propose a clustering method and assign
each cluster with a strategic bidding agent. A practical Distributed
Coordinated Multi-Agent Bidding (DCMAB) has been proposed and implemented to
The empirical study on our industry-scaled real-world data has demonstrated the
effectiveness of our modeling methods. Our results show that a cluster based
bidding would largely outperform single-agent and bandit approaches, and the
coordinated bidding achieves better overall objectives than the purely
self-interested bidding agents.
https://arxiv.org/abs/1802.09756 ,  1469kb)

### Arxiv on Feb. 27th

28Feb18

Title: Reinforcement Learning on Web Interfaces Using Workflow-Guided
Exploration
Authors: Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, Percy
Liang
Categories: cs.AI
Comments: International Conference on Learning Representations (ICLR), 2018

Reinforcement learning (RL) agents improve through trial-and-error, but when
reward is sparse and the agent cannot discover successful action sequences,
learning stagnates. This has been a notable problem in training deep RL agents
to perform web-based tasks, such as booking flights or replying to emails,
where a single mistake can ruin the entire sequence of actions. A common remedy
is to “warm-start” the agent by pre-training it to mimic expert demonstrations,
but this is prone to overfitting. Instead, we propose to constrain exploration
using demonstrations. From each demonstration, we induce high-level “workflows”
which constrain the allowable actions at each time step to be similar to those
in the demonstration (e.g., “Step 1: click on a textbox; Step 2: enter some
text”). Our exploration policy then learns to identify successful workflows and
samples actions that satisfy these workflows. Workflows prune out bad
exploration directions and accelerate the agent’s ability to discover rewards.
We use our approach to train a novel neural policy designed to handle the
semi-structured nature of websites, and evaluate on a suite of web tasks,
including the recent World of Bits benchmark. We achieve new state-of-the-art
results, and show that workflow-guided exploration improves sample efficiency
over behavioral cloning by more than 100x.
https://arxiv.org/abs/1802.08802 ,  362kb)

Title: Variance Reduction Methods for Sublinear Reinforcement Learning
Authors: Sham Kakade, Mengdi Wang, Lin F. Yang
Categories: cs.AI cs.LG stat.ML

This work considers the problem of provably optimal reinforcement learning
for (episodic) finite horizon MDPs, i.e. how an agent learns to maximize
his/her (long term) reward in an uncertain environment. The main contribution
is in providing a novel algorithm — Variance-reduced Upper Confidence
Q-learning (vUCQ) — which enjoys a regret bound of $\widetilde{O}(\sqrt{HSAT} + H^5SA)$, where the $T$ is the number of time steps the agent acts in the MDP,
$S$ is the number of states, $A$ is the number of actions, and $H$ is the
(episodic) horizon time.
This is the first regret bound that is both sub-linear in the model size and
asymptotically optimal. The algorithm is sub-linear in that the time to achieve
$\epsilon$-average regret (for any constant $\epsilon$) is $O(SA)$, which is a
number of samples that is far less than that required to learn any
(non-trivial) estimate of the transition model (the transition model is
specified by $O(S^2A)$ parameters). The importance of sub-linear algorithms is
largely the motivation for algorithms such as $Q$-learning and other “model
free” approaches. vUCQ algorithm also enjoys minimax optimal regret in the long
run, matching the $\Omega(\sqrt{HSAT})$ lower bound.
Variance-reduced Upper Confidence Q-learning (vUCQ) is a successive
refinement method in which the algorithm reduces the variance in $Q$-value
estimates and couples this estimation scheme with an upper confidence based
algorithm. Technically, the coupling of both of these techniques is what leads
to the algorithm enjoying both the sub-linear regret property and the
(asymptotically) optimal regret.
https://arxiv.org/abs/1802.09184 ,  539kb)

Title: The AdobeIndoorNav Dataset: Towards Deep Reinforcement Learning based
Authors: Kaichun Mo, Haoxiang Li, Zhe Lin and Joon-Young Lee
Categories: cs.RO

Deep reinforcement learning (DRL) demonstrates its potential in learning a
data-demanding algorithm relies on a large number of navigation trajectories in
training. Existing datasets supporting training such robot navigation
algorithms consist of either 3D synthetic scenes or reconstructed scenes.
Synthetic data suffers from domain gap to the real-world scenes while visual
inputs rendered from 3D reconstructed scenes have undesired holes and
artifacts. In this paper, we present a new dataset collected in real-world to
facilitate the research in DRL based visual navigation. Our dataset includes 3D
reconstruction for real-world scenes as well as densely captured real 2D images
from the scenes. It provides high-quality visual inputs with real-world scene
complexity to the robot at dense grid locations. We further study and benchmark
one recent DRL based navigation algorithm and present our attempts and thoughts
on improving its generalizability to unseen test targets in the scenes.
https://arxiv.org/abs/1802.08824 ,  7213kb)

Title: Fully Decentralized Multi-Agent Reinforcement Learning with Networked
Agents
Authors: Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Ba\c{s}ar
Categories: cs.LG cs.AI cs.MA math.OC stat.ML

We consider the problem of \emph{fully decentralized} multi-agent
reinforcement learning (MARL), where the agents are located at the nodes of a
time-varying communication network. Specifically, we assume that the reward
functions of the agents might correspond to different tasks, and are only known
to the corresponding agent. Moreover, each agent makes individual decisions
based on both the information observed locally and the messages received from
its neighbors over the network. Within this setting, the collective goal of the
agents is to maximize the globally averaged return over the network through
exchanging information with their neighbors. To this end, we propose two
decentralized actor-critic algorithms with function approximation, which are
applicable to large-scale MARL problems where both the number of states and the
number of agents are massively large. Under the decentralized structure, the
actor step is performed individually by each agent with no need to infer the
policies of others. For the critic step, we propose a consensus update via
communication over the network. Our algorithms are fully incremental and can be
implemented in an online fashion. Convergence analyses of the algorithms are
provided when the value functions are approximated within the class of linear
functions. Extensive simulation results with both linear and nonlinear function
approximations are presented to validate the proposed algorithms. Our work
appears to be the first study of fully decentralized MARL algorithms for
networked agents with function approximation, with provable convergence
guarantees.
https://arxiv.org/abs/1802.08757 ,  1867kb)

Title: Multi-Goal Reinforcement Learning: Challenging Robotics Environments and
Request for Research
Authors: Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen
Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter
Welinder, Vikash Kumar, Wojciech Zaremba
Categories: cs.LG cs.AI cs.RO

The purpose of this technical report is two-fold. First of all, it introduces
a suite of challenging continuous control tasks (integrated with OpenAI Gym)
based on currently existing robotics hardware. The tasks include pushing,
sliding and pick & place with a Fetch robotic arm as well as in-hand object
manipulation with a Shadow Dexterous Hand. All tasks have sparse binary rewards
and follow a Multi-Goal Reinforcement Learning (RL) framework in which an agent
is told what to do using an additional input.
The second part of the paper presents a set of concrete research ideas for
improving RL algorithms, most of which are related to Multi-Goal RL and
Hindsight Experience Replay.
https://arxiv.org/abs/1802.09464 ,  1502kb)

### Arxiv on Feb. 26th

28Feb18

Title: Budget Constrained Bidding by Model-free Reinforcement Learning in
Authors: Di Wu, Xiujun Chen, Xun Yang, Hao Wang, Qing Tan, Xiaoxun Zhang, Kun
Gai
Categories: cs.AI

Real-time bidding (RTB) is almost the most important mechanism in online
display advertising, where proper bid for each page view plays a vital and
essential role for good marketing results. Budget constrained bidding is a
typical scenario in RTB mechanism where the advertisers hope to maximize total
value of winning impressions under a pre-set budget constraint. However, the
optimal strategy is hard to be derived due to complexity and volatility of the
auction environment. To address the challenges, in this paper, we formulate
budget constrained bidding as a Markov Decision Process. Quite different from
prior model-based work, we propose a novel framework based on model-free
reinforcement learning which sequentially regulates the bidding parameter
rather than directly producing bid. Along this line, we further innovate a
reward function which deploys a deep neural network to learn appropriate reward
and thus leads the agent to deliver the optimal policy effectively; we also
design an adaptive $\epsilon$-greedy strategy which adjusts the exploration
behaviour dynamically and further improves the performance. Experimental
results on real dataset demonstrate the effectiveness of our framework.
https://arxiv.org/abs/1802.08365 ,  1343kb)

Title: Ranking Sentences for Extractive Summarization with Reinforcement
Learning
Authors: Shashi Narayan, Shay B. Cohen, Mirella Lapata
Categories: cs.CL
Comments: NAACL 2018, Accepted version, 11 pages

Single document summarization is the task of producing a shorter version of a
document while preserving its principal information content. In this paper we
conceptualize extractive summarization as a sentence ranking task and propose a
novel training algorithm which globally optimizes the ROUGE evaluation metric
through a reinforcement learning objective. We use our algorithm to train a
neural summarization model on the CNN and DailyMail datasets and demonstrate
experimentally that it outperforms state-of-the-art extractive and abstractive
systems when evaluated automatically and by humans.
https://arxiv.org/abs/1802.08636 ,  50kb)

Title: Weighted Double Deep Multiagent Reinforcement Learning in Stochastic
Cooperative Environments
Authors: Yan Zheng, Jianye Hao, Zongzhang Zhang
Categories: cs.MA cs.AI cs.LG

Despite single agent deep reinforcement learning has achieved significant
success due to the experience replay mechanism, Concerns should be reconsidered
in multiagent environments. This work focus on the stochastic cooperative
environment. We apply a specific adaptation to one recently proposed weighted
double estimator and propose a multiagent deep reinforcement learning
framework, named Weighted Double Deep Q-Network (WDDQN). To achieve efficient
cooperation, \textit{Lenient Reward Network} and \textit{Mixture Replay
Strategy} are introduced. By utilizing the deep neural network and the weighted
double estimator, WDDQN can not only reduce the bias effectively but also be
extended to many deep RL scenarios with only raw pixel images as input.
Empirically, the WDDQN outperforms the existing DRL algorithm (double DQN) and
multiagent RL algorithm (lenient Q-learning) in terms of performance and
convergence within stochastic cooperative environments.
https://arxiv.org/abs/1802.08534 ,  1614kb)

Title: Structured Control Nets for Deep Reinforcement Learning
Authors: Mario Srouji, Jian Zhang, Ruslan Salakhutdinov
Categories: cs.LG cs.AI cs.RO
Comments: First two authors contributed equally

solving several important benchmark problems for sequential decision making.
Many control applications use a generic multilayer perceptron (MLP) for
non-vision parts of the policy network. In this work, we propose a new neural
network architecture for the policy network representation that is simple yet
effective. The proposed Structured Control Net (SCN) splits the generic MLP
into two separate sub-modules: a nonlinear control module and a linear control
module. Intuitively, the nonlinear control is for forward-looking and global
control, while the linear control stabilizes the local dynamics around the
residual of global control. We hypothesize that this will bring together the
benefits of both linear and nonlinear policies: improve training sample
efficiency, final episodic reward, and generalization of learned policy, while
requiring a smaller network and being generally applicable to different
training methods. We validated our hypothesis with competitive results on
simulations from OpenAI MuJoCo, Roboschool, Atari, and a custom 2D urban
driving environment, with various ablation and generalization tests, trained
with multiple black-box and policy gradient training methods. The proposed
incorporating problem specific priors into the architecture. As a case study,
we demonstrate much improved performance for locomotion tasks by emulating the
biological central pattern generators (CPGs) as the nonlinear part of the
architecture.
https://arxiv.org/abs/1802.08311 ,  2770kb)

### Arxiv on Feb. 23rd

28Feb18

Title: Convergent Actor-Critic Algorithms Under Off-Policy Training and
Function Approximation
Authors: Hamid Reza Maei
Categories: cs.AI

We present the first class of policy-gradient algorithms that work with both
state-value and policy function-approximation, and are guaranteed to converge
under off-policy training. Our solution targets problems in reinforcement
learning where the action representation adds to the-curse-of-dimensionality;
that is, with continuous or large action sets, thus making it infeasible to
estimate state-action value functions (Q functions). Using state-value
functions helps to lift the curse and as a result naturally turn our
policy-gradient solution into classical Actor-Critic architecture whose Actor
uses state-value function for the update. Our algorithms, Gradient Actor-Critic
and Emphatic Actor-Critic, are derived based on the exact gradient of averaged
state-value function objective and thus are guaranteed to converge to its
optimal solution, while maintaining all the desirable properties of classical
Actor-Critic methods with no additional hyper-parameters. To our knowledge,
this is the first time that convergent off-policy learning methods have been
extended to classical Actor-Critic methods with function approximation.
https://arxiv.org/abs/1802.07842 ,  286kb)

Title: Variational Inference for Policy Gradient
Authors: Tianbing Xu
Categories: cs.LG cs.AI stat.ML

Inspired by the seminal work on Stein Variational Inference and Stein
Variational Policy Gradient, we derived a method to generate samples from the
posterior variational parameter distribution by \textit{explicitly} minimizing
the KL divergence to match the target distribution in an amortize fashion.
Consequently, we applied this varational inference technique into vanilla
policy gradient, TRPO and PPO with Bayesian Neural Network parameterizations
for reinforcement learning problems.
https://arxiv.org/abs/1802.07833 ,  7kb)

### Arxiv on Feb. 22nd

28Feb18

Authors: Yasuhiro Fujita and Shin-ichi Maeda
Categories: cs.LG cs.AI stat.ML

Many continuous control tasks have bounded action spaces and clip
out-of-bound actions before execution. Policy gradient methods often optimize
policies as if actions were not clipped. We propose clipped action policy
knowledge of actions being clipped to reduce the variance in estimation. We
prove that CAPG is unbiased and achieves lower variance than the original
estimator that ignores action bounds. Experimental results demonstrate that
CAPG generally outperforms the original estimator, indicating its promise as a
https://arxiv.org/abs/1802.07564 ,  445kb)

### Arxiv on Feb. 21th

21Feb18

Title: Continual Reinforcement Learning with Complex Synapses
Authors: Christos Kaplanis, Murray Shanahan, Claudia Clopath
Categories: cs.AI cs.LG cs.NE

Unlike humans, who are capable of continual learning over their lifetimes,
artificial neural networks have long been known to suffer from a phenomenon
known as catastrophic forgetting, whereby new learning can lead to abrupt
erasure of previously acquired knowledge. Whereas in a neural network the
parameters are typically modelled as scalar values, an individual synapse in
the brain comprises a complex network of interacting biochemical components
that evolve at different timescales. In this paper, we show that by equipping
tabular and deep reinforcement learning agents with a synaptic model that
incorporates this biological complexity (Benna & Fusi, 2016), catastrophic
forgetting can be mitigated at multiple timescales. In particular, we find that
as well as enabling continual learning across sequential training of two simple
tasks, it can also be used to overcome within-task forgetting by reducing the
need for an experience replay database.
https://arxiv.org/abs/1802.07239 ,  1794kb)

Title: Meta-Reinforcement Learning of Structured Exploration Strategies
Authors: Abhishek Gupta, Russell Mendonca, YuXuan Liu, Pieter Abbeel, Sergey
Levine
Categories: cs.LG cs.AI cs.NE

Exploration is a fundamental challenge in reinforcement learning (RL). Many
of the current exploration methods for deep RL use task-agnostic objectives,
such as information gain or bonuses based on state visitation. However, many
practical applications of RL involve learning more than a single task, and
prior tasks can be used to inform how exploration should be performed in new
to explore effectively in new situations. We introduce a novel gradient-based
fast adaptation algorithm — model agnostic exploration with structured noise
(MAESN) — to learn exploration strategies from prior experience. The prior
experience is used both to initialize a policy and to acquire a latent
exploration space that can inject structured stochasticity into a policy,
producing exploration strategies that are informed by prior knowledge and are
more effective than random action-space noise. We show that MAESN is more
effective at learning exploration strategies when compared to prior meta-RL
methods, RL without learned exploration strategies, and task-agnostic
exploration methods. We evaluate our method on a variety of simulated tasks:
locomotion with a wheeled robot, locomotion with a quadrupedal walker, and
object manipulation.
https://arxiv.org/abs/1802.07245 ,  6738kb)

### Arxiv on Feb. 20th

21Feb18

Title: Reactive Reinforcement Learning in Asynchronous Environments
Authors: Jaden B. Travnik, Kory W. Mathewson, Richard S. Sutton, Patrick M.
Pilarski
Categories: cs.AI cs.LG
Comments: 11 pages, 7 figures, currently under journal peer review

The relationship between a reinforcement learning (RL) agent and an
asynchronous environment is often ignored. Frequently used models of the
interaction between an agent and its environment, such as Markov Decision
Processes (MDP) or Semi-Markov Decision Processes (SMDP), do not capture the
fact that, in an asynchronous environment, the state of the environment may
change during computation performed by the agent. In an asynchronous
environment, minimizing reaction time—the time it takes for an agent to react
to an observation—also minimizes the time in which the state of the
environment may change following observation. In many environments, the
reaction time of an agent directly impacts task performance by permitting the
environment to transition into either an undesirable terminal state or a state
where performing the chosen action is inappropriate. We propose a class of
reactive reinforcement learning algorithms that address this problem of
asynchronous environments by immediately acting after observing new state
information. We compare a reactive SARSA learning algorithm with the
conventional SARSA learning algorithm on two asynchronous robotic tasks
(emergency stopping and impact prevention), and show that the reactive RL
algorithm reduces the reaction time of the agent by approximately the duration
of the algorithm’s learning update. This new class of reactive algorithms may
facilitate safer control and faster decision making without any change to
standard learning guarantees.
https://arxiv.org/abs/1802.06139 ,  1681kb)

Title: Sim-To-Real Optimization Of Complex Real World Mobile Network with
Imperfect Information via Deep Reinforcement Learning from Self-play
Authors: Yongxi Tan, Jin Yang, Xin Chen, Qitao Song, Yunjun Chen, Zhangxiang
Ye, Zhenqiang Su
Categories: cs.AI cs.LG stat.ML

Mobile network that millions of people use every day is one of the most
complex systems in real world. Optimization of mobile network to meet exploding
customer demand and reduce CAPEX/OPEX poses greater challenges than in prior
works. Learning to solve complex problems in real world to benefit everyone and
make the world better has long been ultimate goal of AI. However, it still
remains an unsolved problem for deep reinforcement learning (DRL), given
imperfect information in real world, huge state/action space, lots of data
needed for training, associated time/cost, multi-agent interactions, potential
negative impact to real world, etc. To bridge this reality gap, we proposed a
DRL framework to direct transfer optimal policy learned from multi-tasks in
source domain to unseen similar tasks in target domain without any further
training in both domains. First, we distilled temporal-spatial relationships
between cells and mobile users to scalable 3D image-like tensor to best
characterize partially observed mobile network. Second, inspired by AlphaGo, we
used a novel self-play mechanism to empower DRL agent to gradually improve its
intelligence by competing for best record on multiple tasks. Third, a
decentralized DRL method is proposed to coordinate multi-agents to compete and
cooperate as a team to maximize global reward and minimize potential negative
impact. Using 7693 unseen test tasks over 160 unseen simulated mobile networks
and 6 field trials over 4 commercial mobile networks in real world, we
demonstrated the capability of our approach to direct transfer the learning
from one simulator to another simulator, and from simulation to real world.
This is the first time that a DRL agent successfully transfers its learning
directly from simulation to very complex real world problems with incomplete
and imperfect information, huge state/action space and multi-agent
interactions.
https://arxiv.org/abs/1802.06416 ,  949kb)

Title: Accelerated Primal-Dual Policy Optimization for Safe Reinforcement
Learning
Authors: Qingkai Liang, Fanyu Que, Eytan Modiano
Categories: cs.AI cs.LG stat.ML

Constrained Markov Decision Process (CMDP) is a natural framework for
reinforcement learning tasks with safety constraints, where agents learn a
policy that maximizes the long-term reward while satisfying the constraints on
the long-term cost. A canonical approach for solving CMDPs is the primal-dual
method which updates parameters in primal and dual spaces in turn. Existing
methods for CMDPs only use on-policy data for dual updates, which results in
sample inefficiency and slow convergence. In this paper, we propose a policy
search method for CMDPs called Accelerated Primal-Dual Optimization (APDO),
which incorporates an off-policy trained dual variable in the dual update
procedure while updating the policy in primal space with on-policy likelihood
that APDO achieves better sample efficiency and faster convergence than
state-of-the-art approaches for CMDPs.
https://arxiv.org/abs/1802.06480 ,  72kb)

Title: Recommendations with Negative Feedback via Pairwise Deep Reinforcement
Learning
Authors: Xiangyu Zhao and Liang Zhang and Zhuoye Ding and Long Xia and Jiliang
Tang and Dawei Yin
Categories: cs.IR cs.LG stat.ML

Recommender systems play a crucial role in mitigating the problem of
information overload by suggesting users’ personalized items or services. The
vast majority of traditional recommender systems consider the recommendation
procedure as a static process and make recommendations following a fixed
strategy. In this paper, we propose a novel recommender system with the
capability of continuously improving its strategies during the interactions
with users. We model the sequential interactions between users and a
recommender system as a Markov Decision Process (MDP) and leverage
Reinforcement Learning (RL) to automatically learn the optimal strategies via
recommending trial-and-error items and receiving reinforcements of these items
from users’ feedback. Users’ feedback can be positive and negative and both
types of feedback have great potentials to boost recommendations. However, the
number of negative feedback is much larger than that of positive one; thus
incorporating them simultaneously is challenging since positive feedback could
be buried by negative one. In this paper, we develop a novel approach to
incorporate them into the proposed deep recommender system (DEERS) framework.
The experimental results based on real-world e-commerce data demonstrate the
effectiveness of the proposed framework. Further experiments have been
conducted to understand the importance of both positive and negative feedback
in recommendations.
https://arxiv.org/abs/1802.06501 ,  378kb)

Title: Modeling the Formation of Social Conventions in Multi-Agent Populations
Authors: Ismael T. Freire, Clement Moulin-Frier, Marti Sanchez-Fibla, Xerxes D.
Arsiwalla, Paul Verschure
Categories: cs.MA cs.AI cs.GT q-bio.NC stat.ML

In order to understand the formation of social conventions we need to know
the specific role of control and learning in multi-agent systems. To advance in
this direction, we propose, within the framework of the Distributed Adaptive
Control (DAC) theory, a novel Control-based Reinforcement Learning architecture
(CRL) that can account for the acquisition of social conventions in multi-agent
populations that are solving a benchmark social decision-making problem. Our
new CRL architecture, as a concrete realization of DAC multi-agent theory,
implements a low-level sensorimotor control loop handling the agent’s reactive
behaviors (pre-wired reflexes), along with a layer based on model-free
reinforcement learning that maximizes long-term reward. We apply CRL in a
multi-agent game-theoretic task in which coordination must be achieved in order
to find an optimal solution. We show that our CRL architecture is able to both
find optimal solutions in discrete and continuous time and reproduce human
experimental data on standard game-theoretic metrics such as efficiency in
acquiring rewards, fairness in reward distribution and stability of convention
formation.
https://arxiv.org/abs/1802.06108 ,  750kb)

Title: Efficient Large-Scale Fleet Management via Multi-Agent Deep
Reinforcement Learning
Authors: Kaixiang Lin, Renyu Zhao, Zhe Xu and Jiayu Zhou
Categories: cs.MA cs.AI

Large-scale online ride-sharing platforms have substantially transformed our
lives by reallocating transportation resources to alleviate traffic congestion
and promote transportation efficiency. An efficient fleet management strategy
not only can significantly improve the utilization of transportation resources
but also increase the revenue and customer satisfaction. It is a challenging
task to design an effective fleet management strategy that can adapt to an
environment involving complex dynamics between demand and supply. Existing
studies usually work on a simplified problem setting that can hardly capture
the complicated stochastic demand-supply variations in high-dimensional space.
In this paper we propose to tackle the large-scale fleet management problem
using reinforcement learning, and propose a contextual multi-agent
reinforcement learning framework including two concrete algorithms, namely
contextual deep Q-learning and contextual multi-agent actor-critic, to achieve
explicit coordination among a large number of agents adaptive to different
contexts. We show significant improvements of the proposed framework over
state-of-the-art approaches through extensive empirical studies.
https://arxiv.org/abs/1802.06444 ,  2383kb)
Title: A Deep Q-Learning Agent for the L-Game with Variable Batch Training
Authors: Petros Giannakopoulos, Yannis Cotronis
Categories: cs.LG cs.AI

We employ the Deep Q-Learning algorithm with Experience Replay to train an
agent capable of achieving a high-level of play in the L-Game while
self-learning from low-dimensional states. We also employ variable batch size
for training in order to mitigate the loss of the rare reward signal and
significantly accelerate training. Despite the large action space due to the
number of possible moves, the low-dimensional state space and the rarity of
rewards, which only come at the end of a game, DQL is successful in training an
agent capable of strong play without the use of any search methods or domain
knowledge.
https://arxiv.org/abs/1802.06225 ,  424kb)

Title: Improving Mild Cognitive Impairment Prediction via Reinforcement
Learning and Dialogue Simulation
Authors: Fengyi Tang, Kaixiang Lin, Ikechukwu Uchendu, Hiroko H. Dodge, Jiayu
Zhou
Categories: cs.LG cs.CL stat.ML
Comments: 9 pages, 4 figures, 4 tables

Mild cognitive impairment (MCI) is a prodromal phase in the progression from
normal aging to dementia, especially Alzheimers disease. Even though there is
mild cognitive decline in MCI patients, they have normal overall cognition and
thus is challenging to distinguish from normal aging. Using transcribed data
obtained from recorded conversational interactions between participants and
trained interviewers, and applying supervised learning models to these data, a
recent clinical trial has shown a promising result in differentiating MCI from
normal aging. However, the substantial amount of interactions with medical
staff can still incur significant medical care expenses in practice. In this
paper, we propose a novel reinforcement learning (RL) framework to train an
efficient dialogue agent on existing transcripts from clinical trials.
Specifically, the agent is trained to sketch disease-specific lexical
probability distribution, and thus to converse in a way that maximizes the
diagnosis accuracy and minimizes the number of conversation turns. We evaluate
the performance of the proposed reinforcement learning framework on the MCI
diagnosis from a real clinical trial. The results show that while using only a
few turns of conversation, our framework can significantly outperform
state-of-the-art supervised learning approaches.
https://arxiv.org/abs/1802.06428 ,  617kb)

### Arxiv on Feb. 19th

21Feb18

Title: Monte Carlo Q-learning for General Game Playing
Authors: Hui Wang, Michael Emmerich, Aske Plaat
Categories: cs.AI

Recently, the interest in reinforcement learning in game playing has been
renewed. This is evidenced by the groundbreaking results achieved by AlphaGo.
General Game Playing (GGP) provides a good testbed for reinforcement learning,
currently one of the hottest fields of AI. In GGP, a specification of games
rules is given. The description specifies a reinforcement learning problem,
leaving programs to find strategies for playing well. Q-learning is one of the
canonical reinforcement learning methods, which is used as baseline on some
previous work (Banerjee & Stone, IJCAI 2007). We implement Q-learning in GGP
for three small board games (Tic-Tac-Toe, Connect-Four, Hex). We find that
Q-learning converges, and thus that this general reinforcement learning method
is indeed applicable to General Game Playing. However, convergence is slow, in
comparison to MCTS (a reinforcement learning method reported to achieve good
results). We enhance Q-learning with Monte Carlo Search. This enhancement
improves performance of pure Q-learning, although it does not yet out-perform
MCTS. Future work is needed into the relation between MCTS and Q-learning, and
on larger problem instances.
https://arxiv.org/abs/1802.05944 ,  1865kb)