Arxiv on Feb. 13th


Title: More Robust Doubly Robust Off-policy Evaluation
Authors: Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh
Categories: cs.AI

We study the problem of off-policy evaluation (OPE) in reinforcement learning
(RL), where the goal is to estimate the performance of a policy from the data
generated by another policy(ies). In particular, we focus on the doubly robust
(DR) estimators that consist of an importance sampling (IS) component and a
performance model, and utilize the low (or zero) bias of IS and low variance of
the model at the same time. Although the accuracy of the model has a huge
impact on the overall performance of DR, most of the work on using the DR
estimators in OPE has been focused on improving the IS part, and not much on
how to learn the model. In this paper, we propose alternative DR estimators,
called more robust doubly robust (MRDR), that learn the model parameter by
minimizing the variance of the DR estimator. We first present a formulation for
learning the DR model in RL. We then derive formulas for the variance of the DR
estimator in both contextual bandits and RL, such that their gradients
w.r.t.~the model parameters can be estimated from the samples, and propose
methods to efficiently minimize the variance. We prove that the MRDR estimators
are strongly consistent and asymptotically optimal. Finally, we evaluate MRDR
in bandits and RL benchmark problems, and compare its performance with the
existing methods. ,  939kb)

Title: Beyond the One Step Greedy Approach in Reinforcement Learning
Authors: Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor
Categories: cs.AI
Comments: 13 pages, 1 figure

The famous Policy Iteration algorithm alternates between policy improvement
and policy evaluation. Implementations of this algorithm with several variants
of the latter evaluation stage, e.g, $n$-step and trace-based returns, have
been analyzed in previous works. However, the case of multiple-step lookahead
policy improvement, despite the recent increase in empirical evidence of its
strength, has to our knowledge not been carefully analyzed yet. In this work,
we introduce the first such analysis. Namely, we formulate variants of
multiple-step policy improvement, derive new algorithms using these definitions
and prove their convergence. Moreover, we show that recent prominent
Reinforcement Learning algorithms are, in fact, instances of our framework. We
thus shed light on their empirical success and give a recipe for deriving new
algorithms for future study. ,  68kb)

Title: State Representation Learning for Control: An Overview
Authors: Timoth\’ee Lesort, Natalia D\’iaz-Rodr\’iguez, Jean-Fran\c{c}ois
Goudou, David Filliat
Categories: cs.AI

Representation learning algorithms are designed to learn abstract features
that characterize data. State representation learning (SRL) focuses on a
particular kind of representation learning where learned features are in low
dimension, evolve through time, and are influenced by actions of an agent. As
the representation learned captures the variation in the environment generated
by agents, this kind of representation is particularly suitable for robotics
and control scenarios. In particular, the low dimension helps to overcome the
curse of dimensionality, provides easier interpretation and utilization by
humans and can help improve performance and speed in policy learning algorithms
such as reinforcement learning.
This survey aims at covering the state-of-the-art on state representation
learning in the most recent years. It reviews different SRL methods that
involve interaction with the environment, their implementations and their
applications in robotics control tasks (simulated or real). In particular, it
highlights how generic learning objectives are differently exploited in the
reviewed algorithms. Finally, it discusses evaluation methods to assess the
representation learned and summarizes current and future lines of research. ,  199kb)

Title: Deep Reinforcement Learning for Solving the Vehicle Routing Problem
Authors: Mohammadreza Nazari, Afshin Oroojlooy, Lawrence V. Snyder, Martin
Categories: cs.AI cs.LG stat.ML

We present an end-to-end framework for solving Vehicle Routing Problem (VRP)
using deep reinforcement learning. In this approach, we train a single model
that finds near-optimal solutions for problem instances sampled from a given
distribution, only by observing the reward signals and following feasibility
rules. Our model represents a parameterized stochastic policy, and by applying
a policy gradient algorithm to optimize its parameters, the trained model
produces the solution as a sequence of consecutive actions in real time,
without the need to re-train for every new problem instance. Our method is
faster in both training and inference than a recent method that solves the
Traveling Salesman Problem (TSP), with nearly identical solution quality. On
the more general VRP, our approach outperforms classical heuristics on
medium-sized instances in both solution quality and computation time (after
training). Our proposed framework can be applied to variants of the VRP such as
the stochastic VRP, and has the potential to be applied more generally to
combinatorial optimization problems. ,  650kb)

Title: Sample Efficient Deep Reinforcement Learning for Dialogue Systems with
  Large Action Spaces
Authors: Gell\’ert Weisz, Pawe{\l} Budzianowski, Pei-Hao Su, Milica Ga\v{s}i\’c
Categories: cs.CL cs.AI cs.LG stat.ML

In spoken dialogue systems, we aim to deploy artificial intelligence to build
automated dialogue agents that can converse with humans. A part of this effort
is the policy optimisation task, which attempts to find a policy describing how
to respond to humans, in the form of a function taking the current state of the
dialogue and returning the response of the system. In this paper, we
investigate deep reinforcement learning approaches to solve this problem.
Particular attention is given to actor-critic methods, off-policy reinforcement
learning with experience replay, and various methods aimed at reducing the bias
and variance of estimators. When combined, these methods result in the
previously proposed ACER algorithm that gave competitive results in gaming
environments. These environments however are fully observable and have a
relatively small action set so in this paper we examine the application of ACER
to dialogue policy optimisation. We show that this method beats the current
state-of-the-art in deep learning approaches for spoken dialogue systems. This
not only leads to a more sample efficient algorithm that can train faster, but
also allows us to apply the algorithm in more difficult environments than
before. We thus experiment with learning in a very large action space, which
has two orders of magnitude more actions than previously considered. We find
that ACER trains significantly faster than the current state-of-the-art. ,  1351kb)

Title: ADC: Automated Deep Compression and Acceleration with Reinforcement
Authors: Yihui He, Song Han
Categories: cs.CV

Model compression is an effective technique facilitating the deployment of
neural network models on mobile devices that have limited computation resources
and a tight power budget. However, conventional model compression techniques
use hand-crafted features and require domain experts to explore the large
design space trading off model size, speed, and accuracy, which is usually
sub-optimal and time-consuming. In this paper, we propose Automated Deep
Compression (ADC) that leverages reinforcement learning in order to efficiently
sample the design space and greatly improve the model compression quality. We
achieved state-of-the-art model compression results in a fully automated way
without any human efforts. Under 4x FLOPs reduction, we achieved 2.7% better
accuracy than hand-crafted model compression method for VGG-16 on ImageNet. We
applied this automated, push-the-button compression pipeline to MobileNet and
achieved a 2x reduction in FLOPs, and a speedup of 1.49x on Titan Xp and 1.65x
on an Android phone (Samsung Galaxy S7), with negligible loss of accuracy. ,  833kb)

Title: A note on reinforcement learning with Wasserstein distance
  regularisation, with applications to multipolicy learning
Authors: Mohammed Amin Abdullah, Aldo Pacchiano, Moez Draief
Categories: cs.LG cs.AI

In this note we describe an application of Wasserstein distance to
Reinforcement Learning. The Wasserstein distance in question is between the
distribution of mappings of trajectories of a policy into some metric space,
and some other fixed distribution (which may, for example, come from another
policy). Different policies induce different distributions, so given an
underlying metric, the Wasserstein distance quantifies how different policies
are. This can be used to learn multiple polices which are different in terms of
such Wasserstein distances by using a Wasserstein regulariser. Changing the
sign of the regularisation parameter, one can learn a policy for which its
trajectory mapping distribution is attracted to a given fixed distribution. ,  13kb)

No Responses Yet to “Arxiv on Feb. 13th”

  1. Leave a Comment

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: