### Arxiv on Feb. 27th

28Feb18

Title: Reinforcement Learning on Web Interfaces Using Workflow-Guided
Exploration
Authors: Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, Percy
Liang
Categories: cs.AI
Comments: International Conference on Learning Representations (ICLR), 2018

Reinforcement learning (RL) agents improve through trial-and-error, but when
reward is sparse and the agent cannot discover successful action sequences,
learning stagnates. This has been a notable problem in training deep RL agents
to perform web-based tasks, such as booking flights or replying to emails,
where a single mistake can ruin the entire sequence of actions. A common remedy
is to “warm-start” the agent by pre-training it to mimic expert demonstrations,
but this is prone to overfitting. Instead, we propose to constrain exploration
using demonstrations. From each demonstration, we induce high-level “workflows”
which constrain the allowable actions at each time step to be similar to those
in the demonstration (e.g., “Step 1: click on a textbox; Step 2: enter some
text”). Our exploration policy then learns to identify successful workflows and
samples actions that satisfy these workflows. Workflows prune out bad
exploration directions and accelerate the agent’s ability to discover rewards.
We use our approach to train a novel neural policy designed to handle the
semi-structured nature of websites, and evaluate on a suite of web tasks,
including the recent World of Bits benchmark. We achieve new state-of-the-art
results, and show that workflow-guided exploration improves sample efficiency
over behavioral cloning by more than 100x.
https://arxiv.org/abs/1802.08802 ,  362kb)

Title: Variance Reduction Methods for Sublinear Reinforcement Learning
Authors: Sham Kakade, Mengdi Wang, Lin F. Yang
Categories: cs.AI cs.LG stat.ML

This work considers the problem of provably optimal reinforcement learning
for (episodic) finite horizon MDPs, i.e. how an agent learns to maximize
his/her (long term) reward in an uncertain environment. The main contribution
is in providing a novel algorithm — Variance-reduced Upper Confidence
Q-learning (vUCQ) — which enjoys a regret bound of $\widetilde{O}(\sqrt{HSAT} + H^5SA)$, where the $T$ is the number of time steps the agent acts in the MDP,
$S$ is the number of states, $A$ is the number of actions, and $H$ is the
(episodic) horizon time.
This is the first regret bound that is both sub-linear in the model size and
asymptotically optimal. The algorithm is sub-linear in that the time to achieve
$\epsilon$-average regret (for any constant $\epsilon$) is $O(SA)$, which is a
number of samples that is far less than that required to learn any
(non-trivial) estimate of the transition model (the transition model is
specified by $O(S^2A)$ parameters). The importance of sub-linear algorithms is
largely the motivation for algorithms such as $Q$-learning and other “model
free” approaches. vUCQ algorithm also enjoys minimax optimal regret in the long
run, matching the $\Omega(\sqrt{HSAT})$ lower bound.
Variance-reduced Upper Confidence Q-learning (vUCQ) is a successive
refinement method in which the algorithm reduces the variance in $Q$-value
estimates and couples this estimation scheme with an upper confidence based
algorithm. Technically, the coupling of both of these techniques is what leads
to the algorithm enjoying both the sub-linear regret property and the
(asymptotically) optimal regret.
https://arxiv.org/abs/1802.09184 ,  539kb)

Title: The AdobeIndoorNav Dataset: Towards Deep Reinforcement Learning based
Real-world Indoor Robot Visual Navigation
Authors: Kaichun Mo, Haoxiang Li, Zhe Lin and Joon-Young Lee
Categories: cs.RO

Deep reinforcement learning (DRL) demonstrates its potential in learning a
model-free navigation policy for robot visual navigation. However, the
data-demanding algorithm relies on a large number of navigation trajectories in
training. Existing datasets supporting training such robot navigation
algorithms consist of either 3D synthetic scenes or reconstructed scenes.
Synthetic data suffers from domain gap to the real-world scenes while visual
inputs rendered from 3D reconstructed scenes have undesired holes and
artifacts. In this paper, we present a new dataset collected in real-world to
facilitate the research in DRL based visual navigation. Our dataset includes 3D
reconstruction for real-world scenes as well as densely captured real 2D images
from the scenes. It provides high-quality visual inputs with real-world scene
complexity to the robot at dense grid locations. We further study and benchmark
one recent DRL based navigation algorithm and present our attempts and thoughts
on improving its generalizability to unseen test targets in the scenes.
https://arxiv.org/abs/1802.08824 ,  7213kb)

Title: Fully Decentralized Multi-Agent Reinforcement Learning with Networked
Agents
Authors: Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Ba\c{s}ar
Categories: cs.LG cs.AI cs.MA math.OC stat.ML

We consider the problem of \emph{fully decentralized} multi-agent
reinforcement learning (MARL), where the agents are located at the nodes of a
time-varying communication network. Specifically, we assume that the reward
functions of the agents might correspond to different tasks, and are only known
to the corresponding agent. Moreover, each agent makes individual decisions
based on both the information observed locally and the messages received from
its neighbors over the network. Within this setting, the collective goal of the
agents is to maximize the globally averaged return over the network through
exchanging information with their neighbors. To this end, we propose two
decentralized actor-critic algorithms with function approximation, which are
applicable to large-scale MARL problems where both the number of states and the
number of agents are massively large. Under the decentralized structure, the
actor step is performed individually by each agent with no need to infer the
policies of others. For the critic step, we propose a consensus update via
communication over the network. Our algorithms are fully incremental and can be
implemented in an online fashion. Convergence analyses of the algorithms are
provided when the value functions are approximated within the class of linear
functions. Extensive simulation results with both linear and nonlinear function
approximations are presented to validate the proposed algorithms. Our work
appears to be the first study of fully decentralized MARL algorithms for
networked agents with function approximation, with provable convergence
guarantees.
https://arxiv.org/abs/1802.08757 ,  1867kb)

Title: Multi-Goal Reinforcement Learning: Challenging Robotics Environments and
Request for Research
Authors: Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen
Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter
Welinder, Vikash Kumar, Wojciech Zaremba
Categories: cs.LG cs.AI cs.RO

The purpose of this technical report is two-fold. First of all, it introduces
a suite of challenging continuous control tasks (integrated with OpenAI Gym)
based on currently existing robotics hardware. The tasks include pushing,
sliding and pick & place with a Fetch robotic arm as well as in-hand object
manipulation with a Shadow Dexterous Hand. All tasks have sparse binary rewards
and follow a Multi-Goal Reinforcement Learning (RL) framework in which an agent
is told what to do using an additional input.
The second part of the paper presents a set of concrete research ideas for
improving RL algorithms, most of which are related to Multi-Goal RL and
Hindsight Experience Replay.
https://arxiv.org/abs/1802.09464 ,  1502kb)