### Arxiv on Feb. 27th

Title: **Reinforcement Learning on Web Interfaces Using Workflow-Guided**

** Exploration**

Authors: Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, Percy

Liang

Categories: cs.AI

Comments: International Conference on Learning Representations (ICLR), 2018

Reinforcement learning (RL) agents improve through trial-and-error, but when

reward is sparse and the agent cannot discover successful action sequences,

learning stagnates. This has been a notable problem in training deep RL agents

to perform web-based tasks, such as booking flights or replying to emails,

where a single mistake can ruin the entire sequence of actions. A common remedy

is to “warm-start” the agent by pre-training it to mimic expert demonstrations,

but this is prone to overfitting. Instead, we propose to constrain exploration

using demonstrations. From each demonstration, we induce high-level “workflows”

which constrain the allowable actions at each time step to be similar to those

in the demonstration (e.g., “Step 1: click on a textbox; Step 2: enter some

text”). Our exploration policy then learns to identify successful workflows and

samples actions that satisfy these workflows. Workflows prune out bad

exploration directions and accelerate the agent’s ability to discover rewards.

We use our approach to train a novel neural policy designed to handle the

semi-structured nature of websites, and evaluate on a suite of web tasks,

including the recent World of Bits benchmark. We achieve new state-of-the-art

results, and show that workflow-guided exploration improves sample efficiency

over behavioral cloning by more than 100x.

( https://arxiv.org/abs/1802.08802 , 362kb)

Title: **Variance Reduction Methods for Sublinear Reinforcement Learning**

Authors: Sham Kakade, Mengdi Wang, Lin F. Yang

Categories: cs.AI cs.LG stat.ML

Comments: 33 pages

This work considers the problem of provably optimal reinforcement learning

for (episodic) finite horizon MDPs, i.e. how an agent learns to maximize

his/her (long term) reward in an uncertain environment. The main contribution

is in providing a novel algorithm — Variance-reduced Upper Confidence

Q-learning (vUCQ) — which enjoys a regret bound of $\widetilde{O}(\sqrt{HSAT}

+ H^5SA)$, where the $T$ is the number of time steps the agent acts in the MDP,

$S$ is the number of states, $A$ is the number of actions, and $H$ is the

(episodic) horizon time.

This is the first regret bound that is both sub-linear in the model size and

asymptotically optimal. The algorithm is sub-linear in that the time to achieve

$\epsilon$-average regret (for any constant $\epsilon$) is $O(SA)$, which is a

number of samples that is far less than that required to learn any

(non-trivial) estimate of the transition model (the transition model is

specified by $O(S^2A)$ parameters). The importance of sub-linear algorithms is

largely the motivation for algorithms such as $Q$-learning and other “model

free” approaches. vUCQ algorithm also enjoys minimax optimal regret in the long

run, matching the $\Omega(\sqrt{HSAT})$ lower bound.

Variance-reduced Upper Confidence Q-learning (vUCQ) is a successive

refinement method in which the algorithm reduces the variance in $Q$-value

estimates and couples this estimation scheme with an upper confidence based

algorithm. Technically, the coupling of both of these techniques is what leads

to the algorithm enjoying both the sub-linear regret property and the

(asymptotically) optimal regret.

( https://arxiv.org/abs/1802.09184 , 539kb)

Title: **The AdobeIndoorNav Dataset: Towards Deep Reinforcement Learning based**

** Real-world Indoor Robot Visual Navigation**

Authors: Kaichun Mo, Haoxiang Li, Zhe Lin and Joon-Young Lee

Categories: cs.RO

Deep reinforcement learning (DRL) demonstrates its potential in learning a

model-free navigation policy for robot visual navigation. However, the

data-demanding algorithm relies on a large number of navigation trajectories in

training. Existing datasets supporting training such robot navigation

algorithms consist of either 3D synthetic scenes or reconstructed scenes.

Synthetic data suffers from domain gap to the real-world scenes while visual

inputs rendered from 3D reconstructed scenes have undesired holes and

artifacts. In this paper, we present a new dataset collected in real-world to

facilitate the research in DRL based visual navigation. Our dataset includes 3D

reconstruction for real-world scenes as well as densely captured real 2D images

from the scenes. It provides high-quality visual inputs with real-world scene

complexity to the robot at dense grid locations. We further study and benchmark

one recent DRL based navigation algorithm and present our attempts and thoughts

on improving its generalizability to unseen test targets in the scenes.

( https://arxiv.org/abs/1802.08824 , 7213kb)

Title: **Fully Decentralized Multi-Agent Reinforcement Learning with Networked**

** Agents**

Authors: Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Ba\c{s}ar

Categories: cs.LG cs.AI cs.MA math.OC stat.ML

We consider the problem of \emph{fully decentralized} multi-agent

reinforcement learning (MARL), where the agents are located at the nodes of a

time-varying communication network. Specifically, we assume that the reward

functions of the agents might correspond to different tasks, and are only known

to the corresponding agent. Moreover, each agent makes individual decisions

based on both the information observed locally and the messages received from

its neighbors over the network. Within this setting, the collective goal of the

agents is to maximize the globally averaged return over the network through

exchanging information with their neighbors. To this end, we propose two

decentralized actor-critic algorithms with function approximation, which are

applicable to large-scale MARL problems where both the number of states and the

number of agents are massively large. Under the decentralized structure, the

actor step is performed individually by each agent with no need to infer the

policies of others. For the critic step, we propose a consensus update via

communication over the network. Our algorithms are fully incremental and can be

implemented in an online fashion. Convergence analyses of the algorithms are

provided when the value functions are approximated within the class of linear

functions. Extensive simulation results with both linear and nonlinear function

approximations are presented to validate the proposed algorithms. Our work

appears to be the first study of fully decentralized MARL algorithms for

networked agents with function approximation, with provable convergence

guarantees.

( https://arxiv.org/abs/1802.08757 , 1867kb)

Title: **Multi-Goal Reinforcement Learning: Challenging Robotics Environments and**

** Request for Research**

Authors: Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen

Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter

Welinder, Vikash Kumar, Wojciech Zaremba

Categories: cs.LG cs.AI cs.RO

The purpose of this technical report is two-fold. First of all, it introduces

a suite of challenging continuous control tasks (integrated with OpenAI Gym)

based on currently existing robotics hardware. The tasks include pushing,

sliding and pick & place with a Fetch robotic arm as well as in-hand object

manipulation with a Shadow Dexterous Hand. All tasks have sparse binary rewards

and follow a Multi-Goal Reinforcement Learning (RL) framework in which an agent

is told what to do using an additional input.

The second part of the paper presents a set of concrete research ideas for

improving RL algorithms, most of which are related to Multi-Goal RL and

Hindsight Experience Replay.

( https://arxiv.org/abs/1802.09464 , 1502kb)

Filed under: Reinforcement Learning | Leave a Comment

## No Responses Yet to “Arxiv on Feb. 27th”