### Arxiv on Feb. 12th

12Feb18

Title: A Unified Approach for Multi-step Temporal-Difference Learning with
Eligibility Traces in Reinforcement Learning
Authors: Long Yang, Minhao Shi, Qian Zheng, Wenjia Meng, Gang Pan
Categories: cs.AI cs.LG stat.ML

Recently, a new multi-step temporal learning algorithm, called $Q(\sigma)$,
unifies $n$-step Tree-Backup (when $\sigma=0$) and $n$-step Sarsa (when
$\sigma=1$) by introducing a sampling parameter $\sigma$. However, similar to
other multi-step temporal-difference learning algorithms, $Q(\sigma)$ needs
much memory consumption and computation time. Eligibility trace is an important
mechanism to transform the off-line updates into efficient on-line ones which
consume less memory and computation time. In this paper, we further develop the
original $Q(\sigma)$, combine it with eligibility traces and propose a new
algorithm, called $Q(\sigma ,\lambda)$, in which $\lambda$ is trace-decay
parameter. This idea unifies Sarsa$(\lambda)$ (when $\sigma =1$) and
$Q^{\pi}(\lambda)$ (when $\sigma =0$). Furthermore, we give an upper error
bound of $Q(\sigma ,\lambda)$ policy evaluation algorithm. We prove that
$Q(\sigma,\lambda)$ control algorithm can converge to the optimal value
function exponentially. We also empirically compare it with conventional
temporal-difference learning methods. Results show that, with an intermediate
value of $\sigma$, $Q(\sigma ,\lambda)$ creates a mixture of the existing
algorithms that can learn the optimal value significantly faster than the
extreme end ($\sigma=0$, or $1$).
https://arxiv.org/abs/1802.03171 ,  627kb)

Title: Balancing Two-Player Stochastic Games with Soft Q-Learning
Authors: Jordi Grau-Moya and Felix Leibfried and Haitham Bou-Ammar
Categories: cs.AI

Within the context of video games the notion of perfectly rational agents can
be undesirable as it leads to uninteresting situations, where humans face tough
adversarial decision makers. Current frameworks for stochastic games and
reinforcement learning prohibit tuneable strategies as they seek optimal
performance. In this paper, we enable such tuneable behaviour by generalising
soft Q-learning to stochastic games, where more than one agent interact
strategically. We contribute both theoretically and empirically. On the theory
side, we show that games with soft Q-learning exhibit a unique value and
generalise team games and zero-sum games far beyond these two extremes to cover
a continuous spectrum of gaming behaviour. Experimentally, we show how tuning
agents’ constraints affect performance and demonstrate, through a neural
network architecture, how to reliably balance games with high-dimensional
representations.
https://arxiv.org/abs/1802.03216 ,  1650kb)

Title: Learning Robust Options
Authors: Daniel J. Mankowitz, Timothy A. Mann, Pierre-Luc Bacon, Doina Precup
and Shie Mannor
Categories: cs.AI cs.LG stat.ML

Robust reinforcement learning aims to produce policies that have strong
guarantees even in the face of environments/transition models whose parameters
have strong uncertainty. Existing work uses value-based methods and the usual
primitive action setting. In this paper, we propose robust methods for learning
temporally abstract actions, in the framework of options. We present a Robust
Options Policy Iteration (ROPI) algorithm with convergence guarantees, which
learns options that are robust to model uncertainty. We utilize ROPI to learn
robust options with the Robust Options Deep Q Network (RO-DQN) that solves
multiple tasks and mitigates model misspecification due to model uncertainty.
We present experimental results which suggest that policy iteration with linear
features may have an inherent form of robustness when using coarse feature
representations. In addition, we present experimental results which demonstrate
that robustness helps policy iteration implemented on top of deep neural
networks to generalize over a much broader range of dynamics than non-robust
policy iteration.
https://arxiv.org/abs/1802.03236 ,  1636kb)