Arxiv on Feb. 23rd


Title: Convergent Actor-Critic Algorithms Under Off-Policy Training and
  Function Approximation
Authors: Hamid Reza Maei
Categories: cs.AI

We present the first class of policy-gradient algorithms that work with both
state-value and policy function-approximation, and are guaranteed to converge
under off-policy training. Our solution targets problems in reinforcement
learning where the action representation adds to the-curse-of-dimensionality;
that is, with continuous or large action sets, thus making it infeasible to
estimate state-action value functions (Q functions). Using state-value
functions helps to lift the curse and as a result naturally turn our
policy-gradient solution into classical Actor-Critic architecture whose Actor
uses state-value function for the update. Our algorithms, Gradient Actor-Critic
and Emphatic Actor-Critic, are derived based on the exact gradient of averaged
state-value function objective and thus are guaranteed to converge to its
optimal solution, while maintaining all the desirable properties of classical
Actor-Critic methods with no additional hyper-parameters. To our knowledge,
this is the first time that convergent off-policy learning methods have been
extended to classical Actor-Critic methods with function approximation.

Title: Variational Inference for Policy Gradient
Authors: Tianbing Xu
Categories: cs.LG cs.AI stat.ML
Comments: 7 pages

Inspired by the seminal work on Stein Variational Inference and Stein
Variational Policy Gradient, we derived a method to generate samples from the
posterior variational parameter distribution by \textit{explicitly} minimizing
the KL divergence to match the target distribution in an amortize fashion.
Consequently, we applied this varational inference technique into vanilla
policy gradient, TRPO and PPO with Bayesian Neural Network parameterizations
for reinforcement learning problems.

