### Arxiv on Feb. 26th

28Feb18

Title: Budget Constrained Bidding by Model-free Reinforcement Learning in
Authors: Di Wu, Xiujun Chen, Xun Yang, Hao Wang, Qing Tan, Xiaoxun Zhang, Kun
Gai
Categories: cs.AI

Real-time bidding (RTB) is almost the most important mechanism in online
display advertising, where proper bid for each page view plays a vital and
essential role for good marketing results. Budget constrained bidding is a
typical scenario in RTB mechanism where the advertisers hope to maximize total
value of winning impressions under a pre-set budget constraint. However, the
optimal strategy is hard to be derived due to complexity and volatility of the
auction environment. To address the challenges, in this paper, we formulate
budget constrained bidding as a Markov Decision Process. Quite different from
prior model-based work, we propose a novel framework based on model-free
reinforcement learning which sequentially regulates the bidding parameter
rather than directly producing bid. Along this line, we further innovate a
reward function which deploys a deep neural network to learn appropriate reward
and thus leads the agent to deliver the optimal policy effectively; we also
design an adaptive $\epsilon$-greedy strategy which adjusts the exploration
behaviour dynamically and further improves the performance. Experimental
results on real dataset demonstrate the effectiveness of our framework.
https://arxiv.org/abs/1802.08365 ,  1343kb)

Title: Ranking Sentences for Extractive Summarization with Reinforcement
Learning
Authors: Shashi Narayan, Shay B. Cohen, Mirella Lapata
Categories: cs.CL
Comments: NAACL 2018, Accepted version, 11 pages

Single document summarization is the task of producing a shorter version of a
document while preserving its principal information content. In this paper we
conceptualize extractive summarization as a sentence ranking task and propose a
novel training algorithm which globally optimizes the ROUGE evaluation metric
through a reinforcement learning objective. We use our algorithm to train a
neural summarization model on the CNN and DailyMail datasets and demonstrate
experimentally that it outperforms state-of-the-art extractive and abstractive
systems when evaluated automatically and by humans.
https://arxiv.org/abs/1802.08636 ,  50kb)

Title: Weighted Double Deep Multiagent Reinforcement Learning in Stochastic
Cooperative Environments
Authors: Yan Zheng, Jianye Hao, Zongzhang Zhang
Categories: cs.MA cs.AI cs.LG

Despite single agent deep reinforcement learning has achieved significant
success due to the experience replay mechanism, Concerns should be reconsidered
in multiagent environments. This work focus on the stochastic cooperative
environment. We apply a specific adaptation to one recently proposed weighted
double estimator and propose a multiagent deep reinforcement learning
framework, named Weighted Double Deep Q-Network (WDDQN). To achieve efficient
cooperation, \textit{Lenient Reward Network} and \textit{Mixture Replay
Strategy} are introduced. By utilizing the deep neural network and the weighted
double estimator, WDDQN can not only reduce the bias effectively but also be
extended to many deep RL scenarios with only raw pixel images as input.
Empirically, the WDDQN outperforms the existing DRL algorithm (double DQN) and
multiagent RL algorithm (lenient Q-learning) in terms of performance and
convergence within stochastic cooperative environments.
https://arxiv.org/abs/1802.08534 ,  1614kb)

Title: Structured Control Nets for Deep Reinforcement Learning
Authors: Mario Srouji, Jian Zhang, Ruslan Salakhutdinov
Categories: cs.LG cs.AI cs.RO
Comments: First two authors contributed equally

solving several important benchmark problems for sequential decision making.
Many control applications use a generic multilayer perceptron (MLP) for
non-vision parts of the policy network. In this work, we propose a new neural
network architecture for the policy network representation that is simple yet
effective. The proposed Structured Control Net (SCN) splits the generic MLP
into two separate sub-modules: a nonlinear control module and a linear control
module. Intuitively, the nonlinear control is for forward-looking and global
control, while the linear control stabilizes the local dynamics around the
residual of global control. We hypothesize that this will bring together the
benefits of both linear and nonlinear policies: improve training sample
efficiency, final episodic reward, and generalization of learned policy, while
requiring a smaller network and being generally applicable to different
training methods. We validated our hypothesis with competitive results on
simulations from OpenAI MuJoCo, Roboschool, Atari, and a custom 2D urban
driving environment, with various ablation and generalization tests, trained
with multiple black-box and policy gradient training methods. The proposed