A simple actor-critic algorithm for continuous environments

Pawel Wawrzynski

Outline

A simple actor-critic algorithm for continuous environments

Pawel Wawrzynski

2003

Abstract

In reference to methods analyzed recently by Sutton et al, and Konda & Tsitsiklis, we propose their modification called Randomized Policy Optimizer (RPO). The algorithm has a modular structure and is based on the value function rather than on the action-value function. The modules include neural approximators and a parameterized distribution of control actions. The distribution must belong to a family of smoothly exploring distributions that enables to sample from control action set to approximate certain gradient. A pre-action-value function is introduced similarly to the action-value function, with the first action replaced by the first action distribution parameter. The paper contains an experimental comparison of this approach to reinforcement learning with model-free Adaptive Critic Designs, specifically with Action-Dependent Adaptive Heuristic Critic. The comparison is favorable for our algorithm.

Figures (2)

since V™(“*)(s,) is in fact defined as an expected value of g}. Variance of q} is, however, large. Note that

Another algorithm we tested was RPO(A). Its ex- act formulation is easy to derive on the basis of Table 1. We applied the same values of parameters as we applied to RPO (additionally ) = 0.5). The burden with parameters’ selection was the same as in the case of RPO, ice. little.

References (12)

A. G. Barto, R. S. Sutton, and C. W. Anderson, "Neu- ronlike Adaptive Elements That Can Learn Difficult Learning Control Problems, " IEEE Trans. Syst., Man, Cybern., vol. SMC-13, pp. 834-846, Sept.-Oct. 1983.
D.P. Bertsekas and J.N. Tsitsiklis. Neuro-Dynamic Pro- gramming. Athena Scientific, Belmont, Massachusetts, 1997.
K. Doya, "Reinforcement learning in continuous time and space," Neural Computation, 12:243-269, 2000.
V. R. Konda and J. N. Tsitsiklis, "Actor-Critic Algo- rithms," SIAM Journal on Control and Optimization, Vol. 42, No. 4, pp. 1143-1166, 2003.
D. Liu, X. Xiong and Y. Zhang, "Action-Dependent Adaptive Critic Designs", Proceedings of the INNS- IEEE International Joint Conference on Neural Net- works, Washington, DC, July 2001, pp.990-995.
D. V. Prokhorov and D. C. Wunsch, "Adaptive critic de- signs," IEEE Trans. Neural Networks, vol. 8, pp. 997- 1007, Sept. 1997.
J. Si and Y.-T. Wand, "On-line learning control by asso- ciation and reinforcement," IEEE Transactions on Neu- ral Networks, vol. 12, pp. 264-276, Mar. 2001.
R. S. Sutton and A. G. Barto, Reinforcement Learn- ing: An Introduction, MIT Press, Cambridge, Mas- sachusetts, 1998.
R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, "Policy Gradient Methods for Reinforcement Learning with Function Approximation," Advances in Informa- tion Processing Systems 12, pp. 1057-1063, MIT Press, 2000.
P. Wawrzynski, "Reinforcement Learning in Control Systems," PhD Thesis, Institute of Control and Compu- tation Engineering, Warsaw University of Technology, forthcoming.
C. Watkins and P. Dayan, "Q-Learning," Machine Learning, vol. 8, pp. 279-292, 1992.
R. Williams, "Simple statistical gradient following algorithms for connectionist reinforcement learning," Machine Learning, 8:229-256, 1992.

A simple actor-critic algorithm for continuous environments

Sign up for access to the world's latest research

Abstract

Related papers

References (12)

Related papers

Related topics