Academia.eduAcademia.edu

Outline

A simple actor-critic algorithm for continuous environments

2003

Abstract

In reference to methods analyzed recently by Sutton et al, and Konda & Tsitsiklis, we propose their modification called Randomized Policy Optimizer (RPO). The algorithm has a modular structure and is based on the value function rather than on the action-value function. The modules include neural approximators and a parameterized distribution of control actions. The distribution must belong to a family of smoothly exploring distributions that enables to sample from control action set to approximate certain gradient. A pre-action-value function is introduced similarly to the action-value function, with the first action replaced by the first action distribution parameter. The paper contains an experimental comparison of this approach to reinforcement learning with model-free Adaptive Critic Designs, specifically with Action-Dependent Adaptive Heuristic Critic. The comparison is favorable for our algorithm.

References (12)

  1. A. G. Barto, R. S. Sutton, and C. W. Anderson, "Neu- ronlike Adaptive Elements That Can Learn Difficult Learning Control Problems, " IEEE Trans. Syst., Man, Cybern., vol. SMC-13, pp. 834-846, Sept.-Oct. 1983.
  2. D.P. Bertsekas and J.N. Tsitsiklis. Neuro-Dynamic Pro- gramming. Athena Scientific, Belmont, Massachusetts, 1997.
  3. K. Doya, "Reinforcement learning in continuous time and space," Neural Computation, 12:243-269, 2000.
  4. V. R. Konda and J. N. Tsitsiklis, "Actor-Critic Algo- rithms," SIAM Journal on Control and Optimization, Vol. 42, No. 4, pp. 1143-1166, 2003.
  5. D. Liu, X. Xiong and Y. Zhang, "Action-Dependent Adaptive Critic Designs", Proceedings of the INNS- IEEE International Joint Conference on Neural Net- works, Washington, DC, July 2001, pp.990-995.
  6. D. V. Prokhorov and D. C. Wunsch, "Adaptive critic de- signs," IEEE Trans. Neural Networks, vol. 8, pp. 997- 1007, Sept. 1997.
  7. J. Si and Y.-T. Wand, "On-line learning control by asso- ciation and reinforcement," IEEE Transactions on Neu- ral Networks, vol. 12, pp. 264-276, Mar. 2001.
  8. R. S. Sutton and A. G. Barto, Reinforcement Learn- ing: An Introduction, MIT Press, Cambridge, Mas- sachusetts, 1998.
  9. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, "Policy Gradient Methods for Reinforcement Learning with Function Approximation," Advances in Informa- tion Processing Systems 12, pp. 1057-1063, MIT Press, 2000.
  10. P. Wawrzynski, "Reinforcement Learning in Control Systems," PhD Thesis, Institute of Control and Compu- tation Engineering, Warsaw University of Technology, forthcoming.
  11. C. Watkins and P. Dayan, "Q-Learning," Machine Learning, vol. 8, pp. 279-292, 1992.
  12. R. Williams, "Simple statistical gradient following algorithms for connectionist reinforcement learning," Machine Learning, 8:229-256, 1992.