Reward Shaping with Dynamic Trajectory Aggregation

Takato Okudo

doi:10.48550/ARXIV.2104.06163

Outline

Reward Shaping with Dynamic Trajectory Aggregation

Takato Okudo

2021, arXiv (Cornell University)

https://doi.org/10.48550/ARXIV.2104.06163

visibility

…

description

9 pages

link

1 file

Abstract

Reinforcement learning, which acquires a policy maximizing long-term rewards, has been actively studied. Unfortunately, this learning type is too slow and difficult to use in practical situations because the state-action space becomes huge in real environments. The essential factor for learning efficiency is rewards. Potential-based reward shaping is a basic method for enriching rewards. This method is required to define a specific real-value function called a "potential function" for every domain. It is often difficult to represent the potential function directly. SARSA-RS learns the potential function and acquires it. However, SARSA-RS can only be applied to the simple environment. The bottleneck of this method is the aggregation of states to make abstract states since it is almost impossible for designers to build an aggregation function for all states. We propose a trajectory aggregation that uses subgoal series. This method dynamically aggregates states in an episode during trial and error with only the subgoal series and subgoal identification function. It makes designer effort minimal and the application to environments with high-dimensional observations possible. We obtained subgoal series from participants for experiments. We conducted the experiments in three domains, four-rooms(discrete states and discrete actions), pinball(continuous and discrete), and picking(both continuous). We compared our method with a baseline reinforcement learning algorithm and other subgoal-based methods, including random subgoal and naive subgoal-based reward shaping. As a result, our reward shaping outperformed all other methods in learning efficiency.

References (35)

P. Abbeel and A. Y. Ng, "Apprenticeship learning via inverse reinforce- ment learning," in Proceedings of the 14th International Conference on Machine Learning. Association for Computing Machinery, 2004, p. 1.
S. Amershi, M. Cakmak, W. B. Knox, and T. Kulesza, "Power to the people: The role of humans in interactive machine learning," pp. 105- 120, 2014.
Y. Aytar, T. Pfaff, D. Budden, T. L. Paine, Z. Wang, and N. d. Freitas, "Playing hard exploration games by watching youtube," ser. NIPS'18. Curran Associates Inc., 2018, p. 2935-2945.
P.-L. Bacon, J. Harb, and D. Precup, "The option-critic architecture," in Proceedings of the 31st AAAI Conference on Artificial Intelligence, 2017, pp. 1726-1734.
G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, "Openai gym," 2016.
F. Cruz, S. Magg, Y. Nagai, and S. Wermter, "Improving interactive rein- forcement learning: What makes a good teacher?" Connection Science, pp. 306-325, 2018.
A. Demir, E. C ¸ilden, and F. Polat, "Landmark based reward shaping in reinforcement learning with hidden states," in Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, 2019, pp. 1922-1924.
S. Devlin and D. Kudenko, "Dynamic potential-based reward shaping," in Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, 2012, pp. 433-440.
P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov, "Openai baselines," https: //github.com/openai/baselines, 2017.
Y. Gao and F. Toni, "Potential based reward shaping for hierarchical reinforcement learning." AAAI Press, 2015, pp. 3504-3510.
M. Grzes and D. Kudenko, "Online learning of shaping rewards in reinforcement learning," Neural networks, vol. 23, pp. 541-550, 2010.
M. Grześ and D. Kudenko, Multigrid Reinforcement Learning with Reward Shaping. Springer, Berlin, Heidelberg, 2008.
A. Harutyunyan, T. Brys, P. Vrancx, and A. Nowé, "Shaping mario with human advice," in Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2015, Istanbul, Turkey, May 4-8, 2015. ACM, 2015, pp. 1913-1914.
A. Harutyunyan, S. Devlin, P. Vrancx, and A. Nowe, "Expressing arbitrary reward functions as potential-based advice." AAAI Press, 2015, pp. 2652-2658.
J. Ho and S. Ermon, "Generative adversarial imitation learning," in Advances in Neural Information Processing Systems 29. Curran
W. B. Knox and P. Stone, "Interactively shaping agents via human rein- forcement: The tamer framework," in The 15th International Conference on Knowledge Capture, 2009.
G. Konidaris, S. Osentoski, and P. Thomas, "Value function approxima- tion in reinforcement learning using the Fourier basis," in Proceedings of the 25nd AAAI Conference on Artificial Intelligence, 2011, pp. 380-385.
G. Konidaris and A. G. Barto, "Skill discovery in continuous rein- forcement learning domains using skill chaining," Advances in Neural Information Processing Systems 22, pp. 1015-1023, 2009.
G. Li, R. Gomez, K. Nakamura, and B. He, "Human-centered rein- forcement learning: A survey," IEEE Transactions on Human-Machine Systems, vol. 49, no. 4, pp. 337-349, 2019.
S. Li, R. Wang, M. Tang, and C. Zhang, "Hierarchical reinforcement learning with advantage-based auxiliary rewards," Advances in Neural Information Processing Systems 32, vol. 32, pp. 1409-1419, 2019.
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, "Continuous control with deep reinforcement learning." in ICLR, 2016.
C. Molnar, Interpretable Machine Learning, 2019, https://christophm. github.io/interpretable-ml-book/.
A. Y. Ng, D. Harada, and S. J. Russell, "Policy invariance under reward transformations: Theory and application to reward shaping," in Proceedings of the 16th International Conference on Machine Learning, 1999, pp. 278-287.
A. Y. Ng and S. J. Russell, "Algorithms for inverse reinforcement learning," in Proceedings of the 17th International Conference on Machine Learning, 2000, pp. 663-670.
S. Paul, J. Vanbaar, and A. Roy-Chowdhury, "Learning from trajectories via subgoal discovery," Advances in Neural Information Processing Systems 32, pp. 8411-8421, 2019.
M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Pow- ell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, V. Kumar, and W. Zaremba, "Multi-goal reinforcement learning: Challenging robotics environments and request for research," 2018.
A. Rosenfeld, M. E. Taylor, and S. Kraus, "Leveraging human knowl- edge in tabular reinforcement learning: A study of human subjects," in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, 2017, pp. 3823-3830.
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed. The MIT Press, 2018.
R. S. Sutton, D. Precup, and S. Singh, "Between MDPs and semi- MDPs: A framework for temporal abstraction in reinforcement learning," Artificial Intelligence, vol. 112, no. 1-2, pp. 181-211, 1999.
M. E. Taylor, "Improving reinforcement learning with human input," in Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018, pp. 5724-5728.
M. E. Taylor and P. Stone, "Transfer learning for reinforcement learning domains: A survey," Journal of Machine Learning Research, vol. 10, pp. 1633-1685, 2009.
A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu, "Feudal networks for hierarchical rein- forcement learning," in Proceedings of the 34th International Conference on Machine Learning -Volume 70, ser. ICML'17. JMLR.org, 2017, p. 3540-3549.
E. Wiewiora, "Potential-based shaping and q-value initialization are equivalent," Journal of Artificial Intelligence Research, vol. 19, pp. 205- 208, 2003.
E. Wiewiora, G. Cottrell, and C. Elkan, "Principled methods for advising reinforcement learning agents," in Proceedings of the 20th International Conference on Machine Learning, 2003.
B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey, "Maximum entropy inverse reinforcement learning," in Proceedings of the 23rd National Conference on Artificial Intelligence -Volume 3. AAAI Press, 2008, p. 1433-1438.

Reward Shaping with Dynamic Trajectory Aggregation

Sign up for access to the world's latest research

Abstract

Related papers

References (35)

Related papers

Related topics