Scalable Planning and Learning for Multiagent POMDPs
Abstract
Online, sample-based planning algorithms for POMDPs have shown great promise in scaling to problems with large state spaces, but they become intractable for large action and observation spaces. This is particularly problematic in multiagent POMDPs where the action and observation space grows exponentially with the number of agents. To combat this intractability, we propose a novel scalable approach based on sample-based planning and factored value functions that exploits structure present in many multiagent settings. This approach applies not only in the planning case, but also in the Bayesian reinforcement learning setting. Experimental results show that we are able to provide high quality solutions to large multiagent planning and learning problems.
References (29)
- Amato, C., and Oliehoek, F. A. 2013. Bayesian reinforcement learning for multiagent systems with state uncertainty. In Workshop on Multi-Agent Sequential Decision Making in Uncertain Domains, 76-83.
- Amato, C.; Chowdhary, G.; Geramifard, A.; Ure, N. K.; and Kochenderfer, M. J. 2013. Decentralized control of partially observable Markov decision processes. In CDC, 2398-2405.
- Amin, K.; Kearns, M.; and Syed, U. 2011. Graphical models for bandit problems. In UAI, 1-10.
- Browne, C.; Powley, E. J.; Whitehouse, D.; Lucas, S. M.; Cowling, P. I.; Rohlfshagen, P.; Tavener, S.; Perez, D.; Samothrakis, S.; and Colton, S. 2012. A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intellig. and AI in Games 4(1):1-43.
- Couëtoux, A.; Hoock, J.-B.; Sokolovska, N.; Teytaud, O.; and Bonnard, N. 2011. Continuous upper confidence trees. In Learning and Intelligent Optimization. 433-445.
- Coulom, R. 2007. Computing elo ratings of move patterns in the game of go. International Computer Games Association (ICGA) Journal 30(4):198-208.
- Dibangoye, J. S.; Amato, C.; Buffet, O.; and Charpillet, F. 2014. Exploiting separability in multi-agent planning with continuous-state MDPs. In AAMAS.
- Fairbank, M., and Alonso, E. 2012. The divergence of reinforcement learning algorithms with value-iteration and function approximation. In International Joint Conference on Neural Networks, 1-8. IEEE.
- Gmytrasiewicz, P. J., and Doshi, P. 2005. A framework for sequential planning in multi-agent settings. JAIR 24. Guestrin, C.; Koller, D.; and Parr, R. 2001. Multiagent planning with factored MDPs. In NIPS, 15.
- Guestrin, C.; Lagoudakis, M.; and Parr, R. 2002. Coordinated reinforcement learning. In ICML, 227-234.
- Kearns, M.; Mansour, Y.; and Ng, A. Y. 2002. A sparse sampling algorithm for near-optimal planning in large Markov decision processes. MLJ 49(2-3).
- Kok, J. R., and Vlassis, N. 2006. Collaborative multiagent reinforcement learning by payoff propagation. JMLR 7.
- Kumar, A., and Zilberstein, S. 2009. Constraint-based dynamic programming for decentralized POMDPs with structured interactions. In AAMAS.
- Messias, J. V.; Spaan, M.; and Lima, P. U. 2011. Efficient offline communication policies for factored multiagent POMDPs. In NIPS 24.
- Nair, R.; Varakantham, P.; Tambe, M.; and Yokoo, M. 2005. Networked distributed POMDPs: a synthesis of distributed constraint optimization and POMDPs. In AAAI.
- Ng, B.; Boakye, K.; Meyers, C.; and Wang, A. 2012. Bayes-adaptive interactive POMDPs. In AAAI.
- Oliehoek, F. A.; Spaan, M. T. J.; Whiteson, S.; and Vlassis, N. 2008. Exploiting locality of interaction in factored Dec-POMDPs. In AAMAS.
- Oliehoek, F. A.; Spaan, M. T. J.; and Vlassis, N. 2008. Optimal and approximate Q-value functions for decentralized POMDPs. JAIR 32:289-353.
- Oliehoek, F. A.; Whiteson, S.; and Spaan, M. T. J. 2012. Exploiting structure in cooperative Bayesian games. In UAI, 654-664.
- Oliehoek, F. A.; Whiteson, S.; and Spaan, M. T. J. 2013. Approximate solutions for factored Dec-POMDPs with many agents. In AAMAS, 563-570.
- Oliehoek, F. A. 2012. Decentralized POMDPs. In Wiering, M., and van Otterlo, M., eds., Reinforcement Learning: State of the Art. Springer.
- Pajarinen, J., and Peltonen, J. 2011. Efficient planning for factored infinite-horizon DEC-POMDPs. In IJCAI, 325-331.
- Pynadath, D. V., and Tambe, M. 2002. The communicative multiagent team decision problem: Analyzing teamwork theories and models. JAIR 16.
- Ross, S.; Pineau, J.; Paquet, S.; and Chaib-draa, B. 2008. Online planning algorithms for POMDPs. JAIR 32(1).
- Ross, S.; Pineau, J.; Chaib-draa, B.; and Kreitmann, P. 2011. A Bayesian approach for learning and planning in partially observable Markov decision processes. JAIR 12.
- Silver, D., and Veness, J. 2010. Monte-carlo planning in large POMDPs. In NIPS 23.
- Silver, D.; Sutton, R. S.; and Müller, M. 2012. Temporal-difference search in computer Go. MLJ 87(2):183- 219.
- Stone, P., and Sutton, R. S. 2001. Scaling reinforcement learning toward RoboCup soccer. In ICML, 537-544.
- Tesauro, G. 1995. Temporal difference learning and TD-Gammon. Commun. ACM 38(3):58-68.