Stochastic dynamic programming with factored representations
2000, Artificial Intelligence
Abstract
Markov decision processes (MDPs) have proven to be popular models for decision-theoretic planning, but standard dynamic programming algorithms for solving MDPs rely on explicit, state-based specifications and computations. To alleviate the combinatorial problems associated with such methods, we propose new representational and computational techniques for MDPs that exploit certain types of problem structure. We use dynamic Bayesian networks (with decision trees representing the local families of conditional probability distributions) to represent stochastic actions in an MDP, together with a decision-tree representation of rewards. Based on this representation, we develop versions of standard dynamic programming algorithms that directly manipulate decision-tree representations of policies and value functions. This generally obviates the need for state-by-state computation, aggregating states at the leaves of these trees and requiring computations only for each aggregate state. The key to these algorithms is a decision-theoretic generalization of classic regression analysis, in which we determine the features relevant to predicting expected value. We demonstrate the method empirically on several planning problems, showing significant savings for certain types of problems. We also identify certain classes of problems for which this technique fails to perform well and suggest extensions and related ideas that may prove useful for such problems. We also briefly describe an approximation scheme based on this approach.
References (84)
- References [1] Christopher G. Atkeson, Andrew W. Moore, and Stefan Schaal. Locally weighted learning for control. Artificial Intelligence Review, 11:75-113, 1997.
- R. Iris Bahar, Erica A. Frohm, Charles M. Gaona, Gary D. Hachtel, Enrico Macii, Abelardo Pardo, and Fabio Somenzi. Algebraic decision diagrams and their applications. In International Conference on Computer-Aided Design, pages 188-191. IEEE, 1993.
- Richard E. Bellman. Dynamic Programming. Princeton University Press, Princeton, 1957.
- D. P. Bertsekas and D. A. Castanon. Adaptive aggregation for infinite horizon dynamic programming. IEEE Transactions on Automatic Control, 34:589-598, 1989.
- Dimitri P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs, 1987.
- Dimitri P. Bertsekas and John. N. Tsitsiklis. Neuro-dynamic Programming. Athena, Belmont, MA, 1996.
- Marko Bohanic and Ivan Bratko. Trading accuracy for simplicity in decision trees. Machine Learning, 15:223- 250, 1994.
- Craig Boutilier. Correlated action effects in decision theoretic regression. In Proceedings of the Thirteenth Con- ference on Uncertainty in Artificial Intelligence, pages 30-37, Providence, RI, 1997.
- Craig Boutilier, Ronen I. Brafman, and Christopher Geib. Prioritized goal decomposition of Markov decision processes: Toward a synthesis of classical and decision theoretic planning. In Proceedings of the Fifteenth Inter- national Joint Conference on Artificial Intelligence, pages 1156-1162, Nagoya, 1997.
- Craig Boutilier, Ronen I. Brafman, and Christopher Geib. Structured reachability analysis for Markov decision processes. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, pages 24-32, Madison, WI, 1998.
- Craig Boutilier, Thomas Dean, and Steve Hanks. Decision theoretic planning: Structural assumptions and com- putational leverage. Journal of Artificial Intelligence Research, 11:1-94, 1999.
- Craig Boutilier and Richard Dearden. Using abstractions for decision-theoretic planning with time constraints. In Proceedings of the Twelfth National Conference on Artificial Intelligence, pages 1016-1022, Seattle, 1994.
- Craig Boutilier and Richard Dearden. Approximating value trees in structured dynamic programming. In Pro- ceedings of the Thirteenth International Conference on Machine Learning, pages 54-62, Bari, Italy, 1996.
- Craig Boutilier, Nir Friedman, Moisés Goldszmidt, and Daphne Koller. Context-specific independence in Bayesian networks. In Proceedings of the Twelfth Conference on Uncertainty in Artificial Intelligence, pages 115-123, Portland, OR, 1996.
- Craig Boutilier and Moisés Goldszmidt. The frame problem and Bayesian network action representations. In Proceedingsof the Eleventh Biennial Canadian Conference on Artificial Intelligence, pages 69-83, Toronto, 1996.
- Craig Boutilier and David Poole. Computing optimal policies for partially observable decision processes using compact representations. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 1168-1175, Portland, OR, 1996.
- Craig Boutilier and Martin L. Puterman. Process-oriented planning and average-reward optimality. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 1096-1103, Montreal, 1995.
- Justin A. Boyan and Andrew W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems 7. MIT Press, Cambridge, 1995.
- Randal E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers, C-35(8):677-691, 1986.
- J. R. Burch, E. M. Clarke, K. L. McMillan, D. L. Dill, and L. J. Hwang. Symbolic model checking: 10 20 states and beyond. In Conference on Logic in Computer Science, pages 428-439, 1990.
- Anthony R. Cassandra, Leslie Pack Kaelbling, and Michael L. Littman. Acting optimally in partially observable stochastic domains. In Proceedings of the Twelfth National Conference on Artificial Intelligence, pages 1023- 1028, Seattle, 1994.
- David Chapman. Planning for conjunctive goals. Artificial Intelligence, 32(3):333-377, 1987.
- David Chapman and Leslie Pack Kaelbling. Input generalization in delayed reinforcement learning: An algo- rithm and performance comparisons. In Proceedings of the Twelfth International Joint Conference on Artificial Intelligence, pages 726-731, Sydney, 1991.
- E. M. Clarke, E. A. Emerson, and A.P. Sistla. Automatic verification of finite state concurrent systems using temporal logic specifications: A practical approach. In Symposium on Principles of Programming Languages, pages 117-126. ACM, 1983.
- Adnan Darwiche and Moisés Goldszmidt. Action networks: A framework for reasoning about actions and change under uncertainty. In Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, pages 136- 144, Seattle, 1994.
- Thomas Dean and Robert Givan. Model minimization in Markov decision processes. In Proceedings of the Four- teenth National Conference on Artificial Intelligence, pages 106-111, Providence, 1997.
- Thomas Dean, Robert Givan, and Sonia Leach. Model reduction techniques for computing approximately optimal solutions for Markov decision processes. In Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence, pages 124-131, Providence, RI, 1997.
- Thomas Dean, Leslie Pack Kaelbling, Jak Kirman, and Ann Nicholson. Planning with deadlines in stochastic do- mains. In Proceedingsof the Eleventh National Conferenceon Artificial Intelligence, pages 574-579, Washington, D.C., 1993.
- Thomas Dean and Keiji Kanazawa. A model for reasoning about persistence and causation. Computational In- telligence, 5(3):142-150, 1989.
- Richard Dearden and Craig Boutilier. Abstraction and approximate decision theoretic planning. Artificial Intelli- gence, 89:219-283, 1997.
- Thomas G. Dietterich. The MAXQ method for hierarchical reinforcement learning. In Proceedingsof the Fifteenth International Conference on Machine Learning, pages 118-126, Madison, WI, 1998.
- Thomas G. Dietterich and Nicholas S. Flann. Explanation-based learning and reinforcement learning: A unified approach. In Proceedings of the Twelfth International Conference on Machine Learning, pages 176-184, Lake Tahoe, 1995.
- Thomas G. Dietterich and Nicholas S. Flann. Explanation-based learning and reinforcement learning: A unified view. Machine Learning, 28(2):169-210, 1997.
- Edsger W. Dijkstra. Guarded commands, nondeterminacy and formal derivation of programs. Communications of the ACM, 18(8):453-457, 1975.
- Jerome A. Feldman and Robert F. Sproull. Decision theory and artificial intelligence II: The hungry monkey. Cognitive Science, 1:158-192, 1977.
- Richard E. Fikes and Nils J. Nilsson. STRIPS: A new approach to the application of theorem proving to problem solving. Artificial Intelligence, 2:189-208, 1971.
- Zoltán Gábor, Zsolt Kalmár, and Csaba Szepesvári. Multi-criteria reinforcement learning. In Proceedings of the Fifteenth International Conference on Machine Learning, pages 197-205, Madison, WI, 1998.
- Dan Geiger and David Heckerman. Advances in probabilistic reasoning. In Proceedingsof the Seventh Conference on Uncertainty in Artificial Intelligence, pages 118-126, Los Angeles, 1991.
- Robert Givan and Thomas Dean. Model minimization, regression, and propositional STRIPS planning. In Pro- ceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 1163-1168, Nagoya, Japan, 1997.
- Steve Hanks and Drew V. McDermott. Modeling a dynamic and uncertain world i: Symbolic and probabilistic reasoning about change. Artificial Intelligence, 1994.
- Steven John Hanks. Projecting Plans for Uncertain Worlds. PhD thesis, Yale University, 1990.
- J. Hartmanis and R. E. Stearns. Algebraic Structure Theory of Sequential Machines. Prentice-Hall, Englewood Cliffs, 1966.
- Jesse Hoey, Robert St-Aubin, Alan Hu, and Craig Boutilier. SPUDD: Stochastic planning using decision diagrams. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pages 279-288, Stockholm, 1999.
- Ronald A. Howard. Dynamic Programming and Markov Processes. MIT Press, Cambridge, 1960.
- Ronald A. Howard and James E. Matheson, editors. Readings on the Principles and Applications of Decision Analysis. Strategic Decision Group, Menlo Park, CA, 1984.
- L. Hyafil and R. L. Rivest. Constructing optimal binary decision trees is NP-complete. Information Processing Letters, 5:15-17, 1976.
- Leslie Pack Kaelbling. Hierarchical reinforcement learning: Preliminary results. In Proceedings of the Tenth International Conference on Machine Learning, pages 167-173, Amherst, MA, 1993.
- Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237-285, 1996.
- R. L. Keeney and H. Raiffa. Decisions with Multiple Objectives: Preferences and Value Trade-offs. Wiley, New York, 1976.
- Nicholas Kushmerick, Steve Hanks, and Daniel Weld. An algorithm for probabilistic planning. Artificial Intelli- gence, 76:239-286, 1995.
- D. Lee and M. Yannakakis. Online miminization of transition systems. In Proceedings of the 24th Annual ACM Symposium on the Theory of Computing (STOC-92), pages 264-274, Victoria, BC, 1992.
- Michael Lederman Littman. Algorithms for sequential decision making. Ph.D. Thesis CS-96-09, Brown Uni- versity, Department of Computer Science, March 1996.
- William S. Lovejoy. A survey of algorithmic methods for partially observed Markov decision processes. Annals of Operations Research, 28:47-66, 1991.
- David McAllester and David Rosenblitt. Systematic nonlinear planning. In Proceedings of the Ninth National Conference on Artificial Intelligence, pages 634-639, Anaheim, 1991.
- John McCarthy and P.J. Hayes. Some philosophical problems from the standpoint of artificial intelligence. Ma- chine Intelligence, 4:463-502, 1969.
- Nicolas Meuleau, Milos Hauskrecht, Kee-Eung Kim, Leonid Peshkin, Leslie Pack Kaelbling, Thomas Dean, and Craig Boutilier. Solving very large weakly coupled Markov decision processes. In Proceedings of the Fifteenth National Conference on Artificial Intelligence, pages 165-172, Madison, WI, 1998.
- Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, 1988.
- J. Scott Penberthy and Daniel S. Weld. UCPOP: A sound, complete, partial order planner for adl. In Proceedings of the Third International Conference on Principles of Knowledge Representation and Reasoning, pages 103-114, Cambridge, MA, 1992.
- David Poole. Probabilistic Horn abduction and Bayesian networks. Artificial Intelligence, 64(1):81-129, 1993.
- David Poole. The independentchoice logic for modelling multiple agents under uncertainty. Artificial Intelligence, 94(1-2):7-56, 1997.
- Doina Precup, Richard S. Sutton, and Satinder Singh. Theoretical results on reinforcement learning with tempo- rally abstract behaviors. In Proceedings of the Tenth European Conference on Machine Learning, pages 382-393, Chemnitz, Germany, 1998.
- Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York, 1994.
- Martin L. Puterman and M.C. Shin. Modified policy iteration algorithms for discounted Markov decision prob- lems. Management Science, 24:1127-1137, 1978.
- J. Ross Quinlan. C45: Programs for Machince Learning. Morgan Kaufmann, San Mateo, 1993.
- Ronald L. Rivest. Learning decision lists. Machine Learning, 2:229-246, 1987.
- Earl D. Sacerdoti. The nonlinear nature of plans. In Proceedings of the Fourth International Joint Conference on Artificial Intelligence, pages 206-214, 1975.
- M. J. Schoppers. Universal plans for reactive robots in unpredictable environments. In Proceedings of the Tenth International Joint Conference on Artificial Intelligence, pages 1039-1046, Milan, 1987.
- Paul L. Schweitzer, Martin L. Puterman, and Kyle W. Kindle. Iterative aggregation-disaggregation procedures for discounted semi-Markov reward processes. Operations Research, 33:589-605, 1985.
- Ross D. Shachter. Evaluating influence diagrams. Operations Research, 33(6):871-882, 1986.
- Solomon E. Shimony. The role of relevance in explanation I: Irrelevance as statistical independence. International Journal of Approximate Reasoning, 8(4):281-324, 1993.
- Satinder P. Singh and David Cohn. How to dynamically merge Markov decision processes. In Advances in Neural Information Processing Systems 10, pages 1057-1063. MIT Press, Cambridge, 1998.
- Satinder Pal Singh. Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning, 8:323-339, 1992.
- Richard D. Smallwood and Edward J. Sondik. The optimal control of partially observable Markov processes over a finite horizon. Operations Research, 21:1071-1088, 1973.
- James E. Smith, Samuel Holtzman, and James E. Matheson. Structuring conditional relationships in influence diagrams. Operations Research, 41(2):280-297, 1993.
- Edward J. Sondik. The optimal control of partially observable Markov processes over the infinite horizon: Dis- counted costs. Operations Research, 26:282-304, 1978.
- Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning, pages 216-224, Austin, 1990.
- Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
- Jonathan Tash and Stuart Russell. Control strategies for a stochastic planner. In Proceedings of the Twelfth Na- tional Conference on Artificial Intelligence, pages 1079-1085, Seattle, 1994.
- Joseph A. Tatman and Ross D. Shachter. Dynamic programming and influence diagrams. IEEE Transactions on Systems, Man and Cybernetics, 20(2):365-379, 1990.
- Gerald J. Tesauro. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Com- putation, 6:215-219, 1994.
- John H. Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22:59-94, 1996.
- Paul E. Utgoff. Decision tree induction based on efficient tree restructuring. Technical Report 95-18, University of Massachusetts, March 1995.
- Richard Waldinger. Achieving several goals simultaneously. In E. Elcock and D. Mitchie, editors, Machine In- telligence 8: Machine Representations of Knowledge, pages 94-136. Ellis Horwood, Chichester, England, 1977.
- Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8:279-292, 1992.