The multi-armed bandit, with constraints
2012, Annals of Operations Research
https://doi.org/10.1007/S10479-012-1250-YAbstract
The early sections of this paper present an analysis of a Markov decision model that is known as the multi-armed bandit under the assumption that the utility function of the decision maker is either linear or exponential. The analysis includes efficient procedures for computing the expected utility associated with the use of a priority policy and for identifying a priority policy that is optimal. The methodology in these sections is novel, building on the use of elementary row operations. In the later sections of this paper, the analysis is adapted to accommodate constraints that link the bandits. It was demonstrated in [12, 10] that, given each multi-state, it is optimal to play any Markov chain (bandit) whose current state has the largest index (lowest label). Following [12, 10], the multi-armed bandit problem has stimulated research in control theory, economics, probability, and operations research. A sampling of noteworthy papers includes Bergemann and Välimäkim [2], Bertsimas and Niño-Mora [4], El Karoui and Karatzas [8], Katehakis and Veinott [15], Schlag [17], Sonin [18], Tsisiklis [19], Variaya, Walrand and Buyukkoc [20], Weber [22], and Whittle [24],. Books on the subject (that list many references) include Berry and Fristedt [3], Gittins [11], Gittins, Glazebrook and Weber [13]. The last and most recent of these books provides a status report on the multi-armed bandit that is almost up-to-date. An implication of the analysis in [12, 10] is that the largest of all of the indices equals the maximum over all states of the ratio r(i)/(1 − c), where r(i) denotes the expectation of the reward that is earned if state i's bandit is played once while state i is observed and where c is the discount factor. In 1994, Tsitsiklis [19]
References (24)
- Altman, E. 1999. Constrained Markov Decision Processes. Chapman & Hall/CRC, Boca Raton, USA.
- Bergemann, D., J. Välimäkim. 2008. Bandit problems. S. Durlauf, L. Blume, eds. The New Palgrave Dictionary of Economics (2nd edition).
- Berry, D. A., B. Friestedt. 1985. Bandit Problems. Chapman Hall.
- Bertsimas, D., J. Niño-Mora. 1993. Conservation laws, extended polymatroids and multi-armed bandit problems: a polyhedral approach to indexable systems. Mathematics of Operations Research 21, 257- 306.
- Denardo, E.V. 1967. Contraction mappings in the theory underlying dynamic programming. SIAM Review 9, 165-177.
- Denardo, E.V., H. Park, U.G. Rothblum. 2007. Risk-sensitive and risk-neutral multiarmed bandits. Mathematics of Operations Research 32, 374-394.
- Denardo, E.V., U.G. Rothblum. 2006. A turnpike theorem for a risk-sensitive Markov decision problem with stopping. SIAM J. Control Optim. 45, 414-431.
- El Karoui, N., I. Karatzas. 1994. Dynamic allocation indices in continuous time. Annals of Applied Probability 4, 255-286.
- Feinberg, E.A., U.G. Rothblum. 2011. Splitting randomized stationary policies in total-reward Markov decision processes. Mathematics of Operations Research, to appear.
- Gittins, J.C. 1979. Bandit problems and dynamic allocation indices (with discussion). Journal of the Royal Statistical Society B. 41, 148-177.
- Gittins, J.C. 1989. Multi-armed bandit allocation indices. John Wiley and Sons Inc.
- Gittins, J.C., D.M. Jones. 1974. A dynamic allocation index for the sequential design experiments. J. Gani, K. Sarkadu, I. Vince, eds. Progress in Statistics, European Meeting of Statisticians I, North Holland, Amsterdam, 241-266.
- Gittins, J.C., K. Glazebrook, R. Weber. 2011. Multi-armed bandit allocation indices (2nd edition). John Wiley and Sons Inc.
- Kaspi, H., A. Mandelbaum. 1998. Multi-armed bandits in discrete and continuous time. Annals of Applied Probability 8, 1270-1290.
- Katehakis, M., A.F. Veinott, Jr. 1987. The multiarmed bandit problem: Decomposition and computa- tion. Mathematics of Operations Research 22, 262-268.
- Niño-Mora, J. 2007. A (2/3)n 3 fast pivoting algorithm for the Gittins index and optimal stopping of a Markov chain. INFORMS Journal on Computing 10, 596-606.
- Schlag, K. 1998. Why imitate, and if so, how? A bounded rational approach to multi-armed bandits. Journal of Economic Theory 78, 130-156.
- Sonin, I. 2008. A generalized Gittins index for Markov chains and its recursive calculation. Technical Report. Statistics and Probability Letters 78, 1526-1533.
- Tsitsiklis, J. 1994. A short proof of the Gittins index theorem. Annals of Applied Probability 4, 194- 199.
- Variaya, P., J. Walrand, C. Buyukkoc. 1985. Extensions of the multi-armed bandit problem: The dis- counted case. IEEE Trans. Automat. Control AC-30, 426-439.
- Veinott, A.F., Jr. 1969. Discrete dynamic programming with sensitive discount optimality criteria. Ann. Math. Statist. 40, 1635-1660.
- Weber, R. 1992. On the Gittins index for multiarmed bandits. Annals of Applied Probability 2, 1024- 1033.
- Weiss, G. 1988. Branching bandit processes. Probability in the Engineering and Informational Sciences 2, 269-278.
- Whittle, P. 1980. Multi-armed bandits and the Gittins index. J. Roy. Statist. Soc. B 43, 143-149.