The multi-armed bandit problem with covariates

Yuhong Yang

doi:10.1214/13-AOS1101

Outline

The multi-armed bandit problem with covariates

Yuhong Yang

2013, The Annals of Statistics

https://doi.org/10.1214/13-AOS1101

visibility

…

description

29 pages

link

1 file

Abstract

We consider a multi-armed bandit problem in a setting where each arm produces a noisy reward realization which depends on an observable random covariate. As opposed to the traditional static multi-armed bandit problem, this setting allows for dynamically changing rewards that better describe applications where side information is available. We adopt a nonparametric model where the expected rewards are smooth functions of the covariate and where the hardness of the problem is captured by a margin parameter. To maximize the expected cumulative reward, we introduce a policy called Adaptively Binned Successive Elimination (ABSE) that adaptively decomposes the global problem into suitably "localized" static bandit problems. This policy constructs an adaptive partition using a variant of the Successive Elimination (SE) policy. Our results include sharper regret bounds for the SE policy in a static bandit problem and minimax optimal regret bounds for the ABSE policy in the dynamic problem.

References (28)

AUDIBERT, J.-Y. and BUBECK, S. (2010). Regret bounds and minimax policies under partial monitoring. J. Mach. Learn. Res. 11 2785-2836. MR2738783
AUDIBERT, J.-Y. and TSYBAKOV, A. B. (2007). Fast learning rates for plug-in classifiers. Ann. Statist. 35 608-633. MR2336861
AUDIBERT, J. Y. and TSYBAKOV, A. B. B. (2005). Fast learning rates for plug-in classifiers under the margin condition. Preprint, Laboratoire de Probabilités et Modèles Aléatoires, Univ. Paris VI and VII. Available at arXiv:math/0507180.
AUER, P., CESA-BIANCHI, N. and FISCHER, P. (2002). Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47 235-256.
AUER, P. and ORTNER, R. (2010). UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Period. Math. Hungar. 61 55-65. MR2728432
BATHER, J. A. (1981). Randomized allocation of treatments in sequential experiments. J. R. Stat. Soc. Ser. B Stat. Methodol. 43 265-292. MR0637940
CESA-BIANCHI, N. and LUGOSI, G. (2006). Prediction, Learning, and Games. Cambridge Univ. Press, Cambridge. MR2409394
EVEN-DAR, E., MANNOR, S. and MANSOUR, Y. (2006). Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res. 7 1079-1105. MR2274398
GOLDENSHLUGER, A. and ZEEVI, A. (2009). Woodroofe's one-armed bandit problem revis- ited. Ann. Appl. Probab. 19 1603-1633. MR2538082
GOLDENSHLUGER, A. and ZEEVI, A. (2011). A note on performance limitations in bandit problems with side information. IEEE Trans. Inform. Theory 57 1707-1713. MR2815844
HAZAN, E. and MEGIDDO, N. (2007). Online learning with prior knowledge. In Learning Theory. Lecture Notes in Computer Science 4539 499-513. Springer, Berlin. MR2397608
JUDITSKY, A., NAZIN, A. V., TSYBAKOV, A. B. and VAYATIS, N. (2008). Gap-free bounds for stochastic multi-armed bandit. In Proceedings of the 17th IFAC World Congress.
KAKADE, S., SHALEV-SHWARTZ, S. and TEWARI, A. (2008). Efficient bandit algorithms for online multiclass prediction. In Proceedings of the 25th Annual International Conference on Machine Learning (ICML 2008) (A. McCallum and S. Roweis, eds.) 440-447. Omni- press, Helsinki, Finland.
LAI, T. L. and ROBBINS, H. (1985). Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math. 6 4-22. MR0776826
LANGFORD, J. and ZHANG, T. (2008). The epoch-greedy algorithm for multi-armed ban- dits with side information. In Advances in Neural Information Processing Systems 20 (J. C. Platt, D. Koller, Y. Singer and S. Roweis, eds.) 817-824. MIT Press, Cambridge, MA.
LU, T., PÁL, D. and PÁL, M. (2010). Showing relevant ads via Lipschitz context multi-armed bandits. JMLR: Workshop and Conference Proceedings 9 485-492.
MAMMEN, E. and TSYBAKOV, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808-1829. MR1765618
RIGOLLET, P. and ZEEVI, A. (2010). Nonparametric bandits with covariates. In COLT (A. Tau- man Kalai and M. Mohri, eds.) 54-66. Omnipress, Haifa, Israel.
ROBBINS, H. (1952). Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. (N.S.) 58 527-535. MR0050246
SLIVKINS, A. (2011). Contextual bandits with similarity information. JMLR: Workshop and Conference Proceedings 19 679-701.
TSYBAKOV, A. B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135-166. MR2051002
VOGEL, W. (1960). An asymptotic minimax theorem for the two armed bandit problem. Ann. Math. Statist. 31 444-451. MR0116443
WANG, C.-C., KULKARNI, S. R. and POOR, H. V. (2005). Bandit problems with side obser- vations. IEEE Trans. Automat. Control 50 338-355. MR2123095
WOODROOFE, M. (1979). A one-armed bandit problem with a concomitant variable. J. Amer. Statist. Assoc. 74 799-806. MR0556471
YANG, Y. and ZHU, D. (2002). Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates. Ann. Statist. 30 100-121. MR1892657
LPMA, UMR 7599 UNIVERSITÉ PARIS DIDEROT 175, RUE DU CHEVALERET 75013 PARIS
FRANCE E-MAIL: vianney.perchet@normalesup.org DEPARTMENT OF OPERATIONS RESEARCH AND FINANCIAL ENGINEERING PRINCETON UNIVERSITY PRINCETON, NEW JERSEY 08544
USA E-MAIL: rigollet@princeton.edu

The multi-armed bandit problem with covariates

Sign up for access to the world's latest research

Abstract

Related papers

References (28)

Related papers

Related topics

Cited by