Academia.eduAcademia.edu

Outline

Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy

2021, Cornell University - arXiv

https://doi.org/10.48550/ARXIV.2102.07800

Abstract

Motivated by modern applications, such as online advertisement and recommender systems, we study the top-k eXtreme contextual bandits problem, where the total number of arms can be enormous, and the learner is allowed to select k arms and observe all or some of the rewards for the chosen arms. We first propose an algorithm for the non-eXtreme realizable setting, utilizing the Inverse Gap Weighting strategy for selecting multiple arms. We show that our algorithm has a regret guarantee of O(k (A − k + 1)T log(|F|T)), where A is the total number of arms and F is the class containing the regression function, while only requiringÕ(A) computation per time step. In the eXtreme setting, where the total number of arms can be in the millions, we propose a practically-motivated arm hierarchy model that induces a certain structure in mean rewards to ensure statistical and computational efficiency. The hierarchical structure allows for an exponential reduction in the number of relevant arms for each context, thus resulting in a regret guarantee of O(k (log A − k + 1)T log(|F|T)). Finally, we implement our algorithm using a hierarchical linear function class and show superior performance with respect to wellknown benchmarks on simulated bandit feedback experiments using eXtreme multi-label classification datasets. On a dataset with three million arms, our reduction scheme has an average inference time of only 7.9 milliseconds, which is a 100x improvement.

FAQs

sparkles

AI

What are the computational requirements of the proposed top-k algorithms?add

The study shows that the top-k algorithms require O(A) computation per time-step, while leveraging the additive structure of total rewards. This demonstrates efficiency despite the combinatorial nature of the arm selection.

How does the arm hierarchy impact performance in eXtreme contextual bandits?add

Implementing an arm hierarchy allows the algorithm to reduce the problem size, achieving performance with only O(log A) effective arms while retaining robust regret guarantees. This results in up to a 100x improvement in inference times on large datasets.

What novel insights does the paper provide regarding arm correlations and reward structures?add

The research reveals that rewards for correlated arms tend to exhibit minor variations, suggesting a structured exploration can be implemented effectively. This is modeled through a context-dependent arm space decomposition, enhancing decision-making during learning.

What is the regret performance of the proposed algorithms under realizability assumptions?add

Algorithms in the study demonstrate a top-k regret bound of O(k(A-k+1)T log(|F|T)), under realizability assumptions. This performance is statistically significant, indicating theoretical viability for practical applications.

How does the Inverse Gap Weighting strategy improve decision-making in bandit settings?add

The Inverse Gap Weighting (IGW) strategy allows efficient arm selection, balancing exploration and exploitation by focusing on the estimated rewards of arms. This leads to improved regret performance, making IGW a competitive choice for top-k contextual bandits.

References (48)

  1. Naoki Abe and Philip M Long. Associative reinforcement learning using linear probabilistic con- cepts. In ICML, pages 3-11. Citeseer, 1999.
  2. Alekh Agarwal, Miroslav Dudík, Satyen Kale, John Langford, and Robert Schapire. Contextual bandit learning with predictable rewards. In Artificial Intelligence and Statistics, pages 19-26, 2012.
  3. Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638-1646, 2014.
  4. Mridul Agarwal and Vaneet Aggarwal. Regret bounds for stochastic combinatorial multi-armed bandits with linear space complexity. arXiv preprint arXiv:1811.11925, 2018.
  5. Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127-135, 2013.
  6. Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multi- armed bandit problem. SIAM journal on computing, 32(1):48-77, 2002.
  7. Peter Bartlett, Varsha Dani, Thomas Hayes, Sham Kakade, Alexander Rakhlin, and Ambuj Tewari. High-probability regret bounds for bandit online linear optimization. In Proceedings of the 21st Annual Conference on Learning Theory -COLT 2008, pages 335-342. Omnipress, 2008.
  8. Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19-26, 2011.
  9. K. Bhatia, K. Dahiya, H. Jain, A. Mittal, Y. Prabhu, and M. Varma. The extreme classifi- cation repository: Multi-label datasets and code, 2016. URL http://manikvarma.org/ downloads/XC/XMLRepository.html.
  10. Alberto Bietti, Alekh Agarwal, and John Langford. A contextual bandit bake-off. arXiv preprint arXiv:1802.04064, 2018.
  11. Nicolo Cesa-Bianchi and Gábor Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404-1422, 2012.
  12. Nicolò Cesa-Bianchi, Claudio Gentile, Gábor Lugosi, and Gergely Neu. Boltzmann exploration done right. In Advances in neural information processing systems, pages 6284-6293, 2017.
  13. Wei Chen, Wei Hu, Fu Li, Jian Li, Yu Liu, and Pinyan Lu. Combinatorial multi-armed bandit with general reward functions. In Advances in Neural Information Processing Systems, pages 1659-1667, 2016.
  14. Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208-214, 2011.
  15. Richard Combes, Mohammad Sadegh Talebi Mazraeh Shahi, Alexandre Proutiere, et al. Com- binatorial bandits revisited. Advances in neural information processing systems, 28:2116-2124, 2015.
  16. Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 2008.
  17. Inderjit S Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 269-274, 2001.
  18. Audrey Durand, Charis Achilleos, Demetris Iacovides, Katerina Strati, Georgios D Mitsis, and Joelle Pineau. Contextual bandits for adapting treatment in a mouse model of de novo carcino- genesis. In Machine Learning for Healthcare Conference, pages 67-82, 2018.
  19. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. the Journal of machine Learning research, 9:1871-1874, 2008.
  20. Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pages 586-594, 2010.
  21. Dylan J Foster and Alexander Rakhlin. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. arXiv preprint arXiv:2002.04926, 2020.
  22. Dylan J Foster, Alekh Agarwal, Miroslav Dudík, Haipeng Luo, and Robert E Schapire. Practical contextual bandits with regression oracles. arXiv preprint arXiv:1803.01088, 2018.
  23. Dylan J Foster, Claudio Gentile, Mehryar Mohri, and Julian Zimmert. Adapting to misspecification in contextual bandits. Advances in Neural Information Processing Systems, 33, 2020a.
  24. Dylan J Foster, Alexander Rakhlin, David Simchi-Levi, and Yunzong Xu. Instance-dependent complexity of contextual bandits and reinforcement learning: A disagreement-based perspective. arXiv preprint arXiv:2010.03104, 2020b.
  25. Gaël Guennebaud, Benoît Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2010.
  26. Kalina Jasinska, Krzysztof Dembczynski, Róbert Busa-Fekete, Karlson Pfannschmidt, Timo Klerx, and Eyke Hullermeier. Extreme f-measure maximization using sparse probability estimates. In International Conference on Machine Learning, pages 1435-1444, 2016.
  27. Sujay Khandagale, Han Xiao, and Rohit Babbar. Bonsai: diverse and shallow trees for extreme multi-label classification. Machine Learning, pages 1-21, 2020.
  28. Andreas Krause and Cheng Ong. Contextual gaussian process bandit optimization. Advances in neural information processing systems, 24:2447-2455, 2011.
  29. Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds for stochas- tic combinatorial semi-bandits. In Artificial Intelligence and Statistics, pages 535-543, 2015.
  30. John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Proceedings of the 20th International Conference on Neural Information Processing Systems, pages 817-824. Citeseer, 2007.
  31. Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. Collaborative filtering bandits. In Pro- ceedings of the 39th International ACM SIGIR conference on Research and Development in In- formation Retrieval, pages 539-548, 2016.
  32. Tian Lin, Bruno Abrahao, Robert Kleinberg, John Lui, and Wei Chen. Combinatorial partial mon- itoring game with linear feedback and its applications. In International Conference on Machine Learning, pages 901-909, 2014.
  33. Romain Lopez, Inderjit Dhillon, and Michael I Jordan. Learning from extreme bandit feedback. arXiv preprint arXiv:2009.12947, 2020.
  34. Maryam Majzoubi, Chicheng Zhang, Rajan Chari, Akshay Krishnamurthy, John Langford, and Aleksandrs Slivkins. Efficient contextual bandits with continuous actions. arXiv preprint arXiv:2006.06040, 2020.
  35. H Brendan McMahan and Matthew Streeter. Tighter bounds for multi-armed bandits with expert advice. 2009.
  36. Nadav Merlis and Shie Mannor. Batch-size independent regret bounds for the combinatorial multi- armed bandit problem. arXiv preprint arXiv:1905.03125, 2019.
  37. Yusuke Narita, Shota Yasui, and Kohei Yata. Efficient counterfactual learning from bandit feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4634-4641, 2019.
  38. Yashoteja Prabhu, Anil Kag, Shrutendra Harsola, Rahul Agrawal, and Manik Varma. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the 2018 World Wide Web Conference, pages 993-1002, 2018.
  39. Lijing Qin, Shouyuan Chen, and Xiaoyan Zhu. Contextual combinatorial bandit and its application on diversified online recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining, pages 461-469. SIAM, 2014.
  40. Alexander Rakhlin and Karthik Sridharan. Bistro: An efficient relaxation-based method for con- textual bandits. In ICML, pages 1977-1985, 2016.
  41. Idan Rejwan and Yishay Mansour. Top-k combinatorial bandits with full-bandit feedback. In Algorithmic Learning Theory, pages 752-776. PMLR, 2020.
  42. David Simchi-Levi and Yunzong Xu. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Available at SSRN, 2020.
  43. Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. Off-policy evaluation for slate recommendation. In Advances in Neural Information Processing Systems, pages 3632-3642, 2017.
  44. Sofía S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015.
  45. Marek Wydmuch, Kalina Jasinska, Mikhail Kuznetsov, Róbert Busa-Fekete, and Krzysztof Dem- bczynski. A no-regret generalization of hierarchical softmax to extreme multi-label classification. In Advances in Neural Information Processing Systems, pages 6355-6366, 2018.
  46. Ronghui You, Zihan Zhang, Ziye Wang, Suyang Dai, Hiroshi Mamitsuka, and Shanfeng Zhu. Atten- tionxml: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification. In Advances in Neural Information Processing Systems, pages 5820-5830, 2019.
  47. Hsiang-Fu Yu, Kai Zhong, and Inderjit S Dhillon. Pecos: Prediction for enormous and correlated output spaces. arXiv preprint arXiv:2010.05878, 2020.
  48. Yisong Yue and Carlos Guestrin. Linear submodular bandits and their application to diversified retrieval. In Advances in Neural Information Processing Systems, pages 2483-2491, 2011.