Academia.eduAcademia.edu

Outline

Coding against stragglers in distributed computation scenarios

2020

Abstract

Data and analytics capabilities have made a leap forward in recent years. The volume of available data has grown exponentially. The huge amount of data needs to be transferred and stored with extremely high reliability. The concept of coded computing , or a distributed computing paradigm that utilizes coding theory to smartly inject and leverage data/computation redundancy into distributed computing systems, mitigates the fundamental performance bottlenecks for running large-scale data analytics. In this dissertation, a distributed computing framework, first for input files distributedly stored on the uplink of a cloud radio access network architecture, is studied. It focuses on that decoding at the cloud takes place via network function virtualization on commercial off-the-shelf servers. In order to mitigate the impact of straggling decoders in this platform, a novel coding strategy is proposed, whereby the cloud re-encodes the received frames via a linear code before distributing ...

References (114)

  1. European Telecommunications Standards Institute. Network function virtualisation (NFV); report on models and features for end-to-end reliability. Technical Report GS NFV-REL 003, Apr., 2016.
  2. European Telecommunications Standards Institute. Cloud RAN and MEC: A perfect pairing. ISBN No. 979-10-92620-17-7, Feb., 2018.
  3. Martin Abadi, Joan Feigenbaum, and Joe Kilian. On hiding information from an oracle. Journal of computer and system sciences, 39(1):21-50, Aug., 1989.
  4. Mehmet Fatih Aktas, Pei Peng, and Emina Soljanin. Effective straggler mitigation: Which clones should attack and when? ACM SIGMETRICS Performance Evaluation Review, 45(2):12-14, Sep., 2017.
  5. Ali Al-Shuwaili, Osvaldo Simeone, Joerg Kliewer, and Petar Popovski. Coded network function virtualization: Fault tolerance via in-network coding. IEEE Wireless Communications Letters, 5(6):644-647, Dec., 2016.
  6. Malihe Aliasgari, Jörg Kliewer, and Osvaldo Simeone. Coded computation against processing delays for virtualized cloud-based channel decoding. IEEE Transaction on Communication, 67(1):28-38, Jan., 2019.
  7. Malihe Aliasgari, Jörg Kliewer, and Osvaldo Simeone. Coded computation against straggling decoders for network function virtualization. In Proceeding IEEE International Symposium on Information Theory (ISIT), pages 711-715, Jun., 2018.
  8. Malihe Aliasgari, Osvaldo Simeone, and Jörg Kliewer. Distributed and private coded matrix computation with flexible communication load. In Proceeding IEEE International Symposium Information Theory (ISIT), pages 1092-1096, Jul., 2019.
  9. Islam Alyafawi, Eryk Schiller, Torsten Braun, Desislava Dimitrova, Andre Gomes, and Navid Nikaein. Critical issues of centralized and cloudified LTE-FDD radio access networks. In IEEE International Conference on Communications (ICC), pages 5523-5528. IEEE, Jun., 2015.
  10. Gene M Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings Federation of Information Processing Societies, pages 483-485, Apr., 1967.
  11. Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. Effective straggler mitigation: Attack of the clones. In Proceeding of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI), volume 13, pages 185-198, Apr., 2013.
  12. Ganesh Ananthanarayanan, Srikanth Kandula, Albert G Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. Reining in the outliers in Map-Reduce clusters using mantri. In Proceeding of the 10th USENIX Symposium on Operating Systems Design and Implementation, volume 10, page 24, Oct., 2010.
  13. Navid Azizan-Ruhi, Farshad Lahouti, Amir Salman Avestimehr, and Babak Hassibi. Distributed solution of large-scale linear systems via accelerated projection- based consensus. IEEE Transactions on Signal Processing, 67(14):3806-3817, Dec., 2019.
  14. Karim Banawan and Sennur Ulukus. The capacity of private information retrieval from coded databases. IEEE Transaction on Information Theory, 64(3):1945- 1956, Mar., 2018.
  15. Donald Beaver, Joan Feigenbaum, Joe Kilian, and Phillip Rogaway. Locally random reductions: Improvements and applications. Journal of Cryptology, 10(1):17- 36, Sep., 1997.
  16. Amos Beimel, Yuval Ishai, Eyal Kushilevitz, and Ilan Orlov. Share conversion and private information retrieval. In IEEE 27th Conference on Computational Complexity, pages 258-268, Jun., 2012.
  17. Yitzhak Birk and Tomer Kol. Coding on demand by an informed source (iscod) for efficient broadcast of different supplemental data to caching clients. IEEE Transactions on Information Theory, 52(6):2825-2830, Jan., 2006.
  18. George Blakley. Safeguarding cryptographic keys. In International Workshop on Managing Requirements Knowledge (MARK), pages 313-318. IEEE, Jun., 1979.
  19. Béla Bollobás. Modern graph theory, volume 184. Springer Science & Business Media, 2013.
  20. Rowland Leonard Brooks. On colouring the nodes of a network. Mathematical Proceedings of the Cambridge Philosophical Society, 37(02):194-197, Jul., 1941.
  21. Lynn Elliot Cannon. A cellular computer to implement the Kalman filter algorithm. PhD thesis, Montana State University-Bozeman, College of Engineering, Aug., 1969.
  22. Wei-Ting Chang and Ravi Tandon. On the capacity of secure distributed matrix multiplication. arXiv preprint, arXiv:1806.00469, 2018.
  23. Wei-Ting Chang and Ravi Tandon. On the upload versus download cost for secure and private matrix multiplication. arXiv preprint, arXiv:1906.10684, 2019.
  24. Manmohan Chaubey and Erik Saule. Replicated data placement for uncertain scheduling. In IEEE International Parallel and Distributed Processing Symposium Workshop, pages 464-472. IEEE, May., 2015.
  25. Jaeyoung Choi, David W Walker, and Jack J Dongarra. Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers. Concurrency: Practice and Experience, 6(7):543-570, Oct., 1994.
  26. Benny Chor, Oded Goldreich, Eyal Kushilevitz, and Madhu Sudan. Private information retrieval. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pages 41-50, Oct., 1995.
  27. Anindya B Das, Aditya Ramamoorthy, and Namrata Vaswani. Random convolutional coding for robust and straggler resilient distributed matrix computation. arXiv preprint, arXiv:1907.08064, 2019.
  28. Jeffrey Dean and Luiz André Barroso. The tail at scale. Communications of the ACM, 56(2):74-80, Feb., 2013.
  29. Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Communication of the ACM, 51(1):107-113, Feb., 2008.
  30. Robert H Dennard, Fritz H Gaensslen, V Leo Rideout, Ernest Bassous, and Andre R LeBlanc. Design of ion-implanted mosfet's with very small physical dimensions. IEEE Journal of Solid-State Circuits, 9(5):256-268, Dec., 1974.
  31. Alexandros G Dimakis, Kannan Ramchandran, Yunnan Wu, and Changho Suh. A survey on network codes for distributed storage. Proceedings of the IEEE, 99(3):476-489, 2011.
  32. Rafael GL D'Oliveira, Salim El Rouayheb, and David Karpuk. GASP codes for secure distributed matrix multiplication. arXiv preprint, arXiv:1812.09962, 2018.
  33. Jack Dongarra, Thomas Herault, and Yves Robert. Fault tolerance techniques for high-performance computing. In Computer Communications and Networks, pages 3-85. Springer, Jul., 2015.
  34. Uwe Dötsch, Mark Doll, Hans-Peter Mayer, Frank Schaich, Jonathan Segel, and Philippe Sehier. Quantitative analysis of split base station processing and determination of advantageous architectures for LTE. Bell Labs Technical Journal, 18(1):105-128, Dec., 2013.
  35. Sanghamitra Dutta, Ziqian Bai, Haewon Jeong, Tze Meng Low, and Pulkit Grover. A unified coded deep neural network training strategy based on generalized polydot codes for matrix multiplication. arXiv preprint, arXiv:1811.10751, 2018.
  36. Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover. Short-dot: Computing large linear transforms distributedly using coded short dot products. In Advances In Neural Information Processing Systems, pages 2100-2108, Dec., 2016.
  37. Sanghamitra Dutta, Mohammad Fahim, Farzin Haddadpour, Haewon Jeong, Viveck Cadambe, and Pulkit Grover. On the optimal recovery threshold of coded matrix multiplication. arXiv preprint, arXiv:1801.10292, 2018.
  38. Mohammad Fahim and Viveck R Cadambe. Numerically stable polynomially coded computing. arXiv preprint, arXiv:1903.08326, 2019.
  39. Mohammad Fahim, Haewon Jeong, Farzin Haddadpour, Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover. On the optimal recovery threshold of coded matrix multiplication. In Proceeding 55th Allerton Conference on Communication, Control, and Computing, IL, USA, pages 1264-1270, Oct., 2017.
  40. Joan Feigenbaum. Encrypting problem instances. In Conference on the Theory and Application of Cryptographic Techniques, pages 477-488. Springer, Aug., 1985.
  41. R Freij-Hollanti, O. W. Gnilke, C Hollanti, and D. A. Karpuk. Private information retrieval from coded databases with colluding servers. SIAM Journal on Applied Algebra and Geometry, 1(1):647-664, Nov., 2017.
  42. Kristen Gardner, Samuel Zbarsky, Sherwin Doroudi, Mor Harchol-Balter, and Esa Hyytia. Reducing latency via redundant requests: Exact analysis. ACM SIGMETRICS Performance Evaluation Review, 43(1):347-360, Jun., 2015.
  43. William Gasarch. A survey on private information retrieval. Bulletin of the EATCS, 82(113):72-107, Feb., 2004.
  44. Yael Gertner, Yuval Ishai, Eyal Kushilevitz, and Tal Malkin. Protecting data privacy in private information retrieval schemes. Journal of Computer and System Sciences, 60(3):592-629, Jun., 2000.
  45. Thomas Herault and Yves Robert. Fault-tolerance techniques for high-performance computing. Springer, Jul., 2015.
  46. Juliver Gil Herrera and Juan Felipe Botero. Resource allocation in NFV: A compre- hensive survey. IEEE Transactions on Network and Service Management, 13(3):518-532, Mar., 2016.
  47. Wassily Hoeffding. A class of statistics with asymptotically normal distribution. In Breakthroughs in Statistics, pages 308-334. Springer, 1992.
  48. Wassily Hoeffding. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding, pages 409-426. Springer, 1994.
  49. Kuang-Hua Huang and Jacob A Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transaction on Computers, 100(6):518-528, Jun., 1984.
  50. Svante Janson. Large deviations for sums of partly dependent random variables. Random Structures & Algorithms, 24(3):234-248, Mar., 2004.
  51. Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv preprint, arXiv:1506.08473, 2015.
  52. Zhuqing Jia and Syed A Jafar. Cross subspace alignment codes for coded distributed batch matrix multiplication. arXiv preprint, arXiv:1909.13873, 2019.
  53. Zhuqing Jia and Syed A Jafar. On the capacity of secure distributed matrix multiplication. arXiv preprint, arXiv:1908.06957, 2019.
  54. Gauri Joshi, Yanpei Liu, and Emina Soljanin. On the delay-storage trade-off in content download from coded distributed storage systems. IEEE Journal on Selected Areas in Communications, 32(5):989-997, Dec., 2014.
  55. Gauri Joshi, Emina Soljanin, and Gregory Wornell. Efficient redundancy techniques for latency reduction in cloud systems. ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS), 2(2):12, Sep., 2017.
  56. Jaber Kakar, Seyedhamed Ebadifar, and Aydin Sezgin. On the capacity and straggler- robustness of distributed secure matrix multiplication. IEEE Access, 7:45783- 45799, Apr., 2019.
  57. Jinkyu Kang, Osvaldo Simeone, and Joonhyuk Kang. On the trade-off between computational load and reliability for network function virtualization. IEEE Communications Letters, 21:1767-1770, Dec., 2017.
  58. Fatemeh Kazemi, Esmaeil Karimi, Anoosheh Heidarzadeh, and Alex Sprintson. Private information retrieval with private coded side information: The multi-server case. arXiv preprint, arXiv:1906.11278, 2019.
  59. Fatemeh Kazemi, Esmaeil Karimi, Anoosheh Heidarzadeh, and Alex Sprintson. Single-server single-message online private information retrieval with side information. In Proceeding IEEE International Symposium on Information Theory (ISIT), pages 350-354, Jul., 2019.
  60. Shahrouz Khalili and Osvaldo Simeone. Uplink HARQ for cloud RAN via separation of control and data planes. IEEE Transactions on Vehicular Technology, 66(5):4005-4016, Mar., 2017.
  61. Minchul Kim and Jungwoo Lee. Private secure coded computation. arXiv preprint, arXiv:1902.00167, 2019.
  62. Ger Koole and Rhonda Righter. Resource allocation in grid computing. Journal of Scheduling, 11(3):163-173, Aug., 2008.
  63. Jack Kosaian, KV Rashmi, and Shivaram Venkataraman. Learning a code: Machine learning for approximate non-linear coded computation. arXiv preprint, arXiv:1806.01259, 2018.
  64. Hsiang-Tsung Kung. Fast evaluation and interpolation. Carnegie Mellon University, Tech. Rep., 2009.
  65. Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. Speeding up distributed machine learning using codes. Proceeding IEEE International Symposium on Information Theory, pages 1143-1147, Jul., 2016.
  66. Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. Speeding up distributed machine learning using codes. IEEE Transation on Information Theory, 64(3):1514-1529, Mar., 2018.
  67. Kangwook Lee, Ramtin Pedarsani, and Kannan Ramchandran. On scheduling redundant requests with cancellation overheads. IEEE/ACM Transactions on Networking, 25(2):1279-1290, Apr., 2017.
  68. Kangwook Lee, Changho Suh, and Kannan Ramchandran. High-dimensional coded matrix multiplication. In Proceeding IEEE International Symposium Information Theory (ISIT), pages 2418-2422, Jun., 2017.
  69. Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In Proceeding of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI, volume 14, pages 583-598, Oct., 2014.
  70. Songze Li, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. A unified coding framework for distributed computing with straggling servers. In Globecom Workshops (GC Wkshps), 2016 IEEE, pages 1-6. IEEE, Dec., 2016.
  71. Songze Li, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. Coded MapReduce. In Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conference on, pages 964-971. IEEE, Oct., 2015.
  72. Songze Li, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. Coded distributed computing: Straggling servers and multistage dataflows. In Communication, Control, and Computing (Allerton), 2016 54th Annual Allerton Conference on, pages 164-171. IEEE, Oct., 2016.
  73. Songze Li, Mohammad Ali Maddah-Ali, Qian Yu, and A Salman Avestimehr. A fundamental tradeoff between computation and communication in distributed computing. IEEE Transactions on Information Theory, 64(1):109-128, May., 2018.
  74. Songze Li, Mohammad Ali Maddah-Ali, Qian Yu, and A Salman Avestimehr. A fundamental tradeoff between computation and communication in distributed computing. IEEE Transaction on Information Theory, 64(1):109-128, Sep., 2017.
  75. Jiajia Liu, Zhongyuan Jiang, Nei Kato, Osamu Akashi, and Atsushi Takahara. Reliability evaluation for NFV deployment of future mobile broadband networks. IEEE Wireless Communications, 23(3):90-96, Apr., 2016.
  76. Ankur Mallick, Malhar Chaudhari, and Gauri Joshi. Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication. arXiv preprint, arXiv:1804.10331, 2018.
  77. Rashid Mijumbi, Joan Serrat, Juan-Luis Gorricho, Niels Bouten, Filip De Turck, and Raouf Boutaba. Network function virtualization: State-of-the-art and research challenges. IEEE Communications Surveys & Tutorials, 18(1):236-262, Dec., 2016.
  78. Navid Nikaein. Processing radio access network functions in the cloud: Critical issues and modeling. In Proceedings of the 6th International Workshop on Mobile Cloud Computing and Services,, pages 36-43. ACM, Apr., 2015.
  79. Navid Nikaein, Raymond Knopp, Florian Kaltenberger, Lionel Gauthier, Christian Bonnet, Dominique Nussbaum, and Riadh Ghaddab. OpenAirInterface: an open LTE network in a PC. In Proceedings of the 20th annual international conference on Mobile computing and networking, pages 305-308. ACM, Sep., 2014.
  80. Hanzaleh Akbari Nodehi and Mohammad Ali Maddah-Ali. Secure coded multi-party computation for massive matrix operations. arXiv preprint, arXiv:1908.04255, 2019.
  81. Hanzaleh Akbari Nodehi and Mohammad Ali Maddah-Ali. Limited-sharing multi- party computation for massive matrix operations. In Proceeding IEEE International Symposium on Information Theory (ISIT), pages 1231-1235, Jun., 2018.
  82. Linus Nyman and Mikael Laakso. Notes on the history of fork and join. IEEE Annals of the History of Computing, 38(3):84-87, 2016.
  83. Yury Polyanskiy, H Vincent Poor, and Sergio Verdú. Channel coding rate in the finite blocklength regime. IEEE Transactions on Information Theory, 56(5):2307- 2359, Dec., 2010.
  84. Amirhossein Reisizadehmobarakeh, Saurav Prakash, Ramtin Pedarsani, and Salman Avestimehr. Coded computation over heterogeneous clusters. [Online] www.arxiv.org, arXiv:1701.05973 [cs.IT], 2017.
  85. Francesco Ricci, Lior Rokach, and Bracha Shapira. Introduction to recommender systems handbook. Springer, Oct., 2010.
  86. David P Rodgers. Improvements in multiprocessor system design. ACM SIGARCH Computer Architecture News, 13(3):225-231, Jun., 1985.
  87. Veronica Quintuna Rodriguez and Fabrice Guillemin. Towards the deployment of a fully centralized cloud-RAN architecture. In Wireless Communications and Mobile Computing Conference (IWCMC), 2017 13th International, pages 1055-1060, Valencia, Spain, Jun., 2017.
  88. Veronica Quintuna Rodriguez and Fabrice Guillemin. Cloud-ran modeling based on parallel processing. IEEE Journal on Selected Areas in Communications, 36(3):457-468, Nov., 2018.
  89. Sheldon M Ross. Introduction to Probability Models. Academic Press, 2014.
  90. Peter Rost and Athul Prasad. Opportunistic hybrid arqenabler of centralized-RAN over nonideal backhaul. IEEE Wireless Communications Letters, 3(5):481- 484, Dec., 2014.
  91. Abdón Sánchez-Arroyo. Determining the total colouring number is NP-hard. Discrete Mathematics, 78(3):315-319, 1989.
  92. Albin Severinson, Alexandre Graell i Amat, and Eirik Rosnes. Block-diagonal coding for distributed computing with straggling servers. In Information Theory Workshop (ITW), pages 464-468, Nov., 2017.
  93. Nihar B Shah, Kangwook Lee, and Kannan Ramchandran. When do redundant requests reduce latency? IEEE Transactions on Communications, 64(2):715- 722, Sep., 2015.
  94. Adi Shamir. How to share a secret. Communications of the ACM, 22(11):612-613, Nov., 1979.
  95. Claude E Shannon. A mathematical theory of communication. Bell system technical journal, 27(3):379-423, Jul., 1948.
  96. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The hadoop distributed file system. In IEEE 26th symposium on mass storage systems and technologies (MSST), pages 1-10, May., 2010.
  97. Edgar Solomonik and James Demmel. Communication-optimal parallel 2.5 d matrix multiplication and lu factorization algorithms. In European Conference on Parallel Processing, pages 90-109. Springer, Aug., 2011.
  98. Adarsh M Subramaniam, Anoosheh Heidarzadeh, and Krishna R Narayanan. Random khatri-rao-product codes for numerically-stable distributed matrix multiplication. arXiv preprint, arXiv:1907.05965, 2019.
  99. Hua Sun and Syed Ali Jafar. The capacity of private information retrieval. IEEE Transaction on Information Theory, 63(7):4075-4088, Jul., 2017.
  100. Behrooz Tahmasebi and Mohammad Ali Maddah-Ali. Private sequential function computation. arXiv preprint, arXiv:1908.01204, 2019.
  101. Rashish Tandon, Qi Lei, Alexandros Dimakis, and Nikos Karampatziakis. Gradient coding: Avoiding stragglers in synchronous gradient descent. [Online] www.arxiv.org arXiv:1612.03301 [cs.IT], 2016.
  102. Rashish Tandon, Qi Lei, Alexandros G Dimakis, and Nikos Karampatziakis. Gradient coding: Avoiding stragglers in distributed learning. In International Conference on Machine Learning, pages 3368-3376, Aug., 2017.
  103. Henk C Tijms. A First Course in Stochastic Models. John Wiley and Sons, Jul., 2003.
  104. Robert A Van De Geijn and Jerrell Watts. Summa: Scalable universal matrix multipli- cation algorithm. Concurrency: Practice and Experience, 9(4):255-274, Oct., 1997.
  105. Richard Walker. Implementing discrete mathematics: combinatorics and graph theory with mathematica. The Mathematical Gazette, 76(476):286-288, Jul., 1992.
  106. Da Wang, Gauri Joshi, and Gregory Wornell. Using straggler replication to reduce latency in large-scale parallel computing. ACM SIGMETRICS Performance Evaluation Review, 43(3):7-11, Jun., 2015.
  107. Heecheol Yang and Jungwoo Lee. Secure distributed computing with straggling servers using polynomial codes. IEEE Transaction on Information Forensics and Security, 14(1):141-150, Jan., 2019.
  108. Yaoqing Yang, Malhar Chaudhari, Pulkit Grover, and Soummya Kar. Coded iterative computing using substitute decoding. arXiv preprint, arXiv:1805.06046, 2018.
  109. Yaoqing Yang, Pulkit Grover, and Soummya Kar. Computing linear transformations with unreliable components. IEEE Transactions on Information Theory, 63(6):3729-3756, Mar., 2017.
  110. Sergey Yekhanin. Private information retrieval. Commun. ACM, 53(4):68-73, Apr., 2010.
  111. Qian Yu, Mohammad Maddah-Ali, and Salman Avestimehr. Polynomial codes: An optimal design for high-dimensional coded matrix multiplication. In Advances in Neural Information Processing Systems, pages 4403-4413, Dec., 2017.
  112. Qian Yu, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. Straggler mitigation in distributed matrix multiplication: Fundamental limits and optimal coding. arXiv preprint, arXiv:1801.07487, 2018.
  113. Qian Yu, Netanel Raviv, Jinhyun So, and A Salman Avestimehr. Lagrange coded computing: Optimal design for resiliency, security and privacy. arXiv preprint, arXiv:1806.00939, 2018.
  114. Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with working sets. Proceeding of the 2nd USENIX Conference on Hot topics in Cloud Computing, pages 10-10, Jun., 2010.