Delay-Resistant Geo-Distributed Analytics
IEEE Transactions on Network and Service Management
https://doi.org/10.1109/TNSM.2022.3192710Abstract
Big data analytics platforms have played a critical role in the unprecedented success of data-driven applications. However, real-time and streaming data applications, and recent legislation, e.g., GDPR in Europe, have posed constraints on exchanging and analyzing data, especially personal data, across geographic regions. To address such constraints data has to be processed and analyzed in-situ and aggregated results have to be exchanged among the different sites for further processing. This introduces additional network delays due to the geographic distribution of the sites and potentially affecting the performance of analytics platforms that are designed to operate in datacenters with low network delays. In this paper, we show that the three most popular big data analytics systems (Apache Storm, Apache Spark, and Apache Flink) fail to tolerate round-trip times more than 30 milliseconds even when the input data rate is low. The execution time of distributed big data analytics tasks degrades substantially after this threshold, and some of the systems are more sensitive than others. A closer examination and understanding of the design of these systems show that there is no winner in all wide-area settings. However, we show that it is possible to improve the performance of all these popular big data analytics systems significantly amid even transcontinental delays (where inter-node delay is more than 30 milliseconds) and achieve performance comparable to this within a datacenter for the same load. Index Terms-Wide-area analytics, big data analytics, geodistributed systems, networked systems. I. INTRODUCTION B IG DATA analytics platforms [1], [2], [3], [4], [5], [6] have played a critical role in the unprecedented success of data-driven applications. Such platforms are typically deployed
References (67)
- "Apache storm." 2020. [Online]. Available: https://storm.apache.org/
- A. Toshniwal et al., "Storm@twitter," in Proc. ACM SIGMOD, 2014, pp. 147-156.
- "Apache spark." 2020. [Online]. Available: https://spark.apache.org/
- M. Zaharia et al., "Spark: Cluster computing with working sets," in Proc. HotCloud, vol. 10, 2010, p. 95.
- "Apache Flink." 2020. [Online]. Available: https://flink.apache.org/
- P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas, "Apache flink: Stream and batch processing in a single engine," Bull. IEEE Comput. Soc. Tech. Committee Data Eng., vol. 36, no. 4, pp. 28-38, 2015.
- "Keystone real-time stream processing platform." 2021. [Online]. Available: https://bit.ly/3ngPoeF (Accessed: Jan. 6, 2021).
- "Video access log processing with apache Flink." 2021. [Online]. Available: https://bit.ly/3ol4Vvz (Accessed: Jan. 6, 2021).
- "2020 identity fraud study: Genesis of the identity fraud crisis." 2021. [Online]. Available: https://bit.ly/3hSTkBn (Accessed: Jan. 6, 2021).
- "The state of online retail performance, spring 2017, Akamai." 2021. [Online]. Available: https://bit.ly/3hSVp09 (Accessed: Jan. 6, 2021).
- "Data protection in the EU, the general data protection regulation (GDPR); regulation (EU) 2016/679." 2016. [Online]. Available: http:/ /bit.ly/3qdVUVo
- L. Kalman, "New European data privacy and cyber security laws: One year later," Commun. ACM, vol. 62, no. 4, p. 38, 2019.
- S. Greengard, "Weighing the impact of GDPR," Commun. ACM, vol. 61, no. 11, pp. 16-18, 2018.
- State of California. "California consumer privacy act-Assembly bill no. 375." 2018. [Online]. Available: http://bit.ly/2K5qOjo
- Office of the Privacy Commissioner of Canada. "Amended act on the personal information protection and electronic documents act." 2018. [Online]. Available: https://bit.ly/3oPDSJ7
- The Privacy Protection Authority of Israel. "Protection of privacy reg- ulations (data security) 5777-2017." 2018. [Online]. Available: https:// bit.ly/2LMwcIA
- Personal Information Protection Commission, Japan. "Amended act on the protection of personal information." 2017. [Online]. Available: https:/ /www.ppc.go.jp/en/
- Office of the Australian Information Commissioner. "Australian privacy principles guidelines; Australian privacy principle 5-Notification of the collection of personal information." 2018. [Online]. Available: https:// bit.ly/38BBrUP
- A. Rabkin, M. Arye, S. Sen, V. S. Pai, and M. J. Freedman, "Aggregation and degradation in JetStream: Streaming Analytics in the wide area," in Proc. NSDI, 2014, pp. 275-288.
- B. Zhang, X. Jin, S. Ratnasamy, J. Wawrzynek, and E. A. Lee, "AWStream: Adaptive wide-area streaming analytics," in Proc. SIGCOMM, 2018, pp. 236-252.
- R. Viswanathan, G. Ananthanarayanan, and A. Akella, "CLARINET: WAN-aware optimization for analytics queries," in Proc. OSDI, 2016, pp. 435-450.
- Q. Pu et al., "Low latency geo-distributed data analytics," in Proc. ACM SIGCOMM, 2015, pp. 421-434.
- C.-C. Hung, G. Ananthanarayanan, L. Golubchik, M. Yu, and M. Zhang, "Wide-area analytics with multiple resources," in Proc. EuroSys, 2018, pp. 1-16.
- W. Xiao, W. Bao, X. Zhu, and L. Liu, "Cost-aware big data processing across geo-distributed datacenters," IEEE Trans. Parallel Distrib. Syst., vol. 28, no. 11, pp. 3114-3127, Nov. 2017.
- A. Vulimiri, C. Curino, B. Godfrey, K. Karanasos, and G. Varghese, "WANalytics: Analytics for a Geo-distributed data-intensive world," in Proc. CIDR, 2015, pp. 1-7.
- D. Kumar, J. Li, A. Chandra, and R. Sitaraman, "A TTL-based approach for data aggregation in Geo-distributed streaming Analytics," Proc. ACM Meas. Anal. Comput. Syst., vol. 3, no. 2, pp. 1-27, 2019.
- W. Li, D. Niu, Y. Liu, S. Liu, and B. Li, "Wide-area spark streaming: Automated routing and batch sizing," IEEE Trans. Parallel Distrib. Syst., vol. 30, no. 6, pp. 1434-1448, Jun. 2019.
- A. Jonathan, A. Chandra, and J. Weissman, "Multi-query optimization in wide-area streaming analytics," in Proc. SoCC, 2018, pp. 412-425.
- F. Lai, J. You, X. Zhu, H. V. Madhyastha, and M. Chowdhury, "Sol: Fast distributed computation over slow networks," in Proc. NSDI, 2020, pp. 273-288.
- B. Heintz, A. Chandra, and R. K. Sitaraman, "Optimizing timeliness and cost in Geo-distributed streaming Analytics," IEEE Trans. Cloud Comput., vol. 8, no. 1, pp. 232-245, Jan.-Mar. 2020.
- A. Jonathan, A. Chandra, and J. Weissman, "Rethinking adaptability in wide-area stream processing systems," in Proc. HotCloud, 2018, pp. 1-9.
- H. Mostafaei, S. Afridi, and J. Abawajy, "Network-aware worker place- ment for wide-area streaming analytics," Future Gener. Comput. Syst., vol. 136, pp. 270-281, Nov. 2022.
- L. Chen, S. Liu, and B. Li, "Optimizing network transfers for data analytic jobs across Geo-distributed Datacenters," IEEE Trans. Parallel Distrib. Syst., vol. 33, no. 2, pp. 403-414, Feb. 2022.
- Y. Chen, L. Luo, D. Guo, O. Rottenstreich, and J. Wu, "SDTP: Accelerating wide-area data analytics with simultaneous data transfer and processing," IEEE Trans. Cloud Comput., early access, Oct. 15, 2021, doi: 10.1109/TCC.2021.3119991.
- K. Beedkar, D. Brekardin, J.-A. Quiané-Ruiz, and V. Markl, "Compliant geo-distributed data processing in action," Proc. VLDB Endow., vol. 14, no. 12, pp. 2843-2846, Jul. 2021.
- A. Jonathan, A. Chandra, and J. Weissman, "WASP: Wide-area adap- tive stream processing," in Proc. ACM/IFIP/USENIX Middleware, 2020, pp. 221-235.
- Y. Jin et al., "Zooming in on wide-area latencies to a global cloud provider," in Proc. SIGCOMM, 2019, pp. 104-116.
- W. Reda et al., "Path persistence in the cloud: A study of the effects of inter-region traffic engineering in a large cloud provider's network," SIGCOMM Comput. Commun. Rev., vol. 50, no. 2, pp. 11-23, 2020.
- C.-Y. Hong et al., "Achieving high utilization with software-driven WAN," in Proc. ACM SIGCOMM, 2013, pp. 15-26.
- S. Chintapalli et al., "Benchmarking streaming computation engines: Storm, Flink and spark streaming," in Proc. IPDPSW, 2016, pp. 1789-1792.
- J. Karimov, T. Rabl, A. Katsifodimos, R. Samarev, H. Heiskanen, and V. Markl, "Benchmarking distributed stream data processing systems," in Proc. ICDE, 2018, pp. 1507-1518.
- M. A. Lopez, A. G. P. Lobato, and O. C. M. B. Duarte, "A performance comparison of open-source stream processing platforms," in Proc. GLOBECOM, 2016, pp. 1-6.
- "Apache Zookeeper." 2020. [Online]. Available: https:// zookeeper.apache.org/
- "Apache Hadoop." 2020. [Online]. Available: https://hadoop.apache.org/
- "What is/are the main difference(s) between Flink and storm?" 2015. [Online]. Available: https://bit.ly/3qiaQSF
- S. Saxena and S. Gupta, Practical Real-Time Data Processing and Analytics: Distributed Computing and Event Processing Using Apache Spark, Flink, Storm, and Kafka. London, U.K.: Packt, 2017.
- "Trident API overview." 2021. [Online]. Available: https://bit.ly/ 3win8NC
- S. Zeuch et al., "Analyzing efficient stream processing on modern hardware," Proc. VLDB Endow., vol. 12, no. 5, pp. 516-530, 2019.
- "Extending the Yahoo streaming benchmarks." 2020. [Online]. Available: https://github.com/dataArtisans/yahoo-streaming-benchmark
- "Apache kafka." 2020. [Online]. Available: https://kafka.apache.org/
- "Redis." 2020. [Online]. Available: https://redis.io/
- J. Dean and L. A. Barroso, "The tail at scale," Commun. ACM, vol. 56, no. 2, pp. 74-80, 2013.
- "Traffic control (TC)." 2020. [Online]. Available: https://wiki.debian.org/ TrafficControl
- "The state of the Internet." 2020. [Online]. Available: https://bit.ly/ 39gKAS4
- "Apache storm: Performance tuning." 2021. [Online]. Available: https:/ /bit.ly/3bLzYfM
- "Equinix." 2020. [Online]. Available: https://www.equinix.com/data- centers/
- "Amazon global infrastructure." 2020. [Online]. Available: http:// amzn.to/38zsFq4
- "Data residency in azure." 2020. [Online]. Available: http://bit.ly/ 3qdWimV
- "Google Datacenters." 2020. [Online]. Available: https://about.google/ locations/
- "AT&T network delay." 2020. [Online]. Available: https:// ipnetwork.bgtmo.ip.att.net/pws/network_delay.html (Accessed: Apr. 11, 2020).
- "Global ping statistics: Ping times between Wonder Network servers." 2020. [Online]. Available: https://wondernetwork.com/pings(Accessed: Sep. 10, 2020).
- "Analysis of network flow control and back pressure: Flink advanced tutorials." 2020. [Online]. Available: https://bit.ly/340lvZ7
- "Spark streaming programming guide." 2020. [Online]. Available: http:/ /bit.ly/3bw3bfb (Accessed: Dec. 28, 2020).
- "Socket statistics (SS)." 2020. [Online]. Available: http://bit.ly/39nj2dm
- "Can apache spark use TCP listener as input?" 2019. [Online]. Available: https://bit.ly/3bJIUSV
- "vmStat-Report virtual memory statistics." 2020. [Online]. Available: http://bit.ly/38BlTAd
- A. Uta et al., "Is big data performance reproducible in modern cloud networks?" in Proc. NSDI, 2020, pp. 513-527.