Academia.eduAcademia.edu

Outline

State management in Apache Flink®

Proceedings of the VLDB Endowment

https://doi.org/10.14778/3137765.3137777

Abstract

Stream processors are emerging in industry as an apparatus that drives analytical but also mission critical services handling the core of persistent application logic. Thus, apart from scalability and low-latency, a rising system need is first-class support for application state together with strong consistency guarantees, and adaptivity to cluster reconfigurations, software patches and partial failures. Although prior systems research has addressed some of these specific problems, the practical challenge lies on how such guarantees can be materialized in a transparent, non-intrusive manner that relieves the user from unnecessary constraints. Such needs served as the main design principles of state management in Apache Flink, an open source, scalable stream processor. We present Flink's core pipelined, in-flight mechanism which guarantees the creation of lightweight, consistent, distributed snapshots of application state, progressively, without impacting continuous execution. Co...

Key takeaways
sparkles

AI

  1. Apache Flink integrates state management with computation, ensuring strong consistency and low-latency processing.
  2. Flink's snapshotting mechanism, introduced in version 0.9, captures consistent application state without interrupting execution.
  3. The system handles partial failures and reconfigurations seamlessly, maintaining operational continuity during state changes.
  4. Flink supports various state types (Keyed-State and Operator-State) to manage application-specific requirements effectively.
  5. Large-scale deployments demonstrate Flink's capability to handle hundreds of gigabytes of state across thousands of nodes.

References (47)

  1. REFERENCES
  2. AthenaX : Ubers stream processing platform on Flink. http: //sf.flink-forward.org/kb sessions/athenax- ubers-streaming-processing-platform-on-flink/.
  3. Blink: How Alibaba Uses Apache Flink. http://data- artisans.com/blink-flink-alibaba-search/, 2016.
  4. Introduction to Spark's Structured Streaming. https://www.oreilly.com/learning/apache-spark- 2-0--introduction-to-structured-streaming, 2016.
  5. Rbea: Scalable Real-Time Analytics at King. https://techblog.king.com/rbea-scalable-real- time-analytics-king/, 2016.
  6. Apache Beam. https://beam.apache.org/, 2017.
  7. Apache Flink. http://flink.apache.org/, 2017.
  8. Apache Storm. http://storm.apache.org/, 2017.
  9. Flink Forward. http://flink-forward.org/, 2017.
  10. Flink Survey. http://data-artisans.com/flink-user- survey-2016-part-1/, http://data-artisans.com/ flink-user-survey-2016-part-2/, 2017.
  11. Google Cloud Dataflow. https://cloud.google.com/dataflow/, 2017.
  12. Real-time monitoring with Flink, Kafka and HB. http: //2016.flink-forward.org/kb sessions/a-brief- history-of-time-with-apache-flink-real-time- monitoring-and-analysis-with-flink-kafka-hb/, 2017.
  13. Rockdb. http://rocksdb.org/, 2017.
  14. Stream processing with Flink at Netflix. http: //sf.flink-forward.org/kb sessions/keynote- stream-processing-with-flink-at-netflix/, 2017.
  15. StreamING models, how ING adds models at runtime to catch fraudsters. http://sf.flink-forward.org/kb sessions/ streaming-models-how-ing-adds-models-at- runtime-to-catch-fraudsters/, 2017.
  16. The Trident Stream Processing Programming Model. http://storm.apache.org/releases/0.10.0/Trident- tutorial.html, 2017.
  17. D. J. Abadi, D. Carney, U. C ¸etintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: a new model and architecture for data stream management. VLDBJ, 2003.
  18. T. Akidau, A. Balikov, K. Bekiroglu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. MillWheel: Fault-tolerant stream processing at internet scale. In VLDB, 2013.
  19. T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. J. Fernández-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt, et al. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. VLDB, 2015.
  20. A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, et al. The Stratosphere platform for big data analytics. The VLDB Journal - The International Journal on Very Large Data Bases, 2014.
  21. A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, and J. Widom. Stream: The stanford data stream management system. Book chapter, 2004.
  22. D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In Proceedings of the 1st ACM symposium on Cloud computing, pages 119-130. ACM, 2010.
  23. P. Carbone, S. Ewen, S. Haridi, A. Katsifodimos, V. Markl, and K. Tzoumas. Apache flink: Stream and batch processing in a single engine. IEEE Data Engineering Bulletin, page 28, 2015.
  24. R. Castro Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch. Integrating scale out and fault tolerance in stream processing using operator state management. In Proceedings of the 2013 ACM SIGMOD international conference on Management of data, pages 725-736. ACM, 2013.
  25. C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: easy, efficient data-parallel pipelines. In ACM Sigplan Notices. ACM, 2010.
  26. S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. R. Madden, F. Reiss, and M. A. Shah. TelegraphCQ: continuous dataflow processing. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 668-668. ACM, 2003.
  27. K. M. Chandy and L. Lamport. Distributed snapshots: determining global states of distributed systems. ACM Transactions on Computer Systems (TOCS), 3(1):63-75, 1985.
  28. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):4, 2008.
  29. M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney, U. Cetintemel, Y. Xing, and S. B. Zdonik. Scalable distributed stream processing. In CIDR, volume 3, pages 257-268, 2003.
  30. G. De Francisci Morales and A. Bifet. Samoa: Scalable advanced massive online analysis. The Journal of Machine Learning Research, 16(1):149-153, 2015.
  31. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107-113, 2008.
  32. G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon's highly available key-value store. ACM SIGOPS operating systems review, 41(6):205-220, 2007.
  33. E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (CSUR), 34(3):375-408, 2002.
  34. B. He, M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, and L. Zhou. Comet: batched stream processing for data intensive distributed computing. In Proceedings of the 1st ACM symposium on Cloud computing, pages 63-74. ACM, 2010.
  35. B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In NSDI, 2011.
  36. M. Hirzel, R. Soulé, S. Schneider, B. Gedik, and R. Grimm. A catalog of stream processing optimizations. ACM Computing Surveys (CSUR), 46(4):46, 2014.
  37. P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: Wait-free coordination for internet-scale systems. In USENIX annual technical conference, volume 8, page 9, 2010.
  38. G. Jacques-Silva, F. Zheng, D. Debrunner, K.-L. Wu, V. Dogaru, E. Johnson, M. Spicer, and A. E. Sariyüce. Consistent regions: guaranteed tuple processing in ibm streams. Proceedings of the VLDB Endowment, 9(13):1341-1352, 2016.
  39. J. Kreps, N. Narkhede, J. Rao, et al. Kafka: A distributed messaging system for log processing. NetDB, 2011.
  40. Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, 5(8):716-727, 2012.
  41. N. Marz and J. Warren. Big Data: Principles and best practices of scalable realtime data systems. Manning Publications Co., 2015.
  42. D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In ACM SOSP, 2013.
  43. V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, et al. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing, page 5. ACM, 2013.
  44. H. Wang, L.-S. Peh, E. Koukoumidis, S. Tao, and M. C. Chan. Meteor shower: A reliable stream processing system for commodity data centers. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 2012.
  45. Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language.
  46. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. HotCloud, 2010.
  47. M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing, pages 10-10. USENIX Association, 2012.