Academia.eduAcademia.edu

Outline

Druid A Real-time Analytical Data Store

https://doi.org/10.1145/2588555.2595631

Abstract

Druid is an open source 1 data store designed for real-time exploratory analytics on large data sets. The system combines a column-oriented storage layout, a distributed, shared-nothing architecture, and an advanced indexing structure to allow for the arbitrary exploration of billion-row tables with sub-second latencies. In this paper, we describe Druid's architecture, and detail how it supports fast aggre-gations, flexible filters, and low latency data ingestion.

References (45)

  1. D. J. Abadi, S. R. Madden, and N. Hachem. Column-stores vs. row-stores: How different are they really? In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 967-980. ACM, 2008.
  2. G. Antoshenkov. Byte-aligned bitmap compression. In Data Compression Conference, 1995. DCC'95. Proceedings, page 476. IEEE, 1995.
  3. Apache. Apache solr. http://lucene.apache.org/solr/, February 2013.
  4. S. Banon. Elasticsearch. http://www.elasticseach.com/, July 2013.
  5. C. Bear, A. Lamb, and N. Tran. The vertica database: Sql rdbms for managing big data. In Proceedings of the 2012 workshop on Management of big data systems, pages 37-38. ACM, 2012.
  6. R. Cattell. Scalable sql and nosql data stores. ACM SIGMOD Record, 39(4):12-27, 2011.
  7. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):4, 2008.
  8. J. Cipar, G. Ganger, K. Keeton, C. B. Morrey III, C. A. Soules, and A. Veitch. Lazybase: trading freshness for performance in a scalable database. In Proceedings of the 7th ACM european conference on Computer Systems, pages 169-182. ACM, 2012.
  9. Cloudera impala. http://blog.cloudera.com/blog, March 2013.
  10. A. Colantonio and R. Di Pietro. Concise: Compressed 'n'composable integer set. Information Processing Letters, 110(16):644-650, 2010.
  11. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107-113, 2008.
  12. G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon's highly available key-value store. In ACM SIGOPS Operating Systems Review, volume 41, pages 205-220. ACM, 2007.
  13. C. Engle, A. Lupher, R. Xin, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: fast data analysis using coarse-grained distributed memory. In Proceedings of the 2012 international conference on Management of Data, pages 689-692. ACM, 2012.
  14. F. Färber, S. K. Cha, J. Primsch, C. Bornhövd, S. Sigg, and W. Lehner. Sap hana database: data management for modern business applications. ACM Sigmod Record, 40(4):45-51, 2012.
  15. B. Fink. Distributed computation on dynamo-style distributed storage: riak pipe. In Proceedings of the eleventh ACM SIGPLAN workshop on Erlang workshop, pages 43-50. ACM, 2012.
  16. B. Fitzpatrick. Distributed caching with memcached. Linux journal, (124):72-74, 2004.
  17. A. Hall, O. Bachmann, R. Büssow, S. Gănceanu, and M. Nunkesser. Processing a trillion cells per mouse click. Proceedings of the VLDB Endowment, 5(11):1436-1446, 2012.
  18. B. Hu. Stream database survey. 2011.
  19. P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: Wait-free coordination for internet-scale systems. In USENIX ATC, volume 10, 2010.
  20. C. S. Kim. Lrfu: A spectrum of policies that subsumes the least recently used and least frequently used policies. IEEE Transactions on Computers, 50(12), 2001.
  21. J. Kreps, N. Narkhede, and J. Rao. Kafka: A distributed messaging system for log processing. In Proceedings of 6th International Workshop on Networking Meets Databases (NetDB), Athens, Greece, 2011.
  22. T. Lachev. Applied Microsoft Analysis Services 2005: And Microsoft Business Intelligence Platform. Prologika Press, 2005.
  23. A. Lakshman and P. Malik. Cassandra-a decentralized structured storage system. Operating systems review, 44(2):35, 2010.
  24. Liblzf. http://freecode.com/projects/liblzf, March 2013.
  25. LinkedIn. Senseidb. http://www.senseidb.com/, July 2013.
  26. R. MacNicol and B. French. Sybase iq multiplex-designed for analytics. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, pages 1227-1230. VLDB Endowment, 2004.
  27. N. Marz. Storm: Distributed and fault-tolerant realtime computation. http://storm-project.net/, February 2013.
  28. S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1-2):330-339, 2010.
  29. D. Miner. Unified analytics platform for big data. In Proceedings of the WICSA/ECSA 2012 Companion Volume, pages 176-176. ACM, 2012.
  30. K. Oehler, J. Gruenes, C. Ilacqua, and M. Perez. IBM Cognos TM1: The Official Guide. McGraw-Hill, 2012.
  31. E. J. O'neil, P. E. O'neil, and G. Weikum. The lru-k page replacement algorithm for database disk buffering. In ACM SIGMOD Record, volume 22, pages 297-306. ACM, 1993.
  32. P. O'Neil and D. Quass. Improved query performance with variant indexes. In ACM Sigmod Record, volume 26, pages 38-49. ACM, 1997.
  33. P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil. The log-structured merge-tree (lsm-tree). Acta Informatica, 33(4):351-385, 1996.
  34. Paraccel analytic database. http://www.paraccel.com/resources/Datasheets/ ParAccel-Core-Analytic-Database.pdf, March 2013.
  35. M. Schrader, D. Vlamis, M. Nader, C. Claterbos, D. Collins, M. Campbell, and F. Conrad. Oracle Essbase & Oracle OLAP. McGraw-Hill, Inc., 2009.
  36. K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pages 1-10. IEEE, 2010.
  37. M. Singh and B. Leonhardi. Introduction to the ibm netezza warehouse appliance. In Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research, pages 385-386. IBM Corp., 2011.
  38. M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, et al. C-store: a column-oriented dbms. In Proceedings of the 31st international conference on Very large data bases, pages 553-564. VLDB Endowment, 2005.
  39. A. Tomasic and H. Garcia-Molina. Performance of inverted indices in shared-nothing distributed text document information retrieval systems. In Parallel and Distributed Information Systems, 1993., Proceedings of the Second International Conference on, pages 8-17. IEEE, 1993.
  40. E. Tschetter. Introducing druid: Real-time analytics at a billion rows per second. http://druid.io/blog/2011/ 04/30/introducing-druid.html, April 2011.
  41. Twitter public streams. https://dev.twitter.com/ docs/streaming-apis/streams/public, March 2013.
  42. S. J. van Schaik and O. de Moor. A memory efficient reachability data structure through bit vector compression. In Proceedings of the 2011 international conference on Management of data, pages 913-924. ACM, 2011.
  43. L. VoltDB. Voltdb technical overview. https://voltdb.com/, 2010.
  44. K. Wu, E. J. Otoo, and A. Shoshani. Optimizing bitmap indices with efficient compression. ACM Transactions on Database Systems (TODS), 31(1):1-38, 2006.
  45. M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Computing, pages 10-10. USENIX Association, 2012.