Druid A Real-time Analytical Data Store
https://doi.org/10.1145/2588555.2595631Abstract
Druid is an open source 1 data store designed for real-time exploratory analytics on large data sets. The system combines a column-oriented storage layout, a distributed, shared-nothing architecture, and an advanced indexing structure to allow for the arbitrary exploration of billion-row tables with sub-second latencies. In this paper, we describe Druid's architecture, and detail how it supports fast aggre-gations, flexible filters, and low latency data ingestion.
References (45)
- D. J. Abadi, S. R. Madden, and N. Hachem. Column-stores vs. row-stores: How different are they really? In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 967-980. ACM, 2008.
- G. Antoshenkov. Byte-aligned bitmap compression. In Data Compression Conference, 1995. DCC'95. Proceedings, page 476. IEEE, 1995.
- Apache. Apache solr. http://lucene.apache.org/solr/, February 2013.
- S. Banon. Elasticsearch. http://www.elasticseach.com/, July 2013.
- C. Bear, A. Lamb, and N. Tran. The vertica database: Sql rdbms for managing big data. In Proceedings of the 2012 workshop on Management of big data systems, pages 37-38. ACM, 2012.
- R. Cattell. Scalable sql and nosql data stores. ACM SIGMOD Record, 39(4):12-27, 2011.
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):4, 2008.
- J. Cipar, G. Ganger, K. Keeton, C. B. Morrey III, C. A. Soules, and A. Veitch. Lazybase: trading freshness for performance in a scalable database. In Proceedings of the 7th ACM european conference on Computer Systems, pages 169-182. ACM, 2012.
- Cloudera impala. http://blog.cloudera.com/blog, March 2013.
- A. Colantonio and R. Di Pietro. Concise: Compressed 'n'composable integer set. Information Processing Letters, 110(16):644-650, 2010.
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107-113, 2008.
- G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon's highly available key-value store. In ACM SIGOPS Operating Systems Review, volume 41, pages 205-220. ACM, 2007.
- C. Engle, A. Lupher, R. Xin, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: fast data analysis using coarse-grained distributed memory. In Proceedings of the 2012 international conference on Management of Data, pages 689-692. ACM, 2012.
- F. Färber, S. K. Cha, J. Primsch, C. Bornhövd, S. Sigg, and W. Lehner. Sap hana database: data management for modern business applications. ACM Sigmod Record, 40(4):45-51, 2012.
- B. Fink. Distributed computation on dynamo-style distributed storage: riak pipe. In Proceedings of the eleventh ACM SIGPLAN workshop on Erlang workshop, pages 43-50. ACM, 2012.
- B. Fitzpatrick. Distributed caching with memcached. Linux journal, (124):72-74, 2004.
- A. Hall, O. Bachmann, R. Büssow, S. Gănceanu, and M. Nunkesser. Processing a trillion cells per mouse click. Proceedings of the VLDB Endowment, 5(11):1436-1446, 2012.
- B. Hu. Stream database survey. 2011.
- P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: Wait-free coordination for internet-scale systems. In USENIX ATC, volume 10, 2010.
- C. S. Kim. Lrfu: A spectrum of policies that subsumes the least recently used and least frequently used policies. IEEE Transactions on Computers, 50(12), 2001.
- J. Kreps, N. Narkhede, and J. Rao. Kafka: A distributed messaging system for log processing. In Proceedings of 6th International Workshop on Networking Meets Databases (NetDB), Athens, Greece, 2011.
- T. Lachev. Applied Microsoft Analysis Services 2005: And Microsoft Business Intelligence Platform. Prologika Press, 2005.
- A. Lakshman and P. Malik. Cassandra-a decentralized structured storage system. Operating systems review, 44(2):35, 2010.
- Liblzf. http://freecode.com/projects/liblzf, March 2013.
- LinkedIn. Senseidb. http://www.senseidb.com/, July 2013.
- R. MacNicol and B. French. Sybase iq multiplex-designed for analytics. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, pages 1227-1230. VLDB Endowment, 2004.
- N. Marz. Storm: Distributed and fault-tolerant realtime computation. http://storm-project.net/, February 2013.
- S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1-2):330-339, 2010.
- D. Miner. Unified analytics platform for big data. In Proceedings of the WICSA/ECSA 2012 Companion Volume, pages 176-176. ACM, 2012.
- K. Oehler, J. Gruenes, C. Ilacqua, and M. Perez. IBM Cognos TM1: The Official Guide. McGraw-Hill, 2012.
- E. J. O'neil, P. E. O'neil, and G. Weikum. The lru-k page replacement algorithm for database disk buffering. In ACM SIGMOD Record, volume 22, pages 297-306. ACM, 1993.
- P. O'Neil and D. Quass. Improved query performance with variant indexes. In ACM Sigmod Record, volume 26, pages 38-49. ACM, 1997.
- P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil. The log-structured merge-tree (lsm-tree). Acta Informatica, 33(4):351-385, 1996.
- Paraccel analytic database. http://www.paraccel.com/resources/Datasheets/ ParAccel-Core-Analytic-Database.pdf, March 2013.
- M. Schrader, D. Vlamis, M. Nader, C. Claterbos, D. Collins, M. Campbell, and F. Conrad. Oracle Essbase & Oracle OLAP. McGraw-Hill, Inc., 2009.
- K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pages 1-10. IEEE, 2010.
- M. Singh and B. Leonhardi. Introduction to the ibm netezza warehouse appliance. In Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research, pages 385-386. IBM Corp., 2011.
- M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, et al. C-store: a column-oriented dbms. In Proceedings of the 31st international conference on Very large data bases, pages 553-564. VLDB Endowment, 2005.
- A. Tomasic and H. Garcia-Molina. Performance of inverted indices in shared-nothing distributed text document information retrieval systems. In Parallel and Distributed Information Systems, 1993., Proceedings of the Second International Conference on, pages 8-17. IEEE, 1993.
- E. Tschetter. Introducing druid: Real-time analytics at a billion rows per second. http://druid.io/blog/2011/ 04/30/introducing-druid.html, April 2011.
- Twitter public streams. https://dev.twitter.com/ docs/streaming-apis/streams/public, March 2013.
- S. J. van Schaik and O. de Moor. A memory efficient reachability data structure through bit vector compression. In Proceedings of the 2011 international conference on Management of data, pages 913-924. ACM, 2011.
- L. VoltDB. Voltdb technical overview. https://voltdb.com/, 2010.
- K. Wu, E. J. Otoo, and A. Shoshani. Optimizing bitmap indices with efficient compression. ACM Transactions on Database Systems (TODS), 31(1):1-38, 2006.
- M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Computing, pages 10-10. USENIX Association, 2012.