Cassandra

Avinash Lakshman; Prashant Malik

doi:10.1145/1773912.1773922

Outline

Cassandra -A Decentralized Structured Storage System

jamie luis zegarra tovar

https://doi.org/10.1145/1773912.1773922

visibility

…

description

6 pages

link

1 file

Abstract

Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure. Cassandra aims to run on top of an infrastructure of hundreds of nodes (possibly spread across different data centers). At this scale, small and large components fail continuously. The way Cassandra manages the persistent state in the face of these failures drives the reliability and scalability of the software systems relying on this service. While in many ways Cassandra resembles a database and shares many design and implementation strategies therewith, Cassandra does not support a full rela-tional data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format. Cassandra system was designed to run on cheap commodity hardware and handle high write through-put while not sacrificing read efficiency.

References (21)

REFERENCES
MySQL AB. Mysql.
Atul Adya, William J. Bolosky, Miguel Castro, Gerald Cermak, Ronnie Chaiken, John R. Douceur, Jon Howell, Jacob R. Lorch, Marvin Theimer, and Roger P. Wattenhofer. Farsite: Federated, available, and reliable storage for an incompletely trusted environment. In In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI, pages 1-14, 2002.
Mike Burrows. The chubby lock service for loosely-coupled distributed systems. In OSDI '06: Proceedings of the 7th symposium on Operating systems design and implementation, pages 335-350, Berkeley, CA, USA, 2006. USENIX Association.
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A distributed storage system for structured data. In In Proceedings of the 7th Conference on USENIX Symposium on Operating Systems Design and Implementation -Volume 7, pages 205-218, 2006.
Abhinandan Das, Indranil Gupta, and Ashish Motivala. Swim: Scalable weakly-consistent infection-style process group membership protocol. In DSN '02: Proceedings of the 2002 International Conference on Dependable Systems and Networks, pages 303-312, Washington, DC, USA, 2002. IEEE Computer Society.
Giuseppe de Candia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: amazon Õs highly available key-value store. In Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles, pages 205-220. ACM, 2007.
Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107-113, 2008.
Xavier Défago, Péter Urbán, Naohiro Hayashibara, and Takuya Katayama. The φ accrual failure detector. In RR IS-RR-2004-010, Japan Advanced Institute of Science and Technology, pages 66-78, 2004.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 29-43, New York, NY, USA, 2003. ACM.
Jim Gray and Pat Helland. The dangers of replication and a solution. In In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 173-182, 1996.
David Karger, Eric Lehman, Tom Leighton, Matthew Levine, Daniel Lewin, and Rina Panigrahy. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In In ACM Symposium on Theory of Computing, pages 654-663, 1997.
Matthew L. Massie, Brent N. Chun, and David E. Culler. The ganglia distributed monitoring system: Design, implementation, and experience. Parallel Computing, 30:2004, 2004.
Benjamin Reed and Flavio Junquieira. Zookeeper.
Peter Reiher, John Heidemann, David Ratner, Greg Skinner, and Gerald Popek. Resolving file conflicts in the ficus file system. In USTC'94: Proceedings of the USENIX Summer 1994 Technical Conference on USENIX Summer 1994 Technical Conference, pages 12-12, Berkeley, CA, USA, 1994. USENIX Association.
Robbert Van Renesse, Yaron Minsky, and Mark Hayden. A gossip-style failure detection service. In Service, Ť Proc. Conf. Middleware, pages 55-70, 1996.
Mahadev Satyanarayanan, James J. Kistler, Puneet Kumar, Maria E. Okasaki, Ellen H. Siegel, and David C. Steere. Coda: A highly available file system for a distributed workstation environment. IEEE Trans. Comput., 39(4):447-459, 1990.
Ion Stoica, Robert Morris, David Liben-nowell, David R. Karger, M. Frans Kaashoek, Frank Dabek, and Hari Balakrishnan. Chord: a scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Transactions on Networking, 11:17-32, 2003.
D. B. Terry, M. M. Theimer, Karin Petersen, A. J. Demers, M. J. Spreitzer, and C. H. Hauser. Managing update conflicts in bayou, a weakly connected replicated storage system. In SOSP '95: Proceedings of the fifteenth ACM symposium on Operating systems principles, pages 172-182, New York, NY, USA, 1995. ACM.
Robbert van Renesse, Dan Mihai Dumitriu, Valient Gough, and Chris Thomas. Efficient reconciliation and flow control for anti-entropy protocols. In Proceedings of the 2nd Large Scale Distributed Systems and Middleware Workshop (LADIS '08), New York, NY, USA, 2008. ACM.
Matt Welsh, David Culler, and Eric Brewer. Seda: an architecture for well-conditioned, scalable internet services. In SOSP '01: Proceedings of the eighteenth ACM symposium on Operating systems principles, pages 230-243, New York, NY, USA, 2001. ACM.

Distributed actor languages are an effective means of constructing scalable reliable systems, and the Erlang programming language has a well-established and influential model. While the Erlang model conceptually provides reliable scalability, it has some inherent scalability limits and these force developers to depart from the model at scale. This article establishes the scalability limits of Erlang systems, and reports the work of the EU RELEASE project to improve the scalability and understandability of the Erlang reliable distributed actor model. We systematically study the scalability limits of Erlang, and then address the issues at the virtual machine, language and tool levels. More specifically: (1) We have evolved the Erlang virtual machine so that it can work effectively in large scale single-host multicore and NUMA architectures. We have made important changes and architectural improvements to the widely used Erlang/OTP release. (2) We have designed and implemented Scalable Distributed (SD) Erlang libraries to address language-level scalability issues, and provided and validated a set of semantics for the new language constructs. (3) To make large Erlang systems easier to deploy, monitor, and debug we have developed and made open source releases of five complementary tools, some specific to SD Erlang. Throughout the article we use two case studies to investigate the capabilities of our new technologies and tools: a distributed hash table based Orbit calculation and Ant Colony Optimisation (ACO). Chaos Monkey experiments show that two versions of ACO survive random process failure and hence that SD Erlang preserves the Erlang reliability model. While we report measurements on a range of NUMA and cluster architectures, the key scalability experiments are conducted on the Athos cluster with 256 hosts (6144 cores). Even for programs with no global recovery data to maintain, SD Erlang partitions the network to reduce network traffic and hence improves performance of the Orbit and ACO benchmarks above 80 hosts. ACO measurements show that maintaining global recovery data dramatically limits scalability; however scalability is recovered by partitioning the recovery data. We exceed the established scalability limits of distributed Erlang, and do not reach the limits of SD Erlang for these benchmarks at this scale (256 hosts, 6144 cores).

Several distributed services ranging from key-value stores to cloud storage require fault-tolerance and reliability features. For enabling fast recovery and seamless transition, primary-backup replication protocols are widely used in different application settings including distributed databases, web services, and the Internet of Things. In this study, we elaborate the ways of enhancing the efficiency of the primary-backup replication protocol by introducing various checkpointing techniques. We develop a geographically replicated key-value store based on the RocksDB and use the PlanetLab testbed network for large-scale performance analysis. Using various metrics of interest including blocking time, checkpointing time, checkpoint size, failover time, and throughput and testing with practical workloads via the YCSB tool, our findings indicate that periodic-incremental checkpointing promises up to 5 times decrease in blocking time and a drastic improvement on the overall throughput compared to the traditional primary-backup replication. Furthermore, enabling Snappy compression algorithm on the periodic-incremental checkpointing leads to further reduction in blocking time and increases system throughput compared to the traditional primary-backup replication. KEYWORDS checkpointing, compressed checkpointing, incremental checkpointing, periodic checkpointing, primary-backup replication, replicated cloud key-value stores 1 INTRODUCTION As the cloud systems continue to enlarge, the underlying networks empowering them also maintain their steady growth to stay sustainable against challenges involving immense user population and the big data. This growth is observed in two aspects as the geographical scaling of the nodes and the increase in the node counts. The availability becomes more and more significant as any outages that could last milliseconds of increase in response times may result in high income losses. 1 Moreover, the possibility of facing with failures in these systems is inevitable due to extensive usage of software and hardware components with the long running applications that exceed the mean time between failures of the components. 2 The most important and effective approach to deal with crash failures is replication. It is widely used as a fault-tolerance mechanism, and finding optimal replication protocols is an active research area. There exist two main types of replication protocols, namely, active and passive. In the active replication, which is also known as state-machine replication, every incoming request is processed by every replica in the system resulting in multiple results to be collected. Once collected, they are reduced into a single result value using various algorithms and the client is notified accordingly. In the passive replication, which is also known as primary-backup replication, there exist a single primary replica and a group of backup replicas. Each request is executed only in the primary replica, the result is then copied to backup replicas and the client is notified. Another way of introducing recovery from failures is through the checkpointing that refers to saving the system state to a stable storage after critical executions. Afterwards, in the event of any failures during the execution, the previously saved checkpoint can be restored as a failure-free system state enabling the execution continue over. This approach also facilitates a quick rollback feature even against unforeseen failures and decreases the workload needed to revitalize a replica from zero state, since with a single rollback, the system state would be caught up with the latest failure-free state. 3 In our recent work, we demonstrated applicability and benefits of various checkpointing algorithms in replication protocols. 4,5

Cassandra -A Decentralized Structured Storage System

Sign up for access to the world's latest research

Abstract

Related papers

References (21)

Related papers

Cited by