Distributed Lock Management with RDMA

Dong Young Yoon; Mosharaf Chowdhury; Barzan Mozafari

doi:10.1145/3183713.3196890

Outline

Distributed Lock Management with RDMA

Dong Young Yoon

2018, Proceedings of the 2018 International Conference on Management of Data

https://doi.org/10.1145/3183713.3196890

visibility

…

description

16 pages

link

1 file

Abstract

Lock managers are a crucial component of modern distributed systems. However, with the increasing availability of fast RDMAenabled networks, traditional lock managers can no longer keep up with the latency and throughput requirements of modern systems. Centralized lock managers can ensure fairness and prevent starvation using global knowledge of the system, but are themselves single points of contention and failure. Consequently, they fall short in leveraging the full potential of RDMA networks. On the other hand, decentralized (RDMA-based) lock managers either completely sacrifice global knowledge to achieve higher throughput at the risk of starvation and higher tail latencies, or they resort to costly communications in order to maintain global knowledge, which can result in significantly lower throughput. In this paper, we show that it is possible for a lock manager to be fully decentralized and yet exchange the partial knowledge necessary for preventing starvation and thereby reducing tail latencies. Our main observation is that we can design a lock manager primarily using RDMA's fetch-and-add (FA) operations, which always succeed, rather than compare-and-swap (CAS) operations, which only succeed if a given condition is satisfied. While this requires us to rethink the locking mechanism from the ground up, it enables us to sidestep the performance drawbacks of the previous CAS-based proposals that relied solely on blind retries upon lock conflicts. Specifically, we present DSLR (Decentralized and Starvation-free Lock management with RDMA), a decentralized lock manager that targets distributed systems running on RDMA-enabled networks. We demonstrate that, despite being fully decentralized, DSLR prevents starvation and blind retries by guaranteeing first-comefirst-serve (FCFS) scheduling without maintaining explicit queues. We adapt Lamport's bakery algorithm [36] to an RDMA-enabled environment with multiple bakers, utilizing only one-sided READ and atomic FA operations. Our experiments show that, on average, DSLR delivers 1.8× (and up to 2.8×) higher throughput than all existing RDMA-based lock managers, while reducing their mean and 99.9% latencies by 2.0× and 18.3× (and up to 2.5× and 47×), respectively.

Key takeaways
AI

DSLR achieves 1.8× to 2.8× higher throughput than existing RDMA-based lock managers under contention.
The proposed decentralized lock manager prevents starvation and ensures fairness using fetch-and-add operations.
DSLR adapts Lamport's bakery algorithm for RDMA environments, enhancing performance with atomic operations.
The system can handle transaction failures using leases, enabling fault tolerance in decentralized settings.
Experiments show DSLR reduces mean latencies by 2.0× and 99.9% latencies by 18.3× compared to prior methods.

References (64)

2016. RDMA over Converged Ethernet. http://www.roceinitiative.org/. (2016).
2017. APT. https://www.aptlab.net/. (2017).
2017. Druid | Interactive Analytics at Scale. http://druid.io/. (2017).
2017. InfiniBand Architecture Specification, Release 1.3. https://cw.infinibandta. org/document/dl/7859. (2017).
2017. Open MPI: Open Source High Performance Computing. https://www. open-mpi.org/. (2017).
2017. Teradata: Business Analytics, Hybrid Cloud & Consulting. http://www. teradata.com/. (2017).
2018. Perftest Package | Mellanox Interconnect Community. https://community. mellanox.com/docs/DOC-2802. (2018).
Peter Bailis et al. 2014. Coordination avoidance in database systems. PVLDB.
Claude Barthels, Simon Loesing, Gustavo Alonso, and Donald Kossmann. 2015. Rack-scale in-memory join processing using RDMA. In SIGMOD.
Carsten Binnig, Andrew Crotty, Alex Galakatos, Tim Kraska, and Erfan Zamanian. 2016. The end of slow networks: it's time for a redesign. PVLDB.
Pablo Brenner. 1997. A technical tutorial on the IEEE 802.11 protocol. BreezeCom Wireless Communications (1997).
Mike Burrows. 2006. The Chubby lock service for loosely-coupled distributed systems. In USENIX OSDI.
Yeounoh Chung and Erfan Zamanian. 2015. Using RDMA for Lock Management. arXiv preprint arXiv:1507.03274 (2015).
Crispin Cowan, F Wagle, Calton Pu, Steve Beattie, and Jonathan Walpole. 2000. Buffer overflows: Attacks and defenses for the vulnerability of the decade. In DARPA Information Survivability Conference and Exposition, 2000. DISCEX'00. Proceedings.
Peter B Danzig, Katia Obraczka, and Anant Kumar. 1992. An analysis of wide- area name server traffic: a study of the Internet Domain Name System. ACM SIGCOMM Computer Communication (1992).
Ananth Devulapalli and Pete Wyckoff. 2005. Distributed queue-based locking using advanced network features. In ICPP.
Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. FaRM: fast remote memory. In USENIX NSDI.
Steven Fitzgerald et al. 1997. A directory service for configuring high- performance distributed computations. In High Performance Distributed Comput- ing, 1997. Proceedings. The Sixth IEEE International Symposium on.
Kristen Gardner, Samuel Zbarsky, Sherwin Doroudi, Mor Harchol-Balter, and Esa Hyytia. 2015. Reducing latency via redundant requests: Exact analysis. ACM SIGMETRICS Performance Evaluation Review (2015).
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In ACM SIGOPS operating systems review.
Kishore Gopalakrishna et al. 2012. Untangling cluster management with Helix. In Proceedings of the Third ACM Symposium on Cloud Computing.
Cary Gray and David Cheriton. 1989. Leases: An efficient fault-tolerant mechanism for distributed file cache consistency.
Andrew B Hastings. 1990. Distributed lock management in a transaction pro- cessing environment. In Reliable Distributed Systems, 1990. Proceedings., Ninth Symposium on.
Jiamin Huang, Barzan Mozafari, Grant Schoenebeck, and Thomas Wenisch. 2017. A Top-Down Approach to Achieving Performance Predictability in Database Systems. In SIGMOD.
Jiamin Huang, Barzan Mozafari, and Thomas Wenisch. 2017. Statistical Analysis of Latency Through Semantic Profiling. In EuroSys.
Patrick Hunt, Mahadev Konar, Flavio Paiva Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free Coordination for Internet-scale Systems.. In USENIX ATC.
Prasad Jayanti, King Tan, Gregory Friedland, and Amir Katz. 2001. Bounding LamportâĂŹs bakery algorithm. In International Conference on Current Trends in Theory and Practice of Computer Science.
Horatiu Jula et al. 2008. Deadlock Immunity: Enabling Systems to Defend Against Deadlocks.. In OSDI.
Anuj Kalia, Michael Kaminsky, and David G Andersen. 2014. Using RDMA efficiently for key-value services. In ACM SIGCOMM Computer Communication Review.
Anuj Kalia, Michael Kaminsky, and David G Andersen. 2016. FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs. In USENIX OSDI.
Rishi Kapoor, George Porter, Malveeka Tewari, Geoffrey M Voelker, and Amin Vahdat. 2012. Chronos: Predictable low latency for data center applications. In Proceedings of the Third ACM Symposium on Cloud Computing.
Nancy P Kronenberg, Henry M Levy, and William D Strecker. 1986. VAXcluster: a closely-coupled distributed system. ACM Transactions on Computer Systems.
Byung-Jae Kwak, Nah-Oak Song, and Leonard E Miller. 2005. Performance analysis of exponential backoff. IEEE/ACM Transactions on Networking.
Leslie Lamport. 1974. A new solution of Dijkstra's concurrent programming problem. Commun. ACM (1974).
Leslie Lamport et al. 2001. Paxos made simple. ACM Sigact News (2001).
Feng Li, Sudipto Das, Manoj Syamala, and Vivek R Narasayya. 2016. Accelerating relational databases by leveraging remote memory and rdma. In SIGMOD.
Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014. Tachyon: Reliable, memory speed storage for cluster computing frameworks. In Proceedings of the ACM Symposium on Cloud Computing.
Jialin Li, Naveen Kr Sharma, Dan RK Ports, and Steven D Gribble. 2014. Tales of the tail: Hardware, os, and application-level sources of tail latency. In Proceedings of the ACM Symposium on Cloud Computing.
Gang Luo, Jeffrey F Naughton, Curt J Ellmann, and Michael W Watzke. 2010. Transaction reordering. Data & Knowledge Engineering (2010).
Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store. In USENIX ATC.
Barzan Mozafari, Carlo Curino, Alekh Jindal, and Samuel Madden. 2013. Perfor- mance and resource modeling in highly-concurrent OLTP workloads. In SIGMOD.
Barzan Mozafari, Carlo Curino, and Samuel Madden. 2013. DBSeer: Resource and Performance Prediction for Building a Next Generation Database Cloud. In CIDR.
Barzan Mozafari, Eugene Zhen Ye Goh, and Dong Young Yoon. 2015. CliffGuard: A Principled Framework for Finding Robust Database Designs. In SIGMOD.
Barzan Mozafari, Jags Ramnarayan, Sudhir Menon, Yogesh Mahajan, Soubhik Chakraborty, Hemant Bhanawat, and Kishor Bachhav. 2017. SnappyData: A Unified Cluster for Streaming, Transactions, and Interactive Analytics. In CIDR.
Sundeep Narravula, A Marnidala, Abhinav Vishnu, Karthikeyan Vaidyanathan, and Dhabaleswar K Panda. 2007. High performance distributed lock manage- ment services using network-based remote atomic operations. In Seventh IEEE International Symposium on Cluster Computing and the Grid.
Jacob Nelson et al. 2015. Latency-tolerant software distributed shared memory. In USENIX ATC.
Ravi Rajwar and James R Goodman. 2002. Transactional lock-free execution of lock-based programs. In ACM SIGOPS Operating Systems Review.
Jags Ramnarayan, Barzan Mozafari, Sudhir Menon, Sumedh Wale, Neeraj Kumar, Hemant Bhanawat, Soubhik Chakraborty, Yogesh Mahajan, Rishitesh Mishra, and Kishor Bachhav. 2016. SnappyData: A hybrid transactional analytical store built on Spark. In SIGMOD.
KV Rashmi, Mosharaf Chowdhury, Jack Kosaian, Ion Stoica, and Kannan Ram- chandran. 2016. EC-Cache: Load-Balanced, Low-Latency Cluster Caching with Online Erasure Coding. In USENIX OSDI.
Ro Recio, P Culley, D Garcia, J Hilland, and B Metzler. 2005. An RDMA protocol specification. Technical Report. IETF Internet-draft draft-ietf-rddp-rdmap-03. txt (work in progress).
Kun Ren, Alexander Thomson, and Daniel J Abadi. 2015. VLL: a lock manager redesign for main memory database systems. The VLDB Journal (2015).
Wolf Rödiger, Tobias Mühlbauer, Alfons Kemper, and Thomas Neumann. 2015. High-speed query processing over high-speed networks. PVLDB.
Dongin Shin et al. 2013. Dynamic Interval Polling and Pipelined Post I/O Pro- cessing for Low-Latency Storage Class Memory.. In HotStorage.
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In 2010 IEEE 26th symposium on mass storage systems and technologies (MSST).
Michael Stonebraker and Ariel Weisberg. 2013. The VoltDB Main Memory DBMS. IEEE Data Eng. Bull. (2013).
Gadi Taubenfeld. 2004. The black-white bakery algorithm and related bounded- space, adaptive, local-spinning and FIFO algorithms. Distributed Computing (2004).
Boyu Tian, Jiamin Huang, Barzan Mozafari, Grant Schoenebeck, and Thomas Wenisch. 2018. Contention-aware lock scheduling for transactional databases. PVLDB (2018).
Ashish Vulimiri, Oliver Michel, P Godfrey, and Scott Shenker. 2012. More is less: reducing latency via redundancy. In Proceedings of the 11th ACM Workshop on Hot Topics in Networks.
Carl A Waldspurger. 1995. Lottery and stride scheduling: Flexible proportional- share resource management. (1995).
Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. 2015. Fast in-memory transaction processing using RDMA and HTM. In SOSP.
Cong Yan and Alvin Cheung. 2016. Leveraging lock contention to improve OLTP application performance. PVLDB.
Dong Young Yoon, Ning Niu, and Barzan Mozafari. 2016. DBSherlock: A Perfor- mance Diagnostic Tool for Transactional Databases. In SIGMOD.
Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: cluster computing with working sets. HotCloud (2010).
Lei Zhang, Yu Chen, Yaozu Dong, and Chao Liu. 2012. Lock-Visor: An efficient transitory co-scheduling for MP guest. In ICPP.

Computer systems are designed to make resources available to users and users may be interested in some resources more than others, therefore, a coordination scheme is required to satisfy the users' requirements. This scheme may implement certain policies such as "never allocate more than X units of resource Z". One policy that is of particular interest is the inability of users to access a single resource at the same time, which is called the problem of mutual exclusion. Resource management concerns the coordination and collaboration of users, and it is usually based on making a decision. In the case of mutual exclusion, that decision is about granting access to a resource. Therefore, mutual exclusion is useful for supporting resource access management. The first true solution to the mutual exclusion problem is known as the Bakery algorithm that does not rely on any lower-lever mutual exclusion. We examine the problem of register overflow in realworld implementations of the Bakery algorithm and present a variant algorithm named Bakery++ that prevents overflows from ever happening. Bakery++ avoids overflows without allowing a process to write into other processes' memory and without using additional memory or complex arithmetic or redefining the operations and functions used in Bakery. Bakery++ is almost as simple as Bakery and it is straightforward to implement in real systems. With Bakery++, there is no reason to keep implementing Bakery in real computers because Bakery++ eliminates the probability of overflows and hence it is more practical than Bakery. Previous approaches to circumvent the problem of register overflow included introducing new variables or redefining the operations or functions used in the original Bakery algorithm, while Bakery++ avoids overflows by using simple conditional statements. The result is a new mutual exclusion algorithm that is guaranteed never to allow an overflow and it is simple, correct and easy to implement. Bakery++ has the same temporal and spatial complexities as the original Bakery. We have specified Bakery++ in PlusCal and we have used the TLC model checker to assert that Bakery++ maintains the mutual exclusion property and that it never allows an overflow. CCS CONCEPTS • Software and its engineering → Software organization and properties → Contextual software domains → Operating systems → Process management → Mutual exclusion •

Distributed Lock Management with RDMA

Sign up for access to the world's latest research

Abstract

Key takeawaysAI

Related papers

References (64)

Related papers

Related topics

Cited by

Key takeaways
AI