Academia.eduAcademia.edu

State Machine Replication

description39 papers
group1 follower
lightbulbAbout this topic
State Machine Replication is a method in distributed computing that ensures consistency and fault tolerance by replicating a state machine across multiple nodes. It guarantees that all replicas execute the same sequence of operations, thereby maintaining a consistent state despite failures or network partitions.
lightbulbAbout this topic
State Machine Replication is a method in distributed computing that ensures consistency and fault tolerance by replicating a state machine across multiple nodes. It guarantees that all replicas execute the same sequence of operations, thereby maintaining a consistent state despite failures or network partitions.

Key research themes

1. How can state machine replication protocols balance concurrency and consistency for multi-core system scalability?

This theme investigates the tension between achieving deterministic execution necessary for state machine replication (SMR) and exploiting concurrency provided by modern multi-core architectures. It matters as traditional SMR approaches relying on total order execution hinder scalability on multi-core servers. Solutions here explore partial ordering of executions and replay techniques to preserve consistency while enabling concurrent execution, thus optimizing throughput and latency.

Key finding: Rex introduces an execute-agree-follow model that departs from total request ordering by recording non-deterministic decisions during primary execution as partial-order traces agreed upon via consensus. This enables... Read more
Key finding: This work focuses on incremental optimization of PBFT-family byzantine fault-tolerant SMR protocols by adapting batching parameters dynamically using feedback control. Variable batch sizes and timeouts are tuned based on... Read more
Key finding: The paper analyzes improvements in replica control algorithms that reduce conflict unavailability and improve autonomy and scalability, which complements concurrency concerns in SMR. It categorizes techniques refining... Read more

2. What verification frameworks and formal methods ensure safety and correctness of Byzantine fault-tolerant state machine replication protocols?

Ensuring correctness and security of Byzantine fault-tolerant SMR is critical due to complex failure modes including malicious replicas. This research theme focuses on providing machine-checked formal verification of SMR protocols, capturing Byzantine behavior, and verifying key protocol properties like safety, liveness, and agreement. It addresses the challenge of bridging formal correctness proofs and practical implementations that are robust against arbitrary faults.

Key finding: Velisarios is a Coq-based logic-of-events framework enabling mechanized verification of Byzantine fault-tolerant SMR protocols. It provides reusable epistemic knowledge models and proof tactics, exemplified by the first... Read more
Key finding: ThreatAdaptive presents a novel reconfiguration protocol that automatically adapts the number of replicas and fault thresholds of a BFT-SMR system based on threat-level detectors. It formally characterizes safe... Read more
Key finding: BFT-TO is an asynchronous Byzantine SMR algorithm achieving the lower bound of 2f+1 replicas by using a trusted component, reducing complexity compared to the traditional 3f+1 bound. It offers resilience to arbitrary faults... Read more

3. How do replication and consistency control protocols optimize availability, latency, and fault tolerance in distributed state machines?

This theme encompasses the design and evaluation of data replication and replica control protocols that maintain consistency across replicas while balancing performance factors such as latency, network communication overhead, and system availability. It is fundamental for distributed state machine replication systems where state synchronization impacts throughput and fault tolerance. The research examines various consistency, quorum, and replication control strategies, including voting protocols and hybrid schemes optimizing for different operational constraints.

Key finding: The paper evaluates available copy protocols and variants (naive and optimistic) that ensure consistency of replicated data without requiring instantaneous failure detection. Using Markov models, it shows these protocols... Read more
Key finding: Simulation results demonstrate that voting protocols augmented with volatile regenerable witnesses reduce the number of required replicas and improve availability. Volatile witnesses, though stored in memory, can be quickly... Read more
Key finding: Proposes a hybrid data replication strategy combining voting quorums with adaptive replication factors tailored to workload and failure scenarios. The approach balances tradeoffs between read-write availability and operation... Read more

All papers in State Machine Replication

This paper studies the lattice agreement problem in asynchronous systems and explores its application to building linearizable replicated state machines (RSM). First, we propose an algorithm to solve the lattice agreement problem in... more
Castro and Liskov proposed in 1999 a successful solution for byzantine fault-tolerant replication, named PBFT, which overcame performance drawbacks of earlier byzantine faulttolerant replication protocols. Other proposals extended PBFT... more
We show that, with k-set consensus, any number of processes can emulate k state machines of which at least one progresses. This generalizes the celebrated universality of consensus which enables to build a state machine that always... more
Our increasing dependence on complex and critical information infrastructures and the emerging threat of sophisticated attacks, ask for extended efforts to ensure the correctness and security of these systems. Byzantine fault-tolerant... more
State-machine replication (SMR) is a fundamental technique to implement fault-tolerant services. Recently, various works have aimed at enhancing the scalability of SMR by exploiting partial replication techniques. By sharding the state... more
State-machine replication (SMR) is a fundamental technique to implement fault-tolerant services. Recently, various works have aimed at enhancing the scalability of SMR by exploiting partial replication techniques. By sharding the state... more
The high replication cost of Byzantine fault-tolerance (BFT) methods has been a major barrier to their widespread adoption in commercial distributed applications. We present ZZ, a new approach that reduces the replication cost of BFT... more
We present a new technique for designing distributed protocols for building reliable stateful services called Visigoth Fault Tolerance (VFT). VFT introduces the Visigoth model, which makes it possible to calibrate the timing assumptions... more
State-machine replication (SMR) is a fundamental technique to implement fault-tolerant services. Recently, various works have aimed at enhancing the scalability of SMR by exploiting partial replication techniques. By sharding the state... more
Many services used in large scale web applications should be able to tolerate faults without impacting their performance. State machine replication is a well-known approach to implementing fault-tolerant services, providing high... more
This paper presents the first hierarchical Byzantine tolerant replication architecture suitable to systems that span multiple wide area sites. The architecture confines the effects of any malicious replica to its local site, reduces... more
Available-Partition-tolerant (AP) geo-replicated systems trade consistency for availability. They allow replicas to serve clients' requests without prior synchronization. Potential conflicts due to concurrent operations can then be... more
In a large distributed system such as the Grid, ensuring data integrity is of particular importance. Since in a same network honest users and possible malicious entities live together, the risks of unauthorized alterations of data and... more
The high replication cost of Byzantine fault-tolerance (BFT) methods has been a major barrier to their widespread adoption in commercial distributed applications. We present ZZ, a new approach that reduces the replication cost of BFT... more
One of the unusual challenges faced by peer-to-peer algorithms as opposed to classical distributed algorithms is that these have to compute with data non-determinism where there is no guarantee that data from a particular node will be... more
The high replication cost of Byzantine fault-tolerance (BFT) methods has been a major barrier to their widespread adoption in commercial distributed applications. We present ZZ, a new approach that reduces the replication cost of BFT... more
Paxos is a prominent theory of state-machine replication. Recent data intensive systems that implement state-machine replication generally require high throughput. Earlier versions of Paxos as few of them are classical Paxos, fast Paxos,... more
This paper describes an asynchronous state-machine replication system that tolerates Byzantine faults, which can be caused by malicious attacks or software errors. Our system is the first to recover Byzantine-faulty replicas proactively... more
Our growing reliance on online services accessible on the Internet demands highly available systems that provide correct service without interruptions. Software bugs, operator mistakes, and malicious attacks are a major cause of service... more
This paper describes a new replication algorithm that is able to tolerate Byzantine faults. We believe that Byzantine- fault-tolerant algorithms will be increasingly important in the future because malicious attacks and software errors... more
The paper assumes the reader is familiar with I/O automata, invariant assertions, and simulation relations. Lynch's book [8] provides a good description of the formalism and the two proof techniques.
In this brief announcement, we propose a protocol-agnostic approach to improve the design of primarybackup consensus protocols. At the core of our approach is a novel wait-free design of running several instances of the underlying... more
Applications with internal substructure are common in the cloud, where many systems are organized as independently logged and replicated subsystems that interact via flows of objects or some form of RPC. Restarting such an application is... more
The high replication cost of Byzantine fault-tolerance (BFT) methods has been a major barrier to their widespread adoption in commercial distributed applications. We present ZZ, a new approach that reduces the replication cost of BFT... more
Data replication is playing an increasingly important role in the design of parallel information systems. In particular, the widespread use of cluster architectures in high-performance computing has created many opportunities for applying... more
The high replication cost of Byzantine fault-tolerance (BFT) methods has been a major barrier to their widespread adoption in commercial distributed applications. We present ZZ, a new approach that reduces the replication cost of BFT... more
We propose a specification for weak consistency in the context of a replicated service that tolerates Byzantine faults. We define different levels of consistency for the replies that can be obtained from such a service-we use a real world... more
Causal consistency is a widely used weak consistency model and there are plenty of research prototypes and industrial deployments of causally consistent distributed systems. However, none of them consider Byzantine faults, except Byz-RCM... more
Cloud computing now a day's become most popular and reliable computing technique for organizations and individuals. In the cloud environments, data availability and backup replication are critical and complex issues in the an... more
State-machine replication is a popular approach to building fault-tolerant systems, which relies on the sequential execution of commands to guarantee strong consistency. Sequential execution, however, threatens performance. Recently,... more
MapReduce is a programming model and a runtime environment designed by Google for processing large data sets in its warehouse-scale machines (WSM) with hundreds to thousands of servers [2, 4]. MapReduce is becoming increasingly popular... more
The performance and scalability of byzantine fault-tolerant (BFT) protocols for state machine replication (SMR) have recently come under scrutiny due to their application in the consensus mechanism of blockchain implementations. This led... more
Many services used in large scale web applications should be able to tolerate faults without impacting their performance. State machine replication is a well-known approach to implementing fault-tolerant services, providing high... more
Many services used in large scale web applications should be able to tolerate faults without impacting their performance. State machine replication is a well-known approach to implementing fault-tolerant services, providing high... more
The overall correctness of large-scale systems composed of many groups of replicas executing BFT protocols scales poorly with the number of groups. This is because the probability of at least one group being compromised (more than 1/3... more
Firewalls play a crucial role in assuring the security of today's critical infrastructures, forming a first line of defense by being placed strategically at the front-end of the networks. Sometimes, however, they have exploitable... more
Nowadays, one of the major concerns about the services provided over the Internet is related to their availability. Replication is a well known way to increase the availability of a service. However, replication has some associated costs,... more
Nowadays, organizations are resorting to Security Information and Event Management (SIEM) systems to monitor and manage their network infrastructures. SIEMs employ a data collection capability based on many sensors placed in critical... more
The recent surge of blockchain systems has renewed the interest in traditional Byzantine fault-tolerant consensus protocols. Many such consensus protocols have a primary-backup design in which an assigned replica, the primary, is... more
State-machine replication is a popular approach to building fault-tolerant systems, which relies on the sequential execution of commands to guarantee strong consistency. Sequential execution, however, threatens performance. Recently,... more
The high replication cost of Byzantine fault-tolerance (BFT) methods has been a major barrier to their widespread adoption in commercial distributed applications. We present ZZ, a new approach that reduces the replication cost of BFT... more
We study how to efficiently diffuse updates to a large distributed system of data replicas, some of which may exhibit arbitrary (Byzantine) failures. We assume that strictly fewer than t replicas fail, and that each update is initially... more
We study how to e ciently di use updates to a large distributed system of data replicas, some of which may exhibit arbitrary (Byzantine) failures. We assume that strictly fewer than t replicas fail, and that each update is initially... more
In large-scaled environments such as Computing Grids, data replication is used to permit a better bandwidth usage of the network. Nevertheless, high latency time exposes the replica management protocols to potential performance... more
This paper presents OHT, a hierarchical distributed hash table, which improves the performance of practical ZHT. When n-to-n connection is needed, every single node has to keep a large number of socket connections. This might be quite... more
Replicated state machine is a fundamental concept used for obtaining fault tolerant distributed computation. Legacy distributed computational architectures (such as Hadoop or Zookeeper) are designed to tolerate crashes of individual... more
O artigo aborda os critérios de localização das áreas verdes na estrutura urbana, que se referem aos atributos ambientais e às necessidades de lazer da população. A discussão trata das divergências e similaridades entre os objetivos de... more
Recent work on intrusion-tolerance has shown that resilience to sophisticated network attacks requires system replicas to be deployed across at least three geographically distributed sites. While commodity data centers offer an attractive... more
Available-Partition-tolerant (AP) geo-replicated systems trade consistency for availability. They allow replicas to serve clients' requests without prior synchronization. Potential conflicts due to concurrent operations can then be... more
Download research papers for free!