State Machine Replication

description39 papers

group1 follower

lightbulbAbout this topic

State Machine Replication is a method in distributed computing that ensures consistency and fault tolerance by replicating a state machine across multiple nodes. It guarantees that all replicas execute the same sequence of operations, thereby maintaining a consistent state despite failures or network partitions.

lightbulbAbout this topic

Key research themes

1. How can state machine replication protocols balance concurrency and consistency for multi-core system scalability?

This theme investigates the tension between achieving deterministic execution necessary for state machine replication (SMR) and exploiting concurrency provided by modern multi-core architectures. It matters as traditional SMR approaches relying on total order execution hinder scalability on multi-core servers. Solutions here explore partial ordering of executions and replay techniques to preserve consistency while enabling concurrent execution, thus optimizing throughput and latency.

Rex: replication at the speed of multi-core

by Chuntao HONG

2023

Key finding: Rex introduces an execute-agree-follow model that departs from total request ordering by recording non-deterministic decisions during primary execution as partial-order traces agreed upon via consensus. This enables... Read more

articleView Paper downloadDownload

Adaptive request batching for byzantine replication

by Alirio Sá

2024, ACM SIGOPS Operating Systems Review

Key finding: This work focuses on incremental optimization of PBFT-family byzantine fault-tolerant SMR protocols by adapting batching parameters dynamically using feedback control. Variable batch sizes and timeouts are tuned based on... Read more

articleView Paper downloadDownload

An analysis of replica control

by Shu-Wie Chen

2023, [1992 Proceedings] Second Workshop on the Management of Replicated Data

Key finding: The paper analyzes improvements in replica control algorithms that reduce conflict unavailability and improve autonomy and scalability, which complements concurrency concerns in SMR. It categorizes techniques refining... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What verification frameworks and formal methods ensure safety and correctness of Byzantine fault-tolerant state machine replication protocols?

Ensuring correctness and security of Byzantine fault-tolerant SMR is critical due to complex failure modes including malicious replicas. This research theme focuses on providing machine-checked formal verification of SMR protocols, capturing Byzantine behavior, and verifying key protocol properties like safety, liveness, and agreement. It addresses the challenge of bridging formal correctness proofs and practical implementations that are robust against arbitrary faults.

Velisarios: Byzantine Fault-Tolerant Protocols Powered by Coq

by Marcus Völp

2024, Programming Languages and Systems

Key finding: Velisarios is a Coq-based logic-of-events framework enabling mechanized verification of Byzantine fault-tolerant SMR protocols. It provides reusable epistemic knowledge models and proof tactics, exemplified by the first... Read more

articleView Paper downloadDownload

Threat Adaptive Byzantine Fault Tolerant State-Machine Replication

by Marcus Völp

2024, 2021 40th International Symposium on Reliable Distributed Systems (SRDS)

Key finding: ThreatAdaptive presents a novel reconfiguration protocol that automatically adapts the number of replicas and fault thresholds of a BFT-SMR system based on threat-level detectors. It formally characterizes safe... Read more

articleView Paper downloadDownload

This work was first developed towards the end of the Malicious and Accidental Fault Tolerance (MAFTIA)

by Miguel Correia

2022

Key finding: BFT-TO is an asynchronous Byzantine SMR algorithm achieving the lower bound of 2f+1 replicas by using a trusted component, reducing complexity compared to the traditional 3f+1 bound. It offers resilience to arbitrary faults... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How do replication and consistency control protocols optimize availability, latency, and fault tolerance in distributed state machines?

This theme encompasses the design and evaluation of data replication and replica control protocols that maintain consistency across replicas while balancing performance factors such as latency, network communication overhead, and system availability. It is fundamental for distributed state machine replication systems where state synchronization impacts throughput and fault tolerance. The research examines various consistency, quorum, and replication control strategies, including voting protocols and hybrid schemes optimizing for different operational constraints.

The Performance of Available Copy Protocols for the Management of Replicated Data

by Darrell D E Long and

2016

Key finding: The paper evaluates available copy protocols and variants (naive and optimistic) that ensure consistency of replicated data without requiring instantaneous failure detection. Using Markov models, it shows these protocols... Read more

articleView Paper downloadDownload

A Simulation Study of Replication Control Protocols Using Volatile Witnesses

by Darrell D E Long

2016

Key finding: Simulation results demonstrate that voting protocols augmented with volatile regenerable witnesses reduce the number of required replicas and improve availability. Volatile witnesses, though stored in memory, can be quickly... Read more

articleView Paper downloadDownload

A Flexible Hybrid Approach to Data Replication in Distributed Systems

by Syed Mohtashim Abbas Bokhari

2023, Bokhari, Syed Mohtashim Abbas, and Oliver Theel. "A flexible hybrid approach to data replication in distributed systems." Intelligent Computing: Proceedings of the 2020 Computing Conference, Volume 1. Springer International Publishing, 2020

Key finding: Proposes a hybrid data replication strategy combining voting quorums with adaptive replication factors tailored to workload and failure scenarios. The approach balances tradeoffs between read-write availability and operation... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in State Machine Replication

Linearizable Replicated State Machines with Lattice Agreement

by Vijay Garg

2024, arXiv (Cornell University)

This paper studies the lattice agreement problem in asynchronous systems and explores its application to building linearizable replicated state machines (RSM). First, we propose an algorithm to solve the lattice agreement problem in... more

descriptionView Paper arrow_downwardDownload

Adaptive request batching for byzantine replication

by Alirio Sá

2024, ACM SIGOPS Operating Systems Review

Castro and Liskov proposed in 1999 a successful solution for byzantine fault-tolerant replication, named PBFT, which overcame performance drawbacks of earlier byzantine faulttolerant replication protocols. Other proposals extended PBFT... more

descriptionView Paper arrow_downwardDownload

Generalizing State Machine Replication (Preliminary Version)

by Eli Gafni

2024

We show that, with k-set consensus, any number of processes can emulate k state machines of which at least one progresses. This generalizes the celebrated universality of consensus which enables to build a state machine that always... more

descriptionView Paper arrow_downwardDownload

Velisarios: Byzantine Fault-Tolerant Protocols Powered by Coq

by Marcus Völp

2024, Programming Languages and Systems

Our increasing dependence on complex and critical information infrastructures and the emerging threat of sophisticated attacks, ask for extended efforts to ensure the correctness and security of these systems. Byzantine fault-tolerant... more

descriptionView Paper arrow_downwardDownload

Enhancing throughput of partially replicated state machines via multi-partition operation scheduling

by Peter Van Roy

2024, 2017 IEEE 16th International Symposium on Network Computing and Applications (NCA)

State-machine replication (SMR) is a fundamental technique to implement fault-tolerant services. Recently, various works have aimed at enhancing the scalability of SMR by exploiting partial replication techniques. By sharding the state... more

descriptionView Paper arrow_downwardDownload

Enhancing throughput of partially replicated state machines via multi-partition operation scheduling

by Paolo Romano

2024, 2017 IEEE 16th International Symposium on Network Computing and Applications (NCA)

descriptionView Paper arrow_downwardDownload

ZZ and the art of practical BFT execution

by Rahul Singh

2023, Proceedings of the sixth conference on Computer systems

The high replication cost of Byzantine fault-tolerance (BFT) methods has been a major barrier to their widespread adoption in commercial distributed applications. We present ZZ, a new approach that reduces the replication cost of BFT... more

descriptionView Paper arrow_downwardDownload

Visigoth fault tolerance

by Joao Leitao

2023, Proceedings of the Tenth European Conference on Computer Systems

We present a new technique for designing distributed protocols for building reliable stateful services called Visigoth Fault Tolerance (VFT). VFT introduces the Visigoth model, which makes it possible to calibrate the timing assumptions... more

descriptionView Paper arrow_downwardDownload

Enhancing throughput of partially replicated state machines via multi-partition operation scheduling

by Peter Van Roy

2023

descriptionView Paper arrow_downwardDownload

Efficient and Deterministic Scheduling for Parallel State Machine Replication

by Odorico Machado Mendizabal

2023

Many services used in large scale web applications should be able to tolerate faults without impacting their performance. State machine replication is a well-known approach to implementing fault-tolerant services, providing high... more

descriptionView Paper arrow_downwardDownload

Scaling Byzantine Fault-Tolerant Replication toWide Area Networks

by yair amir

2023, International Conference on Dependable Systems and Networks (DSN'06)

This paper presents the first hierarchical Byzantine tolerant replication architecture suitable to systems that span multiple wide area sites. The architecture confines the effects of any malicious replica to its local site, reduces... more

descriptionView Paper arrow_downwardDownload

ASPAS: As Secure as Possible Available Systems

by Houssam Yactine

2023, Springer eBooks

Available-Partition-tolerant (AP) geo-replicated systems trade consistency for availability. They allow replicas to serve clients' requests without prior synchronization. Potential conflicts due to concurrent operations can then be... more

descriptionView Paper arrow_downwardDownload

Data and code integrity in Grid environments

by Olivier Markowitch

2023, Grid Computing

In a large distributed system such as the Grid, ensuring data integrity is of particular importance. Since in a same network honest users and possible malicious entities live together, the risks of unauthorized alterations of data and... more

descriptionView Paper arrow_downwardDownload

ZZ and the art of practical BFT execution

by Rahul Singh

2023, Proceedings of the sixth conference on Computer systems

descriptionView Paper arrow_downwardDownload

Computing with data non-determinism: Wait time management for peer-to-peer systems

by Asrar Haque

2023, Computer Communications

One of the unusual challenges faced by peer-to-peer algorithms as opposed to classical distributed algorithms is that these have to compute with data non-determinism where there is no guarantee that data from a particular node will be... more

descriptionView Paper arrow_downwardDownload

ZZ and the art of practical BFT execution

by Rahul Singh

2023, Proceedings of the sixth conference on Computer systems

descriptionView Paper arrow_downwardDownload

HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers

by Ajay Agarwal

2023, The Scientific World Journal

Paxos is a prominent theory of state-machine replication. Recent data intensive systems that implement state-machine replication generally require high throughput. Earlier versions of Paxos as few of them are classical Paxos, fast Paxos,... more

descriptionView Paper arrow_downwardDownload

Proactive recovery in a Byzantine-fault-tolerant system

by Miguel Castro

2023, Operating Systems Design and Implementation

This paper describes an asynchronous state-machine replication system that tolerates Byzantine faults, which can be caused by malicious attacks or software errors. Our system is the first to recover Byzantine-faulty replicas proactively... more

descriptionView Paper arrow_downwardDownload

Practical byzantine fault tolerance and proactive recovery

by Miguel Castro

2023, ACM Transactions on Computer Systems

Our growing reliance on online services accessible on the Internet demands highly available systems that provide correct service without interruptions. Software bugs, operator mistakes, and malicious attacks are a major cause of service... more

Fig. 2. View-change protocol: the primary for view vu (replica 0) fails causing a view change to view v+1.

Fig. 5. Relationship between the window of vulnerability T, and other time intervals.

subpartitions. Figure 6 depicts a partition tree with three levels. We call the leaf partitions pages and the interior ones metadata. For example, the experi- ments described in Section 8 were run with a hierarchy with four levels, s equal to 256, and 4-KB pages.

Fig. 8. BFS: replicated file system architecture.

Fig. 9. Latency with varying result sizes: absolute times and slowdown relative to NO-REP.

Fig. 10. Latency with varying argument sizes: absolute times and slowdown relative to NO-REF

Fig. 11. Throughput for operations 0/0, 0/4, and 4/0.

Fig. 12. Latency with varying argument and result sizes with f = 2.

Fig. 13. Checkpoint cost with a varying number of modified pages per checkpoint epoch.

Fig. 14. State transfer latency and throughput.

Fig. 15. Andrew100 and Andrew500: elapsed time in seconds.

Fig. 16. PostMark: throughput in transactions per second.

Fig. 17. Andrew: elapsed time in seconds with and without proactive recoveries.

Table II. Andrew: Maximum Recovery Time (seconds)

descriptionView Paper arrow_downwardDownload

Practical Byzantine Fault Tolerance

by Miguel Castro

2023, Operating Systems Design and Implementation

This paper describes a new replication algorithm that is able to tolerate Byzantine faults. We believe that Byzantine- fault-tolerant algorithms will be increasingly important in the future because malicious attacks and software errors... more

descriptionView Paper arrow_downwardDownload

A Correctness proof for a practical byzantine-fault-tolerant replication algorithm

by Miguel Castro

2023

The paper assumes the reader is familiar with I/O automata, invariant assertions, and simulation relations. Lynch's book [8] provides a good description of the formalism and the two proof techniques.

descriptionView Paper arrow_downwardDownload

Brief Announcement: Revisiting Consensus Protocols through Wait-Free Parallelization

by SUYASH GUPTA

2023

In this brief announcement, we propose a protocol-agnostic approach to improve the design of primarybackup consensus protocols. At the core of our approach is a novel wait-free design of running several instances of the underlying... more

descriptionView Paper arrow_downwardDownload

Reliable, Efficient Recovery for Complex Services with Replicated Subsystems

by Sagar Jha

2023, 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

Applications with internal substructure are common in the cloud, where many systems are organized as independently logged and replicated subsystems that interact via flows of objects or some form of RPC. Restarting such an application is... more

Fig. 2: Total time to start or restart a service. Error bars Fig. 3: represent | standard deviation.

Fig. 3: Total metadata sent/received during the restart process.

Fig. 4: Breakdown of time spent in each phase of starting or restarting a service, when | node per shard is out of date upo restart. Upper bars show fresh start, lower bars show restart.

Fig. 5: Data downloaded by each out-of-date node, in a system with 3 shards of 3 members each.

Fig. 6: Time to restart a service with 3 shards of 3 members each, with 1 out-of-date node per shard. Error bars represent 1 standard deviation.

descriptionView Paper arrow_downwardDownload

ZZ and the art of practical BFT execution

by Rahul Singh

2023, Proceedings of the sixth conference on Computer systems

descriptionView Paper arrow_downwardDownload

How to select a replication protocol according to scalability, availability and communication overhead

by Ricardo Jimenez-peris

2023, Proceedings 20th IEEE Symposium on Reliable Distributed Systems

Data replication is playing an increasingly important role in the design of parallel information systems. In particular, the widespread use of cluster architectures in high-performance computing has created many opportunities for applying... more

descriptionView Paper arrow_downwardDownload

ZZ and the art of practical BFT execution

by Rahul Singh JADON

2023, Proceedings of the sixth conference on Computer systems

descriptionView Paper arrow_downwardDownload

Defining weakly consistent Byzantine fault-tolerant services

by PEDRO GABRIEL FONSECA

2023, ACM International Conference Proceeding Series

We propose a specification for weak consistency in the context of a replicated service that tolerates Byzantine faults. We define different levels of consistency for the replies that can be obtained from such a service-we use a real world... more

descriptionView Paper arrow_downwardDownload

Byz-GentleRain: An Efficient Byzantine-Tolerant Causal Consistency Protocol

by Hengfeng Wei

2023, Lecture Notes in Computer Science

Causal consistency is a widely used weak consistency model and there are plenty of research prototypes and industrial deployments of causally consistent distributed systems. However, none of them consider Byzantine faults, except Byz-RCM... more

descriptionView Paper arrow_downwardDownload

A Replica Distribution Based Fault Tolerance Management For Cloud Computing

by Ajitabh Mahalkari

2023

Cloud computing now a day's become most popular and reliable computing technique for organizations and individuals. In the cloud environments, data availability and backup replication are critical and complex issues in the an... more

FIGURE 3: TYPES OF FAULT TOLERANCE APPROACHES AVAILABLE.

Here we are suggesting an adaptive mechanism for replica distribution for effective fault tolerance in cloud computing, which can effectively used to achieve higher level data availability. The approach overcomes the issues connected with settled The approach gives dynamic nature to distribution and retrieval to ensures the client's data backup in any circumstance. Creating an answer obliges complete understanding of the issues and for cloud computing it is ceaseless data availability for end client. In different circumstances the cloud data neglects to load on end clients machines, and for this situation the proposed replica distribution instrument recognizes the machine to take the backup or do the retrieval. four measurements which are performance, throughput, response time and overhead. clients working, data sorts, data sizes, gadgets supportability, power choices, offering and so on. This data may give a deeper research the framework. In the wake of bringing the data, some change if gave which changes over the typical data into measurements structures from which some choice could be taken. The data is passed predominantly into

descriptionView Paper arrow_downwardDownload

The Paxos Register

by Allen Yumba

2023, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007)

descriptionView Paper arrow_downwardDownload

Recovery in Parallel State-Machine Replication

by Fernando Dotti

2022

State-machine replication is a popular approach to building fault-tolerant systems, which relies on the sequential execution of commands to guarantee strong consistency. Sequential execution, however, threatens performance. Recently,... more

descriptionView Paper arrow_downwardDownload

Making Hadoop MapReduce Byzantine Fault-Tolerant

by Fabricio Silva

2022, Universidade de Lisboa, Portugal

MapReduce is a programming model and a runtime environment designed by Google for processing large data sets in its warehouse-scale machines (WSM) with hundreds to thousands of servers [2, 4]. MapReduce is becoming increasingly popular... more

descriptionView Paper arrow_downwardDownload

A Comparison of Message Exchange Patterns in BFT Protocols

by Fábio Silva

2022, Distributed Applications and Interoperable Systems

The performance and scalability of byzantine fault-tolerant (BFT) protocols for state machine replication (SMR) have recently come under scrutiny due to their application in the consensus mechanism of blockchain implementations. This led... more

descriptionView Paper arrow_downwardDownload

Efficient and Deterministic Scheduling for Parallel State Machine Replication

by Odorico Machado Mendizabal

2022, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

descriptionView Paper arrow_downwardDownload

Classical SMR Atomic Broadcast Application ( b ) Parallel SMR RequestResponse Client Service Execution Service Execution Parallelizer Atomic Broadcast Application Request Client Response Workers

by Odorico Machado Mendizabal

2022

descriptionView Paper arrow_downwardDownload

Large-scale Byzantine fault tolerance: Safe but not always live

by Petr Kuznetsov

2022

The overall correctness of large-scale systems composed of many groups of replicas executing BFT protocols scales poorly with the number of groups. This is because the probability of at least one group being compromised (more than 1/3... more

descriptionView Paper arrow_downwardDownload

SieveQ: A Layered BFT Protection System for Critical Services

by Nuno Neves

2022, IEEE Transactions on Dependable and Secure Computing

Firewalls play a crucial role in assuring the security of today's critical infrastructures, forming a first line of defense by being placed strategically at the front-end of the networks. Sometimes, however, they have exploitable... more

descriptionView Paper arrow_downwardDownload

Resilient state machine replication

by Nuno Neves

2022, Proceedings - 11th Pacific Rim International Symposium on Dependable Computing, PRDC 2005

Nowadays, one of the major concerns about the services provided over the Internet is related to their availability. Replication is a well known way to increase the availability of a service. However, replication has some associated costs,... more

descriptionView Paper arrow_downwardDownload

An intrusion-tolerant firewall design for protecting SIEM systems

by Nuno Neves

2022, Proceedings of the International Conference on Dependable Systems and Networks

Nowadays, organizations are resorting to Security Information and Event Management (SIEM) systems to monitor and manage their network infrastructures. SIEMs employ a data collection capability based on many sensors placed in critical... more

descriptionView Paper arrow_downwardDownload

Revisiting consensus protocols through wait-free parallelization

by suyash gupta

2022

The recent surge of blockchain systems has renewed the interest in traditional Byzantine fault-tolerant consensus protocols. Many such consensus protocols have a primary-backup design in which an assigned replica, the primary, is... more

descriptionView Paper arrow_downwardDownload

Recovery in Parallel State-Machine Replication

by Odorico Machado Mendizabal

2022

descriptionView Paper arrow_downwardDownload

ZZ and the art of practical BFT execution

by Rahul Singh

2022, Proceedings of the sixth conference on Computer systems - EuroSys '11

descriptionView Paper arrow_downwardDownload

On diffusing updates in a Byzantine environment

by Yishay Mansour

2022, Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems

We study how to efficiently diffuse updates to a large distributed system of data replicas, some of which may exhibit arbitrary (Byzantine) failures. We assume that strictly fewer than t replicas fail, and that each update is initially... more

descriptionView Paper arrow_downwardDownload

Diffusion without false rumors: on propagating updates in a Byzantine environment

by Yishay Mansour

2022, Theoretical Computer Science

We study how to e ciently di use updates to a large distributed system of data replicas, some of which may exhibit arbitrary (Byzantine) failures. We assume that strictly fewer than t replicas fail, and that each update is initially... more

descriptionView Paper arrow_downwardDownload

How to Improve the Scalability of Read/Write Operations with Dynamic Reconfiguration of a Tree-Structured Coterie

by Ivan Frain

2022, 2006 International Conference on Parallel Processing Workshops (ICPPW'06)

In large-scaled environments such as Computing Grids, data replication is used to permit a better bandwidth usage of the network. Nevertheless, high latency time exposes the replica management protocols to potential performance... more

descriptionView Paper arrow_downwardDownload

OHT : Hierarchical Distributed Hash Tables

by Kun Feng

2022

This paper presents OHT, a hierarchical distributed hash table, which improves the performance of practical ZHT. When n-to-n connection is needed, every single node has to keep a large number of socket connections. This might be quite... more

descriptionView Paper arrow_downwardDownload

Self-stabilizing Byzantine-Tolerant Distributed Replicated State Machine

by Shlomi Dolev

2022, Lecture Notes in Computer Science

Replicated state machine is a fundamental concept used for obtaining fault tolerant distributed computation. Legacy distributed computational architectures (such as Hadoop or Zookeeper) are designed to tolerate crashes of individual... more

Prototype and replicated Hadoop. We have implemented the prototype as a Java module, replicating the distributed computation system Hadoop [44]. The client sends job requests to Hadoop; its master node performs dispatching functions. It accumulates job requests in a job queue. The Job Tracker agent at Master finds out an available computational facility (a slave node) and assigns a job to it. When this job finishes, the result is tunnelled back to Client through aster. Job Tracker is responsible for maintaining the state of an individual job, which can be one of the following values: Accepted, Running, Finished or Failed. The dispatcher state is the set of the states of its jobs. Figure 4 illustrates the Hadoop scenario. To stage replicas, we use Docker [42] virtual machines instead of physical hosts to reduce the expenses. A Docker machine has its own IP, disk and mem- ory space but the host kernel is shared among all virtual machines. In such lightweighted and ef ios occurring in rea. situation is that we launch the Stabilization Manager as a system service, ficient virtual machine we can reproduce most of the scenar- life. The only implementation issue that is relevant for this not as a Linux Kernel Module (LKM). This is because injecting LKMs into a machine applies to In the replicated Docker the host kernel; therefore only the first attempt will succeed. Hadoop a job request is sent asynchronously to every replica. Then a cluster find an available slave and assigns it to this request. At this

Fig. 5. Submitting a job to the replicated Hadoop

descriptionView Paper arrow_downwardDownload

Qualidade dos espaços verdes urbanos: o papel dos parques de lazer e de preservação

by Maria do Carmo De lima Bezerra

2022, Arq.urb. Revista eletrônica de arquitetura

O artigo aborda os critérios de localização das áreas verdes na estrutura urbana, que se referem aos atributos ambientais e às necessidades de lazer da população. A discussão trata das divergências e similaridades entre os objetivos de... more

descriptionView Paper arrow_downwardDownload

Toward Intrusion Tolerance as a Service: Confidentiality in Partially Cloud-Based BFT Systems

by Maher Khan

2022, 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

Recent work on intrusion-tolerance has shown that resilience to sophisticated network attacks requires system replicas to be deployed across at least three geographically distributed sites. While commodity data centers offer an attractive... more

descriptionView Paper arrow_downwardDownload

ASPAS: As Secure as Possible Available Systems

by Houssam Yactine

2022, Distributed Applications and Interoperable Systems

descriptionView Paper arrow_downwardDownload