How to improve performance of BFT systems?
We design and evaluate Byzantine Fault Tolerant (BFT) state machine replication systems, both on theoretical and implementation ground. We ask:
- What affects performance of such systems in different environments,
- What are the inherent limitations and trade-offs,
- How to discover optimal working conditions for each algorithm, and
- Which algorithms (or their combinations) perform best given working conditions and workload.
BFT state machine replication − A Short Introduction
State machine replication (SMR) is a software technique for tolerating failures using commodity hardware. The critical service to be made fault-tolerant is modeled by a state machine. Several, possibly different, copies of the state machine are then placed on different nodes. Clients of the service access the replicas through a SMR protocol which ensures that, despite contention and failures, replicas perform client requests in the same order.
Two objectives underly the design and implementation of a SMR protocol: robustness and performance. Robustness conveys the ability to ensure availability (liveness) and one-copy semantics (safety) despite failures and asynchrony. On the other hand, performance measures the time it takes to respond to a request (latency) and the number of requests that can be processed per time unit (throughput). The most robust protocols are those that tolerate
- arbitrarily large periods of asynchrony, during which communication delays and process relative speeds are unbounded, and
- arbitrary (Byzantine) failures of any client as well as up to one-third of the replicas (this is the theoretical lower bound).
These are called Byzantine-Fault-Tolerance SMR protocols, or simply BFT protocols, e.g., PBFT, QU, HQ and Zyzzyva. The ultimate goal of the designer of a BFT protocol is to exhibit comparable performance to a non-replicated server under “common” circumstances that are considered the most frequent in practice. The notion of “common” circumstance might depend on the application and underlying network, but it usually means network synchrony, rare failures, and sometimes also the absence of contention.
Not surprisingly, even under the same notion of “common” case, there is no “one size that fits all” BFT protocol. According to our own experience, the performance differences among the protocols can be heavily impacted by the actual network, the size of the messages, the very nature of the “common” case (e.g, contention or not); the actual number of clients, the total number of replicas as well as the cost of the cryptographic libraries being used. This echoes which concluded for instance that “PBFT offers more predictable performance and scales better with payload size compared to Zyzzyva; in contrast, Zyzzyva offers greater absolute throughput in wider-area, lossy networks''. So determining the best protocol seems clearly impossible. In fact, besides all BFT protocols mentioned above, there are good reasons to believe that we could design new protocols outperforming all others under specific circumstances. We do indeed present an example of a such protocols on these pages.
To deploy a BFT solution, a system designer will hence certainly be tempted to adapt a state-of-the-art BFT protocol to the specific application/network setting, and possibly keep adapting it whenever the setting changes. But this can rapidly turn into a nightmare. All protocol implementations we looked at involve around 20.000 lines of (non-trivial) C++ code, e.g., PBFT and Zyzzyva. Any change to an existing protocol, although algorithmically intuitive, is very painful. The changes of the protocol needed to optimize for the “common” case have sometimes strong impacts on the part of the protocol used in other cases (e.g., “view-change” in Zyzzyva). The only complete proof of a BFT protocol we knew of is that of PBFT and it involves 35 pages (even without using any formal language). This difficulty, together with the impossibility of exhaustively testing distributed protocols would rather plead for never changing a protocol that was widely tested, e.g., PBFT.
We propose a way to have the cake and eat a big chunk of it. We present ABsTRACT (Abortable Byzantine faulT-toleRant stAte maChine replicaTion): a new abstraction that significantly reduces the development cost of BFT protocols and makes it significantly easier to develop efficient ones (we simply write Abstract). Abstract looks like state machine replication and it can be used to make any shared service fault-tolerant, with one exception: it may sometimes abort a client request. The (non-triviality) condition under which Abstract cannot abort is a generic parameter. From this perspective, Abstract can be viewed as a virtual type; Each specification of the non-triviality parameter defines a concrete type. 1
At one extreme, one can for example specify a (useless) Abstract instance that could abort in every execution. At the other extreme, one can prevent Abstract from ever aborting: this is exactly BFT. Interesting instances are those in between, e.g.,
- an Abstract instance that cannot abort if there is no concurrency, asynchrony or failures, or
- one that can abort only if there is asynchrony or failures.
When a particular instance of Abstract aborts a client request, Abstract returns an unforgeable request history that can be used by the client to “recover” using another instance of Abstract. This paves the path to composability of Abstract. Any composition of Abstract instances is possible; we expect many of these to lead to interesting flexible BFT protocols. In fact, and to illustrate Abstract composability, we present Modular BFT: a BFT protocol built using two Abstract instances:
- the first, which we denote any Abstract, would typically be an Abstract with a weak non-triviality condition that can be implemented very efficiently in a speculative and optimistic manner, whereas
- the second is a stronger Abstract with a non-triviality property that guarantees to commit a certain number of requests k; this can easily be implemented on top of any BFT protocol (e.g., PBFT, Aardvark, …).
Such a modular approach allows for “black-box” code reuse and can significantly reduce the development cost of new BFT protocols.
We offer master, semester and internship student projects:
Guerraoui, Rachid ; Knezevic, Nikola ; Quéma, Vivien ; Vukolic, Marko (2009) The Next 700 BFT Protocols Technical report.
Vukolic, Marko ; Guerraoui, Rachid (2008) Abstractions for asynchronous distributed computing with malicious players PhD. Thesis
Guerraoui, Rachid ; Levy, Ron R. ; Pochon, Bastian ; Quéma, Vivien (2006) High Throughput Total Order Broadcast for Cluster Environments IEEE International Conference on Dependable Systems and Networks (DSN '06), 2006