Enable Fault Tolerance and Failure Detection

Handle temporary failures#

Typically, distributed systems use a quorum-based approach to handle failures. A quorum is the minimum number of votes required for a distributed transaction to proceed with an operation. If a server is part of the consensus and is down, then we can’t perform the required operation. It affects the availability and durability of our system.

We’ll use a sloppy quorum instead of strict quorum membership. Usually, a leader manages the communication among the participants of the consensus. The participants send an acknowledgment after committing a successful write. Upon receiving these acknowledgments, the leader responds to the client. However, the drawback is that the participants are easily affected by the network outage. If the leader is temporarily down and the participants can’t reach it, they declare the leader dead. Now, a new leader has to be reelected. Such frequent elections have a negative impact on performance because the system spends more time picking a leader than accomplishing any actual work.

In the sloppy quorum, the first $n$ healthy nodes from the preference list handle all read and write operations. The $n$ healthy nodes may not always be the first $n$ nodes discovered when moving clockwise in the consistent hash ring.

Let’s consider the following configuration with $n = 3$ . If node $A$ is briefly unavailable or unreachable during a write operation, the request is sent to the next healthy node from the preference list, which is node $D$ in this case. It ensures the desired availability and durability. After processing the request, the node $D$ includes a hint as to which node was the intended receiver (in this case, $A$ ). Once node $A$ is up and running again, node $D$ sends the request information to $A$ so it can update its data. Upon completion of the transfer, $D$ removes this item from its local storage without affecting the total number of replicas in the system.

Handle permanent failures#

In the event of permanent failures of nodes, we should keep our replicas synchronized to make our system more durable. We need to speed up the detection of inconsistencies between replicas and reduce the quantity of transferred data. We’ll use Merkle trees for that.

In a Merkle tree, the values of individual keys are hashed and used as the leaves of the tree. There are hashes of their children in the parent nodes higher up the tree. Each branch of the Merkle tree can be verified independently without the need to download the complete tree or the entire dataset. While checking for inconsistencies across copies, Merkle trees reduce the amount of data that must be exchanged. There’s no need for synchronization if, for example, the hash values of two trees’ roots are the same and their leaf nodes are also the same. Until the process reaches the tree leaves, the hosts can identify the keys that are out of sync when the nodes exchange the hash values of children. The Merkle tree is a mechanism to implement anti-entropy, which means to keep all the data consistent. It reduces data transmission for synchronization and the number of discs accessed during the anti-entropy process.

The following slides explain how Merkle trees work:

Calculate the hashes for all keys. The hashes will be leaf nodes

1 of 14

The hashes of the H1 and H2 nodes are calculated and stored as their parent node

2 of 14

The hashes of the H3 and H4 nodes are calculated and stored as their parent node

3 of 14

The hashes of all other nodes are calculated and stored as their parent node

4 of 14

The hashes of the nodes H 1 2 and H 3 4 is calculated and stored as their parent node. The same thing is done for H 5 6 and H 7 8

5 of 14

We duplicate the odd node

6 of 14

Calculate the hash of both nodes (the actual node and its duplicate), and store it as the parent node

7 of 14

Calculate the hash of the nodes and store them as the parent node

8 of 14

Calculate the hash of the last two nodes to store as the root node

9 of 14

Suppose the value of K2 is updated. Its hash will now be recalculated

10 of 14

The hash of the parent will also be recalculated and updated

11 of 14

The hash of the parent will also be recalculated and updated

12 of 14

The hash of the parent will also be recalculated and updated

13 of 14

The hash of the root node will also be recalculated and updated

14 of 14

Anti-entropy with Merkle trees#

Each node keeps a distinct Merkle tree for the range of keys that it hosts for each virtual node. The nodes can determine if the keys in a given range are correct. The root of the Merkle tree corresponding to the common key ranges is exchanged between two nodes. We’ll make the following comparison:

Compare the hashes of the root node of Merkle trees.
Do not proceed if they’re the same.
Traverse left and right children using recursion. The nodes identify whether or not they have any differences and perform the necessary synchronization.

The following slides explain more about how Merkle trees work.

Note: We assume the ranges defined are hypothetical for illustration purposes.

Let’s suppose we have the virtual nodes A and B in the ring

1 of 9

The ranges that each virtual node covers are defined

2 of 9

The ranges that each virtual node covers are defined in table form

3 of 9

Merkle tree for Node A

4 of 9

Merkle tree for Node B

5 of 9

Let’s suppose a new virtual node, N8, of A is added and the ranges are updated accordingly

6 of 9

The updated ranges that each virtual node covers are defined in table form

7 of 9

The updated Merkle tree for Node A

8 of 9

The updated Merkle tree for Node B

9 of 9

The advantage of using Merkle trees is that each branch of the Merkle tree can be examined independently without requiring nodes to download the tree or the complete dataset. It reduces the quantity of data that must be exchanged for synchronization and the number of disc accesses that are required during the anti-entropy procedure.

The disadvantage is that when a node joins or departs the system, the tree’s hashes are recalculated because multiple key ranges are affected.

We want our nodes to detect the failure of other nodes in the ring, so let’s see how we can add it to our proposed design.

Promote membership in the ring to detect failures#

The nodes can be offline for short periods, but they may also indefinitely go offline. We shouldn’t rebalance partition assignments or fix unreachable replicas when a single node goes down because it’s rarely a permanent departure. Therefore, the addition and removal of nodes from the ring should be done carefully.

Planned commissioning and decommissioning of nodes results in membership changes. These changes form history. They’re recorded persistently on the storage for each node and reconciled among the ring members using a gossip protocol. A gossip-based protocol also maintains an eventually consistent view of membership. When two nodes randomly choose one another as their peer, both nodes can efficiently synchronize their persisted membership histories.

Let’s learn how a gossip-based protocol works by considering the following example. Say node $A$ starts up for the first time, and it randomly adds nodes $B$ and $E$ to its token set. The token set has virtual nodes in the consistent hash space and maps nodes to their respective token sets. This information is stored locally on the disk space of the node.

Now, node $A$ handles a request that results in a change, so it communicates this to $B$ and $E$ . Another node, $D$ , has $C$ and $E$ in its token set. It makes a change and tells $C$ and $E$ . The other nodes do the same process. This way, every node eventually knows about every other node’s information. It’s an efficient way to share information asynchronously, and it doesn’t take up a lot of bandwidth.

Decentralized failure detection protocols use a gossip-based protocol that allows each node to learn about the addition or removal of other nodes. The join and leave methods of the explicit node notify the nodes about the permanent node additions and removals. The individual nodes detect temporary node failures when they fail to communicate with another node. If a node fails to communicate to any of the nodes present in its token set for the authorized time, then it communicates to the administrators that the node is dead.

Handle temporary failures#

Question

Handle permanent failures#

Anti-entropy with Merkle trees#

Promote membership in the ring to detect failures#

Question 1

1 of 2

Conclusion#