A Java implementation using sockets π³
Raft Whitepaper Β»
Docs π
Β·
Report Bug πͺ³
Β·
Request Feature β¨
Table of Contents
Raft is a consensus algorithm that is designed to be easy to understand. It's equivalent to Paxos in fault-tolerance and performance. The difference is that it's decomposed into relatively independent sub-problems, and it cleanly addresses all major pieces needed for practical systems. We hope Raft will make consensus available to a wider audience, and that this wider audience will be able to develop a variety of higher quality consensus-based systems than are available today.
Consensus is a fundamental problem in fault-tolerant distributed systems. Consensus involves multiple servers agreeing on values. Once they reach a decision on a value, that decision is final. Typical consensus algorithms make progress when any majority of their servers is available; for example, a cluster of 5 servers can continue to operate even if 2 servers fail. If more servers fail, they stop making progress (but will never return an incorrect stateMachineResult
).
Consensus typically arises in the context of replicated state machines, a general approach to building fault-tolerant systems. Each server has a state machine and a log. The state machine is the component that we want to make fault-tolerant, such as a hash table. It will appear to clients that they are interacting with a single, reliable state machine, even if a minority of the servers in the cluster fail. Each state machine takes as input commands from its log. In our hash table example, the log would include commands like set stateMachineResult
, each state machine processes the same series of commands and thus produces the same series of results and arrives at the same series of states.
Java implementation:
- Consensus Module
- Persistent State Management
- Candidate
- Leader
- Follower
- Vote Handling
- Client
- Server
- State Machine
- Thread Management
- Log Management
- Entities
- Persistence
- Truncation
- Snapshots
Testbed environment:
- Network Topology
- Vagrant configuration
- Router
- Main Switch
- Sub-Switches
- Nodes
- Scripts for testing (link failure simulation et similia)
- GUI
- Adding
tc
rules - Deleting
tc
rules - Network partition support
See the open issues for a full list of proposed features (and known issues).
Since the algorithm is robust against any non-byzantine failure, links can have omissions and delays, processes can stop at any time. For this reason we decided to deploy 5 hosts and simulate the different behaviors our implementation could have in a controlled environment. The network topology is shown in the diagram below.
In order to work with a controlled environment, to properly assess the correctness of the implementation, we decided to virtualize the 5 hosts, plus a router and a switch to network them together. The challenge was to test the above-mentioned failures:
- Links can have omissions;
- Links can have delays;
- Processes can stop at any time.
The software of choice for creating a suitable lab for testing purposes was Vagrant (and VirtualBox): both software are open source and offer the required capabilities for handling link failure, as well as process failure. This will be explained more in depth in the following.
π RAFT
βββ π Testing\ Environment/
βββ π vagrant/
β βββ π nodes/
β β βββ π node1.sh
β β βββ π node2.sh
β β βββ π node3.sh
β β βββ π node4.sh
β β βββ π node5.sh
β βββ π client.sh
β βββ π raft-node.sh
β βββ π router.sh
β βββ π switch.sh
βββ π Vagrantfile
As shown in the box above, the VMs are creating following what's inside the Vagrantfile
and configured based on the respective bash scripts contained in vagrant/
.
Network failure is simulated using netem, a tool that comes built-in Linux. This allows to simulate:
-
Delay:
tc qdisc change dev eth0 root netem delay 100ms 20ms distribution normal
-
(Random) Packet Loss:
tc qdisc change dev eth0 root netem loss 0.1%
-
Packet Duplication:
tc qdisc change dev eth0 root netem duplicate 1%
-
Packet Corruption:
tc qdisc change dev eth0 root netem corrupt 0.1%
Another useful feature of netem is the capability to control the bandwidth using the rate
feature. Another options could have been playing with iptables
.
The following table binds each VM with the respective IP and subnet:
Name | IP |
---|---|
Client | 10.0.1.2 |
Router | 10.0.1.1 , 10.0.0.1 |
Switches | - |
Node 1 | 10.0.0.11 |
Node 2 | 10.0.0.12 |
Node 3 | 10.0.0.13 |
Node 4 | 10.0.0.14 |
Node 5 | 10.0.0.15 |
Each node
machine is identical to the other (every one of them is configured by the same bash script, vagrant/raft-node.sh
).
As shown in the above diagram, network is divided into 2 portions. This choice was taken in order to allow network partitioning and test the behavior of our Raft algorithm implementation in a particular case: network can fail partially. Moreover, one portion of the network (the orange one) can re-establish consensus, while the remaining part (the blue one) should'n be able to reach quorum, since there are only 2 nodes.
The switch layering is managed by the open source tool Open vSwitch, which allowed us to create 3 virtual bridges, connected together in order to simulate a somehow realistic scenario. The nodes in the green zone of the diagram are virtual switches, and will be used in the Failure Simulation (see following section).
The setup is handled by the vagrant/switch.sh
script. From that one can really understand what is going on in terms of configuration.
The internal configuration of the above mentioned virtual bridges can be easily visualized in the following diagram.
Three bridges are connected to each other using the patchN
ports and, as the names suggest, vSwitchRouter
is connected to the router, vSwitchPart1
connects the hosts in the upper network partition, while vSwitchPart2
connects the remaining 3 hosts of the lower network partition.
This configuration adds complexity to the testbed, but allows to test partition tolerance of our implementation.
As disclosed above, link failure is handled using the traffic control tool (tc
) that comes built-in Linux. In order to make testing easier we created a script that, running on the host machine and using vagrant ssh
, can add traffic control rules to the switch's interfaces and, using netem
simulate the different behaviors listed in the previous section. Moreover, the user can halt VMs and kill processes in order to simulate processes crashing.
Said script is control-center.sh
and looks like this:
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the GPLv3 License. See LICENSE
for more information.
Giovanni Baccichet - @Giovanni_Bacci - giovanni.baccichet[at].polimi.it
Chiara Magri - chiara1.magri[at]mail.polimi.it