Skip to content

Replication

Brian S. O'Neill edited this page Oct 12, 2017 · 28 revisions

Tupl's built-in replication subsystem is a single-master, Raft-based design, with a few key differences (explained later):

  • Not strictly durable except during elections and membership changes.
  • Native data processing is stream-oriented, not message-oriented.
  • Data replication burden is shared by all group members.

To create a database with replication support, a ReplicatorConfig object must be filled in:

// Semi-unique identifier for the replication group, to
// guard against accidental cross-group pollution.
long groupToken = ...

// Local socket port to listen on.
int localPort = ...

// Replication configuration for the primordial group member.
ReplicatorConfig replConfig = new ReplicatorConfig()
    .groupToken(groupToken)
    .localPort(localPort);

Then pass the replication config to the usual DatabaseConfig object, and open the database:

DatabaseConfig dbConfig = ...

// Enable replication.
dbConfig.replicate(replConfig);

// Enable message logging, through java.util.logging.
dbConfig.eventListener(new EventLogger());

// Open the database.
Database db = Database.open(dbConfig);

This creates the first group member, which logs messages like so:

INFO: REPLICATION: Local member added: NORMAL
INFO: CACHE_INIT: Initializing 239694 cached nodes
INFO: CACHE_INIT: Cache initialization completed in 0.218 seconds
INFO: RECOVERY: Database recovery begin
INFO: RECOVERY: Recovery completed in 0.024 seconds
INFO: REPLICATION: Local member is the leader: newTerm=1, index=0

Instead of the usual "redo" log files, "repl" files are created in the base directory:

  • /var/lib/tupl/myapp.db – primary data file
  • /var/lib/tupl/myapp.info – text file describing the database configuration
  • /var/lib/tupl/myapp.lock – lock file to ensure that at most one process can have the database open
  • /var/lib/tupl/myapp.repl.1.0.0 - first replication log file: term 1, index 0, prev term 0.
  • /var/lib/tupl/myapp.repl.group - text file containing group membership
  • /var/lib/tupl/myapp.repl.md - binary replication metadata file

Additional members must be configured with one or more "seeds". Once joined, the seeds aren't necessary.

// Add the known primordial member as a seed.
replConfig.addSeed("leaderhost", port);
INFO: REPLICATION: Remote member added: leaderhost:3001, NORMAL
INFO: REPLICATION: Local member added: OBSERVER
INFO: REPLICATION: Receiving snapshot: 3,371,008 bytes from leaderhost:3001
INFO: REPLICATION: Remote member is the leader: leaderhost:3001, newTerm=1
INFO: CACHE_INIT: Initializing 239694 cached nodes
INFO: CACHE_INIT: Cache initialization completed in 0.268 seconds
INFO: RECOVERY: Database recovery begin
INFO: RECOVERY: Loading undo logs
INFO: RECOVERY: Recovery completed in 0.052 seconds
INFO: REPLICATION: Local member role changed: OBSERVER to NORMAL

Only the leader can succeed in writing to the database. Writing to a replica causes an UnmodifableReplicaException to be thrown, rolling back the transaction.

Differences from Raft

Raft requires that all changes must be durable before they are replicated. This ensures that when consensus is reached, a majority of hosts would need to completely fail for a change to potentially roll back. Tupl's replication system by default only requires that a majority have reached consensus in volatile-memory only. If a majority of hosts are restarted within a short period of time (a few seconds), some committed changes might roll back. The trade-off is an increase in performance, by an order of magnitude. Changes become truly durable following a checkpoint or a sync, and checkpoints run automatically once a second by default.

During elections and membership changes, Tupl's replication system follows strict durability roles as per the Raft specification. Term and membership changes are infrequent and critical, and so relaxing durability of these events offers no benefit.

Raft describes a replication log consisting of indexed messages, where messages are variable-length. Tupl replicates messages in a stream-oriented fashion, which in way means that all messages are only one byte in length. Replication is performed in batches of bytes, typical of stream-oriented designs. This design merely simplifies how the log is encoded, and it also improves performance. Applications which use the replication system directly without a database do have a message-oriented option available, which is layered over the stream-oriented system.

In Raft, all changes must flow from the leader, and in a strict sequential order. If a replica is missing any data, Raft requires that the leader backtrack one message at a time until the replica can get caught up. This approach is simple, although not terribly efficient. In Tupl, replicas discover that they're missing data, and then they request it from any available peers. In the meantime, any new messages arriving from the leader are accepted (out of order), but they aren't applied until after all the gaps in the log are filled in.

Clone this wiki locally