-
Notifications
You must be signed in to change notification settings - Fork 97
TokuMX Replication Threading
This document is mainly for Zardosht to get his thoughts on TokuDS replication threading down on paper.
At a high level, replication gets started up as follows:
- The main thread calls startReplication
- Necessary variables in txn_context.cpp and oplog.cpp are initialized (namely, pointers to oplog related functions)
- spawns thread to do startReplSets
- The background thread takes over, via startReplSets. Everything below happens on a background thread.
- creates an instance of ReplSet theReplSet,
- construction of this hangs until we can load a config. Either a config is already saved in the local.system.replset collection and is loaded immedietely, or it sleeps every 10 seconds until it can find a config. It does this by calling loadConfig in the constructor.
- after construction, ReplSetImpl::_go runs. What follows below is from ReplSetImpl::_go.
- GTIDManager is initialized based on what is found in the oplog (if anything).
- change state to RS_STARTUP2
- Manager is started up. The manager is responsible for the following:
- processing heartbeats from other machines in the config
- running elections
- processing new configs All work that a manager does is serialized on a queue, so there are no race conditions of two different heartbeats somehow kicking off an election.
- if this machine is the only machine in the config, it starts up as primary
- otherwise, it runs initial sync. Initial sync may do the following, based on the state of the machine:
- do a full initial sync, if the oplog is empty
- attempt to fill in gaps at the end of the oplog, if the oplog is non-empty and fastsync was passed. This is the case where we are making a secondary from a backup
- look at the end (or some entries near the end) of the oplog to see if anything in the oplog has yet to be applied to collections.
- start the following background threads:
- opSync/producer (naming in code is not consistent) <-- responsible for reading data off another machine and writing it to the oplog.
- applier <-- responsible for applying data just written to the oplog to collections, and updating the oplog to state that the data has been applied
- updateReplInfo <-- every second, writes the minLiveGTID and minUnappliedGTID to local.replInfo
- purgeOplog <-- responsible for periodically removing data from the oplog
- ghost <-- responsible for percolating the location of other secondaries that are tailing this machine to satisfy write concern
- Now, we are ready to start up the machine as a secondary. We start it up.
Arbiters do a subset of this work, basically, they never do ANYTHING necessary for secondaries, including starting its threads and writing any data. Once a machine has become an arbiter, it must remain an arbiter. A reconfig cannot change a machine from arbiter to secondary or vice-versa.
To recap, here are the background threads:
- manager
- opsync/producer
- applier
- updateReplinfo
- purgeOplog
- ghost
Note that the combination of opsync and applier make up "replication". If those threads are running, then we consider replication to be running. If those threads are sleeping, then replication is not running. So, when a machine is a primary, replication is not running, meaning those two threads are sleeping and not doing anything.
Now let's review the states a machine can be in, and how they may get into those states, starting with the easy ones:
- RS_STARTUP <-- when creating an instance of ReplSet
- RS_STARTUP2 <-- during initial sync on startup, mentioned above
- RS_ARBITER <-- This is the state arbiters are in, as long as the machine is in the replica set. We transition from RS_STARTUP2 -> RS_ARBITER for arbiter machines, and always stay that way unless a reconfig takes us out of the replica set.
Now the more complicated ones:
- RS_SHUNNED <-- if a reconfig shows we are no longer in the replica set, we change the state to RS_SHUNNED and wait for a new config (via loadConfig) to put us back into the replica set.
- RS_PRIMARY <-- The machine is a primary. This can happen as follows:
- during startup, we notice we are the only machine in the set
- an election, run on the manager thread.
- during a reconfig, we step down to do the reconfig, and then we take back the primary if the reconfig was simple. See ReplSetImpl::initFromConfig
- RS_SECONDARY <-- we are a machine running replication (with exception of a temporary stepdown in loadFromConfig, not a big deal, I probably ought to add a new state for that, as it really is temporary). Outside of that small screw case, here are the times we can become a secondary:
- after startup
- when a primary relinquishes itself, as it can happen in these cases:
- via a command
- manager decides we should relinquish, for a number of reasons(msgCheckNewState)
- consensus.cpp, if another machine is trying to elect itself and we agree (this is a weird case, and I would not be surprised if it goes away)
- when a machine is no longer in recovery mode. This can happen when:
- manager calls blockSync(false), when it does not see an auth issue
- the replSetMaintenance command sets _maintenanceMode, a counter, to 0.
- when a machine is no longer shunned, and a new config is set that brings the machine back into the replica set, we transition to RS_SECONDARY (or RS_RECOVERING)
- RS_RECOVERING <-- a machine is in RECOVERING mode if either it has been put into maintenance mode via the replSetMaintenance command, or the manager has noticed an auth issue and wants to block replication. The idea of RECOVERING mode is that replication should not and does not run. So, when a machine wants to transfer from RS_STARTUP2 or RS_SHUNNED to some other state, it cannot blindly convert to a secondary. We must check if the machine ought to be in RECOVERING. The threads that may transition a machine from RS_SECONDARY to RS_RECOVERING (a primary can never transition to recovering) are the following:
- replSetMaintenance command
- manager, as it sees an auth issue
- RS_ROLLBACK <-- this can only run when replication is running. So, anybody that needs to wait for replication to stop cannot see a machine in RS_ROLLBACK after replication has successfully stopped.
- RS_FATAL <-- done in the following scenarios:
- while replication is running, if we see a machine cannot sync off of anyone because it is too stale (this is different that vanilla MongoDB, which goes into recovering mode when too stale), or if rollback fails
- if something bad happens when loading a config.
- if something bad happens during startup
The following states are defined, but a machine cannot be in these states. These are states that of other machines in the system:
- RS_DOWN <-- the machine is down or inaccessible
- RS_UNKNOWN <-- not sure
A machine never thinks its own state is RS_DOWN or RS_UNKNOWN, but it may think that the state of some other machine in the replica set is down or unknown.
The following threads can stop replication:
- threads that want to become the primary
- manager when it notices we have a new config
- setMaintenanceMode command
- manager when it calls blockSync due to an authIssue
Places we care about if we are fatal:
- opsync thread, before starting to do something
- tryToGoLiveAsASecondary does a no-op if we are fatal
So now let's talk about the threading of the system. And let's disgard arbiters. The system just knows if it is an arbiter and disables the background threads. There seem to be three "steady" states:
- running as primary (RS_PRIMARY)
- running as secondary (RS_SECONDARY, RS_ROLLBACK)
- loading a config (RS_SHUNNED)
- running in recovery mode
Because we are not idempotent, and we want to run replication assuming there are no other writes happening in the system, we need to cleanly transition between these states. I don't think vanilla MongoDB does so. At the bottom of this document are possible vanilla MongoDB I see with state transitions. The bugs are probably benign, but because we are not idempotent, I don't want to risk any screwy behavior.
What we want is as follows. When a machine is running as a secondary, replication is running. When a machine is loading a config, a primary, or in recovery mode, replication is not running. We want these state transitions to be clean.
To do so, I added a new mutex, called the "stateChangeMutex". This mutex is used to cleanly transition between these steady states. The following hold the stateChangeMutex to run operations:
- trying to set a machine to recovery.
- loading a config
- running an election on the manager thread
- any command that wants to change these states (such as step down)
Note that the transition from RS_SECONDARY -> RS_ROLLBACK -> RS_SECONDARY is NOT covered by this mutex, because this is only done while running replication, which should be encapsulated in the "running as secondary" steady state. What this does mean is that other threads must check the state of the machine AFTER stopping replication, and not before, because before, we may be in secondary/rollback, but after, who knows. We may be fatal if rollback fails.
To protect against client threads running normal commands/queries/writes, these important state transitions should also hold a global write lock, and they do.
Here are other locks of note:
- lock for multi statement transactions. If we want to do a transition at a time when we know there are not multi statement transactions being created, we use this lock. Needed for transitioning from primary to secondary, because the machine may have some write transactions that have yet to commit.
- repl set lock. This is used to serialize some repl set operations.
The order of locks necessary:
- stateChangeMutex
- rslock
- multiStmtTransactionLock
- global write lock
The fatal state is a funny state. Replication thread can go fatal, and loading a bad config can go fatal. Threads just need to be aware that a machine can be fatal and act accordingly. That means checking state of system after stopping replication.
These are POSSIBLE bugs I see by doing reviewing vanilla MongoDB's code. This is as of May 27th, 2013, vanilla MongoDB master. I don't have proof as to whether they are actual bugs.
From an email I sent to the mongodb-dev group: In Command::execCommand, the code that checks if a command can run on the given machine seems to be: {{{ bool canRunHere = isMaster( dbname.c_str() ) || c->slaveOk() || ( c->slaveOverrideOk() && ( queryOptions & QueryOption_SlaveOk ) ) || fromRepl; }}}
This bool is evaluated before any locking occurs for the command (assuming locking is needed). ReplSetImpl::relinquish() protects the transition of a machine from RS_PRIMARY to RS_SECONDARY with a global write lock.
What is to stop the following from happening:
- a command foo executes canRunHere while the machine is still the primary
- canRunHere evaluates to true
- the machine then transitions from primary to secondary
- command foo then grabs its appropriate lock (read or write), and proceeds to run the command on a secondary, even though it may not be allowed to do so.
There seem to be races between the replSetMaintenance command and machines transitioning from secondary->primary. Take the following example:
- assumePrimary() called on thread 1
- before verify( iAmPotentiallyHot() );, a replSetMaintenance command executes transitioning the secondary to recovering. This should cause a crash.
- after the verify, but before the global lock is grabbed, the machine may still go into recovering mode. Seems like a bug according to their rules.
In ReplSetImpl::assumePrimary(), they call replset::BackgroundSync::get()->stopReplicationAndFlushBuffer() to stop replication. This stops replication by setting the _pause variable to true. What is to stop the producer thread from continuing and setting _pause back to false (via BackgroundSync::_producerThread())? There is a window in between when they stop replication and flush buffer and before they assume the primary where it looks like replication could kick start again.
In ReplSetImpl::setMaintenanceMode, they simply increment _maintenanceMode and do not wait for replication to stop. SyncTail will eventually notice that they are not a secondary and stop trying to apply data. I think that means that replication can still be running for a bit even when the machine is in the recovery state. Ops may still be getting applied. I don't think ops are flushed either. I don't know what impact this has on the system.
Take the following scenario:
- A is primary, B is secondary, and a bunch of other secondaries exist to ensure that a majority can elect someone.
- A disconnects, but has yet to notice it's disconnected, so it does not relinquish
- B gets voted primary
- B accepts a write and replicates it to a majority of the set, hence satisfying write concern of majority
- A reconnects
- B recognizes that there are two primaries and relinquishes itself
- A continues being primary
- when B, and all other secondaries, start syncing from A, they rollback the operation that was accepted and replicated by B
We now have a write that has theoretically been replicated to a majority of the set, but has been rolled back.
The problem is that the resolution of seeing multiple primaries is a bit arbitrary. I don't have a good solution to this.
Similar to above. But suppose A and B cannot see each other. They can see everyone else in the set, but not each other. I don't see any code that will force A or B to stepdown. This is because the manager will not send a stepdown command to one of the primaries. The manager will only step itself down if it sees that it is a primary, another machine is also a primary, and its id is greater than the other machine's id.
This is likely not a bug.
When rollback runs, a global write lock is grabbed to place the machine in the rollback state, but existing cursors are not invalidated. Maybe this invalidation automatically happens as they modify the oplog?
The method Consensus::electCmdReceived is responsible for doing an actual election, after a round of replSetFresh commands tells us "this machine would make a good primary".
I see in Consensus::electCmdReceived that a bunch of sanity checks are done before doing the actual election. Here is my question, why are the following sanity checks not also performed here, as they were in the replSetFreshCommand:
- that the hopeful primary's oplog position is ahead of theReplSet->lastOpTimeWritten
- that the hopefuly primary's oplog position is ahead of theReplSet->lastOtherOpTime()
Can't this information change between the replSetFresh command and this call?