-
Notifications
You must be signed in to change notification settings - Fork 393
Cluster Metadata Internals
The following describes Cluster Metadata from an implementor's perspective. It is intended for those wishing to familiarize themselves with the implementation in order to make changes or to support the subsystem. This document is not intended for those wishing to use Cluster Metadata but may be helpful.
The Cluster Metadata subsystem is divided into several parts spread across multiple modules. These parts can be divided into three categories: representation, storage and broadcast. Unifying them is the API, which presents the subsystem as one cohesive unit to clients.
Cluster Metadata stores and represents values internally in a different form than how it is presented to clients. Most notably, clients are not exposed to the logical clocks stored alongside values while internally they are used heavily.
The
riak_core_metadata_object
module is responsible for defining the internal representation --
referred to as the "Metadata Object". In addition, it provides a set
of operations that can be performed on Metadata Object(s). These
operations include conflict resolution, reconciling differing versions
of an object, and accessing the values, the clock and additional
information about the object. Outside this module, the
structure of the Metadata Object is opaque and all operations on
objects should use it.
The logical clocks used by Cluster Metadata are
Dotted Version Vectors. The
implementation is contained within dvvset.erl
.
Data is stored both in-memory, using ets, and on-disk, using
dets. Each node locally manages its own storage via the
riak_core_metadata_manager
, or "Metadata Manager", process running on it.
The Metadata Manager is not responsible for replicating data, nor is it aware of other nodes. However, it is aware that writes are causally related by their logical clocks, and it uses this information to make local storage decisions, e.g. when an object is causally older than the value already stored. Data read out of the Metadata Manager is a Metadata Object, while data written is either a Metadata Object or a raw value and a view the logical clock (the view contains enough details to reconstruct the information in the original clock).
In addition to storing Metadata Objects, the subsystem maintains Hash Trees that are the basis for determining differences between data stored on two nodes and can be used to determine when individual or groups of keys change.
The Hash Trees that are used are implemented in the
hashtree_tree
module. More details on its implementation can be found there.
Each node maintains its own hashtree_tree
via the
riak_core_metadata_hashtree
, or Metadata Hashtree, process. This
process is responsible for safely applying different operations on the
tree, in addition to exposing information needed elsewhere in the
subsystem and to clients. It also periodically updates the tree, when
not in use by other parts of the subsystem, so that clients using it
to detect changed keys can do so without forcing the update themselves.
Metadata Hashtrees are only maintained for the lifetime of a running node. They are rebuilt each time a node starts and are never rebuilt for as long as the node continues to run. The initial design of this part of the system included using memory backed Hash Trees. However, they are currently backed by LevelDB and cleared on graceful shutdown and each time the node is started. The entire tree is stored within one leveldb instance.
The Metadata Manager updates the hashes for keys using
riak_core_metadata_hashtree
in its write path. The upper levels of
the tree will only reflect those changes when the tree is updated --
either periodically or when needed by another part of the subsystem.
The above details focus on Cluster Metadata from the point of view of
a single node. It is the broadcast mechanism that puts the "Cluster"
in "Cluster Metadata". This mechanism was built to be general and is
in it of itself a separate subsystem. More details about how it works
can be found in the documentation about
riak_core_broadcast
. The
following describes how Cluster Metadata uses this subsystem.
The riak_core_broadcast_handler
behaviour for Cluster Metadata is
the Metadata Manager. In addition to the simple get/put interface, it
implements all the required broadcast functions.
The majority of the riak_core_broadcast_handler
functions are
well-defined, however, exchange/1
allows a lot of leeway in how to
implement the repair of data between the local node and a remote
peer. To implement exchange/1
, Cluster Metadata uses the Metadata
Hashtrees on both nodes to determine differences between data stored
on each and the existing riak_core_metadata_manager
interface to
repair those differences. The implementation of this logic can be
found in riak_core_metadata_exchange_fsm
.
The API, contained within the
riak_core_metadata
module, ties together the above parts and is the only interface that
should be used by clients. It is not only responsible for hiding the
logical clock and internal representation of the object but, also,
performing writes via the Metadata Manager and submitting
broadcasts of updates and resolved values.
Known outstanding work and issues can be found under the riak_core
repository's issue list. All issues pertaining to this subsystem have
(and should be created with) the
Cluster Metadata
tag.
The following is a list of debugging and testing tips as well as ways to resolve unforeseen issues. It is by no means a complete list but it is, hopefully, a growing one.
It may be useful to know what nodes are being communicated with when
replicating updates across the cluster for a given host.
riak_core_broadcast
's debug_get_peers/2,3
and debug_get_tree/2
expose this information without needing to inspect the process state
of riak_core_broadcast
The documentation of those functions has more details.
The API hides the internal representation of objects for the
client. However, when debugging and testing, it is sometimes useful to
see what the Metadata Manager actually has stored, especially the
logical clock. In this case riak_core_metadata_manager:get/1,2
can
be used to retrieve objects locally or from the Metadata Manager on
another node.
Although not typically a good thing to do, for testing purposes its useful to know how to force writing of siblings. Using the Metadata Manager this can be done easily by writing values with no logical clock:
riak_core_metadata_manager:put({FullPrefix, Key}, undefined,
sibling1value),
riak_core_metadata_manager:put({FullPrefix, Key}, undefined,
sibling2value)
The two calls can be run on the same or different nodes. If run on different nodes the siblings won't appear until after an exchange has completed.
Most of the following is general to the broadcast mechanism but is useful within the context of Cluster Metadata.
The functions riak_core_broadcast:exchanges/0,1
can be used to
determine which exchanges are running that were started on the current
or given node, respectively. These functions return a list of
4-tuples. The 4-tuple contains the name of the module used as the
riak_core_broadcast_handler
for the exchange (always
riak_core_metadata_manager
for Cluster Metadata), the node that is the
peer of the exchange, a unique reference representing the exchange,
and the process id of the exchange fsm doing the work.
Running exchanges may be aborted using
riak_core_broadcast:cancel_exchanges/1
. cancel_exchanges/1
's
argument takes several forms.
To cancel all Cluster Metadata exchanges started by the node where this command is run:
riak_core_broadcast:cancel_exchanges({mod, riak_core_metadata_manager}).
To cancel all exchanges with a given peer that were started by the node where this command is run:
riak_core_broadcast:cancel_exchanges({peer, Peer}).
where Peer
is the Erlang node name of the remote node.
To cancel a specific exchange started by the node where this command is run:
riak_core_broadcast:cancel_exchanges(PidOrRef).
where PidOrRef
is a process id or unique reference found in a
4-tuple returned from riak_core_broadcast:exchanges/0
.
If it is necessary to ensure that the data stored in Cluster Metadata is the same between two nodes, an exchange can be triggered manually. These exchanges are triggered by the broadcast mechanism so, typically, this should not be necessary. However, it can be done by running the following from an Erlang node running Cluster Metadata:
riak_core_metadata_manager:exchange(Peer)
Where Peer
is the Erlang node name for another node in the
cluster. This will trigger an exchange between the node the command
was run on and Peer
. This function does not block until the
completion of the exchange. It only starts it. If the exchange was
started successfully, {ok, Pid}
will be returned, where Pid
is the
process id of the exchange fsm doing the work in the background. If
the exchange could not be started {error, Reason}
will be returned.
DETS files may need to be repaired if a node exits abnormally and the Metadata Manager was not allowed to properly close its files. When the Metadata Manager restarts it will repair the files when they are opened (this is the default DETS behavior).
More details on DETS repair can be found in the docs
Metadata Hashtrees are meant to be ephermeral so they may be deleted without affecting a cluster. The node must first be shutdown and then restarted. This ensures that the Metadata Hashtree process does not crash when the files are deleted and that the tree is rebuilt when the node starts back up.