TokuMX Monitoring Single Server

TokuMX Monitoring: Single Server

This document is part of a series on monitoring a TokuMX database installation.

This document lists a set of important metrics in TokuMX, explains their significance, and suggests some policies for interpreting and acting on their values. This document does not address the collection or reporting of those metrics beyond capturing them from the database itself; there are a number of tools and services that address this need. Additionally, this document is focused only on metrics important to a single server.

Most values below are found in TokuMX extensions to serverStatus, which can be accessed from the mongo shell as db.serverStatus(), or by running the serverStatus command with any MongoDB driver. In general, this is the best thing to watch (particularly the ft section which contains all TokuMX extensions) for a basic understanding of how the system is performing.

In addition, unless otherwise specified, all values below are not persistent across server reboot. Most metrics in TokuMX are ephemeral counters that are zeroed on server startup. Most monitoring tools can interpret such counters by tracking their deltas and will understand a counter reset.

Performance

TokuMX provides the same "opcounters" that MongoDB provides, and this gives the best high-level understanding of performance: raw throughput.

TokuMX also supports the same profiling framework and explain information that MongoDB provides, which are useful for diagnosing problematic queries.

In addition, TokuMX provides per-index counters of inserts, deletes, and queries, accessible through extensions in db.collection.stats() (or the collStats command from any driver):

collStats.indexDetails[i].{queries,inserts,deletes}:

The number of queries, inserts, and deletes that have touched the ith index.
collStats.indexDetails[i].{nscanned,nscannedObjects}:

The same nscanned/nscannedObjects information available through explain(), but aggregated for all queries that used the ith index.

CPU

For writes, the primary consumer of CPU time is usually compression. For in-memory queries it's usually tree searches, but for >RAM queries, decompression and deserialization can begin to impact performance. Serialization/deserialization and compression/decompression times are reported in serverStatus:

serverStatus.serializeTime.{leaf,nonleaf}.serialize:

The time (in seconds) spent serializing leaf and nonleaf nodes before writing them to disk (for checkpoint or when evicted while dirty).
serverStatus.serializeTime.{leaf,nonleaf}.compress:

The time (in seconds) spent compressing leaf and nonleaf nodes before writing them to disk (for checkpoint or when evicted while dirty).
serverStatus.serializeTime.{leaf,nonleaf}.decompress:

The time (in seconds) spent decompressing leaf and nonleaf nodes and their partitions after reading them off disk.
serverStatus.serializeTime.{leaf,nonleaf}.deserialize:

The time (in seconds) spent deserializing leaf and nonleaf nodes and their partitions after reading them off disk.

Memory

Unlike basic MongoDB, TokuMX manages the memory it uses directly. You should not rely on the mem section, instead, the ft.cachetable section describes TokuMX's memory usage in detail:

serverStatus.ft.cachetable.size.current:

The amount of resident memory TokuMX is using for the uncompressed data cache ("cachetable").
serverStatus.ft.cachetable.size.limit:

The amount of memory TokuMX was configured to use (--cacheSize).
serverStatus.ft.cachetable.size.writing:

The amount of memory TokuMX is currently working on writing out in order to evict it to make more room in the cachetable. If size.writing is often high, this may mean that your workload would be better served by more memory.

In the following, it is important to note the distinction between "full" and "partial" cachetable operations. Index tree nodes in TokuMX have two parameters: pageSize and readPageSize (can be set at collection and index creation time, on a per-index basis). All writes are done in pageSize chunks, but reads from disk can be done in readPageSize chunks, which are smaller (individual segments of the whole node). "Full" evictions and misses correspond to evicting or paging in the entire node, while "partial" evictions and misses correspond to evicting or paging in just one of the node's segments, which are generally preferred as they are less expensive.

TokuMX tracks the work done to evict old data from the cachetable when it reaches its memory limit (--cacheSize), as "evictions". Clean evictions are usually cheap, but dirty evictions are much more expensive, as they require that data be written out before it can be released. If a page is clean, it can be partially evicted, which evicts only the lesser recently used parts of the page and does not cause a future full miss later on. Furthermore, TokuMX makes a distinction between leaves and nonleaves (internal tree nodes) when presenting eviction data. If too many nonleaf nodes are getting evicted, this is a sign that the workload can benefit from more memory.

serverStatus.ft.cachetable.evictions.{partial,full}.{leaf,nonleaf}.clean.count:

The number of partial and full evictions of leaf and nonleaf nodes.
serverStatus.ft.cachetable.evictions.{partial,full}.{leaf,nonleaf}.clean.bytes:

The number of bytes released as a result of the partial and full evictions of leaf and nonleaf nodes.
serverStatus.ft.cachetable.evictions.full.{leaf,nonleaf}.dirty.count:

The number of full evictions of dirty leaf and nonleaf nodes. These are included in the counts of evictions of the same type.
serverStatus.ft.cachetable.evictions.full.{leaf,nonleaf}.dirty.bytes:

The number of bytes released as a result of the full evictions of dirty leaf and nonleaf nodes. These are included in the size of evictions of the same type.
serverStatus.ft.cachetable.evictions.full.{leaf,nonleaf}.dirty.time:

The time (in seconds) spent writing out full nodes for evictions of dirty leaf and nonleaf nodes.

Additionally, ft.locktree also has size.current and size.limit fields, which track the memory used for document-level locking. The locktree is configured to have size equal to 10% of --cacheSize by default, but this can be changed with --locktreeMaxMemory. It is uncommon for the locktree to use large amounts of memory in most workloads.

Disk Space

TokuMX uses files on disk for three things:

Collection data, including the primary key indexes and secondary indexes. (--dbpath)

Typically, secondary index size is considerably smaller than than the primary key index. However, clustering secondary indexes are typically about the same size as the primary key index.

Note that just as in basic MongoDB, the oplog data is considered to be a collection.
Transaction logging data (similar to basic MongoDB's journal files). (--logDir)
Temporary files while building bulk indexes for collections restored with mongorestore and foreground ensureIndex operations. (--tmpDir)

For collections, including the oplog, you can use db.collection.stats() (or the collStats command from any driver) to learn about its uncompressed and compressed (on disk) sizes:

collStats.size:

An estimate of the sum of sizes of uncompressed BSON documents in the primary key index for the collection.
collStats.storageSize:

The size of the compressed file on disk for the primary key index for the collection.
collStats.totalIndexSize:

An estimate of the sum of sizes of uncompressed BSON data in all secondary indexes for the collection.
collStats.totalIndexStorageSize:

The sum of sizes of compressed files on disk for all the secondary indexes for the collection.

In addition, you can see a process-wide estimate of compression:

serverStatus.ft.compressionRatio.{leaf,nonleaf,overall}:

The compression ratio achieved for all writes of leaf nodes, nonleaf nodes, and across both node types, of data blocks written to disk since the server was last started.

Transaction log files are capped at about 100MB for each file, and old log files are removed when the data in them becomes unneeded after all transactions represented in the log file have completed and then a checkpoint completes after that, and they are also removed when performing a clean shutdown.

Temporary files for the bulk loader are compressed with quicklz if --loaderCompressTmp is set.

Disk Utilization

Each of the things TokuMX uses disk for have different access patterns. The primary usage of disk is for reading and writing collection data, which has access patterns driven by the workload. Pure insert workloads will not exhibit the same random write I/O patterns of traditional B-tree databases, but random inserts into unique indexes as well as update workloads can induce random read I/O, as those operations contain implicit "hidden queries".

Disk Reads

TokuMX tracks the disk reads it does as "cachetable misses" and reports them in serverStatus:

serverStatus.ft.cachetable.miss.count:

The total number of cachetable misses since the server last started.
serverStatus.ft.cachetable.miss.time:

The total time (in seconds) spent fetching data (including decompression) for cachetable misses, since the server last started.
serverStatus.ft.cachetable.miss.full.{count,time}:

The same information, restricted to full misses, which are typically more expensive than partial misses, but less common.
serverStatus.ft.cachetable.miss.partial.{count,time}:

The same information, restricted to partial misses.

Logger Writes

The TokuMX transaction log normally receives many small sequential writes, and periodic fsyncs. TokuMX tracks both the bytes written to the log, and the number of and time spent doing fsyncs (across all files, though the transaction log is by far the most frequently synced file):

serverStatus.ft.log.count:

Number of individual writes to the log file.
serverStatus.ft.log.time:

Time spent (in seconds) doing writes to the log file.
serverStatus.ft.log.bytes:

Bytes written to the log file
serverStatus.ft.fsync.count:

Number of fsync operations.
serverStatus.ft.fsync.time:

Time spent doing fsync operations (in microseconds).

Fractal Tree Writes

TokuMX's main data structure, the Fractal Tree, usually writes most user data to disk during checkpoints, which are triggered once every 60 seconds by default (--checkpointPeriod, similar to --syncdelay or storage.syncPeriodSecs in MongoDB 2.6.0).

During a checkpoint, all tree data that has changed is written durably to disk, and this allows the system to trim old data from the tail of the transaction log. The system reports timing information about checkpoints in serverStatus:

serverStatus.ft.checkpoint.count:

Number of completed checkpoints.
serverStatus.ft.checkpoint.time:

Time (in seconds) spent doing checkpoints.
serverStatus.ft.checkpoint.lastComplete:

Begin and end timestamps for, and time spent during the last complete checkpoint.

A checkpoint is triggered 60 seconds (--checkpointPeriod) after the previous checkpoint was triggered, or immediately after the last checkpoint if it took longer than 60 seconds. For example, if every checkpoint takes 6 seconds, there should be 54 seconds between checkpoints, and serverStatus.ft.checkpoint.time should be about 10% of the total system uptime. Extremely long checkpoints can cause a system to back up over time, if checkpoints are taking too long it may mean that your system needs more I/O bandwidth for node writes and/or more CPU power for compression. Disk writes for checkpoint are tracked in serverStatus.checkpoint.write:

serverStatus.checkpoint.write.{leaf,nonleaf}.{count,time}:

Number of disk writes of leaf and nonleaf nodes and the time (in seconds) spent doing those writes.
serverStatus.checkpoint.write.{leaf,nonleaf}.bytes.{uncompressed,compressed}:

Uncompressed and compressed sizes of leaf and nonleaf nodes written for checkpoint.

Anomalies

TokuMX also tracks some anomalous events, which will appear in serverStatus if any such events are detected:

serverStatus.ft.alerts.longWaitEvents.logBufferWait:

Number of times a writing client had to wait more than 100ms for access to the log buffer.
serverStatus.ft.alerts.longWaitEvents.fsync.{count,time}:

Same information as serverStatus.ft.fsync.{count,time}, but only for fsync operations that took more than 1 second.
serverStatus.ft.alerts.longWaitEvents.cachePressure.{count,time}:

Number of times and the time spent (in microseconds) that a thread had to wait more than 1 second for evictions to create space in the cachetable for it to page in data it needed.
serverStatus.ft.alerts.longWaitEvents.locktreeWait.{count,time}:

Number of times and the time spent (in microseconds) that a thread had to wait more than 1 second to acquire a document-level lock in the locktree.
serverStatus.ft.alerts.longWaitEvents.locktreeWaitEscalation.{count,time}:

Number of times and the time spent (in microseconds) that a thread had to wait more than 1 second to acquire a document-level lock because the locktree was at the memory limit (--locktreeMaxMemory) and needed to run escalation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly