-
Notifications
You must be signed in to change notification settings - Fork 97
TokuMX Monitoring Single Server
This document is part of a series on monitoring a TokuMX database installation.
This document lists a set of important metrics in TokuMX, explains their significance, and suggests some policies for interpreting and acting on their values. This document does not address the collection or reporting of those metrics beyond capturing them from the database itself; there are a number of tools and services that address this need. Additionally, this document is focused only on metrics important to a single server.
Most values below are found in TokuMX extensions to
serverStatus
, which can be accessed from the mongo shell
as db.serverStatus()
, or by running the serverStatus
command with any
MongoDB driver. In general, this is the best thing to watch (particularly
the ft
section which contains all TokuMX extensions) for a basic
understanding of how the system is performing.
In addition, unless otherwise specified, all values below are not persistent across server reboot. Most metrics in TokuMX are ephemeral counters that are zeroed on server startup. Most monitoring tools can interpret such counters by tracking their deltas and will understand a counter reset.
TokuMX provides the same "opcounters" that MongoDB provides, and this gives the best high-level understanding of performance: raw throughput.
TokuMX also supports the same profiling framework and explain information that MongoDB provides, which are useful for diagnosing problematic queries.
In addition, TokuMX provides per-index counters of inserts, deletes, and
queries, accessible through extensions in
db.collection.stats()
(or the
collStats
command from any driver):
-
collStats.indexDetails[i].{queries,inserts,deletes}
:The number of queries, inserts, and deletes that have touched the
i
th index. -
collStats.indexDetails[i].{nscanned,nscannedObjects}
:The same
nscanned
/nscannedObjects
information available throughexplain()
, but aggregated for all queries that used thei
th index.
For writes, the primary consumer of CPU time is usually compression. For
in-memory queries it's usually tree searches, but for >RAM queries,
decompression and deserialization can begin to impact performance.
Serialization/deserialization and compression/decompression times are
reported in serverStatus
:
-
serverStatus.serializeTime.{leaf,nonleaf}.serialize
:The time (in seconds) spent serializing leaf and nonleaf nodes before writing them to disk (for checkpoint or when evicted while dirty).
-
serverStatus.serializeTime.{leaf,nonleaf}.compress
:The time (in seconds) spent compressing leaf and nonleaf nodes before writing them to disk (for checkpoint or when evicted while dirty).
-
serverStatus.serializeTime.{leaf,nonleaf}.decompress
:The time (in seconds) spent decompressing leaf and nonleaf nodes and their partitions after reading them off disk.
-
serverStatus.serializeTime.{leaf,nonleaf}.deserialize
:The time (in seconds) spent deserializing leaf and nonleaf nodes and their partitions after reading them off disk.
Unlike basic MongoDB, TokuMX manages the memory it uses directly. You
should not rely on the mem
section, instead, the ft.cachetable
section
describes TokuMX's memory usage in detail:
-
serverStatus.ft.cachetable.size.current
:The amount of resident memory TokuMX is using for the uncompressed data cache ("cachetable").
-
serverStatus.ft.cachetable.size.limit
:The amount of memory TokuMX was configured to use (
--cacheSize
). -
serverStatus.ft.cachetable.size.writing
:The amount of memory TokuMX is currently working on writing out in order to evict it to make more room in the cachetable. If
size.writing
is often high, this may mean that your workload would be better served by more memory.
In the following, it is important to note the distinction between "full"
and "partial" cachetable operations. Index tree nodes in TokuMX have two
parameters: pageSize
and readPageSize
(can be set at collection and
index creation time, on a per-index basis). All writes are done in
pageSize
chunks, but reads from disk can be done in readPageSize
chunks, which are smaller (individual segments of the whole node). "Full"
evictions and misses correspond to evicting or paging in the entire node,
while "partial" evictions and misses correspond to evicting or paging in
just one of the node's segments, which are generally preferred as they are
less expensive.
TokuMX tracks the work done to evict old data from the cachetable when it
reaches its memory limit (--cacheSize
), as "evictions". Clean evictions
are usually cheap, but dirty evictions are much more expensive, as they
require that data be written out before it can be released. If a page is
clean, it can be partially evicted, which evicts only the lesser recently
used parts of the page and does not cause a future full miss later
on. Furthermore, TokuMX makes a distinction between leaves and nonleaves
(internal tree nodes) when presenting eviction data. If too many nonleaf
nodes are getting evicted, this is a sign that the workload can benefit
from more memory.
-
serverStatus.ft.cachetable.evictions.{partial,full}.{leaf,nonleaf}.clean.count
:The number of partial and full evictions of leaf and nonleaf nodes.
-
serverStatus.ft.cachetable.evictions.{partial,full}.{leaf,nonleaf}.clean.bytes
:The number of bytes released as a result of the partial and full evictions of leaf and nonleaf nodes.
-
serverStatus.ft.cachetable.evictions.full.{leaf,nonleaf}.dirty.count
:The number of full evictions of dirty leaf and nonleaf nodes. These are included in the counts of evictions of the same type.
-
serverStatus.ft.cachetable.evictions.full.{leaf,nonleaf}.dirty.bytes
:The number of bytes released as a result of the full evictions of dirty leaf and nonleaf nodes. These are included in the size of evictions of the same type.
-
serverStatus.ft.cachetable.evictions.full.{leaf,nonleaf}.dirty.time
:The time (in seconds) spent writing out full nodes for evictions of dirty leaf and nonleaf nodes.
Additionally, ft.locktree
also has size.current
and size.limit
fields, which track the memory used for document-level locking. The
locktree is configured to have size equal to 10% of --cacheSize
by
default, but this can be changed with --locktreeMaxMemory
. It is
uncommon for the locktree to use large amounts of memory in most
workloads.
TokuMX uses files on disk for three things:
-
Collection data, including the primary key indexes and secondary indexes. (
--dbpath
)Typically, secondary index size is considerably smaller than than the primary key index. However, clustering secondary indexes are typically about the same size as the primary key index.
Note that just as in basic MongoDB, the oplog data is considered to be a collection.
-
Transaction logging data (similar to basic MongoDB's journal files). (
--logDir
) -
Temporary files while building bulk indexes for collections restored with
mongorestore
and foregroundensureIndex
operations. (--tmpDir
)
For collections, including the oplog, you can use
db.collection.stats()
(or the
collStats
command from any driver) to learn about its
uncompressed and compressed (on disk) sizes:
-
collStats.size
:An estimate of the sum of sizes of uncompressed BSON documents in the primary key index for the collection.
-
collStats.storageSize
:The size of the compressed file on disk for the primary key index for the collection.
-
collStats.totalIndexSize
:An estimate of the sum of sizes of uncompressed BSON data in all secondary indexes for the collection.
-
collStats.totalIndexStorageSize
:The sum of sizes of compressed files on disk for all the secondary indexes for the collection.
In addition, you can see a process-wide estimate of compression:
-
serverStatus.ft.compressionRatio.{leaf,nonleaf,overall}
:The compression ratio achieved for all writes of leaf nodes, nonleaf nodes, and across both node types, of data blocks written to disk since the server was last started.
Transaction log files are capped at about 100MB for each file, and old log files are removed when the data in them becomes unneeded after all transactions represented in the log file have completed and then a checkpoint completes after that, and they are also removed when performing a clean shutdown.
Temporary files for the bulk loader are compressed with quicklz if
--loaderCompressTmp
is set.
Each of the things TokuMX uses disk for have different access patterns. The primary usage of disk is for reading and writing collection data, which has access patterns driven by the workload. Pure insert workloads will not exhibit the same random write I/O patterns of traditional B-tree databases, but random inserts into unique indexes as well as update workloads can induce random read I/O, as those operations contain implicit "hidden queries".
TokuMX tracks the disk reads it does as "cachetable misses" and reports
them in serverStatus
:
-
serverStatus.ft.cachetable.miss.count
:The total number of cachetable misses since the server last started.
-
serverStatus.ft.cachetable.miss.time
:The total time (in seconds) spent fetching data (including decompression) for cachetable misses, since the server last started.
-
serverStatus.ft.cachetable.miss.full.{count,time}
:The same information, restricted to full misses, which are typically more expensive than partial misses, but less common.
-
serverStatus.ft.cachetable.miss.partial.{count,time}
:The same information, restricted to partial misses.
The TokuMX transaction log normally receives many small sequential writes, and periodic fsyncs. TokuMX tracks both the bytes written to the log, and the number of and time spent doing fsyncs (across all files, though the transaction log is by far the most frequently synced file):
-
serverStatus.ft.log.count
:Number of individual writes to the log file.
-
serverStatus.ft.log.time
:Time spent (in seconds) doing writes to the log file.
-
serverStatus.ft.log.bytes
:Bytes written to the log file
-
serverStatus.ft.fsync.count
:Number of fsync operations.
-
serverStatus.ft.fsync.time
:Time spent doing fsync operations (in microseconds).
TokuMX's main data structure, the Fractal Tree, usually writes most user
data to disk during checkpoints, which are triggered once every 60 seconds
by default (--checkpointPeriod
, similar to --syncdelay
or
storage.syncPeriodSecs
in MongoDB 2.6.0).
During a checkpoint, all tree data that has changed is written durably to
disk, and this allows the system to trim old data from the tail of the
transaction log. The system reports timing information about checkpoints
in serverStatus
:
-
serverStatus.ft.checkpoint.count
:Number of completed checkpoints.
-
serverStatus.ft.checkpoint.time
:Time (in seconds) spent doing checkpoints.
-
serverStatus.ft.checkpoint.lastComplete
:Begin and end timestamps for, and time spent during the last complete checkpoint.
A checkpoint is triggered 60 seconds (--checkpointPeriod
) after the
previous checkpoint was triggered, or immediately after the last
checkpoint if it took longer than 60 seconds. For example, if every
checkpoint takes 6 seconds, there should be 54 seconds between
checkpoints, and serverStatus.ft.checkpoint.time
should be about 10% of
the total system uptime. Extremely long checkpoints can cause a system to
back up over time, if checkpoints are taking too long it may mean that
your system needs more I/O bandwidth for node writes and/or more CPU power
for compression. Disk writes for checkpoint are tracked in
serverStatus.checkpoint.write
:
-
serverStatus.checkpoint.write.{leaf,nonleaf}.{count,time}
:Number of disk writes of leaf and nonleaf nodes and the time (in seconds) spent doing those writes.
-
serverStatus.checkpoint.write.{leaf,nonleaf}.bytes.{uncompressed,compressed}
:Uncompressed and compressed sizes of leaf and nonleaf nodes written for checkpoint.
TokuMX also tracks some anomalous events, which will appear in
serverStatus
if any such events are detected:
-
serverStatus.ft.alerts.longWaitEvents.logBufferWait
:Number of times a writing client had to wait more than 100ms for access to the log buffer.
-
serverStatus.ft.alerts.longWaitEvents.fsync.{count,time}
:Same information as
serverStatus.ft.fsync.{count,time}
, but only for fsync operations that took more than 1 second. -
serverStatus.ft.alerts.longWaitEvents.cachePressure.{count,time}
:Number of times and the time spent (in microseconds) that a thread had to wait more than 1 second for evictions to create space in the cachetable for it to page in data it needed.
-
serverStatus.ft.alerts.longWaitEvents.locktreeWait.{count,time}
:Number of times and the time spent (in microseconds) that a thread had to wait more than 1 second to acquire a document-level lock in the locktree.
-
serverStatus.ft.alerts.longWaitEvents.locktreeWaitEscalation.{count,time}
:Number of times and the time spent (in microseconds) that a thread had to wait more than 1 second to acquire a document-level lock because the locktree was at the memory limit (
--locktreeMaxMemory
) and needed to run escalation.