2.1.0
Grafana Labs is excited to announce version 2.1 of Grafana Mimir, the most scalable, most performant open source time series database in the world.
Below we highlight the top features, enhancements and bugfixes in this release, as well as relevant callouts for those upgrading from Grafana Mimir 2.0. The complete list of changes is recorded in the Changelog.
Features and enhancements
-
Mimir on ARM: We now publish Docker images for both
amd64
andarm64
, making it easier for those on arm-based machines to develop and run Mimir. Multiplaform images are available from the Mimir docker registry. Note that our existing integration test suite only uses theamd64
images, which means we cannot make any functional or performance guarantees about thearm64
images. -
Remote
ruler mode for improved rule evaluation performance: We've added aremote
mode for the Grafana Mimir ruler, in which the ruler delegates rule evaluation to the query-frontend rather than evaluating rules directly within the ruler process itself. This allows recording and alerting rules to benefit from the query parallelization techniques implemented in the query-frontend (like query sharding).Remote
mode is considered experimental and is off by default. To enable, see remote ruler. -
Per-tenant custom trackers for monitoring cardinality: In Grafana Mimir 2.0, we introduced a custom tracker feature that allows you to track the count of active series over time that match a specific label matcher. In Grafana Mimir 2.1, we've made it possible to configure custom trackers via the runtime configuration file. This means you can now define different trackers for each tenant in your cluster and modify those trackers without an ingester restart.
-
Reduce cardinality of Grafana Mimir's
/metrics
endpoint: While Grafana Mimir does a good job of exposing a relatively small number of series about its own state, this number can tick up when running Grafana Mimir clusters with high tenant counts or high active series counts. To reduce this number (and the accompanying cost of scraping and storing these time series), we made several optimizations which decreased series count on the/metrics
endpoint by more than 10%.
Upgrade considerations
We've updated the default values for 2 parameters in Grafana Mimir to give users better out-of-the-box performance:
-
We've changed the default for
-blocks-storage.tsdb.isolation-enabled
fromtrue
tofalse
. We've marked this flag as deprecated and will remove it completely in 2 releases. TSDB isolation is a feature inherited from Prometheus that didn't provide any benefit given Grafana Mimir's distributed architecture and in our 1 billion series load test we found it actually hurt performance. Disabling it reduced our ingester 99th percentile latency by 90%. -
The store-gateway attributes cache is now enabled by default (achieved by updating the default for
-blocks-storage.bucket-store.chunks-cache.attributes-in-memory-max-items
from0
to50000
). This in-memory cache makes it faster to look up object attributes for chunk data. We've been running this optional cache internally for a while and upon a recent configuration audit, realized it made sense to do the same for all users. The increase in store-gateway memory utilization from enabling this cache is negligible and easily justified given the performance gains.
Bug fixes
2.1.0 bug fixes
- PR 1704: Fixed a bug that previously caused Grafana Mimir to crash on startup when trying to run in monolithic mode with the results cache enabled due to duplicate metric names.
- PR 1835: Fixed a bug that caused Grafana Mimir to crash when an invalid Alertmanager configuration was set even though the Alertmanager component was disabled. After this fix, the Alertmanager configuration is only validated if the Alertmanager component is loaded.
- PR 1836: The ability to run Alertmanager with
local
storage broke in Grafana Mimir 2.0 when we removed the ability to run the Alertmanager without sharding. With this bugfix, we've made it possible to again run Alertmanager withlocal
storage. However, for production use, we still recommend using external store since this is needed to persist Alertmanager state (e.g. silences) between replicas. - PR 1715: Restored Grafana Mimir's ability to use CNAME DNS records to reach memcached servers. The bug was inherited from an upstream change to Thanos; we contributed a fix to Thanos and subsequently updated our Thanos version.
CHANGELOG
Grafana Mimir
- [CHANGE] Compactor: No longer upload debug meta files to object storage. #1257
- [CHANGE] Default values have changed for the following settings: #1547
-alertmanager.alertmanager-client.grpc-max-recv-msg-size
now defaults to 100 MiB (previously was not configurable and set to 16 MiB)-alertmanager.alertmanager-client.grpc-max-send-msg-size
now defaults to 100 MiB (previously was not configurable and set to 4 MiB)-alertmanager.max-recv-msg-size
now defaults to 100 MiB (previously was 16 MiB)
- [CHANGE] Ingester: Add
user
label to metricscortex_ingester_ingested_samples_total
andcortex_ingester_ingested_samples_failures_total
. #1533 - [CHANGE] Ingester: Changed
-blocks-storage.tsdb.isolation-enabled
default fromtrue
tofalse
. The config option has also been deprecated and will be removed in 2 minor version. #1655 - [CHANGE] Query-frontend: results cache keys are now versioned, this will cause cache to be re-filled when rolling out this version. #1631
- [CHANGE] Store-gateway: enabled attributes in-memory cache by default. New default configuration is
-blocks-storage.bucket-store.chunks-cache.attributes-in-memory-max-items=50000
. #1727 - [CHANGE] Compactor: Removed the metric
cortex_compactor_garbage_collected_blocks_total
since it duplicatescortex_compactor_blocks_marked_for_deletion_total
. #1728 - [CHANGE] All: Logs that used the
org_id
label now useuser
label. #1634 #1758 - [CHANGE] Alertmanager: the following metrics are not exported for a given
user
andintegration
when the metric value is zero: #1783cortex_alertmanager_notifications_total
cortex_alertmanager_notifications_failed_total
cortex_alertmanager_notification_requests_total
cortex_alertmanager_notification_requests_failed_total
cortex_alertmanager_notification_rate_limited_total
- [CHANGE] Removed the following metrics exposed by the Mimir hash rings: #1791
cortex_member_ring_tokens_owned
cortex_member_ring_tokens_to_own
cortex_ring_tokens_owned
cortex_ring_member_ownership_percent
- [CHANGE] Querier / Ruler: removed the following metrics tracking number of query requests send to each ingester. You can use
cortex_request_duration_seconds_count{route=~"/cortex.Ingester/(QueryStream|QueryExemplars)"}
instead. #1797cortex_distributor_ingester_queries_total
cortex_distributor_ingester_query_failures_total
- [CHANGE] Distributor: removed the following metrics tracking the number of requests from a distributor to ingesters: #1799
cortex_distributor_ingester_appends_total
cortex_distributor_ingester_append_failures_total
- [CHANGE] Distributor / Ruler: deprecated
-distributor.extend-writes
. Now Mimir always behaves as if this setting was set tofalse
, which we expect to be safe for every Mimir cluster setup. #1856 - [FEATURE] Querier: Added support for streaming remote read. Should be noted that benefits of chunking the response are partial here, since in a typical
query-frontend
setup responses will be buffered until they've been completed. #1735 - [FEATURE] Ruler: Allow setting
evaluation_delay
for each rule group via rules group configuration file. #1474 - [FEATURE] Ruler: Added support for expression remote evaluation. #1536 #1818
- The following CLI flags (and their respective YAML config options) have been added:
-ruler.query-frontend.address
-ruler.query-frontend.grpc-client-config.grpc-max-recv-msg-size
-ruler.query-frontend.grpc-client-config.grpc-max-send-msg-size
-ruler.query-frontend.grpc-client-config.grpc-compression
-ruler.query-frontend.grpc-client-config.grpc-client-rate-limit
-ruler.query-frontend.grpc-client-config.grpc-client-rate-limit-burst
-ruler.query-frontend.grpc-client-config.backoff-on-ratelimits
-ruler.query-frontend.grpc-client-config.backoff-min-period
-ruler.query-frontend.grpc-client-config.backoff-max-period
-ruler.query-frontend.grpc-client-config.backoff-retries
-ruler.query-frontend.grpc-client-config.tls-enabled
-ruler.query-frontend.grpc-client-config.tls-ca-path
-ruler.query-frontend.grpc-client-config.tls-cert-path
-ruler.query-frontend.grpc-client-config.tls-key-path
-ruler.query-frontend.grpc-client-config.tls-server-name
-ruler.query-frontend.grpc-client-config.tls-insecure-skip-verify
- The following CLI flags (and their respective YAML config options) have been added:
- [FEATURE] Distributor: Added the ability to forward specifics metrics to alternative remote_write API endpoints. #1052
- [FEATURE] Ingester: Active series custom trackers now supports runtime tenant-specific overrides. The configuration has been moved to limit config, the ingester config has been deprecated. #1188
- [ENHANCEMENT] Alertmanager API: Concurrency limit for GET requests is now configurable using
-alertmanager.max-concurrent-get-requests-per-tenant
. #1547 - [ENHANCEMENT] Alertmanager: Added the ability to configure additional gRPC client settings for the Alertmanager distributor #1547
-alertmanager.alertmanager-client.backoff-max-period
-alertmanager.alertmanager-client.backoff-min-period
-alertmanager.alertmanager-client.backoff-on-ratelimits
-alertmanager.alertmanager-client.backoff-retries
-alertmanager.alertmanager-client.grpc-client-rate-limit
-alertmanager.alertmanager-client.grpc-client-rate-limit-burst
-alertmanager.alertmanager-client.grpc-compression
-alertmanager.alertmanager-client.grpc-max-recv-msg-size
-alertmanager.alertmanager-client.grpc-max-send-msg-size
- [ENHANCEMENT] Ruler: Add more detailed query information to ruler query stats logging. #1411
- [ENHANCEMENT] Admin: Admin API now has some styling. #1482 #1549 #1821 #1824
- [ENHANCEMENT] Alertmanager: added
insight=true
field to alertmanager dispatch logs. #1379 - [ENHANCEMENT] Store-gateway: Add the experimental ability to run index header operations in a dedicated thread pool. This feature can be configured using
-blocks-storage.bucket-store.index-header-thread-pool-size
and is disabled by default. #1660 - [ENHANCEMENT] Store-gateway: don't drop all blocks if instance finds itself as unhealthy or missing in the ring. #1806 #1823
- [ENHANCEMENT] Querier: wait until inflight queries are completed when shutting down queriers. #1756 #1767
- [BUGFIX] Query-frontend: do not shard queries with a subquery unless the subquery is inside a shardable aggregation function call. #1542
- [BUGFIX] Query-frontend: added
component=query-frontend
label to results cache memcached metrics to fix a panic when Mimir is running in single binary mode and results cache is enabled. #1704 - [BUGFIX] Mimir: services' status content-type is now correctly set to
text/html
. #1575 - [BUGFIX] Multikv: Fix panic when using using runtime config to set primary KV store used by
multi
KV. #1587 - [BUGFIX] Multikv: Fix watching for runtime config changes in
multi
KV store in ruler and querier. #1665 - [BUGFIX] Memcached: allow to use CNAME DNS records for the memcached backend addresses. #1654
- [BUGFIX] Querier: fixed temporary partial query results when shuffle sharding is enabled and hash ring backend storage is flushed / reset. #1829
- [BUGFIX] Alertmanager: prevent more file traversal cases related to template names. #1833
- [BUGFUX] Alertmanager: Allow usage with
-alertmanager-storage.backend=local
. Note that when using this storage type, the Alertmanager is not able persist state remotely, so it not recommended for production use. #1836 - [BUGFIX] Alertmanager: Do not validate alertmanager configuration if it's not running. #1835
Mixin
- [CHANGE] Dashboards: Remove per-user series legends from Tenants dashboard. #1605
- [CHANGE] Dashboards: Show in-memory series and the per-user series limit on Tenants dashboard. #1613
- [CHANGE] Dashboards: Slow-queries dashboard now uses
user
label from logs instead oforg_id
. #1634 - [CHANGE] Dashboards: changed all Grafana dashboards UIDs to not conflict with Cortex ones, to let people install both while migrating from Cortex to Mimir: #1801 #1808
- Alertmanager from
a76bee5913c97c918d9e56a3cc88cc28
tob0d38d318bbddd80476246d4930f9e55
- Alertmanager Resources from
68b66aed90ccab448009089544a8d6c6
toa6883fb22799ac74479c7db872451092
- Compactor from
9c408e1d55681ecb8a22c9fab46875cc
to1b3443aea86db629e6efdb7d05c53823
- Compactor Resources from
df9added6f1f4332f95848cca48ebd99
to09a5c49e9cdb2f2b24c6d184574a07fd
- Config from
61bb048ced9817b2d3e07677fb1c6290
to5d9d0b4724c0f80d68467088ec61e003
- Object Store from
d5a3a4489d57c733b5677fb55370a723
toe1324ee2a434f4158c00a9ee279d3292
- Overrides from
b5c95fee2e5e7c4b5930826ff6e89a12
to1e2c358600ac53f09faea133f811b5bb
- Queries from
d9931b1054053c8b972d320774bb8f1d
tob3abe8d5c040395cc36615cb4334c92d
- Reads from
8d6ba60eccc4b6eedfa329b24b1bd339
toe327503188913dc38ad571c647eef643
- Reads Networking from
c0464f0d8bd026f776c9006b05910000
to54b2a0a4748b3bd1aefa92ce5559a1c2
- Reads Resources from
2fd2cda9eea8d8af9fbc0a5960425120
tocc86fd5aa9301c6528986572ad974db9
- Rollout Progress from
7544a3a62b1be6ffd919fc990ab8ba8f
to7f0b5567d543a1698e695b530eb7f5de
- Ruler from
44d12bcb1f95661c6ab6bc946dfc3473
to631e15d5d85afb2ca8e35d62984eeaa0
- Scaling from
88c041017b96856c9176e07cf557bdcf
to64bbad83507b7289b514725658e10352
- Slow queries from
e6f3091e29d2636e3b8393447e925668
to6089e1ce1e678788f46312a0a1e647e6
- Tenants from
35fa247ce651ba189debf33d7ae41611
to35fa247ce651ba189debf33d7ae41611
- Top Tenants from
bc6e12d4fe540e4a1785b9d3ca0ffdd9
tobc6e12d4fe540e4a1785b9d3ca0ffdd9
- Writes from
0156f6d15aa234d452a33a4f13c838e3
to8280707b8f16e7b87b840fc1cc92d4c5
- Writes Networking from
681cd62b680b7154811fe73af55dcfd4
to978c1cb452585c96697a238eaac7fe2d
- Writes Resources from
c0464f0d8bd026f776c9006b0591bb0b
tobc9160e50b52e89e0e49c840fea3d379
- Alertmanager from
- [FEATURE] Alerts: added the following alerts on
mimir-continuous-test
tool: #1676MimirContinuousTestNotRunningOnWrites
MimirContinuousTestNotRunningOnReads
MimirContinuousTestFailed
- [ENHANCEMENT] Added
per_cluster_label
support to allow to change the label name used to differentiate between Kubernetes clusters. #1651 - [ENHANCEMENT] Dashboards: Show QPS and latency of the Alertmanager Distributor. #1696
- [ENHANCEMENT] Playbooks: Add Alertmanager suggestions for
MimirRequestErrors
andMimirRequestLatency
#1702 - [ENHANCEMENT] Dashboards: Allow custom datasources. #1749
- [ENHANCEMENT] Dashboards: Add config option
gateway_enabled
(defaults totrue
) to disable gateway panels from dashboards. #1761 - [ENHANCEMENT] Dashboards: Extend Top tenants dashboard with queries for tenants with highest sample rate, discard rate, and discard rate growth. #1842
- [ENHANCEMENT] Dashboards: Show ingestion rate limit and rule group limit on Tenants dashboard. #1845
- [ENHANCEMENT] Dashboards: Add "last successful run" panel to compactor dashboard. #1628
- [BUGFIX] Dashboards: Fix "Failed evaluation rate" panel on Tenants dashboard. #1629
- [BUGFIX] Honor the configured
per_instance_label
in all dashboards and alerts. #1697
Jsonnet
- [FEATURE] Added support for
mimir-continuous-test
. To deploymimir-continuous-test
you can use the following configuration: #1675 #1850_config+: { continuous_test_enabled: true, continuous_test_tenant_id: 'type-tenant-id', continuous_test_write_endpoint: 'http://type-write-path-hostname', continuous_test_read_endpoint: 'http://type-read-path-hostname/prometheus', },
- [ENHANCEMENT] Ingester anti-affinity can now be disabled by using
ingester_allow_multiple_replicas_on_same_node
configuration key. #1581 - [ENHANCEMENT] Added
node_selector
configuration option to select Kubernetes nodes where Mimir should run. #1596 - [ENHANCEMENT] Alertmanager: Added a
PodDisruptionBudget
ofwithMaxUnavailable = 1
, to ensure we maintain quorum during rollouts. #1683 - [ENHANCEMENT] Store-gateway anti-affinity can now be enabled/disabled using
store_gateway_allow_multiple_replicas_on_same_node
configuration key. #1730 - [ENHANCEMENT] Added
store_gateway_zone_a_args
,store_gateway_zone_b_args
andstore_gateway_zone_c_args
configuration options. #1807 - [BUGFIX] Pass primary and secondary multikv stores via CLI flags. Introduced new
multikv_switch_primary_secondary
config option to flip primary and secondary in runtime config.
Mimirtool
- [BUGFIX]
config convert
: Retain Cortex defaults forblocks_storage.backend
,ruler_storage.backend
,alertmanager_storage.backend
,auth.type
,activity_tracker.filepath
,alertmanager.data_dir
,blocks_storage.filesystem.dir
,compactor.data_dir
,ruler.rule_path
,ruler_storage.filesystem.dir
, andgraphite.querier.schemas.backend
. #1626 #1762
Tools
- [FEATURE] Added a
markblocks
tool that createsno-compact
anddelete
marks for the blocks. #1551 - [FEATURE] Added
mimir-continuous-test
tool to continuously run smoke tests on live Mimir clusters. #1535 #1540 #1653 #1603 #1630 #1691 #1675 #1676 #1692 #1706 #1709 #1775 #1777 #1778 #1795 - [FEATURE] Added
mimir-rules-action
GitHub action, located atoperations/mimir-rules-action/
, used to lint, prepare, verify, diff, and sync rules to a Mimir cluster. #1723