Skip to content

2.1.0

Compare
Choose a tag to compare
@johannaratliff johannaratliff released this 26 May 17:54
· 5030 commits to main since this release

Grafana Labs is excited to announce version 2.1 of Grafana Mimir, the most scalable, most performant open source time series database in the world.

Below we highlight the top features, enhancements and bugfixes in this release, as well as relevant callouts for those upgrading from Grafana Mimir 2.0. The complete list of changes is recorded in the Changelog.

Features and enhancements

  • Mimir on ARM: We now publish Docker images for both amd64 and arm64, making it easier for those on arm-based machines to develop and run Mimir. Multiplaform images are available from the Mimir docker registry. Note that our existing integration test suite only uses the amd64 images, which means we cannot make any functional or performance guarantees about the arm64 images.

  • Remote ruler mode for improved rule evaluation performance: We've added a remote mode for the Grafana Mimir ruler, in which the ruler delegates rule evaluation to the query-frontend rather than evaluating rules directly within the ruler process itself. This allows recording and alerting rules to benefit from the query parallelization techniques implemented in the query-frontend (like query sharding). Remote mode is considered experimental and is off by default. To enable, see remote ruler.

  • Per-tenant custom trackers for monitoring cardinality: In Grafana Mimir 2.0, we introduced a custom tracker feature that allows you to track the count of active series over time that match a specific label matcher. In Grafana Mimir 2.1, we've made it possible to configure custom trackers via the runtime configuration file. This means you can now define different trackers for each tenant in your cluster and modify those trackers without an ingester restart.

  • Reduce cardinality of Grafana Mimir's /metrics endpoint: While Grafana Mimir does a good job of exposing a relatively small number of series about its own state, this number can tick up when running Grafana Mimir clusters with high tenant counts or high active series counts. To reduce this number (and the accompanying cost of scraping and storing these time series), we made several optimizations which decreased series count on the /metrics endpoint by more than 10%.

Upgrade considerations

We've updated the default values for 2 parameters in Grafana Mimir to give users better out-of-the-box performance:

  • We've changed the default for -blocks-storage.tsdb.isolation-enabled from true to false. We've marked this flag as deprecated and will remove it completely in 2 releases. TSDB isolation is a feature inherited from Prometheus that didn't provide any benefit given Grafana Mimir's distributed architecture and in our 1 billion series load test we found it actually hurt performance. Disabling it reduced our ingester 99th percentile latency by 90%.

  • The store-gateway attributes cache is now enabled by default (achieved by updating the default for -blocks-storage.bucket-store.chunks-cache.attributes-in-memory-max-items from 0 to 50000). This in-memory cache makes it faster to look up object attributes for chunk data. We've been running this optional cache internally for a while and upon a recent configuration audit, realized it made sense to do the same for all users. The increase in store-gateway memory utilization from enabling this cache is negligible and easily justified given the performance gains.

Bug fixes

2.1.0 bug fixes

  • PR 1704: Fixed a bug that previously caused Grafana Mimir to crash on startup when trying to run in monolithic mode with the results cache enabled due to duplicate metric names.
  • PR 1835: Fixed a bug that caused Grafana Mimir to crash when an invalid Alertmanager configuration was set even though the Alertmanager component was disabled. After this fix, the Alertmanager configuration is only validated if the Alertmanager component is loaded.
  • PR 1836: The ability to run Alertmanager with local storage broke in Grafana Mimir 2.0 when we removed the ability to run the Alertmanager without sharding. With this bugfix, we've made it possible to again run Alertmanager with local storage. However, for production use, we still recommend using external store since this is needed to persist Alertmanager state (e.g. silences) between replicas.
  • PR 1715: Restored Grafana Mimir's ability to use CNAME DNS records to reach memcached servers. The bug was inherited from an upstream change to Thanos; we contributed a fix to Thanos and subsequently updated our Thanos version.

CHANGELOG

Grafana Mimir

  • [CHANGE] Compactor: No longer upload debug meta files to object storage. #1257
  • [CHANGE] Default values have changed for the following settings: #1547
    • -alertmanager.alertmanager-client.grpc-max-recv-msg-size now defaults to 100 MiB (previously was not configurable and set to 16 MiB)
    • -alertmanager.alertmanager-client.grpc-max-send-msg-size now defaults to 100 MiB (previously was not configurable and set to 4 MiB)
    • -alertmanager.max-recv-msg-size now defaults to 100 MiB (previously was 16 MiB)
  • [CHANGE] Ingester: Add user label to metrics cortex_ingester_ingested_samples_total and cortex_ingester_ingested_samples_failures_total. #1533
  • [CHANGE] Ingester: Changed -blocks-storage.tsdb.isolation-enabled default from true to false. The config option has also been deprecated and will be removed in 2 minor version. #1655
  • [CHANGE] Query-frontend: results cache keys are now versioned, this will cause cache to be re-filled when rolling out this version. #1631
  • [CHANGE] Store-gateway: enabled attributes in-memory cache by default. New default configuration is -blocks-storage.bucket-store.chunks-cache.attributes-in-memory-max-items=50000. #1727
  • [CHANGE] Compactor: Removed the metric cortex_compactor_garbage_collected_blocks_total since it duplicates cortex_compactor_blocks_marked_for_deletion_total. #1728
  • [CHANGE] All: Logs that used theorg_id label now use user label. #1634 #1758
  • [CHANGE] Alertmanager: the following metrics are not exported for a given user and integration when the metric value is zero: #1783
    • cortex_alertmanager_notifications_total
    • cortex_alertmanager_notifications_failed_total
    • cortex_alertmanager_notification_requests_total
    • cortex_alertmanager_notification_requests_failed_total
    • cortex_alertmanager_notification_rate_limited_total
  • [CHANGE] Removed the following metrics exposed by the Mimir hash rings: #1791
    • cortex_member_ring_tokens_owned
    • cortex_member_ring_tokens_to_own
    • cortex_ring_tokens_owned
    • cortex_ring_member_ownership_percent
  • [CHANGE] Querier / Ruler: removed the following metrics tracking number of query requests send to each ingester. You can use cortex_request_duration_seconds_count{route=~"/cortex.Ingester/(QueryStream|QueryExemplars)"} instead. #1797
    • cortex_distributor_ingester_queries_total
    • cortex_distributor_ingester_query_failures_total
  • [CHANGE] Distributor: removed the following metrics tracking the number of requests from a distributor to ingesters: #1799
    • cortex_distributor_ingester_appends_total
    • cortex_distributor_ingester_append_failures_total
  • [CHANGE] Distributor / Ruler: deprecated -distributor.extend-writes. Now Mimir always behaves as if this setting was set to false, which we expect to be safe for every Mimir cluster setup. #1856
  • [FEATURE] Querier: Added support for streaming remote read. Should be noted that benefits of chunking the response are partial here, since in a typical query-frontend setup responses will be buffered until they've been completed. #1735
  • [FEATURE] Ruler: Allow setting evaluation_delay for each rule group via rules group configuration file. #1474
  • [FEATURE] Ruler: Added support for expression remote evaluation. #1536 #1818
    • The following CLI flags (and their respective YAML config options) have been added:
      • -ruler.query-frontend.address
      • -ruler.query-frontend.grpc-client-config.grpc-max-recv-msg-size
      • -ruler.query-frontend.grpc-client-config.grpc-max-send-msg-size
      • -ruler.query-frontend.grpc-client-config.grpc-compression
      • -ruler.query-frontend.grpc-client-config.grpc-client-rate-limit
      • -ruler.query-frontend.grpc-client-config.grpc-client-rate-limit-burst
      • -ruler.query-frontend.grpc-client-config.backoff-on-ratelimits
      • -ruler.query-frontend.grpc-client-config.backoff-min-period
      • -ruler.query-frontend.grpc-client-config.backoff-max-period
      • -ruler.query-frontend.grpc-client-config.backoff-retries
      • -ruler.query-frontend.grpc-client-config.tls-enabled
      • -ruler.query-frontend.grpc-client-config.tls-ca-path
      • -ruler.query-frontend.grpc-client-config.tls-cert-path
      • -ruler.query-frontend.grpc-client-config.tls-key-path
      • -ruler.query-frontend.grpc-client-config.tls-server-name
      • -ruler.query-frontend.grpc-client-config.tls-insecure-skip-verify
  • [FEATURE] Distributor: Added the ability to forward specifics metrics to alternative remote_write API endpoints. #1052
  • [FEATURE] Ingester: Active series custom trackers now supports runtime tenant-specific overrides. The configuration has been moved to limit config, the ingester config has been deprecated. #1188
  • [ENHANCEMENT] Alertmanager API: Concurrency limit for GET requests is now configurable using -alertmanager.max-concurrent-get-requests-per-tenant. #1547
  • [ENHANCEMENT] Alertmanager: Added the ability to configure additional gRPC client settings for the Alertmanager distributor #1547
    • -alertmanager.alertmanager-client.backoff-max-period
    • -alertmanager.alertmanager-client.backoff-min-period
    • -alertmanager.alertmanager-client.backoff-on-ratelimits
    • -alertmanager.alertmanager-client.backoff-retries
    • -alertmanager.alertmanager-client.grpc-client-rate-limit
    • -alertmanager.alertmanager-client.grpc-client-rate-limit-burst
    • -alertmanager.alertmanager-client.grpc-compression
    • -alertmanager.alertmanager-client.grpc-max-recv-msg-size
    • -alertmanager.alertmanager-client.grpc-max-send-msg-size
  • [ENHANCEMENT] Ruler: Add more detailed query information to ruler query stats logging. #1411
  • [ENHANCEMENT] Admin: Admin API now has some styling. #1482 #1549 #1821 #1824
  • [ENHANCEMENT] Alertmanager: added insight=true field to alertmanager dispatch logs. #1379
  • [ENHANCEMENT] Store-gateway: Add the experimental ability to run index header operations in a dedicated thread pool. This feature can be configured using -blocks-storage.bucket-store.index-header-thread-pool-size and is disabled by default. #1660
  • [ENHANCEMENT] Store-gateway: don't drop all blocks if instance finds itself as unhealthy or missing in the ring. #1806 #1823
  • [ENHANCEMENT] Querier: wait until inflight queries are completed when shutting down queriers. #1756 #1767
  • [BUGFIX] Query-frontend: do not shard queries with a subquery unless the subquery is inside a shardable aggregation function call. #1542
  • [BUGFIX] Query-frontend: added component=query-frontend label to results cache memcached metrics to fix a panic when Mimir is running in single binary mode and results cache is enabled. #1704
  • [BUGFIX] Mimir: services' status content-type is now correctly set to text/html. #1575
  • [BUGFIX] Multikv: Fix panic when using using runtime config to set primary KV store used by multi KV. #1587
  • [BUGFIX] Multikv: Fix watching for runtime config changes in multi KV store in ruler and querier. #1665
  • [BUGFIX] Memcached: allow to use CNAME DNS records for the memcached backend addresses. #1654
  • [BUGFIX] Querier: fixed temporary partial query results when shuffle sharding is enabled and hash ring backend storage is flushed / reset. #1829
  • [BUGFIX] Alertmanager: prevent more file traversal cases related to template names. #1833
  • [BUGFUX] Alertmanager: Allow usage with -alertmanager-storage.backend=local. Note that when using this storage type, the Alertmanager is not able persist state remotely, so it not recommended for production use. #1836
  • [BUGFIX] Alertmanager: Do not validate alertmanager configuration if it's not running. #1835

Mixin

  • [CHANGE] Dashboards: Remove per-user series legends from Tenants dashboard. #1605
  • [CHANGE] Dashboards: Show in-memory series and the per-user series limit on Tenants dashboard. #1613
  • [CHANGE] Dashboards: Slow-queries dashboard now uses user label from logs instead of org_id. #1634
  • [CHANGE] Dashboards: changed all Grafana dashboards UIDs to not conflict with Cortex ones, to let people install both while migrating from Cortex to Mimir: #1801 #1808
    • Alertmanager from a76bee5913c97c918d9e56a3cc88cc28 to b0d38d318bbddd80476246d4930f9e55
    • Alertmanager Resources from 68b66aed90ccab448009089544a8d6c6 to a6883fb22799ac74479c7db872451092
    • Compactor from 9c408e1d55681ecb8a22c9fab46875cc to 1b3443aea86db629e6efdb7d05c53823
    • Compactor Resources from df9added6f1f4332f95848cca48ebd99 to 09a5c49e9cdb2f2b24c6d184574a07fd
    • Config from 61bb048ced9817b2d3e07677fb1c6290 to 5d9d0b4724c0f80d68467088ec61e003
    • Object Store from d5a3a4489d57c733b5677fb55370a723 to e1324ee2a434f4158c00a9ee279d3292
    • Overrides from b5c95fee2e5e7c4b5930826ff6e89a12 to 1e2c358600ac53f09faea133f811b5bb
    • Queries from d9931b1054053c8b972d320774bb8f1d to b3abe8d5c040395cc36615cb4334c92d
    • Reads from 8d6ba60eccc4b6eedfa329b24b1bd339 to e327503188913dc38ad571c647eef643
    • Reads Networking from c0464f0d8bd026f776c9006b05910000 to 54b2a0a4748b3bd1aefa92ce5559a1c2
    • Reads Resources from 2fd2cda9eea8d8af9fbc0a5960425120 to cc86fd5aa9301c6528986572ad974db9
    • Rollout Progress from 7544a3a62b1be6ffd919fc990ab8ba8f to 7f0b5567d543a1698e695b530eb7f5de
    • Ruler from 44d12bcb1f95661c6ab6bc946dfc3473 to 631e15d5d85afb2ca8e35d62984eeaa0
    • Scaling from 88c041017b96856c9176e07cf557bdcf to 64bbad83507b7289b514725658e10352
    • Slow queries from e6f3091e29d2636e3b8393447e925668 to 6089e1ce1e678788f46312a0a1e647e6
    • Tenants from 35fa247ce651ba189debf33d7ae41611 to 35fa247ce651ba189debf33d7ae41611
    • Top Tenants from bc6e12d4fe540e4a1785b9d3ca0ffdd9 to bc6e12d4fe540e4a1785b9d3ca0ffdd9
    • Writes from 0156f6d15aa234d452a33a4f13c838e3 to 8280707b8f16e7b87b840fc1cc92d4c5
    • Writes Networking from 681cd62b680b7154811fe73af55dcfd4 to 978c1cb452585c96697a238eaac7fe2d
    • Writes Resources from c0464f0d8bd026f776c9006b0591bb0b to bc9160e50b52e89e0e49c840fea3d379
  • [FEATURE] Alerts: added the following alerts on mimir-continuous-test tool: #1676
    • MimirContinuousTestNotRunningOnWrites
    • MimirContinuousTestNotRunningOnReads
    • MimirContinuousTestFailed
  • [ENHANCEMENT] Added per_cluster_label support to allow to change the label name used to differentiate between Kubernetes clusters. #1651
  • [ENHANCEMENT] Dashboards: Show QPS and latency of the Alertmanager Distributor. #1696
  • [ENHANCEMENT] Playbooks: Add Alertmanager suggestions for MimirRequestErrors and MimirRequestLatency #1702
  • [ENHANCEMENT] Dashboards: Allow custom datasources. #1749
  • [ENHANCEMENT] Dashboards: Add config option gateway_enabled (defaults to true) to disable gateway panels from dashboards. #1761
  • [ENHANCEMENT] Dashboards: Extend Top tenants dashboard with queries for tenants with highest sample rate, discard rate, and discard rate growth. #1842
  • [ENHANCEMENT] Dashboards: Show ingestion rate limit and rule group limit on Tenants dashboard. #1845
  • [ENHANCEMENT] Dashboards: Add "last successful run" panel to compactor dashboard. #1628
  • [BUGFIX] Dashboards: Fix "Failed evaluation rate" panel on Tenants dashboard. #1629
  • [BUGFIX] Honor the configured per_instance_label in all dashboards and alerts. #1697

Jsonnet

  • [FEATURE] Added support for mimir-continuous-test. To deploy mimir-continuous-test you can use the following configuration: #1675 #1850
    _config+: {
      continuous_test_enabled: true,
      continuous_test_tenant_id: 'type-tenant-id',
      continuous_test_write_endpoint: 'http://type-write-path-hostname',
      continuous_test_read_endpoint: 'http://type-read-path-hostname/prometheus',
    },
  • [ENHANCEMENT] Ingester anti-affinity can now be disabled by using ingester_allow_multiple_replicas_on_same_node configuration key. #1581
  • [ENHANCEMENT] Added node_selector configuration option to select Kubernetes nodes where Mimir should run. #1596
  • [ENHANCEMENT] Alertmanager: Added a PodDisruptionBudget of withMaxUnavailable = 1, to ensure we maintain quorum during rollouts. #1683
  • [ENHANCEMENT] Store-gateway anti-affinity can now be enabled/disabled using store_gateway_allow_multiple_replicas_on_same_node configuration key. #1730
  • [ENHANCEMENT] Added store_gateway_zone_a_args, store_gateway_zone_b_args and store_gateway_zone_c_args configuration options. #1807
  • [BUGFIX] Pass primary and secondary multikv stores via CLI flags. Introduced new multikv_switch_primary_secondary config option to flip primary and secondary in runtime config.

Mimirtool

  • [BUGFIX] config convert: Retain Cortex defaults for blocks_storage.backend, ruler_storage.backend, alertmanager_storage.backend, auth.type, activity_tracker.filepath, alertmanager.data_dir, blocks_storage.filesystem.dir, compactor.data_dir, ruler.rule_path, ruler_storage.filesystem.dir, and graphite.querier.schemas.backend. #1626 #1762

Tools

  • [FEATURE] Added a markblocks tool that creates no-compact and delete marks for the blocks. #1551
  • [FEATURE] Added mimir-continuous-test tool to continuously run smoke tests on live Mimir clusters. #1535 #1540 #1653 #1603 #1630 #1691 #1675 #1676 #1692 #1706 #1709 #1775 #1777 #1778 #1795
  • [FEATURE] Added mimir-rules-action GitHub action, located at operations/mimir-rules-action/, used to lint, prepare, verify, diff, and sync rules to a Mimir cluster. #1723