Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAI-5048: Some enhancements for node level metrics #220

Merged
merged 27 commits into from
Oct 2, 2024

Conversation

patsonluk
Copy link
Collaborator

@patsonluk patsonluk commented Sep 6, 2024

Description

This is built on top of #216 proposed by Noble. For diff please refer to this

Some enhancements:

  1. More explicit dependency between AggregateMetricsApiCaller and CoreMetricsApiCaller. CoreMetricsApiCaller would make the api call based on whatever is not covered in the AggregateMetricsApiCaller
  2. Added capability to get property other than counter (for median duration etc) - Added support of key and expr on top of prefix/property for metrics api querying Turns out the expr is very slow on playpen, we are going to support extra property by using multiple &property=... instead
  3. Added p50, p95 and p99 for duration of select/get/update.
  4. For soft and hard autocommit, we no longer use gauges, instead we now use meters just like other metrics, this allows node aggregated metrics for them
  5. Some fixes to label and unit test cases
  6. Added metricType to CoreMetric to retain correct Prometheus metrics type
  7. Metrics obtained from the node level values (solr.node) will have suffix [node aggregated] appended to the # HELP line . for example # HELP top_level_requests_get_duration_p99 top-level gets p99 duration[node aggregated]. This makes it easier to confirm the metrics are obtained from the expected source

Test

Deployed on to playpen:

  1. Confirmed that the metrics in the detailed metrics dashboard is correct after the update
  2. Confirmed that metrics Qtime is comparable to before the change
  3. Confirmed that p50,p95 and p99 duration are now available for various ops

Remarks

  1. Using aggregated node metrics introduced by this change would only work if all the configsets are setting aggregateNodeLevelMetricsEnabled consistently. Otherwise, we might get partial results
  2. Similar Metric Qtime with this change. On playpen, Metrics QTime mean goes from 101ms -> 102ms
  3. In order to get most aggregated node metrics, we will need to enable <indexConfig aggregateNodeLevelMetricsEnabled="true">, and <updateHandler class="solr.DirectUpdateHandler2" aggregateNodeLevelMetricsEnabled="true"> as well
  4. Currently, there are few metrics not available for node level metrics as they are "gauges"
INDEX.merge.major.running.docs
INDEX.merge.minor.running.docs

Other than the above, we have changed old gauges UPDATE.updateHandler.autoCommits and UPDATE.updateHandler.softAutoCommits to meters to enable node aggregated metrics. However for the remaining ones, gauge makes more sense and it's okay to leave them as is, as we do not currently use them in our grafana dashboards anyway.
5. There was an attempt to combine the 2 callers into one and make only one API call. However. it's actually taking longer as we asking for both core and solr.node with several properties (+p50, p95 and p99). That QTime increased from 101ms -> 130ms during our playpen tests.

noblepaul and others added 12 commits August 6, 2024 03:35
…nd CoresMetricsApiCaller

2. CoresMetricsApiCaller will only fetch metrics that are missing from AggregateMetricsApiCaller
3. Added capability to get property other than counter (for median duration etc) - Added support of key and expr on top of prefix/property for metrics api querying
4. Some fixes to label and unit test cases
5. Added metricType to CoreMetric to retain correct Prometheus metrics type
@patsonluk patsonluk changed the base branch from noble/aggregate_prometheus_metrics to fs/branch_9_3 September 6, 2024 02:50
patsonluk and others added 11 commits September 5, 2024 19:59
* Combine both core and solr.node metrics into AggregateMetricsApiCaller to speed things up

* Fixed metrics api property query param

* Fixed metrics api property query param

* Avoid NPE

* javadoc
code cleanup
re-arranged CoreMetric enum ordering to fit the test case
…s_metrics' into noble/patson/aggregate_prometheus_metrics
* Try making 2 calls and see if it's faster

* Try making 2 calls and see if it's faster

* Fixed unit test cases
@patsonluk
Copy link
Collaborator Author

@magibney would you mind to take a look at this PR? The changes built on top of the another PR from noble (which I did review and applied some fixes/changes to this branch).

If you want to see only my changes, it can be found in here

@patsonluk patsonluk changed the title Some enhancements for node level metrics SAI-5048: Some enhancements for node level metrics Sep 9, 2024
Copy link
Collaborator

@nginthfs nginthfs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, sorry for the delay, but LGTM! Tested it locally myself and performance / metrics look good!

@patsonluk patsonluk merged commit ad302ee into fs/branch_9_3 Oct 2, 2024
7 checks passed
magibney pushed a commit that referenced this pull request Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants