Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cluster manager task throttling stats in nodes stats API #5790

Merged
merged 9 commits into from
Jan 17, 2023

Conversation

dhwanilpatel
Copy link
Contributor

Signed-off-by: Dhwanil Patel [email protected]

Description

Add throttling stats in _nodes/stats API.

In Active master node's stats these stats will appear, for other nodes it will be 0. From active master node's stats we can get visibility on how many tasks are getting throttled.

Below is sample stats for two node cluster.

curl "localhost:9200/_nodes/stats/cluster_manager_throttling?pretty"
{
  "_nodes" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "cluster_name" : "opensearch",
  "nodes" : {
    "PZmL3O7dRBOg3d8NTRsSOQ" : {
      "timestamp" : 1673358595082,
      "name" : "a483e760c3eb.ant.amazon.com",
      "transport_address" : "127.0.0.1:9300",
      "host" : "127.0.0.1",
      "ip" : "127.0.0.1:9300",
      "roles" : [
        "cluster_manager",
        "data",
        "ingest",
        "remote_cluster_client"
      ],
      "attributes" : {
        "shard_indexing_pressure_enabled" : "true"
      },
      "cluster_manager_throttling" : {
        "cluster_manager_stats" : {
          "TotalThrottledTasks" : 18,
          "ThrottledTasksPerTaskType" : {
            "put-mapping" : 18
          }
        }
      }
    },
    "BPwWJ9vTSb2KJzPL9VZNcg" : {
      "timestamp" : 1673358595083,
      "name" : "a483e760c3eb.ant.amazon.com",
      "transport_address" : "127.0.0.1:9301",
      "host" : "127.0.0.1",
      "ip" : "127.0.0.1:9301",
      "roles" : [
        "cluster_manager",
        "data",
        "ingest",
        "remote_cluster_client"
      ],
      "attributes" : {
        "shard_indexing_pressure_enabled" : "true"
      },
      "cluster_manager_throttling" : {
        "cluster_manager_stats" : {
          "TotalThrottledTasks" : 0,
          "ThrottledTasksPerTaskType" : { }
        }
      }
    }
  }
}

Issues Resolved

#479

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Dhwanil Patel <[email protected]>
private Map<String, CounterMetric> throttledTasksCount = new ConcurrentHashMap<>();
private Map<String, CounterMetric> throttledTasksCount;

public ClusterManagerThrottlingStats() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need additional metrics on throttle_time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate more on this?
What we are expecting out of throttle_time metric?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time since we last throttled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time since cluster manager performed has done throttling for any task? Or for different task type as well?

I think we can achieve it, as we can maintain time as well when we have performed last throttling on cluster manager node.

I am just wondering how this additional metric will help?

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Dhwanil Patel <[email protected]>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@dhwanilpatel
Copy link
Contributor Author

Precommit is passing, gradle check is failing.

It is failing due to some connection issue with ubuntu. I will retry it later.

* What went wrong:
Execution failed for task ':test:fixtures:gcs-fixture:composeUp'.
> Exit-code 1 when calling /usr/bin/docker-compose, stdout: Step 1/11 : FROM ubuntu:18.04
  18.04: Pulling from library/ubuntu
  Digest: sha256:c1d0baf2425ecef88a2f0c3543ec43690dc16cc80d3c4e593bb95e4f45390e45
  Status: Downloaded newer image for ubuntu:18.04
   ---> e28a50f651f9
  Step 2/11 : RUN apt-get update -qqy
   ---> Running in 7c3adacdf556
  �[91mW: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic/InRelease  Could not connect to archive.ubuntu.com:80 (185.125.190.36), connection timed out Could not connect to archive.ubuntu.com:80 (185.125.190.39), connection timed out
  W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic-updates/InRelease  Unable to connect to archive.ubuntu.com:http:
  W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic-backports/InRelease  Unable to connect to archive.ubuntu.com:http:
  W: Some index files failed to download. They have been ignored, or old ones used instead.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Dhwanil Patel <[email protected]>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@dhwanilpatel
Copy link
Contributor Author

NOTE on gradle check.

Backward compatible tests are failing. ./gradlew ':qa:mixed-cluster:v2.6.0#mixedClusterTest'.

Since we have added 2.6.0 version check for stats, it is expected to happen. It will fix once we merge backport 2.x PR. (#5871). Gradle check is passing in backport PR.

Similar behavior has been observed in other stats PR as well with version check. (#4932 (comment))

@Bukhtawar Bukhtawar changed the title Add throttling stats in nodes stats API Add cluster manager task throttling stats in nodes stats API Jan 17, 2023
@Bukhtawar Bukhtawar merged commit 6da16e1 into opensearch-project:main Jan 17, 2023
@andrross andrross added the backport 2.x Backport to 2.x branch label Jan 19, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-5790-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 6da16e1e60760f630c8f7fe7ec1088feb13ff216
# Push it to GitHub
git push --set-upstream origin backport/backport-5790-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-5790-to-2.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants