Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cluster Health API call can get tripped by circuit breaker #631

Open
Bukhtawar opened this issue Apr 28, 2021 · 13 comments
Open

[BUG] Cluster Health API call can get tripped by circuit breaker #631

Bukhtawar opened this issue Apr 28, 2021 · 13 comments
Labels
bug Something isn't working Cluster Manager

Comments

@Bukhtawar
Copy link
Collaborator

Describe the bug
When the JVM memory pressure is high the calls to cluster health might fail with

[2021-04-05T17:37:46,637][INFO ][c.a.c.e.logger           ] [cc0fd770314ce44c33bedf35605e9c4d] GET /_cluster/health local=true 429 TOO_MANY_REQUESTS 865 1
[2021-04-05T17:37:46,631][INFO ][c.a.c.e.logger           ] [cc0fd770314ce44c33bedf35605e9c4d] GET /_cluster/health local=true 429 TOO_MANY_REQUESTS 865 0
[2021-04-05T17:37:44,838][INFO ][c.a.c.e.logger           ] [cc0fd770314ce44c33bedf35605e9c4d] GET /_cluster/health local=true 429 TOO_MANY_REQUESTS 865 0
[2021-04-05T17:37:44,838][INFO ][c.a.c.e.logger           ] [cc0fd770314ce44c33bedf35605e9c4d] GET /_cluster/health local=true 429 TOO_MANY_REQUESTS 865 0
{
    "error": {
        "root_cause": [
            {
                "type": "circuit_breaking_exception",
                "reason": "[parent] Data too large, data for [<http_request>] would be [2029039272/1.8gb], which is larger than the limit of [2023548518/1.8gb], real usage: [2029039272/1.8gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=5285/5.1kb, in_flight_requests=0/0b, accounting=50225284/47.8mb]",
                "bytes_wanted": 2029039272,
                "bytes_limit": 2023548518,
                "durability": "PERMANENT"
            }
        ],
        "type": "circuit_breaking_exception",
        "reason": "[parent] Data too large, data for [<http_request>] would be [2029039272/1.8gb], which is larger than the limit of [2023548518/1.8gb], real usage: [2029039272/1.8gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=5285/5.1kb, in_flight_requests=0/0b, accounting=50225284/47.8mb]",
        "bytes_wanted": 2029039272,
        "bytes_limit": 2023548518,
        "durability": "PERMANENT"
    },
    "status": 429
}

Expected behavior
Cluster health calls shouldn't get tripped by the circuit breaker as they are important and informative and represents the state of the system

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]
@Bukhtawar Bukhtawar added the bug Something isn't working label Apr 28, 2021
@tlfeng
Copy link
Collaborator

tlfeng commented May 4, 2021

Hi @Bukhtawar,

Could you explain more about how to reproduce the issue?
Looks like it has been fixed in Elasticsearch 5.0 (elastic/elasticsearch@f32b700), besides, request to / is also whitelisted from Circuit Breaking exception in Elasticsearch 6.5 (elastic/elasticsearch@027a22a).

During my own testing, I didn't find "Cluster Health API" call is tripped by circuit breaker.
My steps:

  1. Start OpenSearch beta1 in Ubuntu with default setting.
  2. Set the parent circuit breaker with a low limit: curl -XPUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d '{"persistent" : {"indices.breaker.total.limit" : "5%"}}'
  3. Check the heap usage curl "localhost:9200/_cat/nodes?h=heap*&v", found "circuit_breaking_exception" in the response
  4. Check the cluster health curl "localhost:9200/_cluster/health?pretty", got the desired response without error.

@anshul291995
Copy link

Looking into reproducing this issue. Will update.

@dblock
Copy link
Member

dblock commented Jul 16, 2021

@anshul291995 @Bukhtawar any updates here, what should we do with this?

@Bukhtawar
Copy link
Collaborator Author

We'll need to try to repro here. I'll see if I can pick this up, any help from any community member would be of great help too

@reta
Copy link
Collaborator

reta commented Aug 18, 2021

@Bukhtawar @dblock would you mind if I try to reproduce and (hopefully) fix it? thanks

@reta
Copy link
Collaborator

reta commented Aug 18, 2021

So far confirming @tlfeng findings, not reproducible for /_cluster/health: the health checks are configured to bypass all circuit breakers, it applies both to rest and transport actions. Certainly more details would help:

  • OpenSearch version
  • installed Plugins?
  • where the logs are coming from? (does not look like OpenSearch server)

@dblock
Copy link
Member

dblock commented Aug 31, 2021

@Bukhtawar @dblock would you mind if I try to reproduce and (hopefully) fix it? thanks

No need to ask for permission! Thank you for contributing.

@minalsha
Copy link
Contributor

minalsha commented Sep 7, 2021

@Bukhtawar could you please help with details that @reta is seeking for? Thanks

@Bukhtawar
Copy link
Collaborator Author

I'll try to see if I can repro..

@anasalkouz
Copy link
Member

Closing this issue. @Bukhtawar, please feel free to reopen incase you are able to reproduce it.

@rramachand21
Copy link
Member

Reopening as this is an issue that needs to be fixed.

@andrross
Copy link
Member

andrross commented May 8, 2024

[Triage - attendees 1 2 3 4]
@rramachand21 Do you have any additional information about reproducing this? The findings above suggest that this API should be configured to bypass all circuit breakers.

@ashking94
Copy link
Member

The underlying issue as I have seen may also cause a node to not join the cluster since the node join call also gets tripped by CBE and leads to persistent node drops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager
Projects
Status: 🆕 New
Development

No branches or pull requests

10 participants