-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Shards stuck initializing #12398
Comments
I got the same on empty cluster with opensearch operator :\ |
@vchirikov The shards did eventually initialize overnight. To recover all other shards immediately I had to manually increase the concurrency limit to be higher than the number of stuck shards. curl -k -XPUT "${CLUSTER_ADDRESS}/_cluster/settings" \
-H 'Content-Type:application/json' \
-d'{"transient":{"cluster.routing.allocation.node_initial_primaries_recoveries":20}}' You can replace I found the stuck shards with this python snippet: import pandas as pd
from opensearchpy import OpenSearch
os = OpenSearch(**cluster_option)
shards = os.cat.shards(format='json')
shards_df = pd.DataFrame(shards)
shards_df[(shards_df['state'] != 'STARTED') & (shards_df['prirep'] == 'p')] |
Yep, I tried this, but it didn't help: PUT /_cluster/settings
{
"transient": {
"cluster": {
"routing": {
"allocation.cluster_concurrent_rebalance": 20,
"allocation.node_concurrent_recoveries": 20,
"allocation.enable": "all"
}
}
}
} I saw only previous value in And I gave up, since the cluster was empty, in fact , I just recreated it from scratch. After this I tried scale-up / scale-down and it was ok. btw to see shard status you can use -- |
[Triage - attendees 1 2 3 4 5] |
@peternied Here is how I reached the state twice in a row (same cluster)
At a minimum I'd expect some dialog regarding things to try to figure out why it's stuck. Or a "Can't reproduce, add more logging to shard recovery" |
@mvanderlee Thanks for following up - we'd need more information to reproduce the issue. Can you share your operating system, how you are running the distribution, and then write out each action to get into this state? |
AWS Linux 2 on EC2 r5.4xlarge All we did was enable a windows detector in security analytics. |
@mvanderlee One last critical piece of critical information, please review the opensearch.log, and if you see unhandled exceptions / errors please include them with context and then that would make it clear if this is an OpenSearch issue and what area of the product is impacted Note; I'd recommend reviewing all log entries before publishing logs directly as they could contain data you consider sensitive |
@peternied I really wish there was something useful in the logs, but as I mentioned there isn't. The only exceptions/errors in the logs is
If you have a non-public way for me to share the entire logs, I'd gladly do so. |
@mvanderlee Thanks for taking a look, sounds like an ugly issue. Reach out to me, @ Peter Nied, on our Slack instance - https://opensearch.org/slack.html we can discuss next steps. |
@peternied I ran into this again with opensearch operator, looks like it restart nodes too quickly and this causes unrecoverable split-brain problem. Currently I have cluster with 2 nodes, but I saw this on 3 node cluster. Logs
3h+ stuck in recovery of 5kb index GET /_cluster/allocation/explain?pretty{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
}
],
"type": "illegal_argument_exception",
"reason": "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
},
"status": 400
} |
@vchirikov Could you open an new issue with your reproduction, in the OpenSearch operator repository? |
@mvanderlee Thanks for reaching out offline on the slack, I've taking a look at the logs. I get a picture of the cluster state with the following exception - it looks like the SecurityAnalytics plugin (see messages with SecurityAnalyticsException.java ) & alerting plugin (see messages with DestinationMigrationCoordinator ) is in a very tight request/retry loop that is trying to fetch data and being rejected because there are too many requests in flight and there is too much memory in use (see messages with SearchBackpressureService ) - from the inner cause rejection the completed tasks number is 25M items. You can troubleshoot further by using the I'd recommend troubleshooting those errors by looking at with forum.opensearch.org. It seems like there could be a 'task' explosion due to the index management plugin or one of the other plugins. This is the bounds of my expertise - best of luck. |
Okay, so it appears then that the cluster is reporting itself as ready and starts to ingest before it's fully recovered. This will require a custom sidecar application that does a detailed healthcheck on the cluster and provides an HTTP endpoint for the proxy or load balancer to call. |
@mvanderlee |
Describe the bug
13 shards are still stuck in 'initializing' 3 hours after node restart (single node cluster)
Initially this caused all indices to be unavailable until I increased
cluster.routing.allocation.node_initial_primaries_recoveries
to 20I can not find any logs or other information to try and debug this issue, any guidance would be appreciated.
Related component
Other
To Reproduce
I have no idea.
We have a single node cluster and enabled windows detectors yesterday. Nothing but trouble since then.
Expected behavior
Cluster should be able to restart and have all indices come back online.
At minimum there should be logging and or timeouts when recovering indices.
Additional Details
OpenSearch 2.11.1
The text was updated successfully, but these errors were encountered: