Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Confirm whether "Tragic failure of primary marks replicas as stale" can occur for OpenSearch #16817

Open
karenyrx opened this issue Dec 9, 2024 · 4 comments

Comments

@karenyrx
Copy link

karenyrx commented Dec 9, 2024

Describe the bug

Similar to elastic/elasticsearch#101180, which occured on an ElasticSearch cluster we were running, we would like to understand if the same issue can occur for OpenSearch.

Logs collected for the primary node and master node during the incident: https://docs.google.com/spreadsheets/d/1EkZeoGpMM_fDywgx0keOV-UQwWOaXZHYUuW2hstURv4/edit?gid=1305019963#gid=1305019963 Please keep that in mind it is unfortunately unclear whether these stack traces/logs are in order.

Related component

Search:Resiliency

To Reproduce

  1. Start a long-running, heavy ingestion job
  2. Set the FS for the primary node, to readonly, via:
sudo mount -o remount,ro /dev/mapper/<volume_name>

Note: We were unable to reproduce this on our end, as we were unable to execute step 2 so far, as the mount point was always being used (translog writes?), but theoretically these are the steps that should reproduce the issue.

Expected behavior

A replica should have been auto-promoted when the primary node had an issue, rather than the entire shard becoming unavailable.

Additional Details

Plugins
Please list all plugins currently enabled.

  • org.elasticsearch.search.aggregations.matrix.MatrixAggregationPlugin
  • org.elasticsearch.analysis.common.CommonAnalysisPlugin
  • org.elasticsearch.script.mustache.MustachePlugin
  • org.elasticsearch.painless.PainlessPlugin
  • org.elasticsearch.index.mapper.MapperExtrasPlugin
  • org.elasticsearch.xpack.versionfield.VersionFieldPlugin
  • org.elasticsearch.join.ParentJoinPlugin
  • org.elasticsearch.percolator.PercolatorPlugin
  • org.elasticsearch.index.rankeval.RankEvalPlugin
  • org.elasticsearch.index.reindex.ReindexPlugin
  • org.elasticsearch.xpack.repositories.metering.RepositoriesMeteringPlugin
  • org.elasticsearch.plugin.repository.url.URLRepositoryPlugin
  • org.elasticsearch.xpack.constantkeyword.ConstantKeywordMapperPlugin
  • org.elasticsearch.xpack.searchbusinessrules.SearchBusinessRules
  • org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshots
  • org.elasticsearch.xpack.spatial.SpatialPlugin
  • org.elasticsearch.xpack.transform.Transform
  • org.elasticsearch.transport.Netty4Plugin
  • org.elasticsearch.xpack.unsignedlong.UnsignedLongMapperPlugin
  • org.elasticsearch.xpack.vectors.Vectors
  • org.elasticsearch.xpack.wildcard.Wildcard
  • org.elasticsearch.xpack.analytics.AnalyticsPlugin
  • org.elasticsearch.xpack.async.AsyncResultsIndexPlugin
  • org.elasticsearch.xpack.flattened.FlattenedMapperPlugin
  • org.elasticsearch.xpack.search.AsyncSearch
  • org.elasticsearch.xpack.autoscaling.Autoscaling
  • org.elasticsearch.xpack.ccr.Ccr
  • org.elasticsearch.xpack.core.XPackPlugin
  • org.elasticsearch.xpack.datastreams.DataStreamsPlugin
  • org.elasticsearch.xpack.deprecation.Deprecation
  • org.elasticsearch.xpack.enrich.EnrichPlugin
  • org.elasticsearch.xpack.eql.plugin.EqlPlugin
  • org.elasticsearch.xpack.graph.Graph
  • org.elasticsearch.xpack.idp.IdentityProviderPlugin
  • org.elasticsearch.xpack.frozen.FrozenIndices
  • org.elasticsearch.xpack.ilm.IndexLifecycle
  • org.elasticsearch.xpack.logstash.Logstash
  • org.elasticsearch.xpack.ml.MachineLearning
  • org.elasticsearch.xpack.monitoring.Monitoring
  • org.elasticsearch.xpack.ql.plugin.QlPlugin
  • org.elasticsearch.xpack.rollup.Rollup
  • org.elasticsearch.xpack.security.Security
  • org.elasticsearch.xpack.sql.plugin.SqlPlugin
  • org.elasticsearch.xpack.stack.StackPlugin
  • org.elasticsearch.cluster.coordination.VotingOnlyNodePlugin
  • org.elasticsearch.ingest.common.IngestCommonPlugin
  • org.elasticsearch.xpack.watcher.Watcher
  • org.elasticsearch.ingest.geoip.IngestGeoIpPlugin
  • org.elasticsearch.ingest.useragent.IngestUserAgentPlugin
  • org.elasticsearch.kibana.KibanaPlugin
  • org.elasticsearch.script.expression.ExpressionPlugin

Screenshots
If applicable, add screenshots to help explain your problem.

Logs from the primary node hosting the shard and active master node: https://docs.google.com/spreadsheets/d/1EkZeoGpMM_fDywgx0keOV-UQwWOaXZHYUuW2hstURv4/edit?gid=1305019963#gid=1305019963 Please keep that in mind it is unfortunately unclear whether these stack traces/logs are in order.

Host/Environment (please complete the following information):

  • OS: Linux
  • Version: ES v7.10.2

Additional context
Timeline:

18:40 Cluster was in yellow state due to ongoing ingestion / replication in the cluster.

18:41 Primary shard tried to write metadata but failed with Exception occurred when storing metadata. Disc probably officially turned to readonly at this time.

18:41 Cluster turned to red state, implying no copies of the shard were found within the cluster anymore.

18:41:37 - Master node logs showed that the current state of the shard was closed, with no available replicas.

org.elasticsearch.index.shard.IndexShardClosedException: CurrentState[CLOSED] Replica unavailable - replica could have left ReplicationGroup or IndexShard might have closed

18:41:37 - Master node logs showed replica node failed to perform replication for a bulk write request

[master_node_name_placeholder] failing shard [failed shard, shard [index_name_placeholder][6], node[F8KJuNpISluhjrzZImpLFw], [R], s[STARTED], a[id=FrkhgR1WTrK9C6innxeYLw], message [failed to perform indices:data/write/bulk[s] on replica [index_name_placeholder][6], node[F8KJuNpISluhjrzZImpLFw], [R], s[STARTED], a[id=FrkhgR1WTrK9C6innxeYLw]], failure [IndexShardClosedException[CurrentState[CLOSED] Replica unavailable - replica could have left ReplicationGroup or IndexShard might have closed]], markAsStale [true]]

19:10
An engineer tried gracefully manual reroute with cluster reroute API , but got failure on replica unavailable due to indexshard close exception:

  "6": [ - 
              { - 
                "state": "UNASSIGNED",
                "primary": true,
                "node": null,
                "relocating_node": null,
                "shard": 6,
                "index": "index_name_placeholder",
                "recovery_source": { - 
                  "type": "EXISTING_STORE",
                  "bootstrap_new_history_uuid": false
                },
                "unassigned_info": { - 
                  "reason": "MANUAL_ALLOCATION",
                  "delayed": false,
                  "details": "failed shard on node [4NuDZt9TSueczPGeM1ANLQ]: failed to perform indices:data/write/bulk[s] on replica [index_name_placeholder][6], node[4NuDZt9TSueczPGeM1ANLQ], [R], s[STARTED], a[id=pGGuYwM7SLWOFu1ZZXy8ig], failure IndexShardClosedException[CurrentState[CLOSED] Replica unavailable - replica could have left ReplicationGroup or IndexShard might have closed]",
                  "allocation_status": "no_valid_shard_copy"
                }
              },

19:19 Added "accept_data_loss": true to the command, to ungracefully promote a "stale" replica shard to primary:

POST
{
 "commands": [
  {
   "allocate_stale_primary": {
    "index": "index_name_placeholder",
    "shard": 6,
    "node": "4NuDZt9TSueczPGeM1ANLQ",
    "accept_data_loss": true
   }
  }
 ]
}

19:19 Cluster 5xx failures subsided, and the cluster state went back to green.

@shwetathareja
Copy link
Member

Thanks @karenyrx for raising the issue. Yes, the ideal behavior should for the replica to be promoted as primary.
Feel free to take a stab at the fix.

@karenyrx
Copy link
Author

It seems similar to #803 which was fixed in #4133 by @andrross

@andrross @msfroh would you be able to share any expertise whether the above bug was fixed for OS in that case?

@andrross
Copy link
Member

@karenyrx It does look like #4133 may have fixed this case. Does the test case in CorruptedFileIT added in that PR look similar to the situation you observed?

@rajiv-kv
Copy link
Contributor

[Triage Attendees - 1, 2, 3]
@karenyrx - thanks for filing the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🆕 New
Status: 🆕 New
Development

No branches or pull requests

4 participants