[BUG] Confirm whether "Tragic failure of primary marks replicas as stale" can occur for OpenSearch #16817

karenyrx · 2024-12-09T22:15:24Z

Describe the bug

Similar to elastic/elasticsearch#101180, which occured on an ElasticSearch cluster we were running, we would like to understand if the same issue can occur for OpenSearch.

Logs collected for the primary node and master node during the incident: https://docs.google.com/spreadsheets/d/1EkZeoGpMM_fDywgx0keOV-UQwWOaXZHYUuW2hstURv4/edit?gid=1305019963#gid=1305019963 Please keep that in mind it is unfortunately unclear whether these stack traces/logs are in order.

Related component

Search:Resiliency

To Reproduce

Start a long-running, heavy ingestion job
Set the FS for the primary node, to readonly, via:

sudo mount -o remount,ro /dev/mapper/<volume_name>

Note: We were unable to reproduce this on our end, as we were unable to execute step 2 so far, as the mount point was always being used (translog writes?), but theoretically these are the steps that should reproduce the issue.

Expected behavior

A replica should have been auto-promoted when the primary node had an issue, rather than the entire shard becoming unavailable.

Additional Details

Plugins
Please list all plugins currently enabled.

org.elasticsearch.search.aggregations.matrix.MatrixAggregationPlugin
org.elasticsearch.analysis.common.CommonAnalysisPlugin
org.elasticsearch.script.mustache.MustachePlugin
org.elasticsearch.painless.PainlessPlugin
org.elasticsearch.index.mapper.MapperExtrasPlugin
org.elasticsearch.xpack.versionfield.VersionFieldPlugin
org.elasticsearch.join.ParentJoinPlugin
org.elasticsearch.percolator.PercolatorPlugin
org.elasticsearch.index.rankeval.RankEvalPlugin
org.elasticsearch.index.reindex.ReindexPlugin
org.elasticsearch.xpack.repositories.metering.RepositoriesMeteringPlugin
org.elasticsearch.plugin.repository.url.URLRepositoryPlugin
org.elasticsearch.xpack.constantkeyword.ConstantKeywordMapperPlugin
org.elasticsearch.xpack.searchbusinessrules.SearchBusinessRules
org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshots
org.elasticsearch.xpack.spatial.SpatialPlugin
org.elasticsearch.xpack.transform.Transform
org.elasticsearch.transport.Netty4Plugin
org.elasticsearch.xpack.unsignedlong.UnsignedLongMapperPlugin
org.elasticsearch.xpack.vectors.Vectors
org.elasticsearch.xpack.wildcard.Wildcard
org.elasticsearch.xpack.analytics.AnalyticsPlugin
org.elasticsearch.xpack.async.AsyncResultsIndexPlugin
org.elasticsearch.xpack.flattened.FlattenedMapperPlugin
org.elasticsearch.xpack.search.AsyncSearch
org.elasticsearch.xpack.autoscaling.Autoscaling
org.elasticsearch.xpack.ccr.Ccr
org.elasticsearch.xpack.core.XPackPlugin
org.elasticsearch.xpack.datastreams.DataStreamsPlugin
org.elasticsearch.xpack.deprecation.Deprecation
org.elasticsearch.xpack.enrich.EnrichPlugin
org.elasticsearch.xpack.eql.plugin.EqlPlugin
org.elasticsearch.xpack.graph.Graph
org.elasticsearch.xpack.idp.IdentityProviderPlugin
org.elasticsearch.xpack.frozen.FrozenIndices
org.elasticsearch.xpack.ilm.IndexLifecycle
org.elasticsearch.xpack.logstash.Logstash
org.elasticsearch.xpack.ml.MachineLearning
org.elasticsearch.xpack.monitoring.Monitoring
org.elasticsearch.xpack.ql.plugin.QlPlugin
org.elasticsearch.xpack.rollup.Rollup
org.elasticsearch.xpack.security.Security
org.elasticsearch.xpack.sql.plugin.SqlPlugin
org.elasticsearch.xpack.stack.StackPlugin
org.elasticsearch.cluster.coordination.VotingOnlyNodePlugin
org.elasticsearch.ingest.common.IngestCommonPlugin
org.elasticsearch.xpack.watcher.Watcher
org.elasticsearch.ingest.geoip.IngestGeoIpPlugin
org.elasticsearch.ingest.useragent.IngestUserAgentPlugin
org.elasticsearch.kibana.KibanaPlugin
org.elasticsearch.script.expression.ExpressionPlugin

Screenshots
If applicable, add screenshots to help explain your problem.

Logs from the primary node hosting the shard and active master node: https://docs.google.com/spreadsheets/d/1EkZeoGpMM_fDywgx0keOV-UQwWOaXZHYUuW2hstURv4/edit?gid=1305019963#gid=1305019963 Please keep that in mind it is unfortunately unclear whether these stack traces/logs are in order.

Host/Environment (please complete the following information):

OS: Linux
Version: ES v7.10.2

Additional context
Timeline:

18:40 Cluster was in yellow state due to ongoing ingestion / replication in the cluster.

18:41 Primary shard tried to write metadata but failed with Exception occurred when storing metadata. Disc probably officially turned to readonly at this time.

18:41 Cluster turned to red state, implying no copies of the shard were found within the cluster anymore.

18:41:37 - Master node logs showed that the current state of the shard was closed, with no available replicas.

org.elasticsearch.index.shard.IndexShardClosedException: CurrentState[CLOSED] Replica unavailable - replica could have left ReplicationGroup or IndexShard might have closed

18:41:37 - Master node logs showed replica node failed to perform replication for a bulk write request

[master_node_name_placeholder] failing shard [failed shard, shard [index_name_placeholder][6], node[F8KJuNpISluhjrzZImpLFw], [R], s[STARTED], a[id=FrkhgR1WTrK9C6innxeYLw], message [failed to perform indices:data/write/bulk[s] on replica [index_name_placeholder][6], node[F8KJuNpISluhjrzZImpLFw], [R], s[STARTED], a[id=FrkhgR1WTrK9C6innxeYLw]], failure [IndexShardClosedException[CurrentState[CLOSED] Replica unavailable - replica could have left ReplicationGroup or IndexShard might have closed]], markAsStale [true]]

19:10
An engineer tried gracefully manual reroute with cluster reroute API , but got failure on replica unavailable due to indexshard close exception:

  "6": [ - 
              { - 
                "state": "UNASSIGNED",
                "primary": true,
                "node": null,
                "relocating_node": null,
                "shard": 6,
                "index": "index_name_placeholder",
                "recovery_source": { - 
                  "type": "EXISTING_STORE",
                  "bootstrap_new_history_uuid": false
                },
                "unassigned_info": { - 
                  "reason": "MANUAL_ALLOCATION",
                  "delayed": false,
                  "details": "failed shard on node [4NuDZt9TSueczPGeM1ANLQ]: failed to perform indices:data/write/bulk[s] on replica [index_name_placeholder][6], node[4NuDZt9TSueczPGeM1ANLQ], [R], s[STARTED], a[id=pGGuYwM7SLWOFu1ZZXy8ig], failure IndexShardClosedException[CurrentState[CLOSED] Replica unavailable - replica could have left ReplicationGroup or IndexShard might have closed]",
                  "allocation_status": "no_valid_shard_copy"
                }
              },

19:19 Added "accept_data_loss": true to the command, to ungracefully promote a "stale" replica shard to primary:

POST
{
 "commands": [
  {
   "allocate_stale_primary": {
    "index": "index_name_placeholder",
    "shard": 6,
    "node": "4NuDZt9TSueczPGeM1ANLQ",
    "accept_data_loss": true
   }
  }
 ]
}

19:19 Cluster 5xx failures subsided, and the cluster state went back to green.

The text was updated successfully, but these errors were encountered:

shwetathareja · 2024-12-13T06:02:26Z

Thanks @karenyrx for raising the issue. Yes, the ideal behavior should for the replica to be promoted as primary.
Feel free to take a stab at the fix.

karenyrx · 2024-12-16T15:47:49Z

It seems similar to #803 which was fixed in #4133 by @andrross

@andrross @msfroh would you be able to share any expertise whether the above bug was fixed for OS in that case?

andrross · 2024-12-16T16:24:20Z

@karenyrx It does look like #4133 may have fixed this case. Does the test case in CorruptedFileIT added in that PR look similar to the situation you observed?

rajiv-kv · 2024-12-19T17:00:23Z

[Triage Attendees - 1, 2, 3]
@karenyrx - thanks for filing the issue

karenyrx added bug Something isn't working untriaged labels Dec 9, 2024

github-actions bot added the Search:Resiliency label Dec 9, 2024

shwetathareja added the Cluster Manager label Dec 13, 2024

github-project-automation bot added this to Cluster Manager Project Board Dec 13, 2024

github-project-automation bot moved this to 🆕 New in Cluster Manager Project Board Dec 13, 2024

shwetathareja added the ShardManagement:Resiliency label Dec 13, 2024

github-project-automation bot added this to Shard Management Project Board Dec 13, 2024

github-project-automation bot moved this to 🆕 New in Shard Management Project Board Dec 13, 2024

shwetathareja removed the Search:Resiliency label Dec 13, 2024

rajiv-kv removed the Cluster Manager label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Confirm whether "Tragic failure of primary marks replicas as stale" can occur for OpenSearch #16817

[BUG] Confirm whether "Tragic failure of primary marks replicas as stale" can occur for OpenSearch #16817

karenyrx commented Dec 9, 2024 •

edited

Loading

shwetathareja commented Dec 13, 2024

karenyrx commented Dec 16, 2024

andrross commented Dec 16, 2024

rajiv-kv commented Dec 19, 2024

[BUG] Confirm whether "Tragic failure of primary marks replicas as stale" can occur for OpenSearch #16817

[BUG] Confirm whether "Tragic failure of primary marks replicas as stale" can occur for OpenSearch #16817

Comments

karenyrx commented Dec 9, 2024 • edited Loading

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

shwetathareja commented Dec 13, 2024

karenyrx commented Dec 16, 2024

andrross commented Dec 16, 2024

rajiv-kv commented Dec 19, 2024

karenyrx commented Dec 9, 2024 •

edited

Loading