Migrations fail on single node clusters due to unavailable shards exception #157968

rudolf · 2023-05-17T09:43:00Z

We are seeing frequent migration-related failures on CI with a message like:

Not enough active copies to meet shard count of [ALL] (have 1, needed 2)

After speaking to the Elasticsearch team this appears to be a race condition in Elasticsearch that only happens on single node clusters. We create indices with "auto_expand_replicas": "0-1" and wait for a shards_acknowledged=true response. On a single node cluster this creates an index with 0 replicas.

info [o.e.c.m.MetadataCreateIndexService] [node-01] [.kibana_8.8.0_reindex_temp] creating index, cause [api], templates [], shards [1]/[1]

However, even if the create index API responds that all shards are available, there is a brief time where the index actually has 1 replica assigned which cannot/has not been assigned. ES then immediately adjusts the replicas down to 0

info [o.e.c.r.a.AllocationService] [node-01] updating number_of_replicas to [0] for indices [.kibana_8.8.0_reindex_temp]

However, in the brief time between these two messages, if Kibana indexes any data or searches against the index we'll get "Not enough active copies to meet shard count of [ALL] (have 1, needed 2)" errors.

This should be fixed upstream by ES, but in the meantime we can work around this problem by always waiting for the index status to turn "green".

Note: the error message is similar to #127136 but this is a different issue.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-05-17T09:43:02Z

Pinging @elastic/kibana-core (Team:Core)

rudolf · 2023-05-17T09:47:37Z

To do this we'd need to remove the if (res.acknowledged && res.shardsAcknowledged) check, something like:

--- a/packages/core/saved-objects/core-saved-objects-migration-server-internal/src/actions/create_index.ts
+++ b/packages/core/saved-objects/core-saved-objects-migration-server-internal/src/actions/create_index.ts
@@ -146,25 +146,20 @@ export const createIndex = ({
       AcknowledgeResponse,
       'create_index_succeeded'
     >((res) => {
-      if (res.acknowledged && res.shardsAcknowledged) {
-        // If the cluster state was updated and all shards started we're done
-        return TaskEither.right('create_index_succeeded');
-      } else {
-        // Otherwise, wait until the target index has a 'green' status meaning
-        // the primary (and on multi node clusters) the replica has been started
-        return pipe(
-          waitForIndexStatus({
-            client,
-            index: indexName,
-            timeout: DEFAULT_TIMEOUT,
-            status: 'green',
-          }),
-          TaskEither.map(() => {
-            /** When the index status is 'green' we know that all shards were started */
-            return 'create_index_succeeded';
-          })
-        );
-      }
+      // Otherwise, wait until the target index has a 'green' status meaning
+      // the primary (and on multi node clusters) the replica has been started
+      return pipe(
+        waitForIndexStatus({
+          client,
+          index: indexName,
+          timeout: DEFAULT_TIMEOUT,
+          status: 'green',
+        }),
+        TaskEither.map(() => {
+          /** When the index status is 'green' we know that all shards were started */
+          return 'create_index_succeeded';
+        })
+      );
     })
   );
 };

…een (#157973) Tackles #157968 When creating new indices during SO migrations, we used to rely on the `res.acknowledged && res.shardsAcknowledged` of the `esClient.indices.create(...)` to determine that the indices are ready to use. However, we believe that due to certain race conditions, this can cause Kibana migrations to fail (refer to the [related issue](#157968)). This PR aims at fixing recent CI failures by adding a systematic `waitForIndexStatus` after creating an index.

…een (elastic#157973) Tackles elastic#157968 When creating new indices during SO migrations, we used to rely on the `res.acknowledged && res.shardsAcknowledged` of the `esClient.indices.create(...)` to determine that the indices are ready to use. However, we believe that due to certain race conditions, this can cause Kibana migrations to fail (refer to the [related issue](elastic#157968)). This PR aims at fixing recent CI failures by adding a systematic `waitForIndexStatus` after creating an index. (cherry picked from commit 71125b1)

…urn green (#157973) (#157993) # Backport This will backport the following commits from `main` to `8.8`: - [[Migrations] Systematically wait for newly created indices to turn green (#157973)](#157973)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Gerard Soldevila <[email protected]>

pgayvallet · 2023-05-22T06:31:34Z

@gsoldevila I guess #157973 should have closed this one?

EDIT: or maybe not given the last messages from @dmlemeshko on slack

rudolf · 2023-05-24T14:24:54Z

Second attempt to fix this #158182

rudolf · 2024-04-15T12:39:40Z

Closing as we have not seen further failures on CI

rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Migrations labels May 17, 2023

gsoldevila mentioned this issue May 17, 2023

[Migrations] Systematically wait for newly created indices to turn green #157973

Merged

gsoldevila mentioned this issue Jun 27, 2023

[esArchiver] Update aliases after creating the indices #160584

Merged

rayafratkina assigned rudolf Apr 4, 2024

rudolf closed this as completed Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrations fail on single node clusters due to unavailable shards exception #157968

Migrations fail on single node clusters due to unavailable shards exception #157968

rudolf commented May 17, 2023

elasticmachine commented May 17, 2023

rudolf commented May 17, 2023

pgayvallet commented May 22, 2023 •

edited

Loading

rudolf commented May 24, 2023

rudolf commented Apr 15, 2024

Migrations fail on single node clusters due to unavailable shards exception #157968

Migrations fail on single node clusters due to unavailable shards exception #157968

Comments

rudolf commented May 17, 2023

elasticmachine commented May 17, 2023

rudolf commented May 17, 2023

pgayvallet commented May 22, 2023 • edited Loading

rudolf commented May 24, 2023

rudolf commented Apr 15, 2024

pgayvallet commented May 22, 2023 •

edited

Loading