Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7.17 standalone to 8.15 upgrade results in unmanaged indices #14565

Closed
graphaelli opened this issue Nov 6, 2024 · 6 comments
Closed

7.17 standalone to 8.15 upgrade results in unmanaged indices #14565

graphaelli opened this issue Nov 6, 2024 · 6 comments
Assignees

Comments

@graphaelli
Copy link
Member

graphaelli commented Nov 6, 2024

APM Server version (apm-server version):

7.17.25 to 8.15.3. This does not manifest with managed 7.17.25 (integration server).

Description of the problem including expected versus actual behavior:

Steps to reproduce:

  1. launch 7.17.25 stack with standalone apm-server
  2. upgrade to 8.15.3
  3. note that ingestion does not break - eg traces are written to traces-apm-default
  4. note that traces-apm-default is not managed by ILM

for both setups, GET traces-apm-default/_settings?filter_path=*.settings.index.lifecycle returns:

{
  ".ds-traces-apm-default-2024.11.06-000001": {
    "settings": {
      "index": {
        "lifecycle": {
          "name": "traces-apm.traces-default_policy",
          "prefer_ilm": "false"
        }
      }
    }
  }
}

GET .ds-traces-apm-default-2024.11.06-000001/_ilm/explain

Got:

{
  "indices": {
    ".ds-traces-apm-default-2024.11.06-000001": {
      "index": ".ds-traces-apm-default-2024.11.06-000001",
      "managed": false
    }
  }
}

Expected:

{
  "indices": {
    ".ds-traces-apm-default-2024.11.06-000001": {
      "index": ".ds-traces-apm-default-2024.11.06-000001",
      "managed": true,
      "policy": "traces-apm.traces-default_policy",
      "index_creation_date_millis": 1730923122077,
      "time_since_index_creation": "47.41m",
      "lifecycle_date_millis": 1730923399128,
      "age": "42.79m",
      "phase": "hot",
      "phase_time_millis": 1730923122131,
      "action": "complete",
      "action_time_millis": 1730923903331,
      "step": "complete",
      "step_time_millis": 1730923903331,
      "phase_execution": {
        "policy": "traces-apm.traces-default_policy",
        "phase_definition": {
          "min_age": "0ms",
          "actions": {
            "set_priority": {
              "priority": 100
            },
            "rollover": {
              "max_age": "30d",
              "max_primary_shard_docs": 200000000,
              "min_docs": 1,
              "max_size": "50gb"
            }
          }
        },
        "version": 2,
        "modified_date_in_millis": 1730923391472
      }
    }
  }
}

GET _ilm/policy/traces-apm.traces-default_policy

Got:

{
  "error": {
    "root_cause": [
      {
        "type": "resource_not_found_exception",
        "reason": "Lifecycle policy not found: traces-apm.traces-default_policy"
      }
    ],
    "type": "resource_not_found_exception",
    "reason": "Lifecycle policy not found: traces-apm.traces-default_policy"
  },
  "status": 404
}

Expected:

{
  "traces-apm.traces-default_policy": {
    "version": 2,
    "modified_date": "2024-11-06T20:03:11.472Z",
    "policy": {
      "phases": {
        "hot": {
          "min_age": "0ms",
          "actions": {
            "set_priority": {
              "priority": 100
            },
            "rollover": {
              "max_age": "30d",
              "max_size": "50gb"
            }
          }
        }
      },
      "_meta": {
        "package": {
          "name": "apm"
        },
        "managed_by": "fleet",
        "managed": true
      }
    },
    "in_use_by": {
      "indices": [
        ".ds-traces-apm-default-2024.11.06-000002",
        ".ds-traces-apm-default-2024.11.06-000001"
      ],
      "data_streams": [
        "traces-apm-default"
      ],
      "composable_templates": [
        "traces-apm",
        "traces-apm@template"
      ]
    }
  }
}

The missing traces-apm.traces-default_policy being the main issue.

originally reported by @bvader

@graphaelli graphaelli added the bug label Nov 6, 2024
@bvader
Copy link

bvader commented Nov 7, 2024

@graphaelli
@pius
@lucabelluccini

Looking closer the issue is wider spread and is NOT just limited to standalone unfortunately, so this changes the scope of changes required as well as any interim fix.

When I upgrade from 7.17 - 8.14.3 - 8.15.3 all the following policies exist and are used as data ages etc..etc...

Test Path 1

OOTB 7.17.25 Standalone APM Server to 8.15.3 with Standalone Server the following ILM Policies are missing
they are still needed by the analysis I did below. tl;dr that are set as the ILM policies for existing indices that can get populated with documents

GET _ilm/policy/metrics-apm.internal_metrics-default_policy
GET _ilm/policy/metrics-apm.service_destination_10m_metrics-default_policy
GET _ilm/policy/metrics-apm.service_destination_1m_metrics-default_policy
GET _ilm/policy/metrics-apm.service_destination_60m_metrics-default_policy
GET _ilm/policy/metrics-apm.service_summary_10m_metrics-default_policy
GET _ilm/policy/metrics-apm.service_summary_1m_metrics-default_policy
GET _ilm/policy/metrics-apm.service_summary_60m_metrics-default_policy
GET _ilm/policy/metrics-apm.service_transaction_10m_metrics-default_policy
GET _ilm/policy/metrics-apm.service_transaction_1m_metrics-default_policy
GET _ilm/policy/metrics-apm.service_transaction_60m_metrics-default_policy
GET _ilm/policy/metrics-apm.transaction_10m_metrics-default_policy
GET _ilm/policy/metrics-apm.transaction_1m_metrics-default_policy
GET _ilm/policy/metrics-apm.transaction_60m_metrics-default_policy
GET _ilm/policy/traces-apm.traces-default_policy

Test Path 2

OOTB 7.17.25 with Managed APM Server (Agent) to 8.15.3 with Managed APM Server (Agent)

Good before transition to managed apm server and still available in 8.15.3

GET _ilm/policy/traces-apm.traces-default_policy
GET _ilm/policy/metrics-apm.internal_metrics-default_policy

Does Not Exists transition to managed apm server in 7.17.25 and still does not exist in 8.15.3

GET _ilm/policy/metrics-apm.service_destination_10m_metrics-default_policy
GET _ilm/policy/metrics-apm.service_destination_1m_metrics-default_policy
GET _ilm/policy/metrics-apm.service_destination_60m_metrics-default_policy
GET _ilm/policy/metrics-apm.service_summary_10m_metrics-default_policy
GET _ilm/policy/metrics-apm.service_summary_1m_metrics-default_policy
GET _ilm/policy/metrics-apm.service_summary_60m_metrics-default_policy
GET _ilm/policy/metrics-apm.service_transaction_10m_metrics-default_policy
GET _ilm/policy/metrics-apm.service_transaction_1m_metrics-default_policy
GET _ilm/policy/metrics-apm.service_transaction_60m_metrics-default_policy
GET _ilm/policy/metrics-apm.transaction_10m_metrics-default_policy
GET _ilm/policy/metrics-apm.transaction_1m_metrics-default_policy
GET _ilm/policy/metrics-apm.transaction_60m_metrics-default_policy

I see all these datastreams seeing I want all the default ILM behavior TBH honest I am not sure what is the epec

Image

Image

When I look close at one of the data streams vs the actual index

GET _data_stream/metrics-apm.service_destination.10m-default
{
  "data_streams" : [
    {
      "name" : "metrics-apm.service_destination.10m-default",
      "timestamp_field" : {
        "name" : "@timestamp"
      },
      "indices" : [
        {
          "index_name" : ".ds-metrics-apm.service_destination.10m-default-2024.11.07-000001",
          "index_uuid" : "FZLd2TUzRbW2W0sukNqf5g",
          "prefer_ilm" : false,
          "ilm_policy" : "metrics-apm.service_destination_10m_metrics-default_policy",
          "managed_by" : "Data stream lifecycle"
        }
      ],
      "generation" : 1,
      "_meta" : {
        "description" : "Index template for metrics-apm.service_destination.10m-*",
        "managed" : true
      },
      "status" : "GREEN",
      "template" : "metrics-apm.service_destination.10m@template",
      "lifecycle" : {
        "enabled" : true,
        "data_retention" : "180d"
      },
      "ilm_policy" : "metrics-apm.service_destination_10m_metrics-default_policy",
      "next_generation_managed_by" : "Data stream lifecycle",
      "prefer_ilm" : false,
      "hidden" : true,
      "system" : false,
      "allow_custom_routing" : false,
      "replicated" : false,
      "rollover_on_write" : false
    }
  ]
}

This indicates that this index is look from ILM policy metrics-apm.service_destination_10m_metrics-default_policy

GET .ds-metrics-apm.service_destination.10m-default-2024.11.07-000001/_settings
{
  ".ds-metrics-apm.service_destination.10m-default-2024.11.07-000001" : {
    "settings" : {
      "index" : {
        "mapping" : {
          "total_fields" : {
            "ignore_dynamic_beyond_limit" : "true"
          },
          "ignore_malformed" : "false"
        },
        "hidden" : "true",
        "provided_name" : ".ds-metrics-apm.service_destination.10m-default-2024.11.07-000001",
        "final_pipeline" : "metrics-apm@pipeline",
        "creation_date" : "1730949001080",
        "sort" : {
          "field" : "@timestamp",
          "order" : "desc"
        },
        "number_of_replicas" : "1",
        "uuid" : "FZLd2TUzRbW2W0sukNqf5g",
        "version" : {
          "created" : "8512000"
        },
        "lifecycle" : {
          "name" : "metrics-apm.service_destination_10m_metrics-default_policy",
          "prefer_ilm" : "false"
        },
        "codec" : "best_compression",
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_hot"
            }
          }
        },
        "number_of_shards" : "1",
        "default_pipeline" : "metrics-apm.service_destination@default-pipeline"
      }
    }
  }
}

And it does not exist ... as the others above

GET _ilm/policy/metrics-apm.service_destination_10m_metrics-default_policy
{
  "error" : {
    "root_cause" : [
      {
        "type" : "resource_not_found_exception",
        "reason" : "Lifecycle policy not found: metrics-apm.service_destination_10m_metrics-default_policy"
      }
    ],
    "type" : "resource_not_found_exception",
    "reason" : "Lifecycle policy not found: metrics-apm.service_destination_10m_metrics-default_policy"
  },
  "status" : 404
}

And

GET .ds-metrics-apm.service_destination.10m-default-2024.11.07-000001/_ilm/explain
{
  "indices" : {
    ".ds-metrics-apm.service_destination.10m-default-2024.11.07-000001" : {
      "index" : ".ds-metrics-apm.service_destination.10m-default-2024.11.07-000001",
      "managed" : false
    }
  }
}

@simitt
Copy link
Contributor

simitt commented Nov 7, 2024

I conducted following tests with SNAPSHOT versions, where any rollover issues were fixed already and cannot find a functional problem in lifecycle management. However, I understand that the mix of Data Stream Lifecycle and Index Lifecycle Management introduced in 8.15 can be quite confusing.

The most important point to keep in mind is that any new datastreams will now be managed by the datastream lifecycle management (instead of index lifecycle management). The confusing part is that data management is now mixed (Datastream Lifecycle and Index Lifecycle Management) for upgraded deployments, as well as that an ilm_policy is still shown for Datastream Lifecycle management although irrelevant there.

TestScenario A: Upgrade standalone from 7.17.26-SNAPSHOT to 8.15.4-SNAPSHOT:

After ingesting data, the expected data streams are created. Looking at one of the examples, defined as problematic in the comment above, everything is as expected:

GET _data_stream/metrics-apm.service_destination.1m-default*
{
  "data_streams": [
    {
      "name": "metrics-apm.service_destination.1m-default",
      "timestamp_field": {
        "name": "@timestamp"
      },
      "indices": [
        {
          "index_name": ".ds-metrics-apm.service_destination.1m-default-2024.11.07-000001",
          "index_uuid": "HwK0ATGkRqOt2kHJA-Epjw",
          "prefer_ilm": false,
          "ilm_policy": "metrics-apm.service_destination_1m_metrics-default_policy",
          **"managed_by": "Data stream lifecycle"**
        }
      ],
      "generation": 1,
      "_meta": {
        "managed": true,
        "description": "Index template for metrics-apm.service_destination.1m-*"
      },
      "status": "YELLOW",
      "template": "metrics-apm.service_destination.1m@template",
      "lifecycle": {
        "enabled": true,
        "data_retention": "90d"
      },
      "ilm_policy": "metrics-apm.service_destination_1m_metrics-default_policy",
      "next_generation_managed_by": "Data stream lifecycle",
      "prefer_ilm": false,
      "hidden": false,
      "system": false,
      "allow_custom_routing": false,
      "replicated": false,
      "rollover_on_write": false,
      "failure_store": {
        "enabled": false,
        "rollover_on_write": true,
        "indices": []
      }
    }
  ]
}

Highlighting that the result shows managed_by: "Data stream lifecycle". Hence, querying for the data stream lifcycle:

GET _data_stream/metrics-apm.service_destination.1m-default/_lifecycle

results in

{
  "data_streams": [
    {
      "name": "metrics-apm.service_destination.1m-default",
      "lifecycle": {
        "enabled": true,
        "data_retention": "90d"
      }
    }
  ]
}

For traces the result is the same:

GET _data_stream/traces-apm-default
{
  "data_streams": [
    {
      "name": "traces-apm-default",
      "timestamp_field": {
        "name": "@timestamp"
      },
      "indices": [
        {
          "index_name": ".ds-traces-apm-default-2024.11.07-000001",
          "index_uuid": "xKbCmq1kSWS6xOrPR4zMuw",
          "prefer_ilm": false,
          "ilm_policy": "traces-apm.traces-default_policy",
          "managed_by": "Data stream lifecycle"
        }
      ],
      "generation": 1,
      "_meta": {
        "managed": true,
        "description": "Index template for traces-apm-*"
      },
      "status": "YELLOW",
      "template": "traces-apm@template",
      "lifecycle": {
        "enabled": true,
        "data_retention": "10d"
      },
      "ilm_policy": "traces-apm.traces-default_policy",
      "next_generation_managed_by": "Data stream lifecycle",
      "prefer_ilm": false,
      "hidden": false,
      "system": false,
      "allow_custom_routing": false,
      "replicated": false,
      "rollover_on_write": false,
      "failure_store": {
        "enabled": false,
        "rollover_on_write": true,
        "indices": []
      }
    }
  ]
}
GET _data_stream/traces-apm-default/_lifecycle
{
  "data_streams": [
    {
      "name": "traces-apm-default",
      "lifecycle": {
        "enabled": true,
        "data_retention": "10d"
      }
    }
  ]
}

Data ingestion works as expected - I cannot see a functional problem with this setup.

TestScenario B: Migrate standalone deployment on version 8.15.4-SNAPSHOT to managed:

After manually switching to managed mode, repeating the tests from Scenario A and confirming that everything works the same as before wrt data management.

TestScenario B: Upgrade managed from 7.17.26-SNAPSHOT to 8.15.4-SNAPSHOT:
Data management and ingestion works:

GET _data_stream/metrics-apm.service_destination.1m-default*
{
  "data_streams": [
    {
      "name": "metrics-apm.service_destination.1m-default",
      "timestamp_field": {
        "name": "@timestamp"
      },
      "indices": [
        {
          "index_name": ".ds-metrics-apm.service_destination.1m-default-2024.11.07-000001",
          "index_uuid": "0HPEvPi8RqeypiSzfJqpdA",
          "prefer_ilm": false,
          "ilm_policy": "metrics-apm.service_destination_1m_metrics-default_policy",
          "managed_by": "Data stream lifecycle"
        }
      ],
      "generation": 1,
      "_meta": {
        "description": "Index template for metrics-apm.service_destination.1m-*",
        "managed": true
      },
      "status": "YELLOW",
      "template": "metrics-apm.service_destination.1m@template",
      "lifecycle": {
        "enabled": true,
        "data_retention": "90d"
      },
      "ilm_policy": "metrics-apm.service_destination_1m_metrics-default_policy",
      "next_generation_managed_by": "Data stream lifecycle",
      "prefer_ilm": false,
      "hidden": false,
      "system": false,
      "allow_custom_routing": false,
      "replicated": false,
      "rollover_on_write": false,
      "failure_store": {
        "enabled": false,
        "rollover_on_write": true,
        "indices": []
      }
    }
  ]
}

Querying the lifecycle:

GET _data_stream/metrics-apm.service_destination.1m-default/_lifecycle
{
  "data_streams": [
    {
      "name": "metrics-apm.service_destination.1m-default",
      "lifecycle": {
        "enabled": true,
        "data_retention": "90d"
      }
    }
  ]
}

For the traces-apm* datastream, which already existed before, the lifecycle management is still handled by Index Lifecycle Management:

GET _data_stream/traces-apm-default
{
  "data_streams": [
    {
      "name": "traces-apm-default",
      "timestamp_field": {
        "name": "@timestamp"
      },
      "indices": [
        {
          "index_name": ".ds-traces-apm-default-2024.11.07-000001",
          "index_uuid": "DMZ6PCv6TRGQFRcakc4LdQ",
          "prefer_ilm": true,
          "ilm_policy": "traces-apm.traces-default_policy",
          "managed_by": "Index Lifecycle Management"
        },
        {
          "index_name": ".ds-traces-apm-default-2024.11.07-000002",
          "index_uuid": "c_nB-5L6S62DQb-0btLmoQ",
          "prefer_ilm": false,
          "ilm_policy": "traces-apm.traces-default_policy",
          "managed_by": "Index Lifecycle Management"
        }
      ],
      "generation": 3,
      "_meta": {
        "package": {
          "name": "apm"
        },
        "managed_by": "ingest-manager",
        "managed": true
      },
      "status": "YELLOW",
      "template": "traces-apm@template",
      "ilm_policy": "traces-apm.traces-default_policy",
      "next_generation_managed_by": "Index Lifecycle Management",
      "prefer_ilm": false,
      "hidden": false,
      "system": false,
      "allow_custom_routing": false,
      "replicated": false,
      "rollover_on_write": false,
      "failure_store": {
        "enabled": false,
        "rollover_on_write": false,
        "indices": []
      }
    }
  ]
}

Being managed by Index Lifecycle Management, querying for:

GET _ilm/policy/traces-apm.traces-default_policy
{
  "traces-apm.traces-default_policy": {
    "version": 2,
    "modified_date": "2024-11-07T08:25:05.905Z",
    "policy": {
      "phases": {
        "hot": {
          "min_age": "0ms",
          "actions": {
            "set_priority": {
              "priority": 100
            },
            "rollover": {
              "max_age": "30d",
              "max_size": "50gb"
            }
          }
        }
      },
      "_meta": {
        "package": {
          "name": "apm"
        },
        "managed_by": "fleet",
        "managed": true
      }
    },
    "in_use_by": {
      "indices": [
        ".ds-traces-apm-default-2024.11.07-000001",
        ".ds-traces-apm-default-2024.11.07-000002"
      ],
      "data_streams": [
        "traces-apm-default"
      ],
      "composable_templates": [
        "traces-apm",
        "traces-apm@template"
      ]
    }
  }
}

@simitt
Copy link
Contributor

simitt commented Nov 11, 2024

@endorama @lahsivjar please update this after the testing results with the latest 8.15.4 and 8.16.0 BCs. As shown in my comment above, in my rudimentary testing setup, datastreams weren't unmanaged, but they were managed by DSL instead of ILM.

@endorama
Copy link
Member

@bvader let me recap: what you primarily observed is that the ILM policies where not present after upgrade.

From our tests we can confirm what @simitt found out and explained above: data streams created in 8.15+ will be managed only by DSL, unless prefer_ilm is set to true. This is not the case in the data stream JSON you posted, where we see "managed_by" : "Data stream lifecycle" and "prefer_ilm": false. The ilm_policy setting is not relevant unless prefer_ilm is true.
This is a source of confusion, especially given the Kibana UI showing the policy greyed out and we understand that.

We documented the prefer_ilm: true requirement under 8.15 Known issues: https://www.elastic.co/guide/en/observability/8.15/apm-known-issues.html#_prefer_ilm_required_in_component_templates_to_create_custom_lifecycle_policies

Does this clarify your findings? If you query the lifecycle endpoint GET _data_stream/<data stream>/_lifecycle do you observe the lifecycle applied? Do you observe any indices where neither ILM or DSL are applied to them?

@endorama
Copy link
Member

8.16.0 is out and our testing didn't find any unmanaged index. @bvader @graphaelli do you have any additional feedback on this? If not I would close this as completed.

@graphaelli
Copy link
Member Author

no objections here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants