Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[plugin/apm-data] Set fallback to legacy ILM policies #112028

Merged
merged 2 commits into from
Aug 21, 2024

Conversation

lahsivjar
Copy link
Contributor

@lahsivjar lahsivjar commented Aug 20, 2024

Fixes fallback to legacy ILM policies when a datastream is updated. Without this PR, post update the indexes would be unmanaged without any lifecycle. After this PR:

  1. Any datastream created with apm-data plugin active (> v8.15.0) would be managed by datastream lifecycle
  2. Any datastream created before apm-data plugin was active(< v8.15.0) and migrated to version on or after v8.15.0 would be managed by ILM policies until they are explicitly migrated to use DLM.

The PR doesn't add the lifecycle policies as when/if they are required then they should be available via the previous apm-integration.

Testing locally

  1. Create a stack (ES, Kibana, APM-Server) with data-persistence enabled for ES using 8.14.3 version. We use the 8.14.3 as that is the latest available version which uses APM integration package and thus configures ILM policies.

    Example docker-compose.yaml
    version: '3.9'
    x-logging: &default-logging
      driver: "json-file"
      options:
        max-size: "1g"
    services:
      elasticsearch:
        image: docker.elastic.co/elasticsearch/elasticsearch:8.14.3
        ports:
          - 9200:9200
        healthcheck:
          test: ["CMD-SHELL", "curl -s http://localhost:9200/_cluster/health?wait_for_status=yellow&timeout=500ms"]
          retries: 300
          interval: 1s
        environment:
          - "ES_JAVA_OPTS=-Xms1g -Xmx1g"
          - "network.host=0.0.0.0"
          - "transport.host=127.0.0.1"
          - "http.host=0.0.0.0"
          - "cluster.routing.allocation.disk.threshold_enabled=false"
          - "discovery.type=single-node"
          - "xpack.security.authc.anonymous.roles=remote_monitoring_collector"
          - "xpack.security.authc.realms.file.file1.order=0"
          - "xpack.security.authc.realms.native.native1.order=1"
          - "xpack.security.enabled=true"
          - "xpack.license.self_generated.type=trial"
          - "xpack.security.authc.token.enabled=true"
          - "xpack.security.authc.api_key.enabled=true"
          - "logger.org.elasticsearch=${ES_LOG_LEVEL:-error}"
          - "action.destructive_requires_name=false"
        volumes:
          - "./testing/docker/elasticsearch/roles.yml:/usr/share/elasticsearch/config/roles.yml"
          - "./testing/docker/elasticsearch/users:/usr/share/elasticsearch/config/users"
          - "./testing/docker/elasticsearch/users_roles:/usr/share/elasticsearch/config/users_roles"
          - "./testing/docker/elasticsearch/ingest-geoip:/usr/share/elasticsearch/config/ingest-geoip"
          - "/Users/lahsivjar/Projects/elastic/tmp/esdata2:/usr/share/elasticsearch/data"
        logging: *default-logging
    
      kibana:
        image: docker.elastic.co/kibana/kibana:8.14.3
        ports:
          - 5601:5601
        healthcheck:
          test: ["CMD-SHELL", "curl -s http://localhost:5601/api/status | grep -q 'All services are available'"]
          retries: 300
          interval: 1s
        environment:
          ELASTICSEARCH_HOSTS: '["http://elasticsearch:9200"]'
          ELASTICSEARCH_USERNAME: "${KIBANA_ES_USER:-kibana_system_user}"
          ELASTICSEARCH_PASSWORD: "${KIBANA_ES_PASS:-changeme}"
          XPACK_FLEET_AGENTS_ELASTICSEARCH_HOSTS: '["http://elasticsearch:9200"]'
        depends_on:
          elasticsearch: { condition: service_healthy }
        volumes:
          - "./testing/docker/kibana/kibana.yml:/usr/share/kibana/config/kibana.yml"
        logging: *default-logging
    
      apm-server:
        image: docker.elastic.co/apm/apm-server:8.14.3
        ports:
          - 8200:8200
        healthcheck:
          test: ["CMD-SHELL", "bash -c 'echo -n > /dev/tcp/127.0.0.1/8200'"]
          retries: 300
          interval: 1s
        depends_on:
          elasticsearch: { condition: service_healthy }
        volumes:
          - "./testing/docker/apm-server/apm-server.yml:/usr/share/apm-server/apm-server.yml"
        logging: *default-logging
    NOTE: The config files used in the example docker-compose are [available here](https://github.com/elastic/apm-server/tree/main/testing/docker). `apm.server.yml` file used in the docker-compose could be a simple config file:
    apm-server:
      host: "0.0.0.0:8200"
    output.elasticsearch:
      hosts: ["elasticsearch:9200"]
      username: "admin"
      password: "changeme"
    logging.level: info
    logging.to_stderr: true
  2. Install the APM integration in the cluster.

  3. Send some data, for example: by using apmsoak. Example command: go run ./cmd/apmsoak/ run --file cmd/apmsoak/scenarios.yml --scenario apm-server --server-url http://localhost:8200

  4. Assert that the APM indices created are managed by ILM, for example: by running GET /_data_stream/traces-apm-default to check for trace indices

  5. Build an Elasticsearch docker image using the branch in this PR: ./gradlew buildAarch64DockerImage

  6. Update the versions used in the stack created in step 1 to 8.16.0-SNAPSHOT, for ES use the docker image built in step 5

  7. Send some more data as we did in step 3

  8. Assert that all the APM indices are still managed by ILM

  9. Rollover the datastream

  10. Assert that all the APM indices, including the one created using rollover in step 9, are still managed by ILM

Also, test if the setup works by itself i.e. if a cluster is created using the latest version (with the changes in the PR) then it works as expected and the created APM indices in this case are managed by DSL (datastream lifecycle).

NOTE: Any indices created when APM is on version 8.15.0 and datastream created before 8.15.0 i.e. with ILM, will remain Unmanaged even after this fix. To fix them, we would need to explicitly update them OR use the PUT API on datastream to set DSL.

Fixes: elastic/apm-server#13898

@lahsivjar
Copy link
Contributor Author

[For reviewers] I was planning to add integration tests to validate the behavior with ILM<>DSL but since APM-data plugin has ALWAYS used DSL the tests will not be possible. Previously, the ILM policies were installed using the APM integration but I don't think it would be a good idea to somehow hack the integration installation in tests -- Ideas/suggestions are welcomed.

Another point for discussion is that I have not added ILM policies in this PR. My reasoning for this is that the policies would be present in the cluster if they have been upgraded from the older version and we can use that. If the policies are not present then it means that the integration was not installed and we would be good with using DSL anyway. Let me know if my reasoning here is incorrect or lacking.

@lahsivjar lahsivjar marked this pull request as ready for review August 20, 2024 15:27
@lahsivjar lahsivjar requested a review from a team as a code owner August 20, 2024 15:27
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Aug 20, 2024
@lahsivjar lahsivjar added >bug :Data Management/Data streams Data streams and their lifecycles labels Aug 20, 2024
@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team and removed needs:triage Requires assignment of a team area label labels Aug 20, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@endorama
Copy link
Member

I was planning to add integration tests to validate the behavior with ILM<>DSL but since APM-data plugin has ALWAYS used DSL the tests will not be possible. Previously, the ILM policies were installed using the APM integration but I don't think it would be a good idea to somehow hack the integration installation in tests -- Ideas/suggestions are welcomed.

Is it possible in the test to manually create an ILM and an index using it? It adds some duplication with the functionality provided by the integration but is not something we will need to change/update.

@endorama
Copy link
Member

endorama commented Aug 21, 2024

What I wrote here was due to a bad testing. I did not have persistence enabled, so starting the version of ES from this PR would produce new data streams and appear as to "fix" the issue. In practice with persistence enabled I reproduced what we were expecting. Thanks Vishal for the guidance.

Moved the previous content here to avoid confusion:

Details

I completed the test as described. The only difference I found is on step 8 Assert that all the APM indices are still managed by ILM.

On step 4 Assert that the APM indices created are managed by ILM:

GET /_data_stream/traces-apm-default
{
  "index_name": ".ds-traces-apm-default-2024.08.21-000001",
  "index_uuid": "YrMcukDWTymjeiA7qpB_sw",
  "prefer_ilm": true,
  "ilm_policy": "traces-apm.traces-default_policy",
  "managed_by": "Index Lifecycle Management"
}

After starting ES built from this PR, the indices were already managed by DKM. This is the result for some of them (but they all show the same:

GET /_data_stream/logs-apm.error-default
{
  "index_name": ".ds-traces-apm-default-2024.08.21-000001",
  "index_uuid": "blw7b56sShKFI2ACmCrfdQ",
  "prefer_ilm": false,
  "ilm_policy": "traces-apm.traces-default_policy",
  "managed_by": "Data stream lifecycle"
}
GET /_data_stream/metrics-apm.internal-default
{
  "index_name": ".ds-metrics-apm.internal-default-2024.08.21-000001",
  "index_uuid": "6hZE1gH4S6qlwGFgT3itdQ",
  "prefer_ilm": false,
  "ilm_policy": "metrics-apm.internal_metrics-default_policy",
  "managed_by": "Data stream lifecycle"
}
GET GET /_data_stream/logs-apm.error-default
{
  "index_name": ".ds-logs-apm.error-default-2024.08.21-000001",
  "index_uuid": "EzIN-JmfSsGlnsAa8tpA7w",
  "prefer_ilm": false,
  "ilm_policy": "logs-apm.error_logs-default_policy",
  "managed_by": "Data stream lifecycle"
}

After rollover we see 2 indices, both managed by DLM:

POST /logs-apm.error-default/_rollover/
GET /_data_stream/logs-apm.error-default
{
  "index_name": ".ds-logs-apm.error-default-2024.08.21-000001",
  "index_uuid": "EzIN-JmfSsGlnsAa8tpA7w",
  "prefer_ilm": false,
  "ilm_policy": "logs-apm.error_logs-default_policy",
  "managed_by": "Data stream lifecycle"
},
{
  "index_name": ".ds-logs-apm.error-default-2024.08.21-000002",
  "index_uuid": "NqGsBZbbTNKghyYQZh8Cpw",
  "prefer_ilm": false,
  "ilm_policy": "logs-apm.error_logs-default_policy",
  "managed_by": "Data stream lifecycle"
}

The same happens with multiple rollovers:

POST /traces-apm-default/_rollover/
POST /traces-apm-default/_rollover/
{
  "index_name": ".ds-traces-apm-default-2024.08.21-000001",
  "index_uuid": "blw7b56sShKFI2ACmCrfdQ",
  "prefer_ilm": false,
  "ilm_policy": "traces-apm.traces-default_policy",
  "managed_by": "Data stream lifecycle"
},
{
  "index_name": ".ds-traces-apm-default-2024.08.21-000002",
  "index_uuid": "eHUB_aNtSvK1DCbBYUZNHg",
  "prefer_ilm": false,
  "ilm_policy": "traces-apm.traces-default_policy",
  "managed_by": "Data stream lifecycle"
},
{
  "index_name": ".ds-traces-apm-default-2024.08.21-000003",
  "index_uuid": "2cRTD22pTHKeAau1DqR_hg",
  "prefer_ilm": false,
  "ilm_policy": "traces-apm.traces-default_policy",
  "managed_by": "Data stream lifecycle"
}

@lahsivjar
Copy link
Contributor Author

After starting ES built from this PR, the indices were already managed by DKM. This is the result for some of them (but they all show the same:

Hmm, this is unexpected. If a datastream is created in a prior version then it should not be managed by Datastream Lifecycle -- IIUC, this is what is causing the issue in the first place. I wonder if the persistence didn't work as expected causing your new setup to use DSL from the get-go?

@endorama
Copy link
Member

Updated my previous comment, further testing revealed a persistence issue in my testing as mentioned by Vishal.

TIL that Dev Console content is not persisted in Elasticsearch but through localstorage/cookies, so that is not a reliable indicator for correctly working persistence.

Copy link
Member

@endorama endorama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests covered:

  • from 8.14.3 to 8.16.0-SNAPSHOT (with this PR applied) with data streams created in 8.14.3. Some data streams were manually rolled over.
  • from 8.14.3 to 8.15.0 and then to 8.16.0-SNAPSHOT (with this PR applied) with data streams created in 8.14.3. Some data streams were manually rolled over and some did it automatically upon reaching 8.16.0-SNAPSHOT.

Data streams created in 8.15.0, as mentioned, did not receive any update to their ILM policy and required a manual API call (as documented). Once DSL was applied, previously Unmanaged data streams were updated to use DSL.

@lahsivjar lahsivjar merged commit fd37ef8 into elastic:main Aug 21, 2024
15 checks passed
@lahsivjar lahsivjar deleted the apm-data-fix-dlm branch August 21, 2024 23:12
lahsivjar added a commit to lahsivjar/elasticsearch that referenced this pull request Aug 22, 2024
lahsivjar added a commit that referenced this pull request Aug 22, 2024
lahsivjar added a commit that referenced this pull request Aug 22, 2024
* Revert "[plugin/apm-data] Set fallback to legacy ILM policies (#112028)"

This reverts commit fd37ef8.
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Sep 4, 2024
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Sep 4, 2024
…ic#112112)

* Revert "[plugin/apm-data] Set fallback to legacy ILM policies (elastic#112028)"

This reverts commit fd37ef8.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Data streams Data streams and their lifecycles Team:Data Management Meta label for data/management team v8.15.1 v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New indexes created for datastreams after update to 8.15.0 are without lifecycle policies
4 participants