Add functional tests for ILM & Data Stream Lifecycle #14100

axw · 2024-09-17T00:47:46Z

We should implement functional tests that verify ILM or Data Stream Lefecycle Management is used as expected according to user configuration, and depending on whether a cluster is created new or upgraded. See #13898 (comment)

endorama · 2024-10-29T23:46:19Z

I've been thinking how to implement functional tests. I explored 3 possible approaches:

as part of smoke tests
as part of system tests
with a new approach independent of the 2 above

TL;DR: my conclusion is that we may want to proceed with option 3.

Option 1 heavy leverages ESS, creating infrastructure using terraform, on top of a custom built Bash framework. We do not need leveraging ESS for functional testing and the framework looks built with that purpose. Moreover Bash flexibility comes at a great readability costs. Orchestration is also not clear to me, and documentation is lacking.

Option 2 uses Go tests, but is strictly focused on testing specific APM server behaviors. As we now rely on apm-data plugin we may need to test interactions with different Elasticsearch versions and more complex interactions, which do not fit the current model (that starts a single stack against which runs all tests).

A third option would be to build a framework that looks similar to Option 2, with guarantees provided by Go, but with a scope more similar to Option 1. I created a simple stub in https://github.com/endorama/apm-server/blob/af3ca3744e4746d4c6d7f65a162927d6c9e19331/functionaltests/main_test.go#L31 using testcontainers library and starting a specific Elasticsearch version in a container. The idea here would be to build the freedom to leverage single stack components, allowing us to express complex upgrade and assertion scenarios.
I have some concerns on this approach:

we would duplicate packages already existing in systemtests, but that looks like are lacking the flexibility we would need to express complex cases
we would duplicate some logic of smoke tests, but relying on docker would allow faster testing and using Go easier maintenance in the long term

The third option looks the best to me if I think at future use cases (es adding further cases to this logic #13678 is more difficult in Bash than in Go), and I think a test framework must be ergonomic enough to encourage use.

I see potential for convergence in the long run, but is out of scope and not sure how much weight should have in the decision.

inge4pres · 2024-10-30T17:25:50Z

depending on whether a cluster is created new or upgraded.

The upgrade part is going to be especially tricky IMO, because IIANM Elasticsearch will never allow an upgrade between a released version (e.g. 8.15.3) and an un-released one (SNAPSHOT).
If this is correct, I have no idea on how we could run the test for an upgrade before a release is created.

endorama · 2024-10-30T18:43:20Z

Can we use BCs in cloud first region? I'm not sure is possible to upgrade to those though.

endorama · 2024-10-30T18:46:22Z

I'll recap the discussions from today about how to move forward. I discussed this with @axw, @1pkg and @inge4pres.

My current stub uses testcontainers, but I was not aware that we had flakiness issues with it in systemtests, so it does not look a great path forward.
Additionally, Andrew noticed that our customers mostly use ESS, not some Docker/Compose stack, and there are benefits in testing there.

Leveraging current smoke tests does not look the preferred path forward, as they are mostly Bash + CI and this greatly limits both expressiveness sin tests and reproducibility.

The current proposal would be to implement a new testing framework built on these principals:

we separate the 3 main layers we need: orchestration, fixtures and assertions
orchestration will be Terraform + Bash based; transient modifications for testing will happen outside Terraform, produce some configuration drift but that's not considered relevant (as infrastructure is expected to be cleaned up after tests)
fixtures and assertions will be Go bases, and developed as reusable components we can use for both apm-server and serverless.

This approach would go towards the convergence mentioned in my previous comment, and not using testcontainers would help with converging earlier. The layers mentioned about should "swappable", so that we can mix infrastructure/fixtures/assertions as needed based on testing scenarios. This could potentially also extends to running tests with a Docker stack or ECK, but is out of scope at the moment.

We will also have to consider how to run tests in parallel, some tests may taint the Elasticsearch stack used in a way that does not make safe reusing it for other tests, and some may not. Is not clear how to address this in our design at the moment, but for efficiency would be interesting to be able to mark clusters as tainted for further tests reuse.

Regarding which tests cases to run, we have a set already mentioned in #13898 (comment) that we should include.
Additionally we should include testing upgrade path from versions before 8.13.0 to 8.15 and 8.16.

endorama · 2024-10-31T18:28:44Z

As per our latest discussion, I created a stub of a first test on the new framework we discussed. You can see it here: https://github.com/endorama/apm-server/blob/3b4ec398e8715b9b61ede38cb84aa5928d241492/testing/functional/test1/main_test.go

The first test I'm aiming for is testing the upgrade path from 8.14.0 to 8.15.1 as defined by:

Upgrade 8.14.x to 8.15.1+ with defaults: ILM should continue to be used for old indices, DLM should be used for new indices

I also added a README to clarify the overall idea.

simitt · 2024-11-11T07:45:00Z

Moving this to the next iteration, @endorama please update this issue with where you are at with the functional tests and the goals for this iteration.

endorama · 2024-11-18T13:16:45Z

Adding this as a possible future path to evaluate: leveraging https://pkg.go.dev/github.com/rogpeppe/go-internal/testscript (example in use: https://github.com/kortschak/dex/blob/master/testdata/worklog_load_postgres.txt https://github.com/kortschak/dex/blob/master/main_test.go#L127). This would allow us to "script" our test leveraging CLI tools and go code to augment functionalities, with a defined test runner and test structure. We don't have to take a decision now, as our plan is to write Go code we can integrate it later on. The advantage I see are:

clearly separates test logic from helper logic
test case can be easier to follow through
I'm not sure how easy it would be to reproduce them manually, but would not be much different than tests purely written in Go (as there would be no way to trigger specific functions outside of Go code).

This solution has been suggested by our colleague Dan Kortschak (not citing him to avoid subscribing to the issue).

/cc @axw

axw · 2024-11-20T03:02:11Z

I would be happy to see a spike on testscript-based functional tests, I have also advocated for that in the past, but I don't think anyone ever gave it a shot in apm-server.

endorama · 2024-12-12T17:42:31Z

I prepared the first test in this draft PR #14935

This implementation leverages:

terraform for infrastructure, run by Go code and using the elastic provider to bootstrap and upgrade EC cluster
apm-perf/telemetrygen, a new package extracted in apm-perf to allow easy ingestion of canned data from Go (PR adding it add telemetrygen public package apm-perf#197)
plain Go assertion to perform the checks

Iteration speed is slow on this test as a full run could easily take 20 to 30 minutes. Overall I did not find major roadblocks. I have to investigate a timeout in applying the upgrade to 8.16.0 through Terraform, as there is not why to change such timeout at the moment. It happened only once so far, so I've yet to drawn a conclusion on how problematic this may be.

axw added this to the 8.16 milestone Sep 17, 2024

simitt assigned endorama Oct 18, 2024

This was referenced Dec 12, 2024

add telemetrygen public package elastic/apm-perf#197

Merged

functionaltests: add first test case #14935

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functional tests for ILM & Data Stream Lifecycle #14100

Add functional tests for ILM & Data Stream Lifecycle #14100

axw commented Sep 17, 2024

endorama commented Oct 29, 2024 •

edited

Loading

inge4pres commented Oct 30, 2024

endorama commented Oct 30, 2024

endorama commented Oct 30, 2024

endorama commented Oct 31, 2024

simitt commented Nov 11, 2024

endorama commented Nov 18, 2024

axw commented Nov 20, 2024

endorama commented Dec 12, 2024

Add functional tests for ILM & Data Stream Lifecycle #14100

Add functional tests for ILM & Data Stream Lifecycle #14100

Comments

axw commented Sep 17, 2024

endorama commented Oct 29, 2024 • edited Loading

inge4pres commented Oct 30, 2024

endorama commented Oct 30, 2024

endorama commented Oct 30, 2024

endorama commented Oct 31, 2024

simitt commented Nov 11, 2024

endorama commented Nov 18, 2024

axw commented Nov 20, 2024

endorama commented Dec 12, 2024

endorama commented Oct 29, 2024 •

edited

Loading