-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
functionaltests: add first test case #14935
base: main
Are you sure you want to change the base?
Conversation
This pull request does not have a backport label. Could you fix it @endorama? 🙏
|
|
secrets: |- | ||
EC_API_KEY:elastic-observability/elastic-cloud-observability-team-${{ matrix.environment }}-api-key | ||
|
||
- run: cd functionaltests && go test -v ./ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It failed with
go: downloading github.com/apparentlymart/go-textseg/v15 v15.0.0
# github.com/elastic/apm-server/functionaltests
FAIL github.com/elastic/apm-server/functionaltests [setup failed]
Error: main_test.go:26:2: github.com/elastic/[email protected][33](https://github.com/elastic/apm-server/actions/runs/12315688177/job/34374403286#step:6:34)854-c5983a7ff908: replacement directory ../../apm-perf does not exist
FAIL
See https://github.com/elastic/apm-server/actions/runs/12315688177/job/34374403286
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@v1v thanks for jumping on this this quickly! The test requires elastic/apm-perf#197, I updated it in 320535a to include the dependency from the PR. I'll trigger the workflow again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@v1v how can I run the pipeline? I see it has a workflow_dispatch
trigger but I don't seem to be able to run it from the UI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I created a test branch:
and the workflow is now available:
workflow_dispatch
will be honoured as soon as the workflows exists in main
, meanwhile, you can use the test branch for testing your things, or if you prefer, you can modify this PR and add the push event for the branch: funcitonaltests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just synced the test branch:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
functionaltests/main_test.go
Outdated
// TODO: how to get these from Elastic Cloud? Is it possible? | ||
|
||
// cleanup | ||
t.Log("cleanup") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this run always regardless of the failures? Just wondering if the cleanup
step could be also set as another command we can use in the GH actions.
That's handy if something went wrong, so no leftovers are kept
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was not, I've fixed it in 4532fe5
(#14935)
I added a flag to disable cleanup but it require manual action. In this way any automated test will automatically clean up but it can be overridden manually (for example for testing/troubleshooting).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To follow up on running the cleanup manually, each test is expected to have a tfstate in it's own folder. So it's possible to "manually" (in CI) run terraform destroy
from that folder.
In any case cleanup now always run in CI, so leftovers can only happen due to failures in deleting resources (I'm not sure how frequent this may be)
8111836
to
20c7e57
Compare
Add utility packages to implement functional tests and Terraform code to bootstrap a Elastic Cloud deployment. Implement first test verifying upgrade from 8.15.4 to 8.16.0 for a fresh Elastic Cloud deployment.
20c7e57
to
5ba3747
Compare
The latest run was successful in both qa and production: https://github.com/elastic/apm-server/actions/runs/12561390912, so moving this PR ready for review! |
Co-authored-by: Victor Martinez <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good step forward, thanks for creating the go-based functional test framework.
functionaltests/utils_test.go
Outdated
// Functional tests are expected to run Terraform code to operate | ||
// on infrastructure required for each tests and to query Elastic | ||
// Cloud APIs. In both cases a valid API key is required. | ||
func ecAPICheck(t *testing.T) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be best to use require.NotEmpty()
inside this function, since *testing.T
is being passed, and that we're unlikely to do anything else but error if EC_API_KEY
is unset.
func ecAPICheck(t *testing.T) {
t.Helper()
require.NotEmpty(t, os.Getenv("EC_API_KEY"), "EC_API_KEY env var not set")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in e182ae3
(#14935)
case "pro": | ||
return testRegionProduction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be production
, given that target
defaults to production
, although I see that the execution script has a pro
abbreviation. I've never seen that one, but prod
is widely used as a shorter version of production
.
I think it'd be good to settle on one and avoid invalid values as defaults or valid choices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Settling on pro
as that is what must be used in CI. Adding a comment to clarify this though as I understand the confusion.
Addressed in a031acb
functionaltests/main_test.go
Outdated
if expected.PreferIlm { | ||
assert.True(t, v.PreferIlm) | ||
} else { | ||
assert.False(t, v.PreferIlm) | ||
} | ||
assert.Equal(t, expected.DSManagedBy, v.NextGenerationManagedBy.Name) | ||
assert.Len(t, v.Indices, expected.IndicesPerDs) | ||
|
||
for i, index := range v.Indices { | ||
assert.Equal(t, expected.IndicesManagedBy[i], index.ManagedBy.Name) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PreferIlm
check can be simplified and I think it'd be good to add some extra debugging information, given that we're checking individual fields:
if expected.PreferIlm { | |
assert.True(t, v.PreferIlm) | |
} else { | |
assert.False(t, v.PreferIlm) | |
} | |
assert.Equal(t, expected.DSManagedBy, v.NextGenerationManagedBy.Name) | |
assert.Len(t, v.Indices, expected.IndicesPerDs) | |
for i, index := range v.Indices { | |
assert.Equal(t, expected.IndicesManagedBy[i], index.ManagedBy.Name) | |
} | |
assert.Equal(t, expected.PreferIlm, v.PreferIlm, | |
"datastream %s should prefer ILM", v.Name, | |
) | |
assert.Equal(t, expected.DSManagedBy, v.NextGenerationManagedBy.Name, | |
`datastream %s should be managed by "%s"`, v.Name, expected.DSManagedBy, | |
) | |
assert.Len(t, v.Indices, expected.IndicesPerDs, | |
"datastream %s should have %d indices", v.Name, expected.IndicesPerDs, | |
) | |
for i, index := range v.Indices { | |
assert.Equal(t, expected.IndicesManagedBy[i], index.ManagedBy.Name, | |
`index %s should be managed by "%s"`, index.IndexName, | |
expected.IndicesManagedBy[i], | |
) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more debug information in e00ef0a
(#14935)
But I did not replace assert.True and False with Equal to provide tailored debugging info for each case.
functionaltests/main_test.go
Outdated
g.Logger = zaptest.NewLogger(t, zaptest.Level(zap.InfoLevel)) | ||
require.NoError(t, err) | ||
|
||
err = g.RunBlocking(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error isn't returned, shouldn't this be returned to the caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed this as part of the refactoring to wait for documents to be stored in ES, in fd73107
(#14935)
// NewConfig returns a Config intialised from environment variables. | ||
// func NewConfig() (Config, error) { | ||
// cfg := Config{} | ||
// return cfg, err | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover?
// NewConfig returns a Config intialised from environment variables. | |
// func NewConfig() (Config, error) { | |
// cfg := Config{} | |
// return cfg, err | |
// } |
functionaltests/8_15_test.go
Outdated
// Manual tests had failures due to only 4 data streams being reported | ||
// when no delay was used. Manual inspection always revealed the correct | ||
// number of data streams. | ||
time.Sleep(1 * time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I love the 1m wait approach. If we know ahead of time how many documents we are going to end up with, wouldn't it be best to ensure that the aggregate result of ApmDocCount()
is at least that number? That'll avoid arbitrary waiting times here and there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The upside of this approach was that it had a very simple implementation. I refactored it to verify docs count in some data streams in fd73107
(#14935). Only data streams not related to aggregations are checked, as aggregations one have different numbers of documents across runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also had to add a check on aggregations data streams, as test where sometimes failing. Added in c48e063
(#14935). As the number of document isn't know in advanced I used the docs count not changing as a proxy. In my test this prevented the failure in 1m
aggregation data streams to happen again.
functionaltests/8_15_test.go
Outdated
t.Log("check number of documents") | ||
newCount, err := ac.ApmDocCount(ctx) | ||
require.NoError(t, err) | ||
assertDocCountEqual(t, oldCount, newCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we asserting that the previously indexed data isn't lost somehow? If that's the case or if I've missunderstood, can you add a comment to explain why this is necssary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's the reason. It's not necessary but helps ensuring that after upgrade the state didn't change and we can safely proceed with further ingestion and assertions. I added a clarifying comment in fd73107
(#14935)
We never expect this to fail unless something broke during the upgrade that affected APM data.
functionaltests/8_15_test.go
Outdated
t.Log("check number of documents") | ||
newCount2, err := ac.ApmDocCount(ctx) | ||
require.NoError(t, err) | ||
assertDocCountGreaterThan(t, oldCount, newCount2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greater than is fine but wouldn't it be best to assert the exact number of documents? Or is this something that we expect to change over time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of the refactoring in fd73107
(#14935) I also change this to assert the delta number of documents after the second ingestion run.
Refactor ingest helper into a separate package that wraps apm-perf telemetrygen. The new package provides a similar interface as telemetrygen and adds the RunBlockingWait. This function runs telemetry generation and poll the ES cluster until expected doc count is reached or timeout, whichever comes first. NOTE that RunBlockingWait only waits for docs count in data streams that have a fixed amount of documents after ingestion, ignoring aggregation related data stream which have different document counts on different runs. It updates (esclient.Client).ApmDocCount() to return an easier to assert on value that is then used in RunBlockingWait. To avoid a circular dependency telemetrygen does not use the type alias but the underlying data type. It replaces assertDocCountGreaterThan and assertDocCountEqual with a single assertDocCount helper that uses the new data structure and supports considering previous results in the comparison.
ubuntu-latest is now Ubuntu 24.04 and terraform CLI is not installed anymore. See https://github.com/actions/runner-images/blob/ubuntu24/20250105.1/images/ubuntu/Ubuntu2404-Readme.md The manual step requires manual updates to Terraform version but avoids similar scenarios in the future
I run the relevant GitHub Actions and discovered an issue with terraform. Fixed with latest commit: https://github.com/elastic/apm-server/actions/runs/12688157468/job/35364084828 |
Aggregation data streams may receive data slightly later than data streams we are checking in RunBlockingWait. This leads to proceeding before data streams have stabilized resulting in failures for docs count on 1m aggregation data streams down the line. To avoid it RunBlockingWait now also checks that docs count does not change for 3 iterations consecutively before considering the wait completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments, and still a bit unsure where this pro
convention came from.
matrix: | ||
environment: | ||
- 'qa' | ||
- 'pro' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd prefer prd
(like the MKI cluster names) or prod
rather than using pro
, which I've never seen used anywhere. Is it just me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I share the preference for another acronym, I supposed this could not be changed but let's check with @v1v . Is this something that can be changed? My understanding is that pro
is part of the secret name we use in the pipeline, but I'm not sure if it's used elsewhere or it's a secret specific for this pipeline.
If it's specific, could we update it to prd
or production
?
|
||
t.Logf("created deployment %s", escfg.KibanaURL) | ||
|
||
ac, err := esclient.New(escfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this called ac
? Shouldn't it be called ec
, esc
or esClient
?
t.Logf("time elapsed: %s", time.Now().Sub(start)) | ||
|
||
// check ES logs, there should be no errors | ||
// TODO: how to get these from Elastic Cloud? Is it possible? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could set up monitoring to self and the ES logs should end up in an index in this same cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 great idea, will look into it.
|
||
// docsPerDatastream is a utility type to map esclient.ApmDocCount to a format | ||
// easier to perform assertions on. | ||
type docsPerDatastream map[string]int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type isn't used anywhere
const testRegionQA = "aws-eu-west-1" | ||
const testRegionProduction = "eu-west-1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have a switch case below, I think it may be easier to follow the code trail if these were defined in the return statement, rather than standalone constants.
Motivation/summary
Implement first test verifying upgrade from 8.15.4 to 8.16.0 for a fresh Elastic Cloud deployment.
Provides a new folder structure and helpers to quickly implement functional tests for APM Server on Elastic Cloud.
Checklist
None
How to test these changes
Run
cd functionaltests && go test -v ./
Related issues
Part of #14100