Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

functionaltests: add first test case #14935

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open

functionaltests: add first test case #14935

wants to merge 22 commits into from

Conversation

endorama
Copy link
Member

Motivation/summary

Implement first test verifying upgrade from 8.15.4 to 8.16.0 for a fresh Elastic Cloud deployment.

Provides a new folder structure and helpers to quickly implement functional tests for APM Server on Elastic Cloud.

Checklist

None

How to test these changes

Run cd functionaltests && go test -v ./

Related issues

Part of #14100

Copy link
Contributor

mergify bot commented Dec 12, 2024

This pull request does not have a backport label. Could you fix it @endorama? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-7.17 is the label to automatically backport to the 7.17 branch.
  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
  • backport-8.x is the label to automatically backport to the 8.x branch.

Copy link
Contributor

mergify bot commented Dec 12, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Dec 12, 2024
secrets: |-
EC_API_KEY:elastic-observability/elastic-cloud-observability-team-${{ matrix.environment }}-api-key

- run: cd functionaltests && go test -v ./
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It failed with

go: downloading github.com/apparentlymart/go-textseg/v15 v15.0.0
# github.com/elastic/apm-server/functionaltests
FAIL	github.com/elastic/apm-server/functionaltests [setup failed]
Error: main_test.go:26:2: github.com/elastic/[email protected][33](https://github.com/elastic/apm-server/actions/runs/12315688177/job/34374403286#step:6:34)854-c5983a7ff908: replacement directory ../../apm-perf does not exist
FAIL

See https://github.com/elastic/apm-server/actions/runs/12315688177/job/34374403286

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@v1v thanks for jumping on this this quickly! The test requires elastic/apm-perf#197, I updated it in 320535a to include the dependency from the PR. I'll trigger the workflow again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@v1v how can I run the pipeline? I see it has a workflow_dispatch trigger but I don't seem to be able to run it from the UI?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I created a test branch:

and the workflow is now available:

workflow_dispatch will be honoured as soon as the workflows exists in main, meanwhile, you can use the test branch for testing your things, or if you prefer, you can modify this PR and add the push event for the branch: funcitonaltests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

// TODO: how to get these from Elastic Cloud? Is it possible?

// cleanup
t.Log("cleanup")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this run always regardless of the failures? Just wondering if the cleanup step could be also set as another command we can use in the GH actions.

That's handy if something went wrong, so no leftovers are kept

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not, I've fixed it in 4532fe5 (#14935)

I added a flag to disable cleanup but it require manual action. In this way any automated test will automatically clean up but it can be overridden manually (for example for testing/troubleshooting).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To follow up on running the cleanup manually, each test is expected to have a tfstate in it's own folder. So it's possible to "manually" (in CI) run terraform destroy from that folder.
In any case cleanup now always run in CI, so leftovers can only happen due to failures in deleting resources (I'm not sure how frequent this may be)

v1v and others added 2 commits December 27, 2024 17:22
Add utility packages to implement functional tests and Terraform
code to bootstrap a Elastic Cloud deployment.

Implement first test verifying upgrade from 8.15.4 to 8.16.0 for a
fresh Elastic Cloud deployment.
@endorama
Copy link
Member Author

The latest run was successful in both qa and production: https://github.com/elastic/apm-server/actions/runs/12561390912, so moving this PR ready for review!

@endorama endorama marked this pull request as ready for review December 31, 2024 16:40
@endorama endorama requested a review from a team as a code owner December 31, 2024 16:40
Copy link
Contributor

@marclop marclop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good step forward, thanks for creating the go-based functional test framework.

// Functional tests are expected to run Terraform code to operate
// on infrastructure required for each tests and to query Elastic
// Cloud APIs. In both cases a valid API key is required.
func ecAPICheck(t *testing.T) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be best to use require.NotEmpty() inside this function, since *testing.T is being passed, and that we're unlikely to do anything else but error if EC_API_KEY is unset.

func ecAPICheck(t *testing.T) {
	t.Helper()
	require.NotEmpty(t, os.Getenv("EC_API_KEY"), "EC_API_KEY env var not set")
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in e182ae3 (#14935)

Comment on lines +119 to +120
case "pro":
return testRegionProduction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be production, given that target defaults to production, although I see that the execution script has a pro abbreviation. I've never seen that one, but prod is widely used as a shorter version of production.

I think it'd be good to settle on one and avoid invalid values as defaults or valid choices.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Settling on pro as that is what must be used in CI. Adding a comment to clarify this though as I understand the confusion.

Addressed in a031acb

Comment on lines 98 to 108
if expected.PreferIlm {
assert.True(t, v.PreferIlm)
} else {
assert.False(t, v.PreferIlm)
}
assert.Equal(t, expected.DSManagedBy, v.NextGenerationManagedBy.Name)
assert.Len(t, v.Indices, expected.IndicesPerDs)

for i, index := range v.Indices {
assert.Equal(t, expected.IndicesManagedBy[i], index.ManagedBy.Name)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PreferIlm check can be simplified and I think it'd be good to add some extra debugging information, given that we're checking individual fields:

Suggested change
if expected.PreferIlm {
assert.True(t, v.PreferIlm)
} else {
assert.False(t, v.PreferIlm)
}
assert.Equal(t, expected.DSManagedBy, v.NextGenerationManagedBy.Name)
assert.Len(t, v.Indices, expected.IndicesPerDs)
for i, index := range v.Indices {
assert.Equal(t, expected.IndicesManagedBy[i], index.ManagedBy.Name)
}
assert.Equal(t, expected.PreferIlm, v.PreferIlm,
"datastream %s should prefer ILM", v.Name,
)
assert.Equal(t, expected.DSManagedBy, v.NextGenerationManagedBy.Name,
`datastream %s should be managed by "%s"`, v.Name, expected.DSManagedBy,
)
assert.Len(t, v.Indices, expected.IndicesPerDs,
"datastream %s should have %d indices", v.Name, expected.IndicesPerDs,
)
for i, index := range v.Indices {
assert.Equal(t, expected.IndicesManagedBy[i], index.ManagedBy.Name,
`index %s should be managed by "%s"`, index.IndexName,
expected.IndicesManagedBy[i],
)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more debug information in e00ef0a (#14935)

But I did not replace assert.True and False with Equal to provide tailored debugging info for each case.

g.Logger = zaptest.NewLogger(t, zaptest.Level(zap.InfoLevel))
require.NoError(t, err)

err = g.RunBlocking(context.Background())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error isn't returned, shouldn't this be returned to the caller?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed this as part of the refactoring to wait for documents to be stored in ES, in fd73107 (#14935)

Comment on lines 56 to 60
// NewConfig returns a Config intialised from environment variables.
// func NewConfig() (Config, error) {
// cfg := Config{}
// return cfg, err
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover?

Suggested change
// NewConfig returns a Config intialised from environment variables.
// func NewConfig() (Config, error) {
// cfg := Config{}
// return cfg, err
// }

// Manual tests had failures due to only 4 data streams being reported
// when no delay was used. Manual inspection always revealed the correct
// number of data streams.
time.Sleep(1 * time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I love the 1m wait approach. If we know ahead of time how many documents we are going to end up with, wouldn't it be best to ensure that the aggregate result of ApmDocCount() is at least that number? That'll avoid arbitrary waiting times here and there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upside of this approach was that it had a very simple implementation. I refactored it to verify docs count in some data streams in fd73107 (#14935). Only data streams not related to aggregations are checked, as aggregations one have different numbers of documents across runs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also had to add a check on aggregations data streams, as test where sometimes failing. Added in c48e063 (#14935). As the number of document isn't know in advanced I used the docs count not changing as a proxy. In my test this prevented the failure in 1m aggregation data streams to happen again.

t.Log("check number of documents")
newCount, err := ac.ApmDocCount(ctx)
require.NoError(t, err)
assertDocCountEqual(t, oldCount, newCount)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we asserting that the previously indexed data isn't lost somehow? If that's the case or if I've missunderstood, can you add a comment to explain why this is necssary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's the reason. It's not necessary but helps ensuring that after upgrade the state didn't change and we can safely proceed with further ingestion and assertions. I added a clarifying comment in fd73107 (#14935)

We never expect this to fail unless something broke during the upgrade that affected APM data.

t.Log("check number of documents")
newCount2, err := ac.ApmDocCount(ctx)
require.NoError(t, err)
assertDocCountGreaterThan(t, oldCount, newCount2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greater than is fine but wouldn't it be best to assert the exact number of documents? Or is this something that we expect to change over time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of the refactoring in fd73107 (#14935) I also change this to assert the delta number of documents after the second ingestion run.

Refactor ingest helper into a separate package that wraps apm-perf
telemetrygen. The new package provides a similar interface as
telemetrygen and adds the RunBlockingWait. This function runs telemetry
generation and poll the ES cluster until expected doc count is reached
or timeout, whichever comes first.

NOTE that RunBlockingWait only waits for docs count in data streams that
have a fixed amount of documents after ingestion, ignoring aggregation
related data stream which have different document counts on different
runs.

It updates (esclient.Client).ApmDocCount() to return an easier to assert
on value that is then used in RunBlockingWait. To avoid a circular
dependency telemetrygen does not use the type alias but the underlying
data type.

It replaces assertDocCountGreaterThan and assertDocCountEqual with a
single assertDocCount helper that uses the new data structure and
supports considering previous results in the comparison.
ubuntu-latest is now Ubuntu 24.04 and terraform CLI is not
installed anymore.
See https://github.com/actions/runner-images/blob/ubuntu24/20250105.1/images/ubuntu/Ubuntu2404-Readme.md

The manual step requires manual updates to Terraform version
but avoids similar scenarios in the future
@endorama
Copy link
Member Author

endorama commented Jan 9, 2025

I run the relevant GitHub Actions and discovered an issue with terraform.

Fixed with latest commit: https://github.com/elastic/apm-server/actions/runs/12688157468/job/35364084828

Aggregation data streams may receive data slightly later than
data streams we are checking in RunBlockingWait.
This leads to proceeding before data streams have stabilized
resulting in failures for docs count on 1m aggregation
data streams down the line.

To avoid it RunBlockingWait now also checks that docs count
does not change for 3 iterations consecutively before considering
the wait completed.
@endorama endorama requested a review from marclop January 9, 2025 16:05
Copy link
Contributor

@marclop marclop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments, and still a bit unsure where this pro convention came from.

matrix:
environment:
- 'qa'
- 'pro'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer prd (like the MKI cluster names) or prod rather than using pro, which I've never seen used anywhere. Is it just me?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share the preference for another acronym, I supposed this could not be changed but let's check with @v1v . Is this something that can be changed? My understanding is that pro is part of the secret name we use in the pipeline, but I'm not sure if it's used elsewhere or it's a secret specific for this pipeline.

If it's specific, could we update it to prd or production?


t.Logf("created deployment %s", escfg.KibanaURL)

ac, err := esclient.New(escfg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this called ac? Shouldn't it be called ec, esc or esClient?

t.Logf("time elapsed: %s", time.Now().Sub(start))

// check ES logs, there should be no errors
// TODO: how to get these from Elastic Cloud? Is it possible?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could set up monitoring to self and the ES logs should end up in an index in this same cluster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 great idea, will look into it.


// docsPerDatastream is a utility type to map esclient.ApmDocCount to a format
// easier to perform assertions on.
type docsPerDatastream map[string]int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type isn't used anywhere

Comment on lines +40 to +41
const testRegionQA = "aws-eu-west-1"
const testRegionProduction = "eu-west-1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have a switch case below, I think it may be easier to follow the code trail if these were defined in the return statement, rather than standalone constants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants