Decrease linter's memory usage #1194

ulucinar · 2024-03-06T09:51:59Z

Description of your changes

Depends on: upbound/official-providers-ci#187

Historical Context & Problem Statement

The linter jobs in the upbound/provider-aws repository have been facing recurring failures. Initially, these failures were mitigated by switching to larger self-hosted runners (runners with the e2-standard-8 labels in the upbound organization), but the issues resurfaced due to a performance regression in the musttag linter. Subsequently, we upgraded the golangci-lint version to v1.54.0 which resolved the performance regression with that specific linter. But later, we encountered further challenges, prompting a switch to an even larger runners. The runners we currently use only for the linter job with the label Ubuntu-Jumbo-Runner. Despite these adjustments, the linter jobs has started failing again, primarily due to the high peak memory consumption during linting, with cold analysis cache runs consuming over 50 GB of peak memory, depending on the number of available CPU cores.

Investigation & Findings

The substantial memory usage was traced back to the linter runner's analysis phase. We considered (and investigated some of) the following potential remediations for the linter issues we've been experiencing:

Just configuring the skip-files to exclude, for instance, the generated files as we do right now for the official provider repos, does not help because it just skips reporting the issues. All the generated files are still analyzed and the processing needed to analyze the source code is the memory consuming part. So it does not help us.
Tried bumping the linter runner to the latest version with a hope to enjoy any upstream improvements. The memory consumption is still high.
Tried different sets of enabled linters & examined the resulting memory profiles. This has a potential but the processing needed for the analysis is common to the linters and probably defines a lower bound on the memory consumed. I've seen ~20 GB memory consumption even with a single linter configured. Configuring the linter runner only with the default set of linters (errcheck, gosimple, govet , ineffassign, staticcheck, unused) still consumes over 30 GB of memory. An idea could be to partition the current set of enabled linters across multiple jobs and hope no job will hit the memory limits.
Employ the build tags that will make the analysis actually less costly in contrast to the skip-files argument.
Switch to even larger runners: Ideally we would like to avoid this. We will also be moving their repositories out of the upbound Github organization.
Decrease the concurrency of the linter runner: The linter runner by default uses all the available logical cores available to it. Concurrency-limiting like we did for building & pushing the provider families to limit the load on our package registry may help here, in expense of increased runtime. The common processing phase in the runner prior to the analysis phase could be a limiting factor (please see above).
Increase the pressure on the garbage collector: A common technique with the expense of increased CPU consumption and runtime.
Disable certain linters: We are also trying to avoid this. But if we figure out a single linter that's breaking the memory limit, might be worth implementing.
Break the family provider repos that need it: This is problematic because the resource providers of the same family are coupled (common provider configuration, cross-resource references ,etc.), although we had to decrease this coupling (specifically cross-resource reference induced coupling) for the multi-version CRD support. Also there's the maintenance overhead of the extra repositories. Furthermore, we can achieve a similar effect via the build tags (please see above).

Implemented Strategy in this PR

Run the linter in two phases:
1. Employ build constraints (tags) for the initial construction of a linter runner analysis cache and limit the concurrency to 1 when constructing this cache. We avoid the two hotspots in the provider-aws repo during cache construction: API registration in apis/zz_register.go & API group configuration in config/provider.go.
2. After the cache construction, run the linters on the whole provider codebase with the maximum available concurrency.

The analysis cache expires in 7 days or when the module’s go.sum changes. If the analysis cache can successfully be restored, then the initial cache construction phase is skipped and just full linting with the maximum available concurrency is performed.

For generating the build constraints (tags), we use the buildtagger tool. We currently don’t utilize the build tags for building the resource providers in isolation because:

We currently don’t need this, i.e., the actual build process is not as resource hungry as linting
It impacts the developer experience: It has implications in IDE settings, it will require changes to our build pipelines, etc.
We would need to implement dependency resolution for cross-resource references and we’ve so far avoided this for the family providers (increased complexity), or, we would need to require the manual maintenance of dependencies (which is error prone).

Observed Improvements

In an example run with the two-phase strategy, the cache construction phase consumed a peak memory of ~13 GB and the full linting phase consumed a peak of 24,3 GB, which corresponds to a ~57% reduction in peak memory consumption compared to a single phase run of the linters on the same machine (an M2 Mac with 12 cores). The total execution time of both phases is ~14m, which is about the same time it takes the linters to run in a single phase (when we run the linters in a single phase on cold analysis cache, the peak memory consumption was ~57 GB and the execution time was ~14 min):

Here are results from example runs on an M2 Mac with 12 logical cores and 32 GB of physical memory:

Linter run in single phase (without the proposed initial cache construction phase) on cold analysis cache on the main branch’s Head (commit: fb0fb486e6225cdab27a447c48cb36f98464884e). Linter runner version: v1.55.2:
Average memory consumption is 40560.9MB, max is 58586.2MB. Execution took 13m49.188064208s.
Linter run in single phase with the analysis cache with the same parameters as above:
Average memory consumption is 104.5MB, max is 191.1MB. Execution took 7.325796125s.
Two-phase linter run example on cold analysis cache on the main branch’s Head (commit: fb0fb486e6225cdab27a447c48cb36f98464884e). Linter runner version is v1.55.2:
Average memory consumption in the cache construction phase is 9272.4MB and the peak consumption is 13301.6MB.
Execution of the first phase took 11m13.023597833s. For the second phase (full linting with all the available CPU cores), the average memory consumption is 9331.7MB and the peak is 24904.9MB. The execution of the second phase took 3m10.447865375s.

The linter job now fits into a (standard) Github hosted runner with the label ubuntu-22.04 (with 16 GB of physical memory & 4 cores). So, in preparation of moving provider-aws out of the upbound Github organization, this PR also changes the lint job's runner from Ubuntu-Jumbo-Runner to ubuntu-22.04.

Developer Experience

API group configuration is implemented with slight differences:
- The actual configuration still resides under config/<API group prefix, e.g., acm>/config.go. There are no upjet resource configuration API changes, i.e., the contents of config.go stays the same, and because the build tags for these files are automatically generated, there's no need to manually tag these files.
- Registrations of the API group “Configure” functions are still done in config/provider.go but in a slightly different way. The new registration method is like follows:

func init() {
   ProviderConfiguration.AddConfig(acm.Configure)
   ProviderConfiguration.AddConfig(acmpca.Configure)
   ProviderConfiguration.AddConfig(apigateway.Configure)
   …

make lint can be run with its old semantics. When the linter is running, the generated build tags will be observed in the source tree. If uninterrupted, these tags will be removed by the make target.
make build can be run with its old semantics & behavior

I have:

Run make reviewable test to ensure this PR is ready for review.

How has this code been tested

The linter runner can successfully report issues: https://github.com/upbound/provider-aws/actions/runs/8196350610/job/22416386815?pr=1194

If the linter runner's analysis cache is not restored, it will run to populate the cache without reporting any issues (even if there actually are issues): https://github.com/upbound/provider-aws/actions/runs/8196350610/job/22416386815?pr=1194

If the linter runner's analysis cache could be successfully restored, then the cache population step is skipped: https://github.com/upbound/provider-aws/actions/runs/8196537292/job/22416975348?pr=1194

Uptest run for Cluster.eks: https://github.com/upbound/provider-aws/actions/runs/8209193860

…mption Signed-off-by: Alper Rifat Ulucinar <[email protected]>

- Register API group configuration functions via config.ProviderConfiguration - Exclude config/provider.go from analysis when building the initial linter cache Signed-off-by: Alper Rifat Ulucinar <[email protected]>

Signed-off-by: Alper Rifat Ulucinar <[email protected]>

… key Signed-off-by: Alper Rifat Ulucinar <[email protected]>

…ssfully restored Signed-off-by: Alper Rifat Ulucinar <[email protected]>

- Remove unused build constraints from scripts/tag.sh Signed-off-by: Alper Rifat Ulucinar <[email protected]>

- Only initialize the linter cache in CI pipelines Signed-off-by: Alper Rifat Ulucinar <[email protected]>

ulucinar · 2024-03-08T21:23:15Z

/test-examples="examples/eks/v1beta1/cluster.yaml"

Signed-off-by: Alper Rifat Ulucinar <[email protected]>

ulucinar · 2024-03-08T23:30:26Z

/test-examples="examples/ec2/v1beta1/vpc.yaml"

sergenyalcin

Thanks @ulucinar for your great effort. Left a few comments.

Makefile

sergenyalcin · 2024-03-11T13:00:44Z

Makefile

+	@$(INFO) Running golangci-lint with the analysis cache building phase.
+	@(BUILDTAGGER_DOWNLOAD_URL=$(BUILDTAGGER_DOWNLOAD_URL) ./scripts/tag.sh && \
+	(([[ "${SKIP_LINTER_ANALYSIS}" == "true" ]] && $(OK) "Skipping analysis cache build phase because it's already been populated") && \
+	[[ "${SKIP_LINTER_ANALYSIS}" == "true" ]] || $(GOLANGCILINT) run -v --build-tags account,configregistry,configprovider,linter_run -v --concurrency 1 --disable-all --exclude '.*')) || $(FAIL)


nit: Can we consider increasing the Concurrency?

I've not done extensive tests here. We already know the default concurrency of 4 on Github hosted runners is not working for us even when using the build constraints, it still consumes too much memory. Decreasing the concurrency to 1 without employing the build constraints also did not allow us to fit the linter into a hosted runner with 16 GB of physical memory. But using the build constraints and limiting the concurrency to 1 made it possible to fit onto a 16 GB machine. So, looks like we can try increasing the concurrency to 2 or 3.

The full linting is still done with the maximum available concurrency (in our case, with a concurrency of 4) with the constructed analysis cache. This concurrency limitation is only for the first cache construction phase.

Let me try with a concurrency of 2 and 3. We were initially doing our observations with a relatively larger resource provider (ec2). Switching to account may have helped.

A concurrency of 4 for the initial analysis cache build phase on cold cache has failed here:
https://github.com/upbound/provider-aws/actions/runs/8246986803/job/22554209938?pr=1194

Here's a run with a concurrency of 1. It took ~21 min and ~14 GB of peak memory to complete the initial phase:

And here's another run with a concurrency of 3. It took ~14 min ~19 GB of peak memory to complete the initial phase:

Let's switch to a concurrency of 3 for the initial phase (concurrency of 4 has failed). The second phase already uses the max available concurrency.

If we observe any memory issues for the initial phase, we can decrease it later again.

Makefile

sergenyalcin · 2024-03-11T13:09:07Z

apis/linter_run.go

If I remember correctly, this file was brought up in the context of the fragmentation of the registration phase. It will be needed when we want to add all the resources to the schema. I can't find any use of this function in the repo. Am I missing something, or do we really need this file?

As explained here, the code that the linter runner is processing (analyzing) must compile. The API registration code (apis/zz_register.go) is one of the hot spots, which imports all the API packages and is thus costly to analyze. The apis/linter_run.go keeps the linter happy by providing a definition of apis.AddToScheme & apis.AddToSchemes and the original definitions from apis/zz_register.go are excluded by the supplied build constraints to the linter. So, the definitions in this file are only used by the linter and the actual build uses the definitions from apis/zz_register.go.

This also means that the generated file apis/zz_register.go is not analyzed. We can in theory distribute API registration across the individual resource provider (API group) modules but the cross-resource references necessitate a form of dependency resolution as discussed under the Implemented Strategy in this PR section of the PR description. So avoiding these hot spots at the expense of not linting them is a trade-off we make here.

Gotcha! Thanks for the clarification. This part can also be a bit tricky for other contributors. So, I am leaving this comment unresolved for documenting purposes.

config/registry.go

Signed-off-by: Alper Rifat Ulucinar <[email protected]>

sergenyalcin

Thanks @ulucinar LGTM!

…ase to 3 Signed-off-by: Alper Rifat Ulucinar <[email protected]>

ulucinar added 2 commits March 6, 2024 01:41

Add build constraints while running the linter to reduce memory consu…

c79d7b7

…mption Signed-off-by: Alper Rifat Ulucinar <[email protected]>

Add config.ProviderConfiguration

f874627

- Register API group configuration functions via config.ProviderConfiguration - Exclude config/provider.go from analysis when building the initial linter cache Signed-off-by: Alper Rifat Ulucinar <[email protected]>

ulucinar requested review from sergenyalcin and turkenf as code owners March 6, 2024 09:52

ulucinar marked this pull request as draft March 6, 2024 09:52

Test CI pipeline

d2aec0a

Signed-off-by: Alper Rifat Ulucinar <[email protected]>

ulucinar force-pushed the linter-mem branch 3 times, most recently from 9299464 to c2a4468 Compare March 6, 2024 12:35

Add linter analysis cache restore step before running the linter

3e46cfa

Signed-off-by: Alper Rifat Ulucinar <[email protected]>

ulucinar force-pushed the linter-mem branch from c2a4468 to 3e46cfa Compare March 6, 2024 12:41

Add the go.lint.analysiskey make target to calculate the linter cache…

513ebf1

… key Signed-off-by: Alper Rifat Ulucinar <[email protected]>

ulucinar force-pushed the linter-mem branch 2 times, most recently from f80f0fe to cdca724 Compare March 6, 2024 16:32

ulucinar mentioned this pull request Mar 7, 2024

Reclaim disk space in an initial step while running the CI jobs #1197

Merged

1 task

ulucinar added 2 commits March 7, 2024 17:59

Skip building the linter runner analysis cache if the cache was succe…

4e0e596

…ssfully restored Signed-off-by: Alper Rifat Ulucinar <[email protected]>

Initialize the linter analysis cache using the "account" API group

21ff2c2

- Remove unused build constraints from scripts/tag.sh Signed-off-by: Alper Rifat Ulucinar <[email protected]>

ulucinar force-pushed the linter-mem branch 6 times, most recently from 6331fda to 2b11833 Compare March 7, 2024 23:51

Download the buildtagger binary from uptest's release S3 bucket

767ba19

- Only initialize the linter cache in CI pipelines Signed-off-by: Alper Rifat Ulucinar <[email protected]>

ulucinar force-pushed the linter-mem branch from 2b11833 to 767ba19 Compare March 8, 2024 00:06

ulucinar marked this pull request as ready for review March 8, 2024 19:53

Use buildtagger by default with "make lint"

6975fb6

Signed-off-by: Alper Rifat Ulucinar <[email protected]>

ulucinar force-pushed the linter-mem branch from b127593 to 6975fb6 Compare March 8, 2024 23:29

sergenyalcin reviewed Mar 11, 2024

View reviewed changes

Improve lint make target documentation

ed39e84

Signed-off-by: Alper Rifat Ulucinar <[email protected]>

ulucinar force-pushed the linter-mem branch 2 times, most recently from 555cb40 to 890f5fb Compare March 12, 2024 08:52

sergenyalcin approved these changes Mar 12, 2024

View reviewed changes

Increase the concurrency of the first linter runner cache building ph…

11deaa7

…ase to 3 Signed-off-by: Alper Rifat Ulucinar <[email protected]>

ulucinar force-pushed the linter-mem branch from 8afecf7 to 11deaa7 Compare March 12, 2024 11:28

ulucinar merged commit 46839bd into crossplane-contrib:main Mar 12, 2024
11 checks passed

ulucinar deleted the linter-mem branch March 12, 2024 11:49

ulucinar mentioned this pull request Mar 15, 2024

Decrease linter's memory usage #1217

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decrease linter's memory usage #1194

Decrease linter's memory usage #1194

ulucinar commented Mar 6, 2024 •

edited

Loading

ulucinar commented Mar 8, 2024

ulucinar commented Mar 8, 2024

sergenyalcin left a comment

sergenyalcin Mar 11, 2024

ulucinar Mar 11, 2024 •

edited

Loading

ulucinar Mar 12, 2024 •

edited

Loading

ulucinar Mar 12, 2024

sergenyalcin Mar 11, 2024

ulucinar Mar 11, 2024

sergenyalcin Mar 12, 2024

sergenyalcin left a comment

Decrease linter's memory usage #1194

Decrease linter's memory usage #1194

Conversation

ulucinar commented Mar 6, 2024 • edited Loading

Description of your changes

Historical Context & Problem Statement

Investigation & Findings

Implemented Strategy in this PR

Observed Improvements

Developer Experience

How has this code been tested

ulucinar commented Mar 8, 2024

ulucinar commented Mar 8, 2024

sergenyalcin left a comment

Choose a reason for hiding this comment

sergenyalcin Mar 11, 2024

Choose a reason for hiding this comment

ulucinar Mar 11, 2024 • edited Loading

Choose a reason for hiding this comment

ulucinar Mar 12, 2024 • edited Loading

Choose a reason for hiding this comment

ulucinar Mar 12, 2024

Choose a reason for hiding this comment

sergenyalcin Mar 11, 2024

Choose a reason for hiding this comment

ulucinar Mar 11, 2024

Choose a reason for hiding this comment

sergenyalcin Mar 12, 2024

Choose a reason for hiding this comment

sergenyalcin left a comment

Choose a reason for hiding this comment

ulucinar commented Mar 6, 2024 •

edited

Loading

ulucinar Mar 11, 2024 •

edited

Loading

ulucinar Mar 12, 2024 •

edited

Loading