Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decrease linter's memory usage #1194

Merged
merged 11 commits into from
Mar 12, 2024
Merged

Conversation

ulucinar
Copy link
Collaborator

@ulucinar ulucinar commented Mar 6, 2024

Description of your changes

Depends on: upbound/official-providers-ci#187

Historical Context & Problem Statement

The linter jobs in the upbound/provider-aws repository have been facing recurring failures. Initially, these failures were mitigated by switching to larger self-hosted runners (runners with the e2-standard-8 labels in the upbound organization), but the issues resurfaced due to a performance regression in the musttag linter. Subsequently, we upgraded the golangci-lint version to v1.54.0 which resolved the performance regression with that specific linter. But later, we encountered further challenges, prompting a switch to an even larger runners. The runners we currently use only for the linter job with the label Ubuntu-Jumbo-Runner. Despite these adjustments, the linter jobs has started failing again, primarily due to the high peak memory consumption during linting, with cold analysis cache runs consuming over 50 GB of peak memory, depending on the number of available CPU cores.

Investigation & Findings

The substantial memory usage was traced back to the linter runner's analysis phase. We considered (and investigated some of) the following potential remediations for the linter issues we've been experiencing:

  • Just configuring the skip-files to exclude, for instance, the generated files as we do right now for the official provider repos, does not help because it just skips reporting the issues. All the generated files are still analyzed and the processing needed to analyze the source code is the memory consuming part. So it does not help us.
  • Tried bumping the linter runner to the latest version with a hope to enjoy any upstream improvements. The memory consumption is still high.
  • Tried different sets of enabled linters & examined the resulting memory profiles. This has a potential but the processing needed for the analysis is common to the linters and probably defines a lower bound on the memory consumed. I've seen ~20 GB memory consumption even with a single linter configured. Configuring the linter runner only with the default set of linters (errcheck, gosimple, govet , ineffassign, staticcheck, unused) still consumes over 30 GB of memory. An idea could be to partition the current set of enabled linters across multiple jobs and hope no job will hit the memory limits.
  • Employ the build tags that will make the analysis actually less costly in contrast to the skip-files argument.
  • Switch to even larger runners: Ideally we would like to avoid this. We will also be moving their repositories out of the upbound Github organization.
  • Decrease the concurrency of the linter runner: The linter runner by default uses all the available logical cores available to it. Concurrency-limiting like we did for building & pushing the provider families to limit the load on our package registry may help here, in expense of increased runtime. The common processing phase in the runner prior to the analysis phase could be a limiting factor (please see above).
  • Increase the pressure on the garbage collector: A common technique with the expense of increased CPU consumption and runtime.
  • Disable certain linters: We are also trying to avoid this. But if we figure out a single linter that's breaking the memory limit, might be worth implementing.
  • Break the family provider repos that need it: This is problematic because the resource providers of the same family are coupled (common provider configuration, cross-resource references ,etc.), although we had to decrease this coupling (specifically cross-resource reference induced coupling) for the multi-version CRD support. Also there's the maintenance overhead of the extra repositories. Furthermore, we can achieve a similar effect via the build tags (please see above).

Implemented Strategy in this PR

  • Run the linter in two phases:
    1. Employ build constraints (tags) for the initial construction of a linter runner analysis cache and limit the concurrency to 1 when constructing this cache. We avoid the two hotspots in the provider-aws repo during cache construction: API registration in apis/zz_register.go & API group configuration in config/provider.go.
    2. After the cache construction, run the linters on the whole provider codebase with the maximum available concurrency.

The analysis cache expires in 7 days or when the module’s go.sum changes. If the analysis cache can successfully be restored, then the initial cache construction phase is skipped and just full linting with the maximum available concurrency is performed.

For generating the build constraints (tags), we use the buildtagger tool. We currently don’t utilize the build tags for building the resource providers in isolation because:

  • We currently don’t need this, i.e., the actual build process is not as resource hungry as linting
  • It impacts the developer experience: It has implications in IDE settings, it will require changes to our build pipelines, etc.
  • We would need to implement dependency resolution for cross-resource references and we’ve so far avoided this for the family providers (increased complexity), or, we would need to require the manual maintenance of dependencies (which is error prone).

Observed Improvements

In an example run with the two-phase strategy, the cache construction phase consumed a peak memory of ~13 GB and the full linting phase consumed a peak of 24,3 GB, which corresponds to a ~57% reduction in peak memory consumption compared to a single phase run of the linters on the same machine (an M2 Mac with 12 cores). The total execution time of both phases is ~14m, which is about the same time it takes the linters to run in a single phase (when we run the linters in a single phase on cold analysis cache, the peak memory consumption was ~57 GB and the execution time was ~14 min):

Here are results from example runs on an M2 Mac with 12 logical cores and 32 GB of physical memory:

  1. Linter run in single phase (without the proposed initial cache construction phase) on cold analysis cache on the main branch’s Head (commit: fb0fb486e6225cdab27a447c48cb36f98464884e). Linter runner version: v1.55.2:
    Average memory consumption is 40560.9MB, max is 58586.2MB. Execution took 13m49.188064208s.

  2. Linter run in single phase with the analysis cache with the same parameters as above:
    Average memory consumption is 104.5MB, max is 191.1MB. Execution took 7.325796125s.

  3. Two-phase linter run example on cold analysis cache on the main branch’s Head (commit: fb0fb486e6225cdab27a447c48cb36f98464884e). Linter runner version is v1.55.2:
    Average memory consumption in the cache construction phase is 9272.4MB and the peak consumption is 13301.6MB.
    Execution of the first phase took 11m13.023597833s. For the second phase (full linting with all the available CPU cores), the average memory consumption is 9331.7MB and the peak is 24904.9MB. The execution of the second phase took 3m10.447865375s.

The linter job now fits into a (standard) Github hosted runner with the label ubuntu-22.04 (with 16 GB of physical memory & 4 cores). So, in preparation of moving provider-aws out of the upbound Github organization, this PR also changes the lint job's runner from Ubuntu-Jumbo-Runner to ubuntu-22.04.

Developer Experience

  • API group configuration is implemented with slight differences:
    • The actual configuration still resides under config/<API group prefix, e.g., acm>/config.go. There are no upjet resource configuration API changes, i.e., the contents of config.go stays the same, and because the build tags for these files are automatically generated, there's no need to manually tag these files.
    • Registrations of the API group “Configure” functions are still done in config/provider.go but in a slightly different way. The new registration method is like follows:
func init() {
   ProviderConfiguration.AddConfig(acm.Configure)
   ProviderConfiguration.AddConfig(acmpca.Configure)
   ProviderConfiguration.AddConfig(apigateway.Configure)
   …
  • make lint can be run with its old semantics. When the linter is running, the generated build tags will be observed in the source tree. If uninterrupted, these tags will be removed by the make target.
  • make build can be run with its old semantics & behavior

I have:

  • Run make reviewable test to ensure this PR is ready for review.

How has this code been tested

image image image

ulucinar added 2 commits March 6, 2024 01:41
- Register API group configuration functions via config.ProviderConfiguration
- Exclude config/provider.go from analysis when building the initial linter cache

Signed-off-by: Alper Rifat Ulucinar <[email protected]>
@ulucinar ulucinar marked this pull request as draft March 6, 2024 09:52
Signed-off-by: Alper Rifat Ulucinar <[email protected]>
@ulucinar ulucinar force-pushed the linter-mem branch 3 times, most recently from 9299464 to c2a4468 Compare March 6, 2024 12:35
ulucinar added 2 commits March 7, 2024 17:59
- Remove unused build constraints from scripts/tag.sh

Signed-off-by: Alper Rifat Ulucinar <[email protected]>
@ulucinar ulucinar force-pushed the linter-mem branch 6 times, most recently from 6331fda to 2b11833 Compare March 7, 2024 23:51
- Only initialize the linter cache in CI pipelines

Signed-off-by: Alper Rifat Ulucinar <[email protected]>
@ulucinar ulucinar marked this pull request as ready for review March 8, 2024 19:53
@ulucinar
Copy link
Collaborator Author

ulucinar commented Mar 8, 2024

/test-examples="examples/eks/v1beta1/cluster.yaml"

@ulucinar
Copy link
Collaborator Author

ulucinar commented Mar 8, 2024

/test-examples="examples/ec2/v1beta1/vpc.yaml"

Copy link
Collaborator

@sergenyalcin sergenyalcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ulucinar for your great effort. Left a few comments.

Makefile Outdated Show resolved Hide resolved
Makefile Show resolved Hide resolved
Makefile Outdated
@$(INFO) Running golangci-lint with the analysis cache building phase.
@(BUILDTAGGER_DOWNLOAD_URL=$(BUILDTAGGER_DOWNLOAD_URL) ./scripts/tag.sh && \
(([[ "${SKIP_LINTER_ANALYSIS}" == "true" ]] && $(OK) "Skipping analysis cache build phase because it's already been populated") && \
[[ "${SKIP_LINTER_ANALYSIS}" == "true" ]] || $(GOLANGCILINT) run -v --build-tags account,configregistry,configprovider,linter_run -v --concurrency 1 --disable-all --exclude '.*')) || $(FAIL)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we consider increasing the Concurrency?

Copy link
Collaborator Author

@ulucinar ulucinar Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not done extensive tests here. We already know the default concurrency of 4 on Github hosted runners is not working for us even when using the build constraints, it still consumes too much memory. Decreasing the concurrency to 1 without employing the build constraints also did not allow us to fit the linter into a hosted runner with 16 GB of physical memory. But using the build constraints and limiting the concurrency to 1 made it possible to fit onto a 16 GB machine. So, looks like we can try increasing the concurrency to 2 or 3.

The full linting is still done with the maximum available concurrency (in our case, with a concurrency of 4) with the constructed analysis cache. This concurrency limitation is only for the first cache construction phase.

Let me try with a concurrency of 2 and 3. We were initially doing our observations with a relatively larger resource provider (ec2). Switching to account may have helped.

Copy link
Collaborator Author

@ulucinar ulucinar Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A concurrency of 4 for the initial analysis cache build phase on cold cache has failed here:
https://github.com/upbound/provider-aws/actions/runs/8246986803/job/22554209938?pr=1194

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a run with a concurrency of 1. It took ~21 min and ~14 GB of peak memory to complete the initial phase:
image

And here's another run with a concurrency of 3. It took ~14 min ~19 GB of peak memory to complete the initial phase:
image

Let's switch to a concurrency of 3 for the initial phase (concurrency of 4 has failed). The second phase already uses the max available concurrency.

If we observe any memory issues for the initial phase, we can decrease it later again.

Makefile Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, this file was brought up in the context of the fragmentation of the registration phase. It will be needed when we want to add all the resources to the schema. I can't find any use of this function in the repo. Am I missing something, or do we really need this file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As explained here, the code that the linter runner is processing (analyzing) must compile. The API registration code (apis/zz_register.go) is one of the hot spots, which imports all the API packages and is thus costly to analyze. The apis/linter_run.go keeps the linter happy by providing a definition of apis.AddToScheme & apis.AddToSchemes and the original definitions from apis/zz_register.go are excluded by the supplied build constraints to the linter. So, the definitions in this file are only used by the linter and the actual build uses the definitions from apis/zz_register.go.

This also means that the generated file apis/zz_register.go is not analyzed. We can in theory distribute API registration across the individual resource provider (API group) modules but the cross-resource references necessitate a form of dependency resolution as discussed under the Implemented Strategy in this PR section of the PR description. So avoiding these hot spots at the expense of not linting them is a trade-off we make here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha! Thanks for the clarification. This part can also be a bit tricky for other contributors. So, I am leaving this comment unresolved for documenting purposes.

config/registry.go Outdated Show resolved Hide resolved
Signed-off-by: Alper Rifat Ulucinar <[email protected]>
@ulucinar ulucinar force-pushed the linter-mem branch 2 times, most recently from 555cb40 to 890f5fb Compare March 12, 2024 08:52
Copy link
Collaborator

@sergenyalcin sergenyalcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ulucinar LGTM!

@ulucinar ulucinar merged commit 46839bd into crossplane-contrib:main Mar 12, 2024
11 checks passed
@ulucinar ulucinar deleted the linter-mem branch March 12, 2024 11:49
@ulucinar ulucinar mentioned this pull request Mar 15, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants