-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decrease linter's memory usage #1194
Conversation
…mption Signed-off-by: Alper Rifat Ulucinar <[email protected]>
- Register API group configuration functions via config.ProviderConfiguration - Exclude config/provider.go from analysis when building the initial linter cache Signed-off-by: Alper Rifat Ulucinar <[email protected]>
Signed-off-by: Alper Rifat Ulucinar <[email protected]>
9299464
to
c2a4468
Compare
Signed-off-by: Alper Rifat Ulucinar <[email protected]>
… key Signed-off-by: Alper Rifat Ulucinar <[email protected]>
f80f0fe
to
cdca724
Compare
…ssfully restored Signed-off-by: Alper Rifat Ulucinar <[email protected]>
- Remove unused build constraints from scripts/tag.sh Signed-off-by: Alper Rifat Ulucinar <[email protected]>
6331fda
to
2b11833
Compare
- Only initialize the linter cache in CI pipelines Signed-off-by: Alper Rifat Ulucinar <[email protected]>
/test-examples="examples/eks/v1beta1/cluster.yaml" |
Signed-off-by: Alper Rifat Ulucinar <[email protected]>
/test-examples="examples/ec2/v1beta1/vpc.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ulucinar for your great effort. Left a few comments.
Makefile
Outdated
@$(INFO) Running golangci-lint with the analysis cache building phase. | ||
@(BUILDTAGGER_DOWNLOAD_URL=$(BUILDTAGGER_DOWNLOAD_URL) ./scripts/tag.sh && \ | ||
(([[ "${SKIP_LINTER_ANALYSIS}" == "true" ]] && $(OK) "Skipping analysis cache build phase because it's already been populated") && \ | ||
[[ "${SKIP_LINTER_ANALYSIS}" == "true" ]] || $(GOLANGCILINT) run -v --build-tags account,configregistry,configprovider,linter_run -v --concurrency 1 --disable-all --exclude '.*')) || $(FAIL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we consider increasing the Concurrency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not done extensive tests here. We already know the default concurrency of 4 on Github hosted runners is not working for us even when using the build constraints, it still consumes too much memory. Decreasing the concurrency to 1 without employing the build constraints also did not allow us to fit the linter into a hosted runner with 16 GB of physical memory. But using the build constraints and limiting the concurrency to 1 made it possible to fit onto a 16 GB machine. So, looks like we can try increasing the concurrency to 2 or 3.
The full linting is still done with the maximum available concurrency (in our case, with a concurrency of 4) with the constructed analysis cache. This concurrency limitation is only for the first cache construction phase.
Let me try with a concurrency of 2 and 3. We were initially doing our observations with a relatively larger resource provider (ec2
). Switching to account
may have helped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A concurrency of 4 for the initial analysis cache build phase on cold cache has failed here:
https://github.com/upbound/provider-aws/actions/runs/8246986803/job/22554209938?pr=1194
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a run with a concurrency of 1. It took ~21 min and ~14 GB of peak memory to complete the initial phase:
And here's another run with a concurrency of 3. It took ~14 min ~19 GB of peak memory to complete the initial phase:
Let's switch to a concurrency of 3 for the initial phase (concurrency of 4 has failed). The second phase already uses the max available concurrency.
If we observe any memory issues for the initial phase, we can decrease it later again.
apis/linter_run.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, this file was brought up in the context of the fragmentation of the registration phase. It will be needed when we want to add all the resources to the schema. I can't find any use of this function in the repo. Am I missing something, or do we really need this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As explained here, the code that the linter runner is processing (analyzing) must compile. The API registration code (apis/zz_register.go
) is one of the hot spots, which imports all the API packages and is thus costly to analyze. The apis/linter_run.go
keeps the linter happy by providing a definition of apis.AddToScheme
& apis.AddToSchemes
and the original definitions from apis/zz_register.go
are excluded by the supplied build constraints to the linter. So, the definitions in this file are only used by the linter and the actual build uses the definitions from apis/zz_register.go
.
This also means that the generated file apis/zz_register.go
is not analyzed. We can in theory distribute API registration across the individual resource provider (API group) modules but the cross-resource references necessitate a form of dependency resolution as discussed under the Implemented Strategy in this PR
section of the PR description. So avoiding these hot spots at the expense of not linting them is a trade-off we make here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha! Thanks for the clarification. This part can also be a bit tricky for other contributors. So, I am leaving this comment unresolved for documenting purposes.
Signed-off-by: Alper Rifat Ulucinar <[email protected]>
555cb40
to
890f5fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ulucinar LGTM!
…ase to 3 Signed-off-by: Alper Rifat Ulucinar <[email protected]>
Description of your changes
Depends on: upbound/official-providers-ci#187
Historical Context & Problem Statement
The linter jobs in the
upbound/provider-aws
repository have been facing recurring failures. Initially, these failures were mitigated by switching to larger self-hosted runners (runners with thee2-standard-8
labels in theupbound
organization), but the issues resurfaced due to a performance regression in themusttag
linter. Subsequently, we upgraded thegolangci-lint
version tov1.54.0
which resolved the performance regression with that specific linter. But later, we encountered further challenges, prompting a switch to an even larger runners. The runners we currently use only for the linter job with the labelUbuntu-Jumbo-Runner
. Despite these adjustments, the linter jobs has started failing again, primarily due to the high peak memory consumption during linting, with cold analysis cache runs consuming over 50 GB of peak memory, depending on the number of available CPU cores.Investigation & Findings
The substantial memory usage was traced back to the linter runner's analysis phase. We considered (and investigated some of) the following potential remediations for the linter issues we've been experiencing:
skip-files
to exclude, for instance, the generated files as we do right now for the official provider repos, does not help because it just skips reporting the issues. All the generated files are still analyzed and the processing needed to analyze the source code is the memory consuming part. So it does not help us.errcheck
,gosimple
,govet
,ineffassign
,staticcheck
,unused
) still consumes over 30 GB of memory. An idea could be to partition the current set of enabled linters across multiple jobs and hope no job will hit the memory limits.skip-files
argument.upbound
Github organization.Implemented Strategy in this PR
provider-aws
repo during cache construction: API registration inapis/zz_register.go
& API group configuration inconfig/provider.go
.The analysis cache expires in 7 days or when the module’s
go.sum
changes. If the analysis cache can successfully be restored, then the initial cache construction phase is skipped and just full linting with the maximum available concurrency is performed.For generating the build constraints (tags), we use the buildtagger tool. We currently don’t utilize the build tags for building the resource providers in isolation because:
Observed Improvements
In an example run with the two-phase strategy, the cache construction phase consumed a peak memory of ~13 GB and the full linting phase consumed a peak of 24,3 GB, which corresponds to a ~57% reduction in peak memory consumption compared to a single phase run of the linters on the same machine (an M2 Mac with 12 cores). The total execution time of both phases is ~14m, which is about the same time it takes the linters to run in a single phase (when we run the linters in a single phase on cold analysis cache, the peak memory consumption was ~57 GB and the execution time was ~14 min):
Here are results from example runs on an M2 Mac with 12 logical cores and 32 GB of physical memory:
Linter run in single phase (without the proposed initial cache construction phase) on cold analysis cache on the main branch’s Head (commit:
fb0fb486e6225cdab27a447c48cb36f98464884e
). Linter runner version:v1.55.2
:Average memory consumption is 40560.9MB, max is 58586.2MB. Execution took 13m49.188064208s.
Linter run in single phase with the analysis cache with the same parameters as above:
Average memory consumption is 104.5MB, max is 191.1MB. Execution took 7.325796125s.
Two-phase linter run example on cold analysis cache on the main branch’s Head (commit:
fb0fb486e6225cdab27a447c48cb36f98464884e
). Linter runner version isv1.55.2
:Average memory consumption in the cache construction phase is 9272.4MB and the peak consumption is 13301.6MB.
Execution of the first phase took 11m13.023597833s. For the second phase (full linting with all the available CPU cores), the average memory consumption is 9331.7MB and the peak is 24904.9MB. The execution of the second phase took 3m10.447865375s.
The linter job now fits into a (standard) Github hosted runner with the label
ubuntu-22.04
(with 16 GB of physical memory & 4 cores). So, in preparation of movingprovider-aws
out of theupbound
Github organization, this PR also changes thelint
job's runner fromUbuntu-Jumbo-Runner
toubuntu-22.04
.Developer Experience
config/<API group prefix, e.g., acm>/config.go
. There are no upjet resource configuration API changes, i.e., the contents ofconfig.go
stays the same, and because the build tags for these files are automatically generated, there's no need to manually tag these files.config/provider.go
but in a slightly different way. The new registration method is like follows:make lint
can be run with its old semantics. When the linter is running, the generated build tags will be observed in the source tree. If uninterrupted, these tags will be removed by the make target.make build
can be run with its old semantics & behaviorI have:
make reviewable test
to ensure this PR is ready for review.How has this code been tested
Cluster.eks
: https://github.com/upbound/provider-aws/actions/runs/8209193860