Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3.5.0: Server OOM killed when fetching archived workflows #12872

Closed
3 of 4 tasks
vanny96 opened this issue Apr 2, 2024 · 4 comments
Closed
3 of 4 tasks

v3.5.0: Server OOM killed when fetching archived workflows #12872

vanny96 opened this issue Apr 2, 2024 · 4 comments
Labels
area/api Argo Server API P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority solution/duplicate This issue or PR is a duplicate of an existing one solution/outdated This is not up-to-date with the current version type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@vanny96
Copy link
Contributor

vanny96 commented Apr 2, 2024

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issue exists when I tested with :latest
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

When the Workflow Archive feature is enabled, Argo Workflows starts querying the db while trying to fetch all of the workflows.
In case there is high number of workflows persisted (experienced this with >100000 workflows in argo_archived_workflows) on the DB the applications seems to struggle and it can lead to a considerable increase of of RAM needed to run.

Sometimes this is preceeded by the attached log, which indicates a pretty clear issue: there is no limit set on the query.

While the workflowArchiveServer interfaces seems to have some logic to handle limits, such logic seems to have been disabled by the following snippet at server/workflow/workflow_server.go#228

	// Search whole with Limit 0.
	// Reset the Continue "0" to prevent archive workflow pagination.
	options.Continue = "0"
	options.Limit = 0
	archivedWfList, err := s.wfArchiveServer.ListArchivedWorkflows(ctx, &workflowarchivepkg.ListArchivedWorkflowsRequest{
		ListOptions: options,
		NamePrefix:  "",
		Namespace:   req.Namespace,
	})

Here the limit is hardcoded to 0, which leads to no limit to be set at all.

Requests

  • Set the limit based on the "Results per page" set on the UI.
  • Add an option to disable the rendering of archived workflows from the web server, making the call to the DB optional

Version

3.5.0

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Not relevant

Logs from the workflow controller

Query:          SELECT "workflow" FROM "argo_archived_workflows" WHERE (("clustername" = $1 AND "instanceid" = $2) AND "namespace" = $3 AND not exists (select 1 from argo_archived_workflows_labels where clustername = argo_archived_workflows.clustername and uid = argo_archived_workflows.uid and name = 'workflows.argoproj.io/controller-instanceid')) ORDER BY "startedat" DESC
	Arguments:      []interface {}{"default", "", "namespace"}
	Stack:          
		fmt.(*pp).handleMethods@/usr/local/go/src/fmt/print.go:673
		fmt.(*pp).printArg@/usr/local/go/src/fmt/print.go:756
		fmt.(*pp).doPrint@/usr/local/go/src/fmt/print.go:1211
		fmt.Append@/usr/local/go/src/fmt/print.go:289
		log.(*Logger).Print.func1@/usr/local/go/src/log/log.go:261
		log.(*Logger).output@/usr/local/go/src/log/log.go:238
		log.(*Logger).Print@/usr/local/go/src/log/log.go:260
		github.com/argoproj/argo-workflows/v3/persist/sqldb.(*workflowArchive).ListWorkflows@/go/src/github.com/argoproj/argo-workflows/persist/sqldb/workflow_archive.go:172
		github.com/argoproj/argo-workflows/v3/server/workflowarchive.(*archivedWorkflowServer).ListArchivedWorkflows@/go/src/github.com/argoproj/argo-workflows/server/workflowarchive/archived_workflow_server.go:136
		github.com/argoproj/argo-workflows/v3/server/workflow.(*workflowServer).ListWorkflows@/go/src/github.com/argoproj/argo-workflows/server/workflow/workflow_server.go:232
		github.com/argoproj/argo-workflows/v3/pkg/apiclient/workflow._WorkflowService_ListWorkflows_Handler.func1@/go/src/github.com/argoproj/argo-workflows/pkg/apiclient/workflow/workflow.pb.go:1826
		github.com/argoproj/argo-workflows/v3/server/apiserver.(*argoServer).newGRPCServer.RatelimitUnaryServerInterceptor.func5@/go/src/github.com/argoproj/argo-workflows/util/grpc/interceptor.go:65
		github.com/argoproj/argo-workflows/v3/server/apiserver.(*argoServer).newGRPCServer.ChainUnaryServer.func6.1.1@/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25
		github.com/argoproj/argo-workflows/v3/server/auth.(*gatekeeper).UnaryServerInterceptor.func1@/go/src/github.com/argoproj/argo-workflows/server/auth/gatekeeper.go:98
		github.com/argoproj/argo-workflows/v3/server/apiserver.(*argoServer).newGRPCServer.ChainUnaryServer.func6.1.1@/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25
		github.com/argoproj/argo-workflows/v3/util/grpc.glob..func1@/go/src/github.com/argoproj/argo-workflows/util/grpc/interceptor.go:45
		github.com/argoproj/argo-workflows/v3/server/apiserver.(*argoServer).newGRPCServer.ChainUnaryServer.func6.1.1@/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25
		github.com/argoproj/argo-workflows/v3/server/apiserver.(*argoServer).newGRPCServer.PanicLoggerUnaryServerInterceptor.func4@/go/src/github.com/argoproj/argo-workflows/util/grpc/interceptor.go:26
		github.com/argoproj/argo-workflows/v3/server/apiserver.(*argoServer).newGRPCServer.ChainUnaryServer.func6.1.1@/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25
		github.com/grpc-ecosystem/go-grpc-middleware/logging/logrus.UnaryServerInterceptor.func1@/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/logging/logrus/server_interceptors.go:31
		github.com/argoproj/argo-workflows/v3/server/apiserver.(*argoServer).newGRPCServer.ChainUnaryServer.func6.1.1@/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25
	Error:          upper: slow query
	Time taken:     0.21139s
	Context:        context.Background

Logs from in your workflow's wait container

Not relevant
@eduardodbr
Copy link
Member

probably duplicate of #12025

@caelan-io caelan-io added the solution/duplicate This issue or PR is a duplicate of an existing one label Apr 2, 2024
@caelan-io
Copy link
Member

caelan-io commented Apr 2, 2024

Correct, please see #12025 and proposed fix in #12736

@agilgur5 agilgur5 changed the title Argo Workflow Server gets OOM killed when fetching archived workflows Server gets OOM killed when fetching archived workflows Apr 2, 2024
@agilgur5 agilgur5 closed this as completed Apr 2, 2024
@agilgur5
Copy link
Contributor

agilgur5 commented Apr 2, 2024

  • I can confirm the issue exists when I tested with :latest

Version

3.5.0

  • Set the limit based on the "Results per page" set on the UI.

You're on an old version, this was fixed/reverted in 3.5.1 by #12068.

Please fill out the issue template properly; that is why it asks if you've tested with :latest with a checkbox.

Correct, please see #12025 and proposed fix in #12736

There are still other performance & correctness issues (#11715 -- which is caused by only having "Results per page" limit) with 3.5.x, which #12736 aims to solve.

@agilgur5 agilgur5 added type/regression Regression from previous behavior (a specific type of bug) P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority area/server area/api Argo Server API and removed area/server labels Apr 2, 2024
@caelan-io
Copy link
Member

Thanks, @agilgur5 !

@agilgur5 agilgur5 added the solution/outdated This is not up-to-date with the current version label Apr 26, 2024
@agilgur5 agilgur5 changed the title Server gets OOM killed when fetching archived workflows v3.5.0: Server OOM killed when fetching archived workflows Oct 8, 2024
@argoproj argoproj locked as resolved and limited conversation to collaborators Oct 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/api Argo Server API P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority solution/duplicate This issue or PR is a duplicate of an existing one solution/outdated This is not up-to-date with the current version type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

No branches or pull requests

4 participants