Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(server): mutex calls to sqlitex #13166

Merged
merged 1 commit into from
Jun 13, 2024
Merged

Conversation

Joibel
Copy link
Member

@Joibel Joibel commented Jun 11, 2024

Fixes #13154
Fixes #13140

Motivation

zombiezen/go-sqlite is not thread safe when used through a single connection. The current code is provably racing (run the server with -race and a few workflows being run) and it will tell you this if you argo list via the server a few times. Unless it crashes first.

Modifications

This change doesn't attempt to move to a multiple connection model, it's a minimal change to stop the server crashing all the time, by mutexing the use of the sql connection.

Verification

Running v3.5.7

Inject 200 example/dag-diamond.yamls into your cluster. Run argo list -s <server> to list the workflows. It will crash within a few attempts, usually

With this change it didn't crash in 200 attempts at argo list.

-race build of the server doesn't complain any more either.

[zombiezen/go-sqlite]
(https://github.com/zombiezen/go-sqlite/blob/main/doc.go#L32) is not
thread safe when used through a single connection. The current code is
provably racing (run the server with `-race` and a few workflows being
run) and it will tell you this if you `argo list` via the server a few
times.

This change doesn't attempt to move to a multiple connection model,
it's a minimal change to stop the server crashing all the time, by
mutexing the use of the sql connection.

Fixes argoproj#13154 and argoproj#13140

Signed-off-by: Alan Clucas <[email protected]>
@Joibel Joibel marked this pull request as ready for review June 11, 2024 14:29
@agilgur5 agilgur5 changed the title fix: mutex calls to sqlitex fix(server): mutex calls to sqlitex Jun 12, 2024
@agilgur5
Copy link
Contributor

cc @jiachengxu for review

Copy link
Member

@jiachengxu jiachengxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me, thanks for the work @Joibel!

@Joibel Joibel added this to the v3.5.x patches milestone Jun 13, 2024
@Joibel
Copy link
Member Author

Joibel commented Jun 13, 2024

@agilgur5, @tczhao, @isubasinghe, @terrytangyuan - could someone take a look at this one so we can prep a 3.5.8 with it in?

@terrytangyuan terrytangyuan merged commit 0ca0c0f into argoproj:main Jun 13, 2024
32 checks passed
@sarabala1979
Copy link
Member

good catch @Joibel

@agilgur5
Copy link
Contributor

Nice work figuring out the root cause @Joibel!
Since we hadn't found any nil pointers in our own code, I was suspecting it might be within the SQLite library, although I definitely did not expect that it wasn't thread-safe. Those GoDoc comments really belong in the README of the library IMO 😕

Will you or @jiachengxu be taking up the full fix with thread pooling?

so we can prep a 3.5.8 with it in?

Would like to get a fix for #13177 in as well and then release it ASAP

@agilgur5
Copy link
Contributor

Will you or @jiachengxu be taking up the full fix with thread pooling?

Wrote up a follow-up issue to track this in #13638

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3.5.7 Server keeps restarting, panicking 3.5.7 Server pods are crashing after upgrade
5 participants