Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor leader.c to fix stack growth in handle_exec_sql #697

Merged
merged 13 commits into from
Oct 15, 2024

Conversation

cole-miller
Copy link
Contributor

@cole-miller cole-miller commented Sep 2, 2024

This PR is another attempt to fix the stack growth issue with handle_exec_sql (related to #679), this time in a more principled way that involves a significant refactor of leader.c. Details follow.

Recall that the issue arises with a call stack like this (callees at the top)

handle_exec_sql_next
handle_exec_sql_cb
leaderExecV2
leader__barrier
leader__exec
handle_exec_sql_next
...

that is, indirect recursion that can generate a number of stack frames proportional to the number of ;-separated statements in an EXEC_SQL request. This happens because leader__barrier and leader__exec can invoke their callbacks synchronously when suspending is not required (respectively, because the FSM is up to date with the raft log and and because sqlite3_step generated no changed pages).

This PR rewrites those two functions, establishing a new contract: the callback is only invoked if we yielded to the event loop at least once while processing the request; if the request was processed synchronously, a magic value LEADER_NOT_ASYNC is returned to indicate that the caller should invoke the callback. Existing callsites are updated to reflect this new contract.

With these new implementations available, we can fix the original problem by rewriting handle_exec_sql_next to iteratively process as many statements as possible until one of them suspends or returns an error. The PR includes a regression test to validate this fix.

The new implementations of the leader.c functions are intentionally different in style from the old ones. Effectively, each barrier request and exec request is a coroutine driven by a state machine. With this approach the LEADER_NOT_ASYNC feature falls out rather naturally, and we get the added benefits of more compact code and built-in observability.

@cole-miller cole-miller changed the title Leader pseudo coroutines Refactor leader.c to fix stack growth in handle_exec_sql Sep 2, 2024
@cole-miller cole-miller marked this pull request as draft September 2, 2024 17:05
Copy link

codecov bot commented Sep 2, 2024

Codecov Report

Attention: Patch coverage is 83.26848% with 43 lines in your changes missing coverage. Please review.

Project coverage is 80.95%. Comparing base (85e494d) to head (06992a6).
Report is 25 commits behind head on master.

Files with missing lines Patch % Lines
src/leader.c 81.45% 9 Missing and 19 partials ⚠️
src/gateway.c 78.87% 7 Missing and 8 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #697      +/-   ##
==========================================
- Coverage   81.07%   80.95%   -0.12%     
==========================================
  Files         197      196       -1     
  Lines       29164    29186      +22     
  Branches     4066     4089      +23     
==========================================
- Hits        23644    23627      -17     
- Misses       3875     3895      +20     
- Partials     1645     1664      +19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cole-miller
Copy link
Contributor Author

cole-miller commented Sep 2, 2024

@cole-miller cole-miller linked an issue Sep 3, 2024 that may be closed by this pull request
@cole-miller
Copy link
Contributor Author

Status: trying to identify and fix the regression seen here canonical/lxd#14034, namely sqlite3_close returning nonzero in leader__close (indicating that a prepared statement was not finalized).

Signed-off-by: Cole Miller <[email protected]>
@cole-miller
Copy link
Contributor Author

I believe the LXD issue has been fixed, the problem was returning 0 from exec_async in a case where we didn't suspend. This caused handle_exec_sql_cb to not be invoked, leaking a prepared statement and tripping assert(sqlite3_close(conn) == 0).

@cole-miller cole-miller marked this pull request as ready for review October 11, 2024 00:29
Copy link
Contributor

@letFunny letFunny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks Cole. I just have a few questions to improve my understanding and some very small suggestions.

test/unit/test_gateway.c Outdated Show resolved Hide resolved
src/leader.c Outdated Show resolved Hide resolved
src/leader.c Outdated Show resolved Hide resolved
src/gateway.c Outdated Show resolved Hide resolved
src/gateway.c Show resolved Hide resolved
src/gateway.c Show resolved Hide resolved
test/integration/test_client.c Show resolved Hide resolved
src/gateway.c Show resolved Hide resolved
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Testing again locally revealed that (contrary to what I remembered) 1000
isn't enough to cause a stack overflow with the old implementation on my
machine.

Signed-off-by: Cole Miller <[email protected]>
I don't remember why I thought there was a problem here originally, but
investigation with printf shows that the second flush is happening as
intended.

Signed-off-by: Cole Miller <[email protected]>
@cole-miller cole-miller merged commit e80eada into canonical:master Oct 15, 2024
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Recursion in EXEC_SQL request handling
2 participants