-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reapply #8644 #9242
base: master
Are you sure you want to change the base?
Reapply #8644 #9242
Conversation
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
6065746
to
27144ba
Compare
Waiting to push fix commit until CI completes for the reapplication. |
7a40c4a
to
1e7b192
Compare
Looks like there are still a couple of itests failing. Will keep working on this next week. |
The error message |
This looks relevant, re some of the errors I see in the latest CI run: https://stackoverflow.com/a/42303225 |
Perhaps part of the issue is with the
Based on the SO link above, we might also be lacking some needed indexes. |
With closing the channel and a couple of other tests, I'm seeing logs similar to:
when I reproduce locally, as well as in the CI logs. I'm going to pull on that thread first... On the test config side, also seeing these:
I think the first issue above is with the code, the second is a config issue, and the other config issue in my comment above are the three major failures still happening. I think the |
This looks like a case where we |
Yep, looking into why that isn't caught by the panic/recover mechanism. |
It was actually a lack of error checking in |
1e67a84
to
899ae59
Compare
Looks better as far as the errors on closing channels. Will keep working tomorrow to eliminate the other errors. |
Hmm, so we don't have great visibility into how much memory these CI machines have. Perhaps we need to modify the connection settings to reduce the number of active connections, and also tune params like @djkazic has been working on a postgres+lnd tuning/perf guide, that I think we can eventually check directly into lnd. |
This is also very funky: lnd/kvdb/sqlbase/readwrite_bucket.go Lines 336 to 363 in e3cc4d7
We do two queries to just delete: select to see if exists, then delete. Instead of just trying to delete. Stepping back a minute: perhaps the issue is with this flawed KV abstraction we have. Perhaps we should just re-create a better hierarchical KV table from scratch. We use |
Here's another instance of duplicated work in lnd/kvdb/sqlbase/readwrite_bucket.go Lines 149 to 187 in e3cc4d7
We select to see if it exists, then potentially do the insert again. Instead, we can just do an |
I think the way the sequence is implemented may also be problematic: we have the sequence field directly in the table, which means table locks may need to be held. The sequence gets incremented a lot for stuff like payments, or invoice. We may be able to instead split that out into another table that can be updated independently of the main table: lnd/kvdb/sqlbase/readwrite_bucket.go Lines 412 to 437 in e3cc4d7
|
I've been able to reduce (but not fully eliminate) the I've also tried treating these errors and In addition, I've found one more place where we get the I pushed these changes above for discussion. My next step is to try to reduce the number of conflicts based on @Roasbeef's suggestions above. I'm going on vacation for the rest of the week until next Tuesday, so will keep working on this then. |
I think treating the OOM errors as serialization errors ended up being a mistake. Going to take that out and push when this run is done. In addition, I'm trying doubling the |
Got some errors like:
Looking into those, they could also be from my change of treating the OOM errors as retryable. |
Set |
Only 1 test failure last run, looks like no instances of |
I'm running with
|
@@ -194,10 +194,10 @@ ifeq ($(dbbackend),postgres) | |||
# Remove a previous postgres instance if it exists. | |||
docker rm lnd-postgres --force || echo "Starting new postgres container" | |||
|
|||
# Start a fresh postgres instance. Allow a maximum of 200 connections so | |||
# Start a fresh postgres instance. Allow a maximum of 500 connections so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we can also limit sql
package params for lnd as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this was just me setting it back to the original config. In the end, I've been running tests locally with the existing config plus the parameters suggested by @djkazic above. I did set the default max connections to 20 in LND. Will push cleaned up/latest code tomorrow.
Looking at the last failure, I think it's a flake that's unrelated to the DB (we encountered it in our tests often before I added a hacky fix locally, it's due to an issue in txnotifier). |
ea5fdde
to
c37acf5
Compare
I'm still getting excessive serialization errors causing failures. Looks like that's hopefully the last place to fix. I'm trying out some schema/query changes, and it's likely to be pretty different than the existing schema. I'm going to try to do it all in the |
I have a somewhat incomplete attempt at rebuilding the KV SQL but haven't fixed the failures. I have a couple of avenues to explore next. |
This reverts commit 67419a7.
62c18ca
to
f3c8290
Compare
Pushed last update for the week:
I've had better results with those, but am bottlenecked by how long it takes the itests to run in terms of testing which changes are making the most difference. Tests are still not fully passing. In particular, this failure is also a fairly consistent failure for me locally, so I'll look into that next week.
|
Looking into the failure above, I can reproduce it every few times running the test by itself locally with the postgres backend. It looks like we get 6 settle and 4 forward events rather than 5 forward and 5 settle. From a first glance, it looks like we're reusing/overwriting some parts of a struct incorrectly, but I'm narrowing down what/where. |
Change Description
Fix #9229 by reapplying #8644 and
batch
packagechanneldb
packagecurrent transaction is aborted
errors as serialization errors in case we hit a serialization error and ignore it, and get this error in a subsequent call to postgresdb-instance
postgres flags inMakefile
per @djkazic's recommendationsmaxconnections
parameter for postgres DBs to 20 instead of 50 by defaultSteps to Test
See the failing itests prior to the fix, and the passing itests after the fix.
Pull Request Checklist
Testing
Code Style and Documentation
[skip ci]
in the commit message for small changes.📝 Please see our Contribution Guidelines for further guidance.