Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the data inconsistency by moving the SetConsistentIndex into the transaction lock #13844

Closed
wants to merge 1 commit into from

Conversation

ahrtr
Copy link
Member

@ahrtr ahrtr commented Mar 28, 2022

Try to fix 13766.

@codecov-commenter
Copy link

codecov-commenter commented Mar 28, 2022

Codecov Report

Merging #13844 (acc74bf) into main (be29295) will decrease coverage by 0.20%.
The diff coverage is 100.00%.

❗ Current head acc74bf differs from pull request most recent head 3a7264e. Consider uploading reports for the commit 3a7264e to get more accurate results

@@            Coverage Diff             @@
##             main   #13844      +/-   ##
==========================================
- Coverage   72.57%   72.37%   -0.21%     
==========================================
  Files         468      468              
  Lines       38222    38238      +16     
==========================================
- Hits        27741    27673      -68     
- Misses       8699     8771      +72     
- Partials     1782     1794      +12     
Flag Coverage Δ
all 72.37% <100.00%> (-0.21%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/etcdserver/server.go 84.18% <100.00%> (-0.67%) ⬇️
server/storage/backend/backend.go 84.17% <100.00%> (+0.51%) ⬆️
server/storage/backend/batch_tx.go 63.15% <100.00%> (+0.21%) ⬆️
server/etcdserver/api/v3rpc/lease.go 74.68% <0.00%> (-7.60%) ⬇️
client/pkg/v3/fileutil/purge.go 66.03% <0.00%> (-7.55%) ⬇️
server/storage/mvcc/watchable_store.go 84.42% <0.00%> (-5.44%) ⬇️
raft/rafttest/node.go 95.00% <0.00%> (-5.00%) ⬇️
server/etcdserver/api/v3rpc/member.go 93.54% <0.00%> (-3.23%) ⬇️
server/etcdserver/cluster_util.go 70.35% <0.00%> (-3.17%) ⬇️
server/etcdserver/api/v3election/election.go 66.66% <0.00%> (-2.78%) ⬇️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update be29295...3a7264e. Read the comment docs.

@ahrtr ahrtr force-pushed the fix_data_inconsistency branch 2 times, most recently from cba5fd2 to 65a691c Compare March 28, 2022 05:35
@ahrtr ahrtr changed the title Fix the data inconsistency by moving the SetConsistentIndex into the transaction log Fix the data inconsistency by moving the SetConsistentIndex into the transaction lock Mar 28, 2022
@ahrtr ahrtr force-pushed the fix_data_inconsistency branch 2 times, most recently from c14ab30 to e37d481 Compare March 28, 2022 06:52
@ahrtr
Copy link
Member Author

ahrtr commented Mar 28, 2022

The SetConsistentIndex and the applyV3.Apply are supposed to be an atomic operation, obviously there aren't. If the periodic commit happens in between, and the etcd server crashes right after the periodic commit, then etcd sever will be in a bad situation when it starts again. It should be a real issue.

But I can still reproduce the data inconsistency issue with this PR. So this PR isn't a complete fix.


   logger.go:130: 2022-03-28T00:01:28.981-0700	INFO	tester	received response	{"operation": "INITIAL_START_ETCD", "from": "127.0.0.1:3379", "success": false, "status": "\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
    logger.go:130: 2022-03-28T00:01:28.981-0700	INFO	tester	received response	{"operation": "INITIAL_START_ETCD", "from": "127.0.0.1:2379", "success": false, "status": "\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
    logger.go:130: 2022-03-28T00:01:28.981-0700	INFO	tester	sent request	{"operation": "INITIAL_START_ETCD", "to": "127.0.0.1:1379"}
    logger.go:130: 2022-03-28T00:01:28.982-0700	INFO	tester	received response	{"operation": "INITIAL_START_ETCD", "from": "127.0.0.1:1379", "success": false, "status": "\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
    etcd_tester_test.go:43: Bootstrap failed {error 26 0  "INITIAL_START_ETCD" is not valid; last server operation was "INITIAL_START_ETCD", "INITIAL_START_ETCD" is not valid; last server operation was "INITIAL_START_ETCD", "INITIAL_START_ETCD" is not valid; last server operation was "INITIAL_START_ETCD"}
--- FAIL: TestFunctional (0.01s)
FAIL
FAIL: (code:1):
  % ./bin/etcd-tester --config ./tests/functional/functional.yaml -test.v
ETCD_TESTER_EXIT_CODE: 1

@ahrtr ahrtr marked this pull request as draft March 28, 2022 08:04
@ahrtr ahrtr force-pushed the fix_data_inconsistency branch from e37d481 to b994234 Compare March 28, 2022 08:35
@ahrtr ahrtr marked this pull request as ready for review March 28, 2022 08:43
@ahrtr ahrtr force-pushed the fix_data_inconsistency branch from b994234 to acc74bf Compare March 28, 2022 08:52
@ahrtr ahrtr force-pushed the fix_data_inconsistency branch from acc74bf to 3a7264e Compare March 28, 2022 09:17
@ahrtr ahrtr marked this pull request as draft March 28, 2022 09:34
@serathius
Copy link
Member

serathius commented Mar 28, 2022

@ahrtr Thanks for looking into this, logs you linked are a bug in functional framework and not a prof of reproduction

   logger.go:130: 2022-03-28T00:01:28.981-0700	INFO	tester	received response	{"operation": "INITIAL_START_ETCD", "from": "127.0.0.1:3379", "success": false, "status": "\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
    logger.go:130: 2022-03-28T00:01:28.981-0700	INFO	tester	received response	{"operation": "INITIAL_START_ETCD", "from": "127.0.0.1:2379", "success": false, "status": "\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
    logger.go:130: 2022-03-28T00:01:28.981-0700	INFO	tester	sent request	{"operation": "INITIAL_START_ETCD", "to": "127.0.0.1:1379"}
    logger.go:130: 2022-03-28T00:01:28.982-0700	INFO	tester	received response	{"operation": "INITIAL_START_ETCD", "from": "127.0.0.1:1379", "success": false, "status": "\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
    etcd_tester_test.go:43: Bootstrap failed {error 26 0  "INITIAL_START_ETCD" is not valid; last server operation was "INITIAL_START_ETCD", "INITIAL_START_ETCD" is not valid; last server operation was "INITIAL_START_ETCD", "INITIAL_START_ETCD" is not valid; last server operation was "INITIAL_START_ETCD"}

Means that previous state from last run was not properly cleaned up I recommend running:

sudo killall etcd && sudo killall etcd-agent

Please rerun them and look for hash inconsistency log present in one of the members. I will also try to verify this PR.

@ahrtr
Copy link
Member Author

ahrtr commented Mar 28, 2022

Thanks for the info. I do not get time to read through the functional test's source code so far. What's log which can prove that the issue is reproduced?

I will continue to work on this tomorrow.

@serathius
Copy link
Member

Please read the #13838 (comment)

@serathius
Copy link
Member

I also recommend qualifying on the commit that first introduced the bug 5005167 as it will reduce a change of any other issues arising

@ptabor
Copy link
Contributor

ptabor commented Mar 28, 2022

Benjamin: Thank you for the code. It's not clear to me that it solves the root problem. I commented in #13766 (comment) that my current thinking is about looking for sources of mid-apply commits so unlocks. They can be either:

  • applies that do commit multiple transactions
  • RW Transaction triggered from outside of 'apply' context.

It seems that Compaction is really being executed by a concurrent goroutine to the apply itself storage/mvcc/kvstore_compaction.go:57 and it triggers 'Unlocks'.

My current thinking:

  1. There should exist ability to wrap Backend with custom implementation that provision hooks to transaction lifecycle.
  2. Out out apply() transactions should use Backend wrapper that has no registered Hooks.
  3. The transactions in Apply() should use Backend that has Hooks() like currently with additional verification that
    there was only one Unlock() per transaction called. i.e. Apply calls hook->VerifyApplyEnd() and hook->VerifyApplyBegin() or checks counters... 

@ahrtr
Copy link
Member Author

ahrtr commented Mar 28, 2022

Thanks @serathius & @ptabor for the feedback, which makes sense.

It's not clear to me that it solves the root problem.

This PR is just a draft, and I was planning to quickly verify my thought.

My current thinking:

1. There should exist ability to wrap Backend with custom implementation that provision hooks to transaction lifecycle.

2. Out out apply() transactions should use Backend wrapper that has no registered Hooks.

3. The transactions in `Apply()` should use Backend that has Hooks() like currently with additional verification that
   there was only one Unlock() per transaction called. i.e. Apply calls `hook->VerifyApplyEnd()` and `hook->VerifyApplyBegin()` or checks counters...

Exactly, I will try to get all these sorted out, and provide a more robust solution and send out for review.

@ahrtr
Copy link
Member Author

ahrtr commented Mar 28, 2022

Please rerun them and look for hash inconsistency log present in one of the members. I will also try to verify this PR.

I just tried this PR on top of your PR 13838, and do not reproduce the data inconsistent issue. I do not reproduce the issue without this PR either. Probably it's because my local VM isn't powerful enough, which has 4 CPU (each with Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz), and 8GB memory.

I will try to get a more robust machine, and change the writing directory, and try again.

@serathius Have you tried this PR in your workstation?

@ahrtr
Copy link
Member Author

ahrtr commented Mar 29, 2022

I also recommend qualifying on the commit that first introduced the bug 5005167 as it will reduce a change of any other issues arising

Actually the root cause of this issue isn't coming from the commit 5005167. It just increases the possibility of reproducing this issue. The key point is to make sure we either support atomic operation (just one transaction/Lock & Unlock) or Reentrant transaction.

@serathius
Copy link
Member

@serathius Have you tried this PR in your workstation?

Yes, I was not able to reproduce the issue with this fix. I have run tests for over an hour, so I think we are ok.

I also recommend qualifying on the commit that first introduced the bug 5005167 as it will reduce a change of any other issues arising

Actually the root cause of this issue isn't coming from the commit 5005167. It just increases the possibility of reproducing this issue. The key point is to make sure we either support atomic operation (just one transaction/Lock & Unlock) or Reentrant transaction.

This would be very worrying, because I was not able to reproduce the issue on previous commits at all. If the issue is not caused by 5005167 then we don't have a good way to qualify our results. If the root cause is to rare to be detected then we need to improve our testing.

@ptabor @ahrtr if 5005167 is not a root cause, do you have any inclination of when the real issue was introduced? Any idea how to improve our approach to testing to detect the issue?

@ahrtr
Copy link
Member Author

ahrtr commented Mar 29, 2022

Yes, I was not able to reproduce the issue with this fix. I have run tests for over an hour, so I think we are ok.

Thanks @serathius for the feedback, it's really good news. I will think about a formal solution/fix in the next step.

This would be very worrying, because I was not able to reproduce the issue on previous commits at all. If the issue is not caused by 5005167 then we don't have a good way to qualify our results. If the root cause is to rare to be detected then we need to improve our testing.

@ptabor @ahrtr if 5005167 is not a root cause, do you have any inclination of when the real issue was introduced? Any idea how to improve our approach to testing to detect the issue?

No need to worry:), I do not read through the functional test's source code so far, but I believe the reason why we do not see such issue before the commit 505167 is we do not perform operations such as periodic compaction or other operations which may call Commit or ForceCommit in functional test, or at least the frequency is really low.

@ahrtr
Copy link
Member Author

ahrtr commented Mar 29, 2022

do you have any inclination of when the real issue was introduced?

The root cause is coming from server.go#L1774 and server.go#L1804.

And of course the addition of OnPreCommitUnSafe also contributes to the root cause.

@ptabor
Copy link
Contributor

ptabor commented Mar 29, 2022

The root cause is coming from server.go#L1774 and server.go#L1804.

Yes. And precisely SetConsistentIndex() started to be harmful potentially when any 'Unlock' could prematurely submit transaction. I'm not sure if it's worth to test the issue on commit in the middle of the PR, vs. the whole PR boundary, i.e. f1123d6.

@ahrtr
Copy link
Member Author

ahrtr commented Mar 30, 2022

I just submitted another PR 13854, so closing this one.

@ahrtr ahrtr closed this Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent revision and data occurs
4 participants