-
Notifications
You must be signed in to change notification settings - Fork 663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(resharding): fix Chain::get_shards_to_state_sync() #12617
fix(resharding): fix Chain::get_shards_to_state_sync() #12617
Conversation
This function was left unimplemented for the resharding v3 case, so here we implement it. For each shard ID, if there hasn't been a resharding, we say that we want to state sync it if we'll be tracking it next epoch and don't currently track it. If there has been a resharding, we state sync it if the previous condition is true and we aren't currently generating the shard itself as part of resharding. Since this shards to state sync calculation is made when we preprocess the first block of an epoch, there are 3 relevant epochs here: prev_epoch, current_epoch, next_epoch. When we process the last block of prev_epoch, start_resharding() will be called and will initiate the child shard splitting if we were tracking the parent of the split shards during prev_epoch (this is true even if we don't track either child in current_epoch). So if we aren't tracking a child shard during current_epoch, we still actually don't want to state sync it if we were tracking the parent previously
few things:
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #12617 +/- ##
==========================================
- Coverage 70.53% 70.38% -0.15%
==========================================
Files 847 847
Lines 172839 172855 +16
Branches 172839 172855 +16
==========================================
- Hits 121904 121656 -248
- Misses 45833 46116 +283
+ Partials 5102 5083 -19
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes because I think it can be simplified, let me know if I'm missing something.
@@ -2424,56 +2426,69 @@ impl Chain { | |||
shard_tracker: &ShardTracker, | |||
me: &Option<AccountId>, | |||
parent_hash: &CryptoHash, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<unrelated rant>
Why do we have two names - prev and parent - for the same thing? 😢 </unrelated rant>
shard_tracker.will_care_about_shard(me.as_ref(), parent_hash, shard_id, true); | ||
let does_care_about_shard = | ||
shard_tracker.care_about_shard(me.as_ref(), parent_hash, shard_id, true); | ||
) -> Result<bool, Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the old condition is no longer relevant but the logic here seems too complicated.
In V2 we needed to get a second copy of the parent, even the node tracks it, to perform the preprocessing.
will_care_about_shard && (will_shard_layout_change || !does_care_about_shard)
In V3 this should be needed so it feels like the following should be sufficient:
will_care_about_shard && !does_care_about_shard
Please keep in mind that will_care_about_shard
is resharding aware. We should aim to keep the resharding specific part of this logic in one place if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In V3 this should be needed so it feels like the following should be sufficient:
will_care_about_shard && !does_care_about_shard
So, relative to the current PR, something like:
diff --git a/chain/chain/src/chain.rs b/chain/chain/src/chain.rs
index 931f711f8..ac6b3d650 100644
--- a/chain/chain/src/chain.rs
+++ b/chain/chain/src/chain.rs
@@ -2465,32 +2465,7 @@ impl Chain {
if shard_tracker.care_about_shard(me.as_ref(), parent_hash, shard_id, true) {
return Ok(false);
}
- // Now we need to state sync it unless this is a post-resharding child shard whose parent we were tracking in the
- // previous epoch, in which case we don't need to state sync because we'll generate the child when we do the resharding
-
- let shard_layout = epoch_manager.get_shard_layout(epoch_id)?;
- let prev_epoch_id = epoch_manager.get_prev_epoch_id_from_prev_block(parent_hash)?;
- let prev_shard_layout = epoch_manager.get_shard_layout(&prev_epoch_id)?;
-
- let resharded = shard_layout != prev_shard_layout;
- if !resharded {
- return Ok(true);
- }
- let Some(parent_shard_id) = shard_layout.try_get_parent_shard_id(shard_id)? else {
- return Ok(true);
- };
- let was_split = parent_shard_id != shard_id;
- if !was_split {
- return Ok(true);
- }
-
- // Note that here passing `prev_prev_hash` to care_about_shard() will have us check whether we were tracking it in
- // the previous epoch, because the current block is the first block of an epoch, so prev_prev_hash is the "parent_hash"
- // of the last block of the previous epoch. TODO: consider refactoring these ShardTracker functions to accept an epoch_id
- // to make this less tricky.
- let splitting_child =
- shard_tracker.care_about_shard(me.as_ref(), prev_prev_hash, parent_shard_id, true);
- Ok(!splitting_child)
+ Ok(true)
}
/// Check if any block with missing chunk is ready to be processed and start processing these blocks
Now that the other PR is merged, we can try the modification to test_resharding_v3_shard_shuffling_slower_post_processing_tasks
that I mentioned caused a crash:
diff --git a/integration-tests/src/test_loop/tests/resharding_v3.rs b/integration-tests/src/test_loop/tests/resharding_v3.rs
index c04d89ba6..457db7723 100644
--- a/integration-tests/src/test_loop/tests/resharding_v3.rs
+++ b/integration-tests/src/test_loop/tests/resharding_v3.rs
@@ -985,6 +985,7 @@ fn test_resharding_v3_shard_shuffling_slower_post_processing_tasks() {
.track_all_shards(false)
.all_chunks_expected(false)
.delay_flat_state_resharding(2)
+ .epoch_length(10)
.build();
test_resharding_v3_base(params);
}
Now unfortunately this passes because something about the order of the test loop task processing for state sync and resharding catchup has changed between when I wrote that and the current head of master. Without trying to figure it out, we can just hack it to run in the same order as before:
diff --git a/chain/chain/src/flat_storage_resharder.rs b/chain/chain/src/flat_storage_resharder.rs
index 83bd481f3..a401c51ce 100644
--- a/chain/chain/src/flat_storage_resharder.rs
+++ b/chain/chain/src/flat_storage_resharder.rs
@@ -609,6 +609,16 @@ impl FlatStorageResharder {
if self.controller.is_cancelled() {
return FlatStorageReshardingTaskResult::Cancelled;
}
+ let head = chain_store.head().unwrap();
+ let Some(sync_hash) = chain_store.get_current_epoch_sync_hash(&head.epoch_id).unwrap() else {
+ return FlatStorageReshardingTaskResult::Postponed;
+ };
+ let key = borsh::to_vec(&near_primitives::state_sync::StatePartKey(sync_hash, shard_uid.shard_id(), 0)).unwrap();
+ let part = chain_store.store().get(near_store::DBCol::StateParts, &key).unwrap();
+ if !part.is_some() {
+ return FlatStorageReshardingTaskResult::Postponed;
+ }
+
info!(target: "resharding", ?shard_uid, "flat storage shard catchup task started");
let metrics = FlatStorageReshardingShardCatchUpMetrics::new(&shard_uid);
// Apply deltas and then create the flat storage.
Then (with the above epoch length change), run it and you get:
thread 'test_loop::tests::resharding_v3::test_resharding_v3_shard_shuffling_slower_post_processing_tasks' panicked at chain/chain/src/resharding/resharding_actor.rs:93:17:
impossible to recover from a flat storage shard catchup failure!
with this error message at the end of the logs:
5.661s ERROR resharding: flat storage shard catchup delta application failed! shard_uid=s7.v3 err=Other("unexpected resharding catchup flat storage status for s7.v3: Ready(FlatStorageReadyStatus { flat_head: BlockInfo { hash: Gposh9D271WMUy3wt4Re1pXtFViNEdiMEgWyjm8eoYHH, height: 24, prev_hash: 7d7uMfUA8dXHjGcixM88gVVnt7jaak8N5y7gfv7C58af } })")
What's happening here is the problem with account6
I mentioned in the other comment. It's not tracking the child in the current epoch, will track it in the next, but is performing the split because it tracked the parent in the previous epoch.
Please keep in mind that
will_care_about_shard
is resharding aware. We should aim to keep the resharding specific part of this logic in one place if possible.
Yeah it's a bit messy... maybe we should split somethiing out of this?
Just for clarity I think you're talking about the following scenario: That is an interesting case to think about and same even without resharding actually. The two options we have are:
What is your intuition about how those compare? Approach 2 is what you've implemented here. What would it take to make 1) work? Are there any drawbacks to this approach that I overlooked? Which approach would have simpler and cleaner code? |
Wait that is a good point, actually didn't even realize that. What do you think about just changing this to not care about the resharding case specifically, but just in general checking whether we were tracking the parent (usually the same shard ID) in the prev epoch. It gets a lot simpler that way and I think it just makes more sense (see below). something like this relative to the current PR: diff --git a/chain/chain/src/chain.rs b/chain/chain/src/chain.rs
index f712c6081..b2c0c5abb 100644
--- a/chain/chain/src/chain.rs
+++ b/chain/chain/src/chain.rs
@@ -2472,28 +2472,17 @@ impl Chain {
// previous epoch, in which case we don't need to state sync because we'll generate the child when we do the resharding
let shard_layout = epoch_manager.get_shard_layout(epoch_id)?;
- let prev_epoch_id = epoch_manager.get_prev_epoch_id_from_prev_block(parent_hash)?;
- let prev_shard_layout = epoch_manager.get_shard_layout(&prev_epoch_id)?;
-
- let resharded = shard_layout != prev_shard_layout;
- if !resharded {
- return Ok(true);
- }
let Some(parent_shard_id) = shard_layout.try_get_parent_shard_id(shard_id)? else {
return Ok(true);
};
- let was_split = parent_shard_id != shard_id;
- if !was_split {
- return Ok(true);
- }
// Note that here passing `prev_prev_hash` to care_about_shard() will have us check whether we were tracking it in
// the previous epoch, because the current block is the first block of an epoch, so prev_prev_hash is the "parent_hash"
// of the last block of the previous epoch. TODO: consider refactoring these ShardTracker functions to accept an epoch_id
// to make this less tricky.
- let splitting_child =
+ let cared_before =
shard_tracker.care_about_shard(me.as_ref(), prev_prev_hash, parent_shard_id, true);
- Ok(!splitting_child)
+ Ok(!cared_before)
}
/// Check if any block with missing chunk is ready to be processed and start processing these blocks
I could be missing something, but I think really all we would have to do would be to fix the inconsistency between the state sync and the resharding code, since we don't want both of them to believe they should be the one setting the child shard's state. The fact that 1) is basically already implemented is sort of why this crash happens at all
Yeah, in my opinion it's a bit strange to discard state we know we're going to want, then ask others for that state, then proceed as we would have if we had never discarded it (i.e. apply chunks for that shard again post-state sync). So suppose we're in this situation where we're going to track a shard in So there are three points of interest here:
Suppose we're on a localnet with small state and, say, an epoch length of 100, and that the starting block heights are On mainnet the picture is similar except that instead of not applying chunks for 2 blocks in favor of asking our peers to give us the state, we stop applying chunks for an hour or so. Feels to me like we don't even gain a whole lot by intentionally ceasing to apply state updates just to start doing it again an hour later. Like, we have the state and the means to keep it updated, and we know we'll need it to be updated in 12 hours, but we decide to stop updating it ourselves for one small part of the beginning of the epoch? It feels like sort of bad behavior to place a burden on your peers and ask them to provide state parts (and in the case of resharding, perform it) for you when you could have just been applying chunks yourself for that hour or so. Not to mention that it also introduces this dependency on state sync in a case where it shouldn't have been necessary for the node to keep up.
Actually I think maybe 2! Especially with the diff I posted above since we actually dont rly care about whether there's a reasharding or not. If we go with 1), we still have to figure out how to not perform the resharding if state sync is going to take care of downloading the child shard's state
Yea this is clear here in the usage of the |
I totally agree that 2) is the less wasteful implementation in terms of actual resources. What I'm asking is whether it's premature optimization, and if there is any tradeoff in terms of code simplicity and robustness that we have to consider. If 2) is as easy and nice (or easier and nicer) than 1) then by all means we should do it.
Awesome, let's do it then, please update the PR and re-request review and I'll have a look. I still think it's less future proof but there isn't any reason not to do it now. We can always fix it later if we actually find any issues. |
Updated to get rid of the logic related to resharding, as discussed above and updated the PR description. Something to pay attention to is that in the code that says: In addition to the integration tests, I ran this: diff --git a/pytest/tests/sanity/transactions.py b/pytest/tests/sanity/transactions.py
index bdc72f68f..05de6f799 100755
--- a/pytest/tests/sanity/transactions.py
+++ b/pytest/tests/sanity/transactions.py
@@ -24,8 +24,8 @@ nodes = start_cluster(
config=None,
extra_state_dumper=True,
genesis_config_changes=[["min_gas_price",
- 0], ["max_inflation_rate", [0, 1]],
- ["epoch_length", 10],
+ 0], ["max_inflation_rate", [0, 1]], ['shuffle_shard_assignment_for_chunk_producers', True],
+ ["epoch_length", 20],
["block_producer_kickout_threshold", 70]],
client_config_changes={
0: {
@@ -146,16 +146,16 @@ for height, hash in utils.poll_blocks(nodes[4], timeout=TIMEOUT):
last_balances = [x for x in ctx.expected_balances]
ctx.send_moar_txs(hash, 10, use_routing=True)
sent_height = height
- else:
- assert height <= sent_height + 10, ('Balances before: {before}\n'
- 'Expected balances: {expected}\n'
- 'Current balances: {current}\n'
- 'Sent at height: {sent_at}\n'
- 'Current height: {height}').format(
- before=last_balances,
- expected=ctx.expected_balances,
- current=ctx.get_balances(),
- sent_at=sent_height,
- height=height)
- if height >= 100:
+ # else:
+ # assert height <= sent_height + 10, ('Balances before: {before}\n'
+ # 'Expected balances: {expected}\n'
+ # 'Current balances: {current}\n'
+ # 'Sent at height: {sent_at}\n'
+ # 'Current height: {height}').format(
+ # before=last_balances,
+ # expected=ctx.expected_balances,
+ # current=ctx.get_balances(),
+ # sent_at=sent_height,
+ # height=height)
+ if height >= 150:
break
which ends up being a decent sanity check because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Can you add an integration test for the interesting case? I think @staffik added some tools to configure with what shards it should track in what epoch, in the resharding testloop.
chain/chain/src/chain.rs
Outdated
if self.epoch_manager.is_next_block_epoch_start(prev_hash)? { | ||
debug!(target: "chain", %block_hash, "block is the first block of an epoch"); | ||
if !self.prev_block_is_caught_up(prev_prev_hash, prev_hash)? { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mini nit: Maybe consider inverting the logic and early returns instead of else
s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
chain/chain/src/chain.rs
Outdated
debug!(target: "chain", block_hash=?header.hash(), "block is the first block of an epoch"); | ||
if !self.prev_block_is_caught_up(&prev_prev_hash, &prev_hash)? { | ||
if self.epoch_manager.is_next_block_epoch_start(prev_hash)? { | ||
debug!(target: "chain", %block_hash, "block is the first block of an epoch"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mini nit: The log is not wrong but the message is a bit out of place. It's very generic but it's logged from within get_catchup_and_state_sync_infos
. Maybe log something like get_catchup_and_state_sync_infos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved it to the bottom to log what shards were gonna state sync
let mut shards_to_sync = Vec::new(); | ||
for shard_id in epoch_manager.shard_ids(&epoch_id)? { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to for loop
chain/chain/src/chain.rs
Outdated
/// beginning after the block `epoch_last_block`. If that epoch is epoch T, the logic is: (will | ||
/// track the shard in epoch T+1) && (not tracking it in T) && (didn't track it in T-1). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can you multiline the condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
chain/chain/src/chain.rs
Outdated
let shard_layout = epoch_manager.get_shard_layout(epoch_id)?; | ||
let prev_epoch_id = epoch_manager.get_epoch_id(epoch_last_block)?; | ||
let prev_shard_layout = epoch_manager.get_shard_layout(&prev_epoch_id)?; | ||
let parent_shard_id = if shard_layout == prev_shard_layout { | ||
shard_id | ||
} else { | ||
shard_layout.get_parent_shard_id(shard_id)? | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a method called something like get_prev_shard_id
that's used when going back chain and collecting receipts. It should do this for you if my memory serves me well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good call, I didn't notice that. Looks like that API is a little more general than what we want here, since we're just interested in one shard ID, and it's more direct to only deal with that one than looking thru the Vec
return value. Feels a little error-prone and like it adds some burden on the caller to check that the Vec only has one element. I'll follow up with another PR adding/modifying the API for that simpler case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you need just one shard get_prev_shard_id_from_prev_hash
should do the trick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohhh true. embarrassing I missed that function right below the other one :D
Yes, you can find it in |
added diff --git a/chain/chain/src/flat_storage_resharder.rs b/chain/chain/src/flat_storage_resharder.rs
index 83a943411..14cfc013b 100644
--- a/chain/chain/src/flat_storage_resharder.rs
+++ b/chain/chain/src/flat_storage_resharder.rs
@@ -597,6 +597,23 @@ impl FlatStorageResharder {
Ok(iter)
}
+ fn state_sync_first(&self, shard_uid: ShardUId, chain_store: &ChainStore) -> bool {
+ let head = chain_store.head().unwrap();
+ let mut syncing = false;
+ for _ in chain_store.store().iter(near_store::DBCol::StateDlInfos) {
+ syncing = true;
+ }
+ if !syncing {
+ return false;
+ }
+ let Some(sync_hash) = chain_store.get_current_epoch_sync_hash(&head.epoch_id).unwrap() else {
+ return true;
+ };
+ let key = borsh::to_vec(&near_primitives::state_sync::StatePartKey(sync_hash, shard_uid.shard_id(), 0)).unwrap();
+ let part = chain_store.store().get(near_store::DBCol::StateParts, &key).unwrap();
+ !part.is_some()
+ }
+
/// Task to perform catchup and creation of a flat storage shard spawned from a previous
/// resharding operation. May be a long operation time-wise. This task can't be cancelled
/// nor postponed.
@@ -609,6 +626,12 @@ impl FlatStorageResharder {
if self.controller.is_cancelled() {
return FlatStorageReshardingTaskResult::Cancelled;
}
+
+ if self.state_sync_first(shard_uid, chain_store) {
+ tracing::info!("catchup wait for state sync");
+ return FlatStorageReshardingTaskResult::Postponed;
+ }
+
info!(target: "resharding", ?shard_uid, "flat storage shard catchup task started");
let metrics = FlatStorageReshardingShardCatchUpMetrics::new(&shard_uid);
// Apply deltas and then create the flat storage.
On top of this PR, the test still fails, but with a different error: |
Yeah it's not a real bug, that diff postpones it if theres any state sync going on, and account0 state sync shard 3 during the post-resharding epoch, so it postpones it unnecessarily and then isnt finished till too late |
Yeah I think it works here because we only unload it if we're not tracking it both in the current and the next epoch: nearcore/chain/chain/src/chain.rs Line 2047 in 445ee6f
|
This function was left unimplemented for the resharding v3 case, so here we implement it. For each shard ID, we say that we want to state sync it if we'll be tracking it next epoch and don't currently track it in this or the previous epoch. If we don't track it in the current epoch but did track it in the previous, then there's no need to state sync, since we can just keep applying chunks to stay caught up.
This logic doesn't make reference to resharding since it should be the same in either case, but it fixes the bug where both state sync and the resharding code might believe they're responsible for generating the child shard state after a resharding if we're in this rare case where we track the parent in epoch
T-1
, don't track the child inT
, and will track the child inT+1
. In that case, start_resharding() will be called at the end of epochT-1
and will initiate the child shard splitting (this is true even if we don't track either child in the current epoch). And then resharding will proceed as usual and we'll start applying chunks for the child when it's ready, and no state sync has to be involved.