-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NONEVM-984][solana] - Reorg Detection + lighter rpc call #951
base: develop
Are you sure you want to change the base?
Conversation
Quality Gate passedIssues Measures |
@@ -112,11 +130,8 @@ func (c *pendingTxContext) New(tx pendingTx, sig solana.Signature, cancel contex | |||
return err | |||
} | |||
|
|||
// upgrade to write lock if sig or id do not exist | |||
// upgrade to write lock if id do not exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
// upgrade to write lock if id do not exist | |
// upgrade to write lock if id does not exist |
return err | ||
} | ||
|
||
var pTx pendingTx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: think we can move this into the write lock block since we don't really need it outside
return | ||
} | ||
|
||
txm.lggr.Warnw("re-org detected for transaction", "txID", txInfo.id, "signature", sig, "previousStatus", txInfo.state, "currentStatus", currentTxState) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think we can consider re-orgs as part of the normal TXM processes since it's an expected occurrence eventually. Warn
might be too alarming here. I think Info
or even Debug
might be a better log level here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually let's do Debug
to keep this the same as our expiration log
} | ||
|
||
// For regressions from "Confirmed, we'll need to rebroadcast the tx. | ||
if regressionType == FromConfirmed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC I thought I remember you mentioning that if we detect a re-org, we'd have to replace the blockhash in all cases since the re-org implies that the block hash we were using is invalid. So rebroadcasting with the same blockhash would fail anyways. If that's the case, I think we can ditch this regressionType
and just assign a new blockhash always. We could also skip the specific confirmed re-orged tx
log below too then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if that's not an issue, we can maybe still consider keeping the same logic for both cases if it helps simplify things. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Great catch with this one. This approach simplifies the design a lot. Let me test it first to ensure we don’t unintentionally introduce races on Processed
ones, though I have a good feeling about this.
To back up our thoughts, here are some relevant quotes from the Solana Docs:
What is a blockhash?
A blockhash refers to the last Proof of History (PoH) hash for a slot. Since Solana uses PoH as a trusted clock, a transaction's recent blockhash can be thought of as a timestamp.
Proof of History refresher
Solana's Proof of History mechanism uses a very long chain of recursive SHA-256 hashes to build a trusted clock. The “history” part of the name comes from the fact that block producers hash transaction id's into the stream to record which transactions were processed in their block.
PoH can be used as a trusted clock because each hash must be produced sequentially. Each produced block contains a blockhash and a list of hash checkpoints called “ticks” so that validators can verify the full chain of hashes in parallel and prove that some amount of time has actually passed.
How does transaction expiration work?
Each transaction includes a “recent blockhash” which is used as a PoH clock timestamp and expires when that blockhash is no longer “recent enough”.
As each block is finalized, the final hash of the block is added to the BlockhashQueue which stores a maximum of the 300 most recent blockhashes. During transaction processing, Solana Validators will check if each transaction's recent blockhash is recorded within the most recent 151 stored hashes (aka "max processing age"). If the transaction's recent blockhash is older than this max processing age, the transaction is not processed.
In conclusion, I have a good feeling this approach will be viable. If a block gets reorged, my understanding is that its blockhash will no longer be present in the mentioned BlockhashQueue
. Therefore, any retry attempt using the previous blockhash would not be accepted, mitigating the risk of duplicate transactions. The only transaction that should go through would be the new one created with a fresh (and hopefully valid) blockhash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, we’re considering the premise that a signature regression implies an invalid blockhash. If this holds true, it seems we wouldn’t need to check and aggregate state across all signatures since they all rely on the same blockhash, which would likely be invalid as well.
Are we in the same page with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya I was thinking the same thing. Not 100% sure if this can happen but I also want to be careful with potentially mistaking a status regression with a lagging RPC. In case the multinode switches to a new node that does not have a processed block with 1 of our tx propagated to it, we could maybe falsely consider the signature regressed. Maybe this risk is low enough, if it's real, and not that impactful that we can ignore it though. Aggregating statuses feels like could reduce the risk but we'd probably still have it if bumping is disabled and only have 1 valid sig at a time anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
In general, what's the harm in using a new BlockHash whenever we suspect a reorg?
// - Confirmed -> Processed || Broadcasted || Not Found | ||
// - Processed -> Broadcasted || Not Found | ||
currentTxState := convertStatus(status) | ||
if regressionType, isRegressed := isStatusRegression(txInfo.state, currentTxState); isRegressed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't need regressionType
anymore based on my other comment, I think we can just use TxHasReorg
for this and eliminate GetSignatureInfo
and UpdateSignatureStatus
to minimize the number of storage methods we use.
- We could rework
TxHasReorg
to take in a sig and a current state and return the id and bool. That way we can do theisStatusRegression
on the storage side and we can dropGetSignatureInfo
- We could also update
TxHasReorg
to fetch the id from the sig map to fetch the tx. Since we'd have access to the current sig and its new state, I don't think we'd have to callUpdateSignatureStatus
. We'd be dropping all old sigs anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this
status, errGetStatus = txmInstance.GetTransactionStatus(ctx, txID) | ||
require.NoError(t, errGetStatus) | ||
require.Equal(t, types.Finalized, status, "tx should be finalized after reorg") | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also include a test case where we detect a re-org from Processed
state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn’t achieve this because the local live validator transitions to Confirmed
instantly. I can try spamming a large number of transactions to overwhelm it, which might delay the transition and provide a window to stop it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah gotcha. In that case I think this is ok. Your mock testing for that scenario might be enough
) | ||
|
||
// isStatusRegression checks if the current status is a regression compared to the previous status: | ||
// - Confirmed -> Processed, Broadcasted, Not Found: should not regress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Comment part ": should not regress" doesn't make sense.
Did you mean that any state change from confirmed/processed to broadcasted/NotFound means that it did regress?
Atleast that's that the code says, so the comment isn't reflecting the code.
} | ||
|
||
// For regressions from "Confirmed, we'll need to rebroadcast the tx. | ||
if regressionType == FromConfirmed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
In general, what's the harm in using a new BlockHash whenever we suspect a reorg?
// Removes all signatures associated with the prior tx, cancels prior ctx, updates compute unit price and sets given blockhash for rebroadcasting. | ||
// Calls sendWithRetry directly to avoid enqueuing the transaction. It logs the error when rebroadcast fails and returns the new signature when successful. | ||
func (txm *Txm) rebroadcastWithGivenBlockhash(ctx context.Context, pTx pendingTx, blockhash solana.Hash, lastValidBlockHeight uint64) (solana.Signature, error) { | ||
// Removes all signatures associated to prior tx and cancels context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Just say "Remove the previous tx from state".
Because it does that, in addition to signatures.
// TxHasReorg determines whether a reorg has occurred for a given tx. | ||
// It achieves this by comparing the highest aggregated state across all associated signatures with the current state of the transaction. | ||
// If the highest aggregated state is less than the current state, a reorg has occurred and we need to handle it. | ||
TxHasReorg(id string) bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename to IsTxReorged
The previous name, to me, looked like was telling this interface that a reorg has occurred.
if _, exists := c.sigToID[sig]; exists { | ||
return ErrSigAlreadyExists | ||
// Check if ID already exists in any of the maps | ||
if _, exists := c.broadcastedProcessedTxs[tx.id]; exists { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you're using tx.id for already existing entries, won't that cause unnecessary errors when calling this function from rebroadcastExpiredTxs()?
I see that rebroadcastExpiredTxs() will collect all expired txs, and then call txm.sendWithRetry() on them, using the same tx.id. That looks to me will fail because this function will return the error ErrIDAlreadyExists.
Description
Tickets
Soak Testing