Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NONEVM-984][solana] - Reorg Detection + lighter rpc call #951

Open
wants to merge 94 commits into
base: develop
Choose a base branch
from

Conversation

Farber98
Copy link
Contributor

@Farber98 Farber98 commented Nov 28, 2024

Description

  • Track transaction statuses on per signature basis to identify which signature for a transaction was included to detect re-orgs specifically for it
  • Update the confirmation logic to identify regression in a signature’s transaction status when no or processed status is received to detect re-orgs
    • A transaction can revert back to no status or processed from confirmed status
    • A transaction can revert back to no status from processed status
  • If a re-org is detected
    • from confirmed, restart the retry/bump loop
    • from processed, we don't do nothing and it's handled by expiration rebroadcast if needed later

Tickets

Soak Testing

@Farber98 Farber98 marked this pull request as ready for review January 2, 2025 19:40
@Farber98 Farber98 requested a review from a team as a code owner January 2, 2025 19:40
@@ -112,11 +130,8 @@ func (c *pendingTxContext) New(tx pendingTx, sig solana.Signature, cancel contex
return err
}

// upgrade to write lock if sig or id do not exist
// upgrade to write lock if id do not exist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// upgrade to write lock if id do not exist
// upgrade to write lock if id does not exist

return err
}

var pTx pendingTx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: think we can move this into the write lock block since we don't really need it outside

return
}

txm.lggr.Warnw("re-org detected for transaction", "txID", txInfo.id, "signature", sig, "previousStatus", txInfo.state, "currentStatus", currentTxState)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we can consider re-orgs as part of the normal TXM processes since it's an expected occurrence eventually. Warn might be too alarming here. I think Info or even Debug might be a better log level here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually let's do Debug to keep this the same as our expiration log

}

// For regressions from "Confirmed, we'll need to rebroadcast the tx.
if regressionType == FromConfirmed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC I thought I remember you mentioning that if we detect a re-org, we'd have to replace the blockhash in all cases since the re-org implies that the block hash we were using is invalid. So rebroadcasting with the same blockhash would fail anyways. If that's the case, I think we can ditch this regressionType and just assign a new blockhash always. We could also skip the specific confirmed re-orged tx log below too then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if that's not an issue, we can maybe still consider keeping the same logic for both cases if it helps simplify things. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Great catch with this one. This approach simplifies the design a lot. Let me test it first to ensure we don’t unintentionally introduce races on Processed ones, though I have a good feeling about this.

To back up our thoughts, here are some relevant quotes from the Solana Docs:

What is a blockhash?

A blockhash refers to the last Proof of History (PoH) hash for a slot. Since Solana uses PoH as a trusted clock, a transaction's recent blockhash can be thought of as a timestamp.

Proof of History refresher

Solana's Proof of History mechanism uses a very long chain of recursive SHA-256 hashes to build a trusted clock. The “history” part of the name comes from the fact that block producers hash transaction id's into the stream to record which transactions were processed in their block.

PoH can be used as a trusted clock because each hash must be produced sequentially. Each produced block contains a blockhash and a list of hash checkpoints called “ticks” so that validators can verify the full chain of hashes in parallel and prove that some amount of time has actually passed.

How does transaction expiration work?

Each transaction includes a “recent blockhash” which is used as a PoH clock timestamp and expires when that blockhash is no longer “recent enough”.

As each block is finalized, the final hash of the block is added to the BlockhashQueue which stores a maximum of the 300 most recent blockhashes. During transaction processing, Solana Validators will check if each transaction's recent blockhash is recorded within the most recent 151 stored hashes (aka "max processing age"). If the transaction's recent blockhash is older than this max processing age, the transaction is not processed.

In conclusion, I have a good feeling this approach will be viable. If a block gets reorged, my understanding is that its blockhash will no longer be present in the mentioned BlockhashQueue. Therefore, any retry attempt using the previous blockhash would not be accepted, mitigating the risk of duplicate transactions. The only transaction that should go through would be the new one created with a fresh (and hopefully valid) blockhash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, we’re considering the premise that a signature regression implies an invalid blockhash. If this holds true, it seems we wouldn’t need to check and aggregate state across all signatures since they all rely on the same blockhash, which would likely be invalid as well.

Are we in the same page with this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I was thinking the same thing. Not 100% sure if this can happen but I also want to be careful with potentially mistaking a status regression with a lagging RPC. In case the multinode switches to a new node that does not have a processed block with 1 of our tx propagated to it, we could maybe falsely consider the signature regressed. Maybe this risk is low enough, if it's real, and not that impactful that we can ignore it though. Aggregating statuses feels like could reduce the risk but we'd probably still have it if bumping is disabled and only have 1 valid sig at a time anyways.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
In general, what's the harm in using a new BlockHash whenever we suspect a reorg?

// - Confirmed -> Processed || Broadcasted || Not Found
// - Processed -> Broadcasted || Not Found
currentTxState := convertStatus(status)
if regressionType, isRegressed := isStatusRegression(txInfo.state, currentTxState); isRegressed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't need regressionType anymore based on my other comment, I think we can just use TxHasReorg for this and eliminate GetSignatureInfo and UpdateSignatureStatus to minimize the number of storage methods we use.

  1. We could rework TxHasReorg to take in a sig and a current state and return the id and bool. That way we can do the isStatusRegression on the storage side and we can drop GetSignatureInfo
  2. We could also update TxHasReorg to fetch the id from the sig map to fetch the tx. Since we'd have access to the current sig and its new state, I don't think we'd have to call UpdateSignatureStatus. We'd be dropping all old sigs anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this

status, errGetStatus = txmInstance.GetTransactionStatus(ctx, txID)
require.NoError(t, errGetStatus)
require.Equal(t, types.Finalized, status, "tx should be finalized after reorg")
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also include a test case where we detect a re-org from Processed state?

Copy link
Contributor Author

@Farber98 Farber98 Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn’t achieve this because the local live validator transitions to Confirmed instantly. I can try spamming a large number of transactions to overwhelm it, which might delay the transition and provide a window to stop it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha. In that case I think this is ok. Your mock testing for that scenario might be enough

)

// isStatusRegression checks if the current status is a regression compared to the previous status:
// - Confirmed -> Processed, Broadcasted, Not Found: should not regress

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Comment part ": should not regress" doesn't make sense.
Did you mean that any state change from confirmed/processed to broadcasted/NotFound means that it did regress?
Atleast that's that the code says, so the comment isn't reflecting the code.

}

// For regressions from "Confirmed, we'll need to rebroadcast the tx.
if regressionType == FromConfirmed {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
In general, what's the harm in using a new BlockHash whenever we suspect a reorg?

// Removes all signatures associated with the prior tx, cancels prior ctx, updates compute unit price and sets given blockhash for rebroadcasting.
// Calls sendWithRetry directly to avoid enqueuing the transaction. It logs the error when rebroadcast fails and returns the new signature when successful.
func (txm *Txm) rebroadcastWithGivenBlockhash(ctx context.Context, pTx pendingTx, blockhash solana.Hash, lastValidBlockHeight uint64) (solana.Signature, error) {
// Removes all signatures associated to prior tx and cancels context.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Just say "Remove the previous tx from state".
Because it does that, in addition to signatures.

// TxHasReorg determines whether a reorg has occurred for a given tx.
// It achieves this by comparing the highest aggregated state across all associated signatures with the current state of the transaction.
// If the highest aggregated state is less than the current state, a reorg has occurred and we need to handle it.
TxHasReorg(id string) bool

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename to IsTxReorged

The previous name, to me, looked like was telling this interface that a reorg has occurred.

if _, exists := c.sigToID[sig]; exists {
return ErrSigAlreadyExists
// Check if ID already exists in any of the maps
if _, exists := c.broadcastedProcessedTxs[tx.id]; exists {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you're using tx.id for already existing entries, won't that cause unnecessary errors when calling this function from rebroadcastExpiredTxs()?

I see that rebroadcastExpiredTxs() will collect all expired txs, and then call txm.sendWithRetry() on them, using the same tx.id. That looks to me will fail because this function will return the error ErrIDAlreadyExists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants