-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NEP-522: Optional Transaction Ordering #522
base: master
Are you sure you want to change the base?
Changes from all commits
48f0e2d
18a5501
1d13284
59f527a
bc173b1
ea69d3f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,205 @@ | ||
--- | ||
NEP: 522 | ||
Title: Stateless Transactions | ||
Authors: David Millar-Durrant <[email protected]> | ||
Status: Draft | ||
DiscussionsTo: https://github.com/nearprotocol/neps/pull/522 | ||
Type: Protocol | ||
Version: 1.0.0 | ||
Created: 2023-12-05 | ||
LastUpdated: 2023-12-05 | ||
--- | ||
|
||
## Summary | ||
|
||
This allows the sending of transactions to the NEAR Blockchain without depending on any Blockchain state. It also allows individual keys to send transactions without needing to know what order they will arrive on the network. | ||
|
||
We add an optional `random_nonce` field to the `Transaction` and `DelegateAction` messages which allows clients to pick an alternate way of disambiguating identical transactions and preventing replay attacks. The `random_nonce` is an arbitrary value selected by the client, typically a random number. | ||
|
||
We also add an optional `expires_at` field to the `Transaction` and `DelegateAction` messages. The protocol guarantees that this message will not be run in any block with a block timestamp later than the one contained in `expires_at`. | ||
|
||
`expires_at` and `random_nonce` exist as alternatives to `max_block_height` and `nonce` respectively. They will cause this field to be ignored and the mechanisms connected to them to not be used. We put limits on the validity period of messages using `random_nonce` to avoid a large cost of checking their uniqueness. | ||
|
||
## Motivation | ||
|
||
### Problem | ||
|
||
In theory, the current nonce implementation is sound and efficient to implement. In practice it is responsible for much of the unreliability observed in applications written on NEAR, insecure client libraries and many hours of wasted engineers' time. | ||
|
||
This disconnect stems from two faulty assumptions: | ||
|
||
1. A client can control in which block their transactions land | ||
2. A client private key will always exist in only one place at a time. | ||
|
||
Let's deal with the effects of the first assumption. If you send many transactions to a single RPC endpoint, even with the protocol attempting to order transactions in a maximally favorable way within a block, many will arrive out of order. This is simply the nature of both networks and distributed systems. | ||
|
||
``` | ||
sent_nonces = [1, 2, 3, ... 98, 99, 100] | ||
received_nonces = block_1 = [1, 99], block_2 = [2, 3, ... 98, 100] | ||
valid_nonces = block_1 = [1, 99], block_2 = [100] | ||
``` | ||
|
||
The more transactions you try to execute per block the more that are rejected. This creates an issue called "nonce contention". Naturally clients try to hide this all too common failure from the client, so they silently increment the nonce and try again thus increasing traffic and failures. This behavior slows down transactions, adds load to RPC nodes and causes spurious failures. | ||
|
||
Far more perniciously, since no client I'm aware of verifies the RPC response using a light client before retrying, RPC nodes can prompt the client to provide them with transactions with different nonces. This allows rogue RPC nodes to perform replay attacks. | ||
|
||
We have found that often users are managing a single key using multiple clients or are running multiple servers controlling the same key. As such clients fetch a nonce from an indexer every time they need to make a transaction. Indexers tend to fall behind or fail under load and this causes our wallets, relayers and other services to fail. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They fetch a nonce from an indexer? Why is that? Fetching nonces from RPC ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure, @MaximusHaximus, why don't we fetch directly from an RPC node or do we? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @DavidM-D Actually, we do use RPC calls to fetch access key details (including the nonce) - here's where it happens inside of |
||
|
||
[Metatransactions](https://github.com/nearprotocol/neps/pull/366) cause yet more problems. Since Metatransactions have two nonces, one in the `DelegateAction` and another in the `Transaction` clients have twice the chance of messing things up. First relayers generally sign messages for many users and can reach hundreds of transactions per second. Generally it's sensible to have a number of relayer instances around the world behind a load balancer. You therefore need to globally round robin a large number of keys in order to minimize nonce contention, which mandates some kind of communication/persistence between relayer instances that otherwise would not need to exist; with random nonces, relayers could remain stateless when it comes to the access key they use to sign transactions. | ||
|
||
Furthermore, when a transaction submitted by a relayer fails due to nonce contention, it may be either caused by contention on the key the relayer uses to sign transactions, or it might be due to contention on the key held by the client using the relayer which was responsible for signing the `DelegateAction` itself. If the failure is due to the latter case, the client that is using the relayer needs to identify that this is the case (either by querying transaction results or by the relayer telling the client about the reason for failure), and it is then responsible for signing a new `DelegateAction` with a new, valid nonce and submitting it to the relayer. Finally, on receipt of the new transaction, the relayer must also identify a new, valid nonce for its own key, and sign a new transaction using that nonce. This is a slow and brittle process and makes failure modes for using relayers more complex than they need to be. | ||
|
||
These issues have been a pain point for development of wallets, faucets, FastAuth and relayer to name a few. They are the major cause of FastAuth's Relayer's instability and have led to downtime multiple times in the last month directly and indirectly. | ||
|
||
All of these problems are solvable with sufficiently smart clients, enough (100s-1,000s) of keys per account and rock solid infrastructure, but we haven't managed that so far and it's probably a lot easier to just simplify how we call the network. | ||
|
||
### Solution | ||
|
||
Our new solution is simple from a client side. Generate a random nonce, set an `expires_at` 100 seconds in the future and send it to the network. If it fails spuriously retry with the same nonce, if it fails consistently prompt the user to make a decision on whether to resend the transaction. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Many systems that use a random nonce recommend using a timestamp, or a timestamp concatenated with a random number, like the first 64 bits of a UUIDv7. Would you recommend or discourage that approach, and why? I think that's worth calling out here either way. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd recommend a random number and it doesn't need to be cryptographically secure.
We don't require a cryptographically secure number as we don't require non predictability, the client could reasonably increment a single value per transaction and it would remain secure. |
||
|
||
The client doesn't need to query the key's nonce, and it doesn't need to know what the current block height is, removing a source of instability. They can send as many messages as they like from a single key without worrying about any form of contention. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The client will still need to know a recent block hash to specify the |
||
|
||
This mode of operation isn't suited to cold wallets because of the short transaction validity and ensuring the ordering of many transactions[^2]. In these cases these clients should use the old mechanism. | ||
|
||
## Specification | ||
|
||
We describe `Transaction` and `DelegateAction`s collectively as messages in this specification. | ||
|
||
We propose to add the optional fields to the messages: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems it could be a bit of work to get backwards-compatibility right for adding these fields. It's probably not worth describing it here in the NEP itself. But I'm looking forward to see how this is done in the reference implementation. |
||
|
||
```rust | ||
expires_at: Option<u64>, | ||
random_nonce: Option<u64>, | ||
``` | ||
|
||
You first construct a trie of the following type where `|` represents a tagged sum type: | ||
|
||
``` | ||
Expiry In Seconds | Expiry Block Height => Hash Transaction | Hash DelegateAction => () | ||
``` | ||
|
||
When the `Transaction` or `DelegateAction` is received it's Borsh representation is hashed. If `expires_at` is used the expiry is rounded to the nearest second. When `max_block_height`is used to describe expiry it's directly inserted. | ||
|
||
We then ensure that the message is unique using the following pseudocode: | ||
|
||
```rust | ||
fn ensure_unique(trie: &mut Trie, message: Transaction | DelegateAction) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are alternatives which leave the nonce storage in memory instead of in the trie. For example Bowen's proposal. I also proposed that we could bound the amount of space taken for nonce checking by using a Bloom filter (though the trade-off is the possibility for false rejections). I think these proposals should be integrated into this NEP either as alternatives that were considered and rejected or as part of the main design. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Bloom filters probably wouldn't be appropriate because it's difficult to remove an element from them, which you'd have to do when things expire. |
||
// Does is have expires_at or a valid max_block_height? | ||
let expiry: MaxBlockHeight | ExpiresAt = message.expiry_type(); | ||
// If it has an expiry time ceil it to the next second | ||
if expiry is ExpiresAt { | ||
expiry = epiry.ceil_to_second() | ||
} | ||
// Insert into the trie | ||
let already_exists = trie.insert(expiry, message.hash()); | ||
assert_eq!(exists, false); | ||
} | ||
``` | ||
|
||
At the end of every block we need to cleanup the trie, making sure not to remove any hashes that might still be valid. | ||
|
||
```rust | ||
fn cleanup_trie(trie: &mut Trie, chunk_before_last: Chunk, last_chunk: Chunk) { | ||
let blocktime_before_last = chunk_before_last.block_time.ceil_to_second(); | ||
let last_blocktime = chunk_before_last.block_time.ceil_to_second(); | ||
|
||
// Get all the seconds where things have expired since the last cleanup | ||
// This doesn't include the last blocktime (.. not ..=) | ||
let expired_times = blocktime_before_last..last_blocktime; | ||
|
||
for t in expired_times { | ||
// Remove all hashes that expired at this time | ||
trie.remove_all(ExpiresAt(t)); | ||
} | ||
Comment on lines
+110
to
+113
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This potentially produces a lot of IOPS for the modified trie nodes, which has been a bottleneck in the past. I guess it should be fine because of the rounding to full seconds for timestamps. But it would be great if you could derive an explicit upper bound for how many trie values would be added / removed per chunk and include it in the NEP description. |
||
|
||
// Remove all hashes that expired at the previous block | ||
trie.remove_all(MaxBlockHeight(last_chunk.block_height)) | ||
} | ||
``` | ||
|
||
The nonce only needs to exist on the shard containing the sender account since that is the only place it can be sent from. On a shard split we need to ensure that all valid transaction hashes are sent to both of the new shards[^3]. | ||
|
||
If a client attempts to send an identical transaction with an identical `random_nonce` we preserve the existing behavior for responding to an already sent transaction. Provided it still exists they will receive the previous transactions receipt and no action will be taken on chain. | ||
|
||
There is no requirement that the `random_nonce` field is unique or random. A client could (inadvisably) decide to always provide a `random_nonce` with a value of 0 and it would work as expected until they tried to send two completely identical transactions. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We'll need to make sure this doesn't conflict with plans around transaction priority / gas wars. On many other chains at least, there's an explicit feature to allow the sender to overwrite a pending transaction at a given key+nonce with a new one, either to attempt to proactively 'cancel' a transaction that isn't likely to be included in a block or to increase it's priority so it'll get included faster. That first use case isn't as important here, because a pending transaction with a given There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This doesn't prevent any of these mechanisms when not using a Also if we want people to be able to cancel/re-submit outstanding transactions, wouldn't it be better to do it for individual transactions rather than all outstanding transactions using nonce contention? It lead to a nicer API and fewer re-submissions during peak load. |
||
|
||
When `random_nonce` is present the protocol **may** reject transactions with an `expires_at` more than 120 seconds after the most recent block timestamp or a `max_block_height` more than 100 blocks after the most recent block height and the error `ExpiryTooLate` will be thrown. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For non-delegate transactions, I believe this also implies that if you specify There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is correct I will make that clearer There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the "may reject" is too weak due to the cost to all nodes if a long-lived transaction is accepted. Too long-lived transactions should be a chunk validity condition. |
||
|
||
When the `random_nonce` field is present the `nonce` field is ignored and all mechanisms connected to it are disabled. `nonce` field must be 0 when `random_nonce` is present or the error `MalformedTransaction | MalformedDelegateAction` will be thrown. | ||
|
||
When the `expires_at` field is present the `max_block_height` field is ignored and all mechanisms connected to it are disabled. `max_block_height` must be 0 when `expires_at` is present or the `MalformedTransaction | MalformedDelegateAction` error will be thrown. | ||
|
||
|
||
## Reference Implementation | ||
|
||
A draft PR of the protocol change is a WIP and should land on Friday 8th Dec 2023. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the status of the reference implementation? |
||
|
||
## Security Implications | ||
|
||
Great care must be taken to ensure that the `CryptoHash` of any message that has been executed must exist anywhere that message may be valid and could be executed. If this is not the case then it's possible to launch replay attacks on transactions. | ||
|
||
Since these transactions store the `CryptoHash` in the working trie for the duration of their validity, the validity period of these transactions must always be small enough to prevent slow lookups or excessive space use. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems like this trie is unbounded. Even though no single entry will last in the trie for more than 2 minutes or so, I'm worried about the attack surface here. An obvious possible solution is to limit based on the number of working trie entries for a given key. I don't know much about the current mechanics of NEAR under high load – for normal incrementing nonces, is there a point where the network will reject transactions because there's too many pending for that key? If so, we could use the same limit here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's bounded by the number of transactions that can be executed by a shard per expiry window (~100 seconds). Assuming a shard can one day do 10,000 transactions a second (I've only observed up to ~300) that'd be 32MBs in the trie. I don't see much point in having a per key limit since an attacker could easily create an account with 100,000 keys. It would also require us to store the keys that each cryptohash belongs to which would double the amount of data stored in the trie under certain circumstances. |
||
|
||
Node providers may decide to mess with the block time. The protocol ensures that block time is always monotonically increasing, so invalid messages can't become valid, but they could make messages last much longer than one might expect. That being said that's already possible if you can slow down the networks block speed causing similar issues. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given that there are minimal protocol-level guarantees on timestamp, I wonder if we should not add the |
||
|
||
## Alternatives | ||
|
||
There are plenty of ways to pack this alternate nonce into the existing message format. | ||
|
||
We could say that if the top bit is 1 then it's a random nonce. | ||
This saves some space, but I think will manifest in less clear errors when a block consumer reads it. | ||
|
||
We could have a boolean flag `random_nonce` in the message which changes the behavior of the nonce. | ||
This would likely save space in the message but is less flexible moving forward. | ||
|
||
I'm broadly OK with any of these options. | ||
|
||
We could have the expiry time measured in seconds since Unix epoch and have it be a u32. This saves some space, but it's [inconsistent with our contract standards](https://github.com/near/NEPs/blob/random-nonces/neps/nep-0393.md) and [2038](https://en.wikipedia.org/wiki/Year_2038_problem) isn't that far away in the great span of things. | ||
|
||
Why not do this without the expiry time? | ||
We could but it would mean that there'd still be a dependency on an indexer. The current block height is somewhat predictable, but it's not like you can set your watch to it especially under load. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar to my last question, why use an indexer for this instead of a |
||
|
||
## Future possibilities | ||
|
||
I imagine as this feature becomes more widely used, there will be pressure to increase the period that these transactions are valid for at the expense of more storage being used. The current NEP covers many use cases and uses very few resources, so I'm going to leave it at this low validity but allow for it to expand at a later date. | ||
|
||
## Consequences | ||
|
||
### Positive | ||
|
||
- Simpler relayer | ||
- Simpler MPC service | ||
- Simpler faucets | ||
- Simpler more secure clients | ||
- Better application reliability | ||
- No more reliance on an indexer to send transactions | ||
- Read RPC [will work better](https://pagodaplatform.atlassian.net/browse/ND-536) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This issue should be made public (and not require any atlassian login) since NEPs are public documentation. |
||
|
||
### Neutral | ||
|
||
### Negative | ||
|
||
- Additional fields mean larger messages | ||
- Additional complexity leads to more potential attacks | ||
- More things will be stored in the trie (but not for long and they're not too big) | ||
- Sometimes people shouldn't be sending transactions when an indexer is down. | ||
|
||
### Backwards Compatibility | ||
|
||
This is going to be backwards compatible for clients, but probably not backwards compatible for things like block explorers, indexers or RPC nodes. We'll need a period of time of this running on testnet for them to update their code. | ||
|
||
## Unresolved Issues (Optional) | ||
|
||
- I'm not quite sure how much more to charge for this, the data is small and ephemeral, so maybe not much/nothing | ||
|
||
## Changelog | ||
|
||
No changes so far | ||
|
||
## Copyright | ||
|
||
Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). | ||
|
||
[^1]: Actually we compare the most recently used nonce to the one on the indexer and increment from the greater of the two. This helps in some but not all situations when indexers are down/behind. | ||
Check failure on line 203 in neps/nep-0522.md GitHub Actions / markdown-lintLink and image reference definitions should be needed [Unused link or image reference definition: "^1"] [Context: "[^1]: Actually we compare the ..."]
|
||
[^2]: I'm yet to come across a use case where someone want a number of transactions to be executed in order but doesn't care whether or not some of the transactions fail spuriously. | ||
[^3]: This is innefficient on a state split, but those are rare and it saves us storing everyone's account IDs which can be pretty large. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does
max_block_height
already exist onTransaction
? It's not in the docs.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears to and it's mandatory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean for Transaction, not
DelegateAction
. ForTransaction
, it appears thatexpires_at
would be the only expiration option, not an alternative tomax_block_height
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit late here but in case this is still relevant...
Today, transactions don't have an explicit
max_block_height
, as @ewiner correctly observed. Instead, they expire at the height of the givenblock_hash
+transaction_validity_period
. In mainnet, this is currently at 86400 blocks, which is 1 day if block time is 1s.In theory, this means by picking a block hash at a specific height, one can define an expiration block height. But at the same time, the block hash is used to pick the canonical hash in case of forks. So I would argue, you really want to use the most recent
block_hash
you know, to ensure your changes build on top of the blockchain state that is known to you. But then you always get the fixed validity of 86400 blocks.So yes, in my understanding, the proposal is the first proper way to limit a transaction validity period by the user.
Here is the check in the code:
https://github.com/near/nearcore/blob/7bad46aa5a9fe12b6d2b6e28800c7623365dd92f/chain/chain/src/store.rs#L707-L711