near · DavidM-D · Dec 6, 2023 · Dec 6, 2023 · Dec 6, 2023 · Dec 6, 2023
@@ -0,0 +1,205 @@
+---
+NEP: 522
+Title: Stateless Transactions
+Authors: David Millar-Durrant <[email protected]>
+Status: Draft
+DiscussionsTo: https://github.com/nearprotocol/neps/pull/522
+Type: Protocol
+Version: 1.0.0
+Created: 2023-12-05
+LastUpdated: 2023-12-05
+---
+
+## Summary
+
+This allows the sending of transactions to the NEAR Blockchain without depending on any Blockchain state. It also allows individual keys to send transactions without needing to know what order they will arrive on the network.
+
+We add an optional `random_nonce` field to the `Transaction` and `DelegateAction` messages which allows clients to pick an alternate way of disambiguating identical transactions and preventing replay attacks. The `random_nonce` is an arbitrary value selected by the client, typically a random number.
+
+We also add an optional `expires_at` field to the `Transaction` and `DelegateAction` messages. The protocol guarantees that this message will not be run in any block with a block timestamp later than the one contained in `expires_at`.
+
+`expires_at` and `random_nonce` exist as alternatives to `max_block_height` and `nonce` respectively. They will cause this field to be ignored and the mechanisms connected to them to not be used. We put limits on the validity period of messages using `random_nonce` to avoid a large cost of checking their uniqueness.
+
+## Motivation
+
+### Problem
+
+In theory, the current nonce implementation is sound and efficient to implement. In practice it is responsible for much of the unreliability observed in applications written on NEAR, insecure client libraries and many hours of wasted engineers' time. 
+
+This disconnect stems from two faulty assumptions:
+
+1. A client can control in which block their transactions land 
+2. A client private key will always exist in only one place at a time.
+
+Let's deal with the effects of the first assumption. If you send many transactions to a single RPC endpoint, even with the protocol attempting to order transactions in a maximally favorable way within a block, many will arrive out of order. This is simply the nature of both networks and distributed systems.
+
+```
+sent_nonces     = [1, 2, 3, ... 98, 99, 100]
+received_nonces = block_1 = [1, 99], block_2 = [2, 3, ... 98, 100]
+valid_nonces    = block_1 = [1, 99], block_2 = [100]
+```
+
+The more transactions you try to execute per block the more that are rejected. This creates an issue called "nonce contention". Naturally clients try to hide this all too common failure from the client, so they silently increment the nonce and try again thus increasing traffic and failures. This behavior slows down transactions, adds load to RPC nodes and causes spurious failures.
+
+Far more perniciously, since no client I'm aware of verifies the RPC response using a light client before retrying, RPC nodes can prompt the client to provide them with transactions with different nonces. This allows rogue RPC nodes to perform replay attacks.
+
+We have found that often users are managing a single key using multiple clients or are running multiple servers controlling the same key. As such clients fetch a nonce from an indexer every time they need to make a transaction. Indexers tend to fall behind or fail under load and this causes our wallets, relayers and other services to fail.
+
+[Metatransactions](https://github.com/nearprotocol/neps/pull/366) cause yet more problems. Since Metatransactions have two nonces, one in the `DelegateAction` and another in the `Transaction` clients have twice the chance of messing things up. First relayers generally sign messages for many users and can reach hundreds of transactions per second. Generally it's sensible to have a number of relayer instances around the world behind a load balancer. You therefore need to globally round robin a large number of keys in order to minimize nonce contention, which mandates some kind of communication/persistence between relayer instances that otherwise would not need to exist; with random nonces, relayers could remain stateless when it comes to the access key they use to sign transactions.
+
+Furthermore, when a transaction submitted by a relayer fails due to nonce contention, it may be either caused by contention on the key the relayer uses to sign transactions, or it might be due to contention on the key held by the client using the relayer which was responsible for  signing the `DelegateAction` itself.  If the failure is due to the latter case, the client that is using the relayer needs to identify that this is the case (either by querying transaction results or by the relayer telling the client about the reason for failure), and it is then responsible for signing a new `DelegateAction` with a new, valid nonce and submitting it to the relayer. Finally, on receipt of the new transaction, the relayer must also identify a new, valid nonce for its own key, and sign a new transaction using that nonce.  This is a slow and brittle process and makes failure modes for using relayers more complex than they need to be.
+
+These issues have been a pain point for development of wallets, faucets, FastAuth and relayer to name a few. They are the major cause of FastAuth's Relayer's instability and have led to downtime multiple times in the last month directly and indirectly.
+
+All of these problems are solvable with sufficiently smart clients, enough (100s-1,000s) of keys per account and rock solid infrastructure, but we haven't managed that so far and it's probably a lot easier to just simplify how we call the network.
+
+### Solution
+
+Our new solution is simple from a client side. Generate a random nonce, set an `expires_at` 100 seconds in the future and send it to the network. If it fails spuriously retry with the same nonce, if it fails consistently prompt the user to make a decision on whether to resend the transaction.
+
+The client doesn't need to query the key's nonce, and it doesn't need to know what the current block height is, removing a source of instability. They can send as many messages as they like from a single key without worrying about any form of contention.
+
+This mode of operation isn't suited to cold wallets because of the short transaction validity and ensuring the ordering of many transactions[^2]. In these cases these clients should use the old mechanism.
+
+## Specification
+
+We describe `Transaction` and `DelegateAction`s collectively as messages in this specification.
+
+We propose to add the optional fields to the messages:
+
+```rust
+expires_at: Option<u64>,
+random_nonce: Option<u64>,
+```
+
+You first construct a trie of the following type where `|` represents a tagged sum type:
+
+```
+Expiry In Seconds | Expiry Block Height => Hash Transaction | Hash DelegateAction => ()
+```
+
+When the `Transaction` or `DelegateAction` is received it's Borsh representation is hashed. If `expires_at` is used the expiry is rounded to the nearest second. When `max_block_height`is used to describe expiry it's directly inserted.
+
+We then ensure that the message is unique using the following pseudocode:
+
+```rust
+fn ensure_unique(trie: &mut Trie, message: Transaction | DelegateAction) {
+    // Does is have expires_at or a valid max_block_height?
+    let expiry: MaxBlockHeight | ExpiresAt = message.expiry_type();
+    // If it has an expiry time ceil it to the next second
+    if expiry is ExpiresAt {
+        expiry = epiry.ceil_to_second()
+    }
+    // Insert into the trie
+    let already_exists = trie.insert(expiry, message.hash());
+    assert_eq!(exists, false);
+}
+```
+
+At the end of every block we need to cleanup the trie, making sure not to remove any hashes that might still be valid.
+
+```rust
+fn cleanup_trie(trie: &mut Trie, chunk_before_last: Chunk, last_chunk: Chunk) {
+    let blocktime_before_last = chunk_before_last.block_time.ceil_to_second();
+    let last_blocktime = chunk_before_last.block_time.ceil_to_second();
+
+    // Get all the seconds where things have expired since the last cleanup
+    // This doesn't include the last blocktime (.. not ..=)
+    let expired_times = blocktime_before_last..last_blocktime;
+
+    for t in expired_times {
+        // Remove all hashes that expired at this time
+        trie.remove_all(ExpiresAt(t));
+    }
+
+    // Remove all hashes that expired at the previous block
+    trie.remove_all(MaxBlockHeight(last_chunk.block_height))
+}
+```
+
+The nonce only needs to exist on the shard containing the sender account since that is the only place it can be sent from. On a shard split we need to ensure that all valid transaction hashes are sent to both of the new shards[^3].
+
+If a client attempts to send an identical transaction with an identical `random_nonce` we preserve the existing behavior for responding to an already sent transaction. Provided it still exists they will receive the previous transactions receipt and no action will be taken on chain. 
+
+There is no requirement that the `random_nonce` field is unique or random. A client could (inadvisably) decide to always provide a `random_nonce` with a value of 0 and it would work as expected until they tried to send two completely identical transactions. 
+
+When `random_nonce` is present the protocol **may** reject transactions with an `expires_at` more than 120 seconds after the most recent block timestamp or a `max_block_height` more than 100 blocks after the most recent block height and the error `ExpiryTooLate` will be thrown.
+
+When the `random_nonce` field is present the `nonce` field is ignored and all mechanisms connected to it are disabled. `nonce` field must be 0 when `random_nonce` is present or the error `MalformedTransaction | MalformedDelegateAction` will be thrown.
+
+When the `expires_at` field is present the `max_block_height` field is ignored and all mechanisms connected to it are disabled. `max_block_height` must be 0 when `expires_at` is present or the `MalformedTransaction | MalformedDelegateAction`  error will be thrown.
+
+
+## Reference Implementation
+
+A draft PR of the protocol change is a WIP and should land on Friday 8th Dec 2023.
+
+## Security Implications
+
+Great care must be taken to ensure that the `CryptoHash` of any message that has been executed must exist anywhere that message may be valid and could be executed. If this is not the case then it's possible to launch replay attacks on transactions. 
+
+Since these transactions store the `CryptoHash` in the working trie for the duration of their validity, the validity period of these transactions must always be small enough to prevent slow lookups or excessive space use.
+
+Node providers may decide to mess with the block time. The protocol ensures that block time is always monotonically increasing, so invalid messages can't become valid, but they could make messages last much longer than one might expect. That being said that's already possible if you can slow down the networks block speed causing similar issues.
+
+## Alternatives
+
+There are plenty of ways to pack this alternate nonce into the existing message format.
+
+We could say that if the top bit is 1 then it's a random nonce. 
+This saves some space, but I think will manifest in less clear errors when a block consumer reads it.
+
+We could have a boolean flag `random_nonce` in the message which changes the behavior of the nonce. 
+This would likely save space in the message but is less flexible moving forward.
+
+I'm broadly OK with any of these options.
+
+We could have the expiry time measured in seconds since Unix epoch and have it be a u32. This saves some space, but it's [inconsistent with our contract standards](https://github.com/near/NEPs/blob/random-nonces/neps/nep-0393.md) and [2038](https://en.wikipedia.org/wiki/Year_2038_problem) isn't that far away in the great span of things.
+
+Why not do this without the expiry time?
+We could but it would mean that there'd still be a dependency on an indexer. The current block height is somewhat predictable, but it's not like you can set your watch to it especially under load.
+
+## Future possibilities
+
+I imagine as this feature becomes more widely used, there will be pressure to increase the period that these transactions are valid for at the expense of more storage being used. The current NEP covers many use cases and uses very few resources, so I'm going to leave it at this low validity but allow for it to expand at a later date.
+
+## Consequences
+
+### Positive
+
+- Simpler relayer
+- Simpler MPC service
+- Simpler faucets
+- Simpler more secure clients
+- Better application reliability
+- No more reliance on an indexer to send transactions
+- Read RPC [will work better](https://pagodaplatform.atlassian.net/browse/ND-536)
+
+### Neutral
+
+### Negative
+
+- Additional fields mean larger messages
+- Additional complexity leads to more potential attacks
+- More things will be stored in the trie (but not for long and they're not too big)
+- Sometimes people shouldn't be sending transactions when an indexer is down.
+
+### Backwards Compatibility
+
+This is going to be backwards compatible for clients, but probably not backwards compatible for things like block explorers, indexers or RPC nodes. We'll need a period of time of this running on testnet for them to update their code.
+
+## Unresolved Issues (Optional)
+
+- I'm not quite sure how much more to charge for this, the data is small and ephemeral, so maybe not much/nothing
+
+## Changelog
+
+No changes so far
+
+## Copyright
+
+Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).
+
+[^1]: Actually we compare the most recently used nonce to the one on the indexer and increment from the greater of the two. This helps in some but not all situations when indexers are down/behind.
+[^2]: I'm yet to come across a use case where someone want a number of transactions to be executed in order but doesn't care whether or not some of the transactions fail spuriously.
+[^3]: This is innefficient on a state split, but those are rare and it saves us storing everyone's account IDs which can be pretty large.
@@ -350,6 +350,11 @@
     /// relayer and should match for given account's `public_key`.
     /// After this action is processed it will increment.
     pub nonce: Nonce,
+    /// An arbitrary number generated by the user, typically random
+    pub random_nonce: Option<Nonce>,
+    /// When this transaction expires
+    /// (number of non-leap-nanoseconds since January 1, 1970 0:00:00 UTC).
+    pub expires_at: Option<u64>,
     /// The maximal height of the block in the blockchain below which the given DelegateAction is valid.
     pub max_block_height: BlockHeight,
     /// Public key used to sign this delegated action.
@@ -396,7 +401,8 @@
 DelegateActionSenderDoesNotMatchTxReceiver
 ```
 
-- If the current block is equal or greater than `max_block_height`
+- If the current block is equal or greater than `max_block_height` 
+- If the block time is greater than `expires_at`
 
 ```rust
 /// Delegate action has expired
@@ -423,3 +429,9 @@
 /// DelegateAction nonce is larger than the upper bound given by the block height (block_height * 1e6)
 DelegateActionNonceTooLarge
 ```
+
+- If `nonce` is non zero and `random_nonce` is present
+- If `max_block_height` is non zero and `expires_at` is present
+```rust
+MalformedTransaction
+```
@@ -11,7 +11,11 @@
     /// Nonce is used to determine order of transaction in the pool.
     /// It increments for a combination of `signer_id` and `public_key`
     pub nonce: Nonce,
-    /// Receiver account for this transaction. If
+    /// An arbitrary  number generated by the user, typically random
+    pub random_nonce: Option<Nonce>,
+    /// When this transaction expires
+    /// (number of non-leap-nanoseconds since January 1, 1970 0:00:00 UTC).
+    pub expires_at: Option<u64>,
     pub receiver_id: AccountId,
     /// The hash of the block in the blockchain on top of which the given transaction is valid
     pub block_hash: CryptoHash,
@@ -38,7 +42,7 @@
 A `Transaction` can contain a list of actions. When there are more than one action in a transaction, we refer to such
 transaction as batched transaction. When such a transaction is applied, it is equivalent to applying each of the actions
 separately, except:
 * After processing a `CreateAccount` action, the rest of the action is applied on behalf of the account that is just created.
 This allows one to, in one transaction, create an account, deploy a contract to the account, and call some initialization
 function on the contract.
 * `DeleteAccount` action, if present, must be the last action in the transaction.
@@ -58,19 +62,19 @@
 #### Valid `signer_id` format

 Whether `signer_id` is valid. If not, a
 ```rust
 /// TX signer_id is not in a valid format or not satisfy requirements see `near_core::primitives::utils::is_valid_account_id`
 InvalidSignerId { signer_id: AccountId },
 ```
 error is returned.

 #### Valid `receiver_id` format

 Whether `receiver_id` is valid. If not, a
 ```rust
 /// TX receiver_id is not in a valid format or not satisfy requirements see `near_core::primitives::utils::is_valid_account_id`
 InvalidReceiverId { receiver_id: AccountId },
 ```
 error is returned.

 #### Valid `signature`