Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auctioneer will create duplicate QuoteNotifiers until the first price is established #10660

Open
warner opened this issue Dec 9, 2024 · 3 comments
Labels
auction bug Something isn't working Inter-protocol Overarching Inter Protocol

Comments

@warner
Copy link
Member

warner commented Dec 9, 2024

While doing upgrade18 testing on the Emerynet testnet, we discovered a bug in the way the auctioneer establishes the "Quote Notifier", which resulted in a ton of duplicate notifiers being created. This caused a massive amount of work (nine hours of full blocks) to be triggered by the PushPrice oracle message.

This was fixed by PR #10615 , to be incorporated into agoric-upgrade-18-rc3.

Background

The AuctionBook is part of the Inter Protocol code which periodically holds auctions to sell off assets (like ATOM or stATOM). The primary current use is for the vaultManager to be able to liquidate assets when their value has dropped below a threshold, so that all vaults maintain a positive value. VaultManagers capture the price they will use as the threshold for liquidation shortly before each auction round starts. The auction sells each asset at a price captured at the beginning of the auction.

The price is provided by a set of "oracles" which periodically submit PushPrice transactions. These prices are submitted to a per-denomination "price feed vat", which runs an internal Notifier whose updates contain prices. That vat can also create a QuoteNotifier, whose updates contain "QuotePayments", which are ERTP Payment objects whose Amount describes the price and a timestamp reflecting the effective time of issuance of the quote. A second per-denom vat named the "scaled price authority" creates "scaled" QuotePayments, which then have their own QuoteNotifiers. Finally, there is a single vat named vat-priceAuthority which performs the denom-to-scaledPriceAuthority dispatch (and can be updated when we replace the sPA/price-feed vats), through which all requests for a QuoteNotifier are routed.

The auction book is woken up periodically (once per hour on mainnet, once every 10 minutes on emerynet) to see if any collateral needs to sold off. For various performance reasons, the auction book wants to maintain a copy of a recent QuotePayment rather than asking the price feeds every time. To support that, the auction book uses observeNotifier to subscribe to the scaled-price-authority's QuoteNotifier, and updates a local variable (actually a portion of the durable state of the AuctionBook exo) every time it receives an update. This way, the auction book always (or usually always) has a copy of a relatively-recent price, which can be accessed synchronously.

When the auction book is first created (e.g. when the denom is added), the exo finish() function calls a local helper method named observeQuoteNotifier() to initialize the process. This asks priceAuthority to makeQuoteNotifier(), which gets routed to the scaled-price-authority vat for that denom, which then sends its own makeQuoteNotifier() to the corresponding price-feed vat, which makes a Notifier and sends it back. The scaled-price-authority vat builds a new Notifier around that one (applying the scaling transform on each update) and returns it to the auction book. The auction code then uses observeNotifier() to attach a callback which updates state.updatingOracleQuote each time a new quote is delivered.

At the end of this process, we have one new Notifier in vat-price-feed, one new Notifier in vat-scaled-priceAuthority, and one new Observer in vat-auctioneer. This notifier/notifier/observer chain will then be triggered each time the price is updated, forever (or at least until the vats are killed).

In addition, each time the auction timer is triggered, captureOraclePriceForRound() is invoked, which can also use observeQuoteNotifier() to get one set up. This handles the case where the price-feed vat is terminated, and we need to establish a Notifier in its replacement. Also, observers are ephemeral, and the auctioneer vat might get upgraded, so this is the pathway by which the new auctioneer incarnation will establish a new set of notifiers and observers (although note that we do not yet claim the auctioneer vat is upgradable, and there is more work to be done before we can make that claim).

The Bug

The core problem was the flag/lock used by the AuctionBook to ensure that we don't have duplicate observers/notifiers was based on the wrong criteria. captureOraclePriceForRound() used the truthiness of state.updatingOracleQuote to decide whether to call observeQuoteNotifier or not.

https://github.com/Agoric/agoric-sdk/blob/agoric-upgrade-18-rc2/packages/inter-protocol/src/auction/auctionBook.js#L643-L651

if (!state.updatingOracleQuote) {
    // if the price has feed has died, try restarting it.
    facets.helper.observeQuoteNotifier();
    return;
  }

But state.updatingOracleQuote does not get set at all until the first price has been delivered:

      updateState: quote => {
              trace(
                `BOOK notifier ${priceFrom(quote).numerator.value}/${
                  priceFrom(quote).denominator.value
                }`,
              );
              state.updatingOracleQuote = priceFrom(quote);
            },

This won't happen until enough oracles have submitted a price for a round to meet the quorum requirements (two oracles, at least for Emerynet). And that can't happen until the oracles have accepted their invitations and submitted the PushPrice txns.

The Stall

On Emerynet (unlike mainnet), we don't have automatic oracles, and only do a manual PushPrice when we're doing qualification testing. So most of the denoms don't have established prices, especially after the initial upgrade18 was deployed, which replaced the auctioneers (creating new ones, without prices in their state.updatingOracleQuote).

As a result, every 10 minutes, captureOraclePriceForRound() was called, found no state.updatingOracleQuote, called observeQuoteNotifier(), which caused both the scaled-price-authority and the price-feed to create new QuoteNotifiers and wire them up to the internal price notifier that had never fired yet. We deployed ugprade18 to emerynet early in November, so this growth continued for several weeks without being noticed or causing visible problems, in both the current vat-auctioneer and the previous vat-auctioneer (we actually have five copies of vat-auctioneer on Emerynet, because our chain upgrades merely add new ones, and do not kill off the old ones, and they are all still subscribed to the wakeup timer).

On saturday night (01-dec-2024, about 18:30 PST), we submitted a PushPrice (I think for ATOM, maybe stATOM). By this time, we had tens of thousands of these notifiers in place, all waiting for the first price to arrive. This first PushPrice didn't meet the quorum threshold, so didn't cause much activity. A few minutes later, when the second price was submitted at 18:33, that triggered the price-feed vat to fire the internal Notifier, which then triggered about 18k callbacks to all of the QuoteNotifiers in that vat. Each one needed to get an updated timestamp to create the QuotePayment amount, so that one delivery sent 18k getCurrentTimestamp messages to vat-timer before finishing. This one delivery took about 30 minutes to execute, resulting in one extremely long block on emerynet, with a gigantic transcript full of syscall.send messages.

After that, those 18k messages resulted in 18k responses, followed by 18k followups, etc. Each of the 18k observer/notifier sets required about a dozen messages to get from the initial PushPrice to the final assignment of state.updatingOracleQuote, so there were roughly 200k deliveries performed before the chain finally became idle again. Every individual delivery was small and completed in milliseconds, so every block was well-constrained by our 65M computron limit. But the sheer number of them meant that every block was full for the next nine hours, during which time no other actions could be performed. No new txns were taken off the actionQueue, no timer updates were performed, we couldn't even pull new PushPrices actions from the high-priority action queue, until we finished the "catch up" run that we do at the start of each block. By the time it finished, state.updatingOracleQuote had been updated 18k times, each with a different QuotePayment object (all with the same price and timestamp).

So from the point of view of an external observer, the chain was stuck for 30 minutes (no new blocks), followed by nine hours of txns being added to blocks but not actually getting executed, after which a whole bunch of stale work was executed. And then finally the chain could behave normally again. (However any new PushPrices will trigger another nine hours of work).

The Fix

The fix, landed in #10615, was to change the lock/flag to be more accurate. We introduced a new flag, which is tested (and then set) upon entry to observeQuoteNotifier, and cleared if/when the makeQuoteNotifier fails, or if/when the resulting notifier exists (finish, meaning the observer has been told to shut down and emits its final update, or fail, which means the observer has failed, e.g. when the price-feed vat is terminated). This should ensure that we don't initiate the creation of a new notifier while we might already be trying to create one. By clearing the flag on failure, it should also ensure that we don't inhibit notifier/observer creation when there is no chance that a creation attempt is already in place.

The state variable will still be null until the first update, and non-null after the update, but we no longer use it to make decisions about creating new notifiers.

This flag is managed as an ephemeral Set, indexed by denom, closed over by all AuctionBooks (one per denom). This turned out to be the best way to ensure that upgrading vat-auctioneer would clear the flag (the Observers are ephemeral, so we must make sure that new auctioneer incarnations make new ones). A durable flag wouldn't be cleared, and putting the flag in per-instance state would require us to iterate over all existing auctions at incarnation startup to update their flags.

We rely upon the call to captureOraclePriceForRound() to trigger notifier/observer creation in new vat-auctioneer incarnations. We still have the call in finish(), so deploying a new denomination will start the process early (before the first auction timer fires), but that doesn't help with vat-auctioneer upgrades.

@warner warner added bug Something isn't working Inter-protocol Overarching Inter Protocol auction labels Dec 9, 2024
@warner
Copy link
Member Author

warner commented Dec 9, 2024

This was fixed by PR #10615 , so it can be closed immediately.

@warner warner closed this as completed Dec 9, 2024
@toliaqat toliaqat reopened this Dec 10, 2024
@toliaqat
Copy link
Contributor

Reopenned because @warner noticed that we have a reoccurrence of the same issue with vaultManager.

so it looks like vaultManager.js has the same observeNotifier pattern (with the same buggy/incomplete interlock flag) as the auctioneer did, and when up18rc3 hits mainnet, it will be creating a string of duplicate QuoteNotifiers until the first price quotes appear, just like the auctioneer did in up18rc2.

mergify bot added a commit that referenced this issue Dec 11, 2024
refs: #10660

## Description

in testing upgrade 18 candidates, we realized that vaultManager was subject to the same error as auctions fixed in #10615.

### Security Considerations

No Security implications.

### Scaling Considerations

Not fixing this would add a new notifier to the vaultManager's quote watcher for every hour that elapsed between upgrade and the oracles starting to provide price updates.

### Documentation Considerations

Unnecessary.

### Testing Considerations

This was detected by trawling slogs. It's not clear that we can do better than that to verify the fix.

### Upgrade Considerations

It would probably be a mistake to upgrade vaults without this fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auction bug Something isn't working Inter-protocol Overarching Inter Protocol
Projects
None yet
Development

No branches or pull requests

2 participants