Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't reschedule auction's price notifier if we already have one #10615

Merged
merged 9 commits into from
Dec 4, 2024

Conversation

Chris-Hibbert
Copy link
Contributor

Description

It was observed, after applying upgrade 18 to EmeryNet, that a whole slew of promises resolved when a price was provided. The current belief is that observeQuoteNotifier will be called every auction round until a price is available, and that creates a new observer with each call that wait until a price is published, and then they all continue waiting for each successive update.

This change adds an interlock, so if there's already a notifier waiting, we don't add a new one.

Security Considerations

No security implication.

Scaling Considerations

Processing about 19000 actions waiting on a notifier in EmeryNet took several hours. If we're correct that the notifiers will continue to cycle, we expect to see a similar wait for each price update on that currency. That's unsustainable.

The only current theory about dropping all those actions waiting for notifiers is to kill the vat. We can't kill the priceAuthority vats that hold the notifiers, but we might be able to cleanly kill the abandoned auctioneers.

Documentation Considerations

Not needed.

Testing Considerations

Tough to test in unit tests. It's conceivable that we could recreate the situation in a3p-integration, though it would be hard to observe the results.

Upgrade Considerations

We probably shouldn't ship upgrade 18 with something to address this problem.

@Chris-Hibbert Chris-Hibbert added bug Something isn't working resource-exhaustion Threats to availability from resource exhaustion attacks contract-upgrade auction labels Dec 4, 2024
@Chris-Hibbert Chris-Hibbert requested a review from warner December 4, 2024 00:48
@Chris-Hibbert Chris-Hibbert self-assigned this Dec 4, 2024
@Chris-Hibbert Chris-Hibbert requested a review from a team as a code owner December 4, 2024 00:48
@@ -464,6 +465,8 @@ export const prepareAuctionBook = (baggage, zcf, makeRecorderKit) => {
AmountMath.make(collateralBrand, QUOTE_SCALE),
bidBrand,
);
baggage.set(QUOTE_NOTIFIER, quoteNotifier);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's storing a promise in (durable) baggage, which isn't allowed (also that const quoteNotifier = should probably be quoteNotifierP = to make that more obvious). Would it hurt anything to await and store the actual Presence instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.

I also notice that we don't seem to make a new baggage for each book. I need to figure out if that's true, and either make new baggages, or make unique names for each notifier

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, no, that wouldn't be a trivial change. It would make observeQuoteNotifier async, but also it wouldn't be setting the flag at the right time: if two callers invoke observeQuoteNotifier in the same turn, we'd still wind up with multiple outstanding calls to makeQuoteNotifier, and we'd get multiple observers again.

I think observerQuoteNotifier() needs to be named/intentioned "maybeObserveQuoteNotifier" or "maybeStartObservingQuoteNotifier". And it needs to be "async idempotent" (not sure if that's a term of art but you get the idea): no matter how many times you call it in a row, you still only wind up with a single outstanding async request, but if that request fails for any reason, it resets back to a state where you can call it again and it won't short-circuit.

The flag needs to be checked on entry to the function, during the synchronous prelude, before sending any messages. Then it needs to send off the makeQuoteNotifier, and if that fails it needs to clear the flag. If it succeeds then it can build the observer, which clears the flag if/when the observer fails or finishes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated via pairing.

Copy link

cloudflare-workers-and-pages bot commented Dec 4, 2024

Deploying agoric-sdk with  Cloudflare Pages  Cloudflare Pages

Latest commit: 466d3b8
Status: ✅  Deploy successful!
Preview URL: https://6c571184.agoric-sdk.pages.dev
Branch Preview URL: https://cth-auction-pricenotifier.agoric-sdk.pages.dev

View logs

@Chris-Hibbert Chris-Hibbert force-pushed the cth-auction-priceNotifier branch from b68ac12 to ba64933 Compare December 4, 2024 18:38
Copy link
Member

@warner warner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submitting review now so @Chris-Hibbert can consider the map-vs-set thing, will re-review when that's settled

// auctionBook changed to create a sub-baggage for each book (to distinguish
// their quoteNotifier flags) so older auctions will not be upgradable to this
// version. We believe this version is saving all the necesary state to be
// upgraded, but that hasn't been verified, so we don't mark it as `canUpgrade'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of those meta flags is that the first sentence ("older versions will not be upgradable to this version") is the reason we don't mark this as canUpgrade. The second sentence ("saving all the necessary state") is what justifies our use of canBeUpgraded. The fact that we haven't verified it is either good note to add but doesn't change the flags, or a reason to not use canBeUpgraded.

My opinion is that canBeUpgraded is appropriate, but I don't know how much of a stickler we are for "there must be tests to prove this claim before we're allowed to make it".

So maybe add ", so we do not mark this as canUpgrade" to the first sentence, and maybe make the second sentence be like "We believe.., so we mark this as canBeUpgraded (but this hasn't been verified with an a3p test)".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we haven't verified it is either good note to add but doesn't change the flags, or a reason to not use canBeUpgraded.

If we had verified it, it would be 'canUpgrade. The only time it should be canBeUpgraded` is when we hope it's upgraded, but haven't proven it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought canUpgrade is backwards-looking, while canBeUpgraded is forward-looking, and that there was no way to express "probably upgradable but not tested".

If I understand it correctly, canUpgrade means that this code is capable of upgrading a previous version, which we know is false here because the sub-baggage we added.. (actually we just changed that, we're no longer modifying the way baggage is used, so maybe this could upgrade the previous version, but I imagine there are other reasons that might not work).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never did know the difference. Ooh! we have docs now:

   * - `canUpgrade` means this code can perform an upgrade
   * - `canBeUpgraded` means that the contract stores kinds durably such that the next version can upgrade

So I concur that canBeUpgraded is correct and that any test for canUpgrade from the previous version is certain to fail.

@@ -118,6 +122,12 @@ export const makeOfferSpecShape = (bidBrand, collateralBrand) => {
export const prepareAuctionBook = (baggage, zcf, makeRecorderKit) => {
const makeScaledBidBook = prepareScaledBidBook(baggage);
const makePriceBook = preparePriceBook(baggage);
// a map from collateralBrand to true when the quoteNotifier has an observer
// the brand is absent when there's no observer.
const quoteNotifierFlags = provideDurableMapStore(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, hey, should this just be a DurableMapSet?

Copy link
Member

@warner warner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me. Nits on the comments, might pull in @dckc on the meta flag question (and to do another pass in general).

@@ -645,8 +668,8 @@ export const prepareAuctionBook = (baggage, zcf, makeRecorderKit) => {

trace(`capturing oracle price `, state.updatingOracleQuote);
if (!state.updatingOracleQuote) {
// if the price has feed has died, try restarting it.
facets.helper.observeQuoteNotifier();
// if the price feed has died, restart it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add "(or hasn't been started for this incarnation yet)"

// auctionBook changed to create a sub-baggage for each book (to distinguish
// their quoteNotifier flags) so older auctions will not be upgradable to this
// version. We believe this version is saving all the necesary state to be
// upgraded, but that hasn't been verified, so we don't mark it as `canUpgrade'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought canUpgrade is backwards-looking, while canBeUpgraded is forward-looking, and that there was no way to express "probably upgradable but not tested".

If I understand it correctly, canUpgrade means that this code is capable of upgrading a previous version, which we know is false here because the sub-baggage we added.. (actually we just changed that, we're no longer modifying the way baggage is used, so maybe this could upgrade the previous version, but I imagine there are other reasons that might not work).

@Chris-Hibbert
Copy link
Contributor Author

The declaration of ContractMeta says

  /**
   * - `none` means that the contract is not upgradable.
   * - `canUpgrade` means this code can perform an upgrade
   * - `canBeUpgraded` means that the contract stores kinds durably such that the next version can upgrade
   */

I'd be happy to take it out. It's not required.

@Chris-Hibbert Chris-Hibbert added the automerge:squash Automatically squash merge label Dec 4, 2024
Copy link
Member

@dckc dckc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW...

Comment on lines +459 to +461
// Ensure that there is an observer monitoring the quoteNotifier. We
// assume that all failure modes for quoteNotifier eventually lead to
// fail or finish.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I read this as "We rely on SOMETHING to see that all failure modes for quoteNotifier eventually lead to fail or finish"? What is the SOMETHING?

If we don't rely on something else, we have to ensure it here, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption is baked into the design of the notifiers and observers. There's no way to interrupt it from the consumer end, and no way to verify that it's still alive. Its contract says if it every ceases to continue producing new values it will either call fail or finish, or return an exception because the vat died.

Comment on lines +121 to +122
// Brands that have or are making active quoteNotifier Observers
const observedBrands = new Set();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to get the relevant invariant in my head.

Something that combines observedBrands with state.updatingOracleQuote

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state.updatingOracleQuote has a value when we've received an update since the last auction. observedBrands contains the brand if the priceNotifier observer has been started and hasn't returned an error or called finish.

In the normal case, observedBrands doesn't change, but state.updatingOracleQuote gets reset every auction.

The interesting cases are startup and if the notifier ever breaks. The problem with startup is that it can take an arbitrary timespan for the oracles to start reporting prices after the auction starts. We want to start the observer, but we don't want to do it multiple time (as we were previously doing) because each creates a new observer that persists.

// auctionBook changed to create a sub-baggage for each book (to distinguish
// their quoteNotifier flags) so older auctions will not be upgradable to this
// version. We believe this version is saving all the necesary state to be
// upgraded, but that hasn't been verified, so we don't mark it as `canUpgrade'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never did know the difference. Ooh! we have docs now:

   * - `canUpgrade` means this code can perform an upgrade
   * - `canBeUpgraded` means that the contract stores kinds durably such that the next version can upgrade

So I concur that canBeUpgraded is correct and that any test for canUpgrade from the previous version is certain to fail.

Copy link
Member

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in one of these cases I wish state with shapes could grow new props.

@@ -118,6 +118,8 @@ export const makeOfferSpecShape = (bidBrand, collateralBrand) => {
export const prepareAuctionBook = (baggage, zcf, makeRecorderKit) => {
const makeScaledBidBook = prepareScaledBidBook(baggage);
const makePriceBook = preparePriceBook(baggage);
// Brands that have or are making active quoteNotifier Observers
const observedBrands = new Set();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we explain why it's ok to have this be a heap Set? Both from a cardinality and upgradability pov.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brian and I talked through this. If we decide to upgrade auctions, then the restart won't have an observer running, but it won't have reset any tracking variable we create, so it's important that the tracking go away on restart.

I think the cardinality question is about the fact that we're sharing a Set across auctoinBooks. We don't have ephemeral per-object state, but we can have ephemeral shared state. The number of brands handled by the auction is small and seldom grows, so the set won't get too big. We're already keeping all the auctionBooks in a single vat.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured as much for the upgradability side, but I would prefer this to be written out as comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, we lost the race with the merge. Please let me know if you think it's worth a new PR to add the comment. I'm happy to do it if you ask.

},
}),
e => {
trace('makeQuoteNotifier failed, resetting', e);
state.updatingOracleQuote = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to reset the state here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's safe because we'll check for a notifier again at the next auction. It's a good idea because we can't count on the notifier continuing to produce events if there's an error at this level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just don't quite understand in which case this isn't already null. This is an error making the notifier, so afaiu we would never have updated this state in the first place?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't explore it deeply, but I think if a previous version was happily updating the state, then it upgrades, then anything which samples the price at the beginning of the new incarnation (before any new updates happen) will get the old price. If an auction then starts and initiates the observer process, but fails (maybe because the price feed vat is getting upgraded too?), then we'll hit this case, clear the ephemeral flag (so the next auction can initiate a new one), and also clear the durable state.updatingOracleQuote, which at least will tell clients to not rely on the old quote.

I suspect that's a bit weird. Not sure if it's a problem though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On line 469, we add the brand to the set, and on line 472, we attempt to create the notifier. On line 477, we await the notifier. If the creation promise fails, we end up here, and want to ensure the brand is removed from the set, or we'll never try again to create a notifier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensure the brand is removed from the set, or we'll never try again to create a notifier.

I have no problem with the ephemeral set being reset on error. My concern is with how the durable updatingOracleQuote state is managed. Resetting it at this state doesn't seem correct to me, and might be an indication something else is amiss.

Brian mentions a case where this durable state may not be null: after an upgrade. I suppose the product question is whether we want to maintain or not the old "updating" quote while we attempt to restart the observer.

But beyond that question, the problem is that the ensureQuoteNotifierObserved call is gated on !updatingOracleQuote in the first place, so really it seems like restarting the vat wouldn't trigger the observer creation anyway because that state wouldn't be reset by the upgrade, and the contract would think it still has a notifier.

IMO, we need to switch the call to ensureQuoteNotifierObserved to be unconditional, return true if the observer is in the Set, and if not immediately reset the updatingOracleQuote (if we want to avoid pretending we have an updating quote), start the notifier, and return false so that captureOraclePriceForRound can continue bailing out early. In that case updatingOracleQuote wouldn't need to be reset if we fail to make the observer since we'd be guaranteed for it to already be null (again if we don't want to pretend we have an updating quote while re-establishing the observer).

Regarding the early bail out of captureOraclePriceForRound, I'm a little surprised that we would skip capturing a price for round when the observer isn't yet setup instead of delaying the capture, but that's a different topic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah, if the auctionBook has a price (in durable state.updatingOracleQuote), then gets restarted/upgraded, the new incarnation won't start a new observer at startup (because nothing special happens at restart, it's not like we walk through all the pre-existing brands and do something for each one), and when the auction timer fires and it calls captureOraclePriceForRound(), that will see the old price and assume it's ok, and won't start a replacement observer then either.

I agree with @mhofman that the captureOraclePriceForRound() needs to check the ephemeral set instead of the durable state, or simply always call ensureQuoteNotifierObserved() and rely on its internal check to prevent duplicates.

const { state, facets } = this;
const { collateralBrand, bidBrand, priceAuthority } = state;

if (observedBrands.has(collateralBrand)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using the collateralBrand as key instead of the identity of this auctionBook? Are we guaranteed there is only ever a 1:1 relationship between auctionBook and collateralBrand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of auctionBook and auctioneer use the collateralBrand, so we're already reliant on it being 1:1.

Copy link
Member

@mhofman mhofman Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there is never multiple bid brands for a given collateral? I'm just wary of assumptions like this that aren't super clear or documented in code comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the bids are in IST, so there's no chance of that.

With Vaults, we explicitly intended that we might want to have multiple vaultManagers for a given currency that could have different minimums, or other requirements, and charge different rates.

With auctions, it's a mistake to split the bidding equity across multiple auctions, so the design was always that each auctionBook would handle all the bids and asks for a particular asset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think having a single bidding denom is the assumption that wasn't clear to me.

@mergify mergify bot merged commit e596a01 into master Dec 4, 2024
81 checks passed
@mergify mergify bot deleted the cth-auction-priceNotifier branch December 4, 2024 23:44
mujahidkay pushed a commit that referenced this pull request Dec 9, 2024
)

## Description

It was observed, after applying upgrade 18 to EmeryNet, that a whole slew of promises resolved when a price was provided. The current belief is that `observeQuoteNotifier` will be called every auction round until a price is available, and that creates a new observer with each call that wait until a price is published, and then they all continue waiting for each successive update.

This change adds an interlock, so if there's already a notifier waiting, we don't add a new one.

### Security Considerations

No security implication.

### Scaling Considerations

Processing about 19000 actions waiting on a notifier in EmeryNet took several hours. If we're correct that the notifiers will continue to cycle, we expect to see a similar wait for each price update on that currency. That's unsustainable.

The only current theory about dropping all those actions waiting for notifiers is to kill the vat. We can't kill the priceAuthority vats that hold the notifiers, but we might be able to cleanly kill the abandoned auctioneers.

### Documentation Considerations

Not needed.

### Testing Considerations

Tough to test in unit tests. It's conceivable that we could recreate the situation in `a3p-integration`, though it would be hard to observe the results.

### Upgrade Considerations

We probably shouldn't ship upgrade 18 with something to address this problem.
mujahidkay added a commit that referenced this pull request Dec 9, 2024
### Description

Cherry-picks the following commits from master:
- #10551
(9e19321)
- #10635
(ad4e83e)
- #10615
(e596a01 )
- #10634
(a1856f3)

Since we plan to verify this rc on devnet rather than emerynet, there is
no apparent need for a new upgrade name. Not aware of any deployments on
devnet before so can reuse the previous upgrade name. However, skipping
emerynet is dependent on comms with the validators so I have added a new
upgrade name `agoric-ugprade-18-emerynet-rc3` just in case. Only added
to bypass the need for a new rc if for some reason emerynet validation
is needed anyways - best case scenario is that it remains unused.

commits added using git cherry-pick
mergify bot added a commit that referenced this pull request Dec 11, 2024
refs: #10660

## Description

in testing upgrade 18 candidates, we realized that vaultManager was subject to the same error as auctions fixed in #10615.

### Security Considerations

No Security implications.

### Scaling Considerations

Not fixing this would add a new notifier to the vaultManager's quote watcher for every hour that elapsed between upgrade and the oracles starting to provide price updates.

### Documentation Considerations

Unnecessary.

### Testing Considerations

This was detected by trawling slogs. It's not clear that we can do better than that to verify the fix.

### Upgrade Considerations

It would probably be a mistake to upgrade vaults without this fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auction automerge:squash Automatically squash merge bug Something isn't working contract-upgrade resource-exhaustion Threats to availability from resource exhaustion attacks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants