Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[scd] properly cleanup implicit sub on oir update #1057

Closed

Conversation

Shastick
Copy link
Contributor

@Shastick Shastick commented Jul 18, 2024

This is a first stab at #1056, and covers two corner cases that currently result in dangling implicit subscriptions:

  • when an OIR previously linked to an implicit subscription is updated with a user-specified subscription
  • when an OIR previously linked to an implicit subscription is updated in such a way that a new implicit subscription is created.

The OIR upsert logic now check the following:

  • if a subscription was previously attached to the updated OIR
  • if the previous subscription was an implicit one
  • if the previous implicit subscription has no more dependent OIRs

We now delete subscriptions that meet all the conditions above.

Notes

This PR chooses the approach to maintain the re-creation of an implicit subscription when, in a request to update an OIR, parameters for an implicit subscription are provided.

Resolves #1056

Test plan: with this PR, the DSS is passing the test scenario check through which I discovered the problem, but which has not yet been added to the qualifier. (It will be available once interuss/monitoring#720 is merged)

Copy link
Contributor

@barroco barroco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Shastick for taking care of it. This looks good to me since it only cleans up orphan subscription without changing the current response to USSP.
Let's open another issue to discuss if creating a new implicit subscription is the expected behaviour.
You will find minor comments inline.

pkg/scd/operational_intents_handler.go Outdated Show resolved Hide resolved
pkg/scd/operational_intents_handler.go Outdated Show resolved Hide resolved
pkg/scd/operational_intents_handler.go Outdated Show resolved Hide resolved
return stacktrace.Propagate(err, "Could not determine if previous Subscription can be removed")
}

// We _could_ reuse the previous subscription ID if it was an implicit subscription,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a reference to the ticket to keep a link with the design discussion/decision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment does not entirely reflect my intention, I will reword it.

in short: when an update is done and the parameters for the creation of an implicit subscription are passed, my understanding is that we actually want to create a new implicit subscription.

Clients are allowed to change the subscription attached to an OIR, and from my understanding, I think it is valid to give the option to replace an existing subscription with an implicit one.

Clients that wish to reuse an existing subscription (whether or not it is an implicit one) should specify the existing subscription ID in the call parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more precise: there was no design discussion about this – this is my interpretation of how the API should work.

If you think this needs discussing/clarifying we can either open an issue, discuss here or have a chat at the next weekly meeting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when an update is done and the parameters for the creation of an implicit subscription are passed, my understanding is that we actually want to create a new implicit subscription.

Yes, when new_subscription is specified, a new implicit subscription should be created.

I think it is valid to give the option to replace an existing subscription with an implicit one.

I don't know of an actual use case for this, but I agree this is what should happen if someone specifies new_subscription on an op intent with an existing subscription. And, if that existing subscription was implicit and no longer has any dependent operational intents, then I would expect the existing subscription to be automatically deleted by the DSS since it was implicit (managed by DSS).

Clients that wish to reuse an existing subscription (whether or not it is an implicit one) should specify the existing subscription ID in the call parameters.

Agreed.

If you think this needs discussing/clarifying we can either open an issue, discuss here or have a chat at the next weekly meeting.

I don't think there is any ambiguity here so I don't think there's need for further discussion. However, if someone thinks otherwise then we can certainly have that discussion.

if len(dependentOps) == 0 {
return false, stacktrace.NewError("An implicit Subscription had no dependent OperationalIntents")
} else if len(dependentOps) == 1 {
return true, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're assuming the one dependent operational intent reference is the one the caller is interested in, but that assumption is neither documented nor enforced. If we're just checking a particular subscription, then the answer shouldn't be "yes" or "no", it should be "only when removing this particular operational intent reference" (returning an optionally-nil pointer to an operational intent ID). Or, if we're checking whether a particular subscription can be removed after removing the dependency of an operational intent reference on that subscription, then the ID of the operational intent should be passed into this function and validated that the single dependent operational intent is the operational intent passed.

The operational intent ID was not needed previously because the context in which the code used to be located had just identified this subscription as the one the operational intent was using.

@@ -16,6 +16,42 @@ import (
"github.com/interuss/stacktrace"
)

// subscriptionCanBeRemoved will check if:
// - a previous subscription was attached,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attached to what? Neither this documentation nor the function signature mentions an operational intent. I think we should add an op intent ID to the function signature (see other comment) and then this will make more sense (a previous subscription was attached to the specified operational intent).

@Shastick Shastick force-pushed the fix-implicit-sub-oir-update branch 2 times, most recently from ddf25c7 to 67d531c Compare August 13, 2024 07:22
@@ -490,6 +511,14 @@ func (a *Server) upsertOperationalIntentReference(ctx context.Context, authorize
}
}

removePreviousImplicitSubscription, err = subscriptionIsOnlyAttachedToOIR(ctx, r, &id, previousSubscriptionID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why not deleting the subscription in this block rather than setting a flag? That would simplify the control flow.

Copy link
Contributor Author

@Shastick Shastick Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point in the transaction the subscription cannot be deleted, because it is referenced by the operational intent.

The previous implicit subscription can only be removed once the OIR has been updated to point to another subscription.

If you delete it before the r.UpsertOperationalIntent(ctx, op) statement at line 719, you get an error.

Copy link
Contributor Author

@Shastick Shastick Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: will need to confirm the above again, as I might have been confused by another error.

My current understanding is that foreign keys are enforced immediately in cockroach DB, and not at the time when the transaction is committed, which is why the subscription cleanup must happen after the OIR was updated. The doc seems to concur, though I have not checked if we set a non-default foreign key action on the relevant field.

The relevant extract:

If there are any existing references to the key being deleted, the transaction will fail at the end of the statement

if err != nil {
return false, stacktrace.Propagate(err, "Unable to get OperationalIntent's Subscription from repo")
}
if sub == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We retrieve the subscription ID from the operational intent out of the store. IIUC it is then not realistic for the subscription to not exist because the store will guarantee that to not happen.
IMO this also highlights that invoking GetSubscription here is not appropriatea: this should be done when previousSubscriptionID is retrieved, and then pass the subscription around and not just its ID.

Copy link
Contributor Author

@Shastick Shastick Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We retrieve the subscription ID from the operational intent out of the store. IIUC it is then not realistic for the subscription to not exist because the store will guarantee that to not happen.

Operational intents are not guaranteed to have a subscription attached: OIR's in the 'ACCEPTED' state may well be created without a subscription. Sorry, had to put myself back into the context of the PR.

At this point the sub should not be nil, but I think we should still check for this nil and report a clean error if it happens versus running into a nil pointer, in case we have a bug?

This check is taken verbatim from the old deletion logic that this method replaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this also highlights that invoking GetSubscription here is not appropriatea: this should be done when previousSubscriptionID is retrieved, and then pass the subscription around and not just its ID.

I disagree: there are circumstances where we won't need to fetch the subscription. For example, when updating an OIR, if the request specifies the same subscription ID as before, there is no need to get it from the store.

return false, stacktrace.Propagate(err, "Could not find dependent OperationalIntents")
}
if len(dependentOps) == 0 {
return false, stacktrace.NewError("An implicit Subscription had no dependent OperationalIntents")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if that's OK: when this new logic will be introduced, there will exist in the store some dangling implicit subscriptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct: although, I essentially kept the logic as it existed before being factored into this method.

I'd rather fix this in a separate PR, as there might be some discussion about whether or not we should opportunistically fix such inconsistencies when we encounter them vs. noisily failing so we become aware of them.

return false, stacktrace.NewError("OperationalIntent's Subscription missing from repo")
}

if sub.ImplicitSubscription {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use an early return if the subscription is explicit.

@mickmis
Copy link
Contributor

mickmis commented Aug 23, 2024

I tried to build an overview of what should happen wrt subscription in the upsert.

Is the subscription ID provided in the request parameters the same as the subscription ID attached to the existing operational intent? (in both cases the subscription ID may be none)
If yes: no need to do anything.
If no we have the following cases:

implicit old sub explicit old sub no old sub
implicit sub in params delete old if no dependent OI - -
explicit sub in params delete old if no dependent OI - -
no sub in params delete old if no dependent OI - -
new implicit sub in params create new implicit sub
delete old if no dependent OI
create new implicit sub create new implicit sub

At the moment we have a bit of a complex control flow for handling subscriptions updates, based first on whether the subscription ID is empty, but if the above is correct, the control flow should be as simple as (and also extensible to the deletion of operational intents):

if params.sub_id != old_oi.sub_id:
    if old_sub.implicit:
        delete_if_no_dependent_ois()
    if params.new_sub:
        create_new_implicit_sub()

If I'm correct a small refactoring could be appropriate to factor away this logic IMO. WDYT?

@Shastick Shastick marked this pull request as draft August 23, 2024 13:42
@Shastick
Copy link
Contributor Author

Taking to draft as I need to handle another corner case.

@Shastick Shastick force-pushed the fix-implicit-sub-oir-update branch 2 times, most recently from ee0fdf1 to a917152 Compare August 23, 2024 14:30
@Shastick
Copy link
Contributor Author

If I'm correct a small refactoring could be appropriate to factor away this logic IMO. WDYT?

Yes, it's actually possible to move the check out of the if block. The result does not yet feel entirely clean, but I'm not sure we can do better without doing a deeper refactoring

@Shastick Shastick force-pushed the fix-implicit-sub-oir-update branch 3 times, most recently from dbf7c86 to f4948b8 Compare August 23, 2024 14:47
@Shastick Shastick marked this pull request as ready for review August 23, 2024 14:48
@Shastick Shastick marked this pull request as draft August 23, 2024 14:53
@Shastick Shastick force-pushed the fix-implicit-sub-oir-update branch 2 times, most recently from cbc2b0f to c917f3a Compare August 23, 2024 16:09
@Shastick Shastick marked this pull request as ready for review August 23, 2024 16:09
@Shastick Shastick force-pushed the fix-implicit-sub-oir-update branch 2 times, most recently from 229c4a2 to 7bf4b55 Compare August 27, 2024 12:27
@Shastick
Copy link
Contributor Author

@mickmis where do we stand on the need to refactor the UpdateOperationalIntentReference method? If needed, the refactor should certainly happen in another PR: should we do it before or after merging this one? How urgent is this with respect to the current open issues that are caused by the dangling OIRs?

Once I'm done with handling the OVN check PRs and corresponding scenarios I can take a look at the refactor, but I'm wondering if it is aligned with our current priorities?

Copy link
Contributor

@mickmis mickmis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we stand on the need to refactor the UpdateOperationalIntentReference method? If needed, the refactor should certainly happen in another PR: should we do it before or after merging this one? How urgent is this with respect to the current open issues that are caused by the dangling OIRs?

Once I'm done with handling the OVN check PRs and corresponding scenarios I can take a look at the refactor, but I'm wondering if it is aligned with our current priorities?

After looking at it one more time, I do believe the change in this PR does what it is supposed to do. But I think it can and should be simplified more.

If not bound by any other constraint, IMO this function should be refactored before making such a change. Its control flow is already more complex than it should, and this change make that situation even worse. This makes it hard to identify tricky corner-cases, both at implementation and review. Also, the logic in this function is both one of the most complex and most central in the DSS.
Now, I'm not sure where we stand in terms of constraints and priorities. Do you have an opinion @BenjaminPelletier ?

pkg/scd/operational_intents_handler.go Show resolved Hide resolved
// A subscription has been specified:
// If it is different from the previous subscription, we need to fetch it from the store
// in order to ensure it correctly covers the OIR.
if effectiveSubscription == nil || previousSubIsBeingReplaced {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

effectiveSubscription == nil will always be true here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually now that I look at it again, I'm also not completely sure why this change here is needed? Was there previously an issue that is fixed by this change? Or is it just an optimization to avoid fetching twice the same subscription?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

effectiveSubscription == nil will always be true here.

Thanks for catching. A previous revision was pre-setting the value and that ended up being even more confusing than it is now.

Actually now that I look at it again, I'm also not completely sure why this change here is needed? Was there previously an issue that is fixed by this change? Or is it just an optimization to avoid fetching twice the same subscription?

This is to avoid re-fetching the same subscription already fetched at line 496 in the if old.SubscriptionID != nil { block if that subscription is not being replaced. In cases where it is being replaced, we need to fetch the subscription that will eventually be attached to the OIR to verify that it properly covers the OIR.

It indeed ensures that we don't do unnecessary roundtrips.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching. A previous revision was pre-setting the value and that ended up being even more confusing than it is now.

(Well, as a good illustration of how convoluted the control logic got, my 'simplification' was too much and needs a rework)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to avoid re-fetching the same subscription already fetched at line 496 in the if old.SubscriptionID != nil { block if that subscription is not being replaced. In cases where it is being replaced, we need to fetch the subscription that will eventually be attached to the OIR to verify that it properly covers the OIR.

It indeed ensures that we don't do unnecessary roundtrips.

Could you add a comment mentioning that? That clarifies that this is not actually a separate case, but an optimization. Ideally that could be hidden behind a function but not sure if that's doable or makes sense here.

@Shastick
Copy link
Contributor Author

After looking at it one more time, I do believe the change in this PR does what it is supposed to do. But I think it can and should be simplified more.

Do you mean that this PR can and should be simplified, or that the method should be simplified?

If not bound by any other constraint, IMO this function should be refactored before making such a change. Its control flow is already more complex than it should, and this change make that situation even worse. This makes it hard to identify tricky corner-cases, both at implementation and review. Also, the logic in this function is both one of the most complex and most central in the DSS. Now, I'm not sure where we stand in terms of constraints and priorities. Do you have an opinion @BenjaminPelletier ?

I personally see this as two separate issues:

  • one of correctness: is the code doing what it is supposed to do? So far we have relied on tests to convince ourselves of that, and we should continue to do so here. Furthermore we know that the DSS is currently not behaving as it should, and we have some evidence that this PR would address that. If we don't trust our tests enough to tell us if the DSS behaves as expected, we should probably start by adding tests until we reach that confidence.
  • one of maintainability: how long does it take to understand the code and to modify it, and how hard is it to add a feature or fix something without unintended consequences.

As such, I would see no issue in proceeding with adding something as long as it is correct (when we apply our usual standard for correctness) even if it worsens maintainability a little bit, if focusing on maintainability first means delaying the fix by an unknown amount of time.

I see three reasons for not waiting on a refactor to proceed with the fix:

  • I don't see a good path to simplifying things yet (we could have separate paths for OIR creation and update, maybe moving things to submethods will make the logic more tractable, maybe there's a major simplification that is possible and that we don't see, maybe we will disagree on the approach and need to align), so I expect that this might take a while
  • we plan to inline or push down some logic into the queries themselves rather than doing round-trips between the database and the Go logic. This in itself will deeply impact the relevant methods, and may cause us to remove or rework the simplifications we did to improve maintainability: any effort spent building the 'perfect' method now risks being thrown away if we don't combine it with pushing down the logic to the database.
  • longer term plans may involve moving away from an SQL/relational-based store, which will have an even deeper impact

I do however agree that the logic herein is pretty central to what the DSS does, and that it would deserve to be made as maintainable as possible as that will benefit us in the long term.

@Shastick Shastick force-pushed the fix-implicit-sub-oir-update branch from 7bf4b55 to 5f0dab6 Compare August 29, 2024 19:15
@Shastick Shastick force-pushed the fix-implicit-sub-oir-update branch 3 times, most recently from 6584b7a to 1c4e1ae Compare August 29, 2024 19:32
@Shastick Shastick marked this pull request as draft August 29, 2024 19:34
@Shastick Shastick force-pushed the fix-implicit-sub-oir-update branch from 1c4e1ae to 9ca5e54 Compare August 29, 2024 19:44
@Shastick Shastick marked this pull request as ready for review August 29, 2024 19:46
@BenjaminPelletier
Copy link
Member

I do think fixing the correctness problem from #1056 (which I think this PR does) is important; it has impact in a number of areas (e.g., this table from #1074 which probably means #1074 wouldn't have been filed if this issue had been fixed earlier).

I have not reviewed the full content of this PR in particular because Committer review time is limited and theoretically we should all have fairly similar overall thoughts (when we don't, that's a good candidate for additional documentation in the TSC repo). I (and/or @barroco) can take a more holistic look to weigh in on the inherently-gray tradeoffs between correctness, maintainability, and timeliness, though I think in an ideal world (which indeed we don't live in, unfortunately) a good result could be worked out with the reviewer(s) already most familiar with the content.

I don't know if this general thought is applicable to this specific case, but often complexity can be managed by factoring more sections of a large function into their own functions with their own documentation, signatures, and encapsulated behavior (as has already been done once in this PR). This lets the core of the complex function look more like brief pseudocode, and therefore be more understandable at a glance (like @mickmis's pseudocode).

We don't need this fix tomorrow, so we can still iterate on the most maintainable version, but I agree we don't want that to take forever. If anyone thinks we could do a better job at picking tradeoffs, that would be a great topic for the Tuesday meeting, and probably worth giving a heads up on Slack 1-2 days beforehand that we'll want to discuss that so anyone who wants to can pre-read the background info (e.g., this PR) to be better prepared to discuss real time.

@Shastick
Copy link
Contributor Author

Shastick commented Aug 30, 2024

Given that

We don't need this fix tomorrow, so we can still iterate on the most maintainable version, but I agree we don't want that to take forever.

I'm happy to start with encapsulating things into submethods and see what kind of results I can come up with today.

edit: created #1088 so we can discuss the refactor-specific things there.

@mickmis
Copy link
Contributor

mickmis commented Aug 30, 2024

I personally see this as two separate issues:

  • one of correctness: is the code doing what it is supposed to do? So far we have relied on tests to convince ourselves of that, and we should continue to do so here. Furthermore we know that the DSS is currently not behaving as it should, and we have some evidence that this PR would address that. If we don't trust our tests enough to tell us if the DSS behaves as expected, we should probably start by adding tests until we reach that confidence.
  • one of maintainability: how long does it take to understand the code and to modify it, and how hard is it to add a feature or fix something without unintended consequences.

As such, I would see no issue in proceeding with adding something as long as it is correct (when we apply our usual standard for correctness) even if it worsens maintainability a little bit, if focusing on maintainability first means delaying the fix by an unknown amount of time.

You make great points here, thanks for the response, I'd just add that for correctness we need understanding in addition to testing (which is what I struggled to reach).

All in all I think that what you ended up doing @Shastick (i.e. timeboxing some enhancements) is a good compromise.

(thanks for the feedback @BenjaminPelletier)

@Shastick Shastick marked this pull request as draft August 30, 2024 14:14
@Shastick
Copy link
Contributor Author

Shastick commented Sep 6, 2024

Closing this PR to open a new clean one, since so much of original base branch has changed.

See #1104 for the continuation

@Shastick Shastick closed this Sep 6, 2024
Shastick added a commit to Orbitalize/dss that referenced this pull request Sep 9, 2024
This is a rework of interuss#1057 after most changes for interuss#1088 have landed.

The DSS built from this PR successfully passed the new qualifier's implicit subscription handling scenario, although that scenario still needs to be extended to cover all corner cases.

The PR is open for review, as it addresses interuss#1056
mickmis pushed a commit that referenced this pull request Sep 9, 2024
This is a rework of #1057 after most changes for #1088 have landed.

The DSS built from this PR successfully passed the new qualifier's implicit subscription handling scenario, although that scenario still needs to be extended to cover all corner cases.

The PR is open for review, as it addresses #1056
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[scd] dangling implicit subscription when OIR with implicit subscription is mutated
4 participants