-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement TTLs on ended SCD subscriptions to prevent unbounded growth and latency increases #1074
Comments
Thanks for the report @jctrouble.
Also, here is a discussion relevant to this issue: interuss/tsc#12. However it seems like performance impact were not anticipated. Maybe we need to start removing subscriptions and operational intents that are expired for a certain amount of time. |
@mickmis I'm still wrapping my head around the system, so forgive the dumb question: Is there an easy way to tell if a subscription is implicit / attached to an intent? |
At a basic level, the discovery and synchronization service is closer to a real-time system (reflects current state of the world) than it is to a long-term storage system -- discovering flight information is generally only needed a limited amount of time in advance (30 or 56 days depending on context from F3548-21), and only has value in very narrow cases past the time of flight. So, I would expect us to want effective TTLs on pretty much all data -- there is nothing in the DSS we would want to keep for a year, for instance. So, the question would only be how long the TTLs should be. There's probably some discussion there, but I think it would be hard to argue anything longer than 56*2 days, and we would probably want shorter. Since DSS deployments last longer than 112 days, TTLs (even very loose ones) seem valuable for the long-term health of the system. I believe the implicitness of the Subscription ("implicit" means the USS created or updated an operational intent and, instead of providing a pre-existing subscription to maintain airspace awareness for that operational intent, asked the DSS to maintain a subscription on the USS's behalf -- IMO this is an unnecessary complication, but it is in the standard) is an explicit column for the Subscription row. ASTM F3548-21 SCD0080 requires USSs to "maintain awareness" when they have an active operational intent. The meaning and interpretation of this gets murky when the wall time is past the end of the operational intent, and this was some of the motivation for the discussion @mickmis linked. Whether any operational intents are using a subscription can be determined by checking the operational intent rows for their subscription_id. Again, the "correct" treatment of subscriptions attached to operational intents that ended in the past is murky; see linked discussion for more. As an additional note, I believe the resolution of #1056 (mostly pending in #1057) should likely go a long way to reducing the build up of subscriptions in the database. |
Ran a bunch of queries to categorize implicit vs explicit subscriptions, whether or not they're associated with an existing operation, and whether or not they expired over a week ago:
The queries were variations of: SELECT
COUNT(sub.id)
FROM scd_subscriptions AS sub
LEFT OUTER JOIN scd_operations AS op
ON sub.id = op.subscription_id
WHERE
op.id IS NOT NULL
AND sub.implicit = TRUE
AND sub.ends_at <= now() - INTERVAL '7 days' |
Thanks a lot for the data @jctrouble, that's very useful insight and answers all of my previous questions.
Indeed, resolution of #1056 will address the issue that led to the accrual of the 10995 implicit subscriptions without operational intent reported above. However it will not remove all the existing ones. From your feedback @BenjaminPelletier it seems clear that we will need some way to cleanup the database, but also that defining what to cleanup is not trivial.
Given the federated nature of the DSS, and the approximate definition of what a cleanup should look like, I do feel that option 1 enables the most flexibility for operations of a DSS pool. |
I don't think 1-3 are mutually exclusive, so it's probably most valuable to figure out if/when we want to do each thing separately. It seems like we need option 1 to migrate existing deployments in any case (even if we had 3 already), so that does seem like the best first step. I would try to stay away from option 2 if possible as I think that adds more moving parts than are necessary. After completing the manual version of option 1, the next question would probably be whether to make a cron job for option 1 or pursue option 3. Given the likely CRDB transition, I think we should probably defer that decision until after the manual version of option 1 to give more time to see what the CRDB plan will be (and whether row-level TTLs will be available in the new solution). @mickmis, do you think you would have time to work on this in the near term? Note that deployment_manager exists, though may or may not be helpful for this project. |
@BenjaminPelletier yes, I will start working on it. |
@BenjaminPelletier IIUC doing so through deployment_manager implies being able to perform those operations only if the DSS is deployed on a kubernetes cluster. So that would exclude any local deployment using simple docker-compose files. Is that desirable? |
FYI in PR #1116 there is an implementation (still awaiting review) of a CLI tool that enables deletion of old operational intents and subscriptions. |
#1116 is now merged and the CLI tool available. |
After applying the changes in #1070, this query:
dss/pkg/scd/store/cockroach/subscriptions.go
Lines 286 to 297 in 915d45b
still needs to read a lot of rows (analysis) to do its job:
While it uses the
cell_idx
for the spatial query, it must still search through every subscription that ever lived for that S2 cell, just to filter them out by start and end dates. With time (and subscriptions not getting cleaned up) this query will continue to get slower and slower and slower as it needs to filter through more and more (likely irrelevant) rows.The obvious approach would be to set up an index for this case. Unfortunately in a cursory investigation I was unable to get an inverted index working that contained
{starts_at, ends_at, cells}
. I'm not sure CRDB supports the kind of index we'd need to be able to search using all three columns.The next best thing would be to prune expired subscriptions, such that the table doesn't experience unbounded growth. Cockroach supports row-level TTLs, such that we could purge any subscriptions whose end date is sufficiently in the past. In a local analysis on Wing's internal DSS, I found the following:
over 60% of subscriptions had an end date of more than a week ago. By choosing an interval and pruning, we can ensure that the subscriptions table only grows as big as there are active subscriptions, and the queries don't face unbounded data sets.
The text was updated successfully, but these errors were encountered: