Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement cache update deduplication per fetch cycle #5509

Merged
merged 36 commits into from
Oct 17, 2024

Conversation

edwbuck
Copy link
Contributor

@edwbuck edwbuck commented Sep 20, 2024

fixes #5349

This PR also fixes #5349 by collecting all registration entries (or attested node) updates into a single set of items to pull (using the golang "map keys" pattern) and only after that set is built, pulling those for the items to update the cache.

This item blocks other issues, such as the backoff polling algorithm and the polling query which would do batch selects.

The sections of the algorithm:

  • beforeFirstEvent (any event that arrives with a lower id than the first event)
  • polledEvents (any event that was skipped in a previous call of newEvents)
  • newEvents (any event that is detected past the previous last event)

Description of fix for #5349

Each of the sections of the algorithm above no longer update the cache directly. Instead, as they detect a Registration Entry / Attested Node that requires a cache update, it stores the need to fetch the Cache Entry by storing the appropriate key into a fetch map where the keys include the id of the item to be fetched.

As a map cannot contain duplicate identical keys, the fetched items are deduplicated. This map is then used to fetch the items once.

@edwbuck
Copy link
Contributor Author

edwbuck commented Sep 20, 2024

Converting to a draft until I can fix the unit tests, which haven't been updated to match the new structure of the old algorithm.

Maintainers are free to comment on items they'd like changed early, if they have the time to review what exists.

Signed-off-by: Edwin Buck <[email protected]>
Improved testing of authorized_entryfetcher_attested_nodes.go
Minor fixes to authorized_entryfetcher_regisration_entries.go
Fixed authorizedentries/cache_test.go
Better documentation for eventTracker.go
Fixed lack of sub-minute polling on eventTracker.go
Fixed unit tests to match sub-minute polling abilities.

Signed-off-by: Edwin Buck <[email protected]>
Renamed items in authorized entryfetcher attested nodes to not conflict.

Signed-off-by: Edwin Buck <[email protected]>
…ies.

Renamed similar test in attested nodes unit test to not conflict.

Signed-off-by: Edwin Buck <[email protected]>
Rename equivalent attested nodes unit test to avoid collision.
Rename registration entries func call to match attested nodes pattern.

Signed-off-by: Edwin Buck <[email protected]>
@edwbuck edwbuck changed the title Backoff Implement cache update deduplication per fetch cycle and backoff algorithm. Oct 1, 2024
Signed-off-by: Edwin Buck <[email protected]>
Signed-off-by: Edwin Buck <[email protected]>
Added unit testing for first three boundaries.

Signed-off-by: Edwin Buck <[email protected]>
@edwbuck
Copy link
Contributor Author

edwbuck commented Oct 7, 2024

Failing unit tests seem unrelated to the PR.

* are primarily set by the low order bits repeatedly by multipliation with
* a number designed to mix bits deterministicly for better hash dispersion.
*/
func hash(event uint) uint {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming that since this is a uint, we expect this to be valid for how many events until uint overflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4_294_967_295 is max uint (4 billion). At 800_000 a day, that would be 5_368 days, or about 14.6 years.

I imagine that in order to guard against this, we would have to dig into the database setup code that GORM creates. That's because the database is the origin of the event's id, set through autoincrement.

Since Uber is skipping one for every entry, Uber would be burning through these at twice the 800_000 rate, so you have 7 years to address the issue.

Also, the odds are very good that your mysql implementation is using Integer, and not Uint, so you might have 1/2 of that. Note that mysql (last I checked) had a very unusual way of handling the issue, it simply reused the max_int value over and over again, so you would get a strong failure (can't use a primary key for two entries).

To reset this you could shutdown the servers, delete the event tables, and relaunch. There's nothing persisted about the next item to process, and so the system would start again at 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this brings to mind for me tangential issue #5501

I wouldn't necessarily say it's blocking for me, but it sounds like workaround proposed is a bit of surprise significant manual action, which the operator also needs to have awareness of to even know they need to do let alone do safely.

It would be worthwile to consider (over the 7 years as you brought up :) ) how we can solve this more gracefully

Copy link
Member

@azdagron azdagron Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint is arch dependent. We only ship 64-bit artifacts so unless someone is building against 32-bit (unlikely), then the uint value itself will be 64-bits. That said, GORM (maybe now, and at least with the version we are using) uses 32-bit integer database types for IDs of type uint.

Ideally, SPIRE declare should declare its own model type using an explicit 64-bit type and then provide a migration to move the IDs to a 64-bit database type.

Copy link
Contributor

@stevend-uber stevend-uber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my understanding of the added code and changed functionality is correct, these changes LGTM -> Apropos the custom Boundary and Hash code, we'd expect this to benefit our environment by saving approximately 1.5gb of memory in the worst case scenarios which is non-trivial but also not-breaking. (read: We would benefit, but would be happy and fully understand if maintainers went with a different approach)

Thanks so much @edwbuck for making these changes!

@edwbuck
Copy link
Contributor Author

edwbuck commented Oct 9, 2024

@stevend-uber Thanks for the LGTM. There are likely fine optimizations that can happen, and one optimization that @azdagron mentioned which would improve the memory performance greatly.

In any case, the database performance should improve, even if the server performance takes a hit. The server performance can further be improved by the memory optimization that @azdagron has suggested, and if the current algorithm is completely replaced, by whatever replacement is offered.

Thank you for the passing review.

Copy link
Member

@azdagron azdagron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without providing benchmark results (or better yet, providing a repeatable benchmark) the performance and impact of the event tracker (on CPU and memory) is hard to determine. This is of particular importance considering the scale of events we expect to be tracked in some of the larger SPIRE user deployments.
While I wouldn't block this PR on that, it certainly would be prudent to be made available before we'd consider graduating this feature. We've missed this kind of benchmarking on related features in the past and have at times been surprised by the impact.

I similarly won't block for nitpick comments but would welcome fixes for improved readability.

It has been pointed out in the review and on contributor calls, but there is some concern from maintainers and other parties about the readability/maintainability of the event tracker. An alternative approach has been prototyped and is available, with benchmarks, here and here. As mentioned in the last contributor sync, the maintainers are discussing how we'd like to move forward and will have an update soon.

Thank you for your patience.

pkg/server/datastore/sqlstore/sqlstore.go Outdated Show resolved Hide resolved
pkg/server/endpoints/eventTracker.go Outdated Show resolved Hide resolved
Comment on lines 96 to 98
if boundaryList == nil {
boundaryList = []uint{}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why does the returned slice have to be non-nil even when empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't, but it makes the unit testing framework happier.

pkg/server/endpoints/eventTracker.go Outdated Show resolved Hide resolved
pkg/server/endpoints/eventTracker.go Outdated Show resolved Hide resolved
pkg/server/endpoints/eventTracker_test.go Outdated Show resolved Hide resolved
* the poll list.
*/
func (et *eventTracker) SelectEvents() []uint {
pollList := et.pool.Get().([]uint)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't sufficient. In order to leverage the pool, memory needs to be returned to it. If you look at the pq branch, you'll see an additional "FreeEvents" method added to the tracker which is invoked by the caller of SelectEvents when it has finished to "return" the events slice back to the event tracker for reuse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That method is only called in benchmarks in the pq branch. FWIW.

@edwbuck
Copy link
Contributor Author

edwbuck commented Oct 9, 2024

Without providing benchmark results (or better yet, providing a repeatable benchmark) the performance and impact of the event tracker (on CPU and memory) is hard to determine. This is of particular importance considering the scale of events we expect to be tracked in some of the larger SPIRE user deployments. While I wouldn't block this PR on that, it certainly would be prudent to be made available before we'd consider graduating this feature. We've missed this kind of benchmarking on related features in the past and have at times been surprised by the impact.

Currently we are creating a slice of all missed events and walking that slice unconditionally, each polling cycle. My submission is basically doing the same, with a bit of additional math that was carefully crafted to use: increment, comparison by subtraction, pointer (base address + offset) memory semantics, and modulus division, all of which are relatively fast operations, many with support for single machine code commands.

All of those items add some time to the calculation, but basically keep it a O(n) semantics, without heavy need for branching or pipeline bubble creation. As a result, the algorithm is relatively fast. If the modulus divison operation was not O(1) then hash table lookup would not be (effectively) O(1). My "n" comes from the fact that we do this O(1) calculation N times.

I similarly won't block for nitpick comments but would welcome fixes for improved readability.

I don't mind improving readability. I'll even update the nits if I have the time to do so. Sometimes they really help readability.

It has been pointed out in the review and on contributor calls, but there is some concern from maintainers and other parties about the readability/maintainability of the event tracker. An alternative approach has been prototyped and is available, with benchmarks, here and [here](https://github.com/azdagron/spire/blob/pq/pkg/server/endpoints/benchmark_results.txt. As mentioned in the last contributor sync, the maintainers are discussing how we'd like to move forward and will have an update soon.

And what isn't pointed out in the contributor call is that the benchmark above didn't include dispersion of identically scheduled items, such that all items would initially poll at the exact same offsets. It's also not quite clear that even with this not being added yet, the benchmark is about 60% of the speed of the original submission.

Finally, the initial dispersion routine offered can't be unit tested easily, because it adds a random jitter of -3 seconds to +3 seconds. This means polling for the next minute will "miss" a few elements (randomly) and potentially double-poll elements (also randomly).

While the presented hash above doesn't have perfect dispersion, it's good enough to be within 10% of the input value across all the currently tested partitions, and more importantly, it is deterministic. That means we can write tests to validate that Event ID 323423 polls at polling intervals 1,2,3,4,5,8,11,18, 23, and so on without ever having to run a population of events to determine if their average is correctly polling within a minute.

In fact the replacement algorithm would not even pass the same unit tests, because those tests validate that an item is only polled a certain number of times, and the random jitter would potentially increase or decrease that number of times as the jitter reduced or increased the polling delay.

Thank you for your patience.

You're welcome, but we have a solution that's been in review for two weeks now, and the maintainers are resorting to attempting to rewrite it to meet some goal, discarding some of the desired characteristics as they progress in redoing the work offered.

Debugging a priority queue is not trivial. Debugging a priority queue with random jitter is just a bit harder. Even if we fix the drift to not have the next poll accumulate errors over time, just writing the unit tests to prove some of the items that we have now will not be trivial. We cannot prove, for example that something is only polled X number of times. We cannot trivially remove the polling of an item without allowing it to poll first.

Hiding the complexity in another package isn't reduction of complexity. Both solutions are non-trivial, but one is deterministic, and the other has random elements. We should prefer to debug deterministic items, instead of relying on luck and chance with random ones. And while one uses actual times instead of poll counts, we only poll on certain time offsets (determined by the polling interval) so actual times don't matter as much as one might think.

@azdagron
Copy link
Member

azdagron commented Oct 9, 2024

And what isn't pointed out in the contributor call is that the benchmark above didn't include dispersion of identically scheduled items, such that all items would initially poll at the exact same offsets.

You are correct that your solution immediately distributes bursty event arrival across the poll windows. That said these events are already arriving somewhat distributed. Further, even if there are bursts of changes, SPIRE server can only ingest registration entries/agents at a certain rate. Considering the largest "window" according the the current backoff strategy is 1 minute, it seems reasonable to assume there will be natural enough distribution even without some jitter to spread that load.

It's also not quite clear that even with this not being added yet, the benchmark is about 60% of the speed of the original submission.

On my machine it was 7ms (this PR) vs 12ms (priority queue). An extra 5ms of cpu spent across a 5000ms interval (5s) does not seem like a high price to pay for readability and simplicity.

Finally, the initial dispersion routine offered can't be unit tested easily

Jitter can be disabled during tests (and is, if you look at the pq branch) to assert the correct backoff strategy. Jitter distribution events can be measured separately.

, because it adds a random jitter of -3 seconds to +3 seconds. This means polling for the next minute will "miss" a few elements (randomly) and potentially double-poll elements (also randomly).

The consequence of a miss or double poll of an individual event seems irrelevant? The time-to-noticing-a-skipped-event-has-arrived will still be within 57-63s vs 60s. Database load should also not be materially different. Over the aggregate you're still averaging once every 60s (assuming a good PRNG).

You're welcome, but we have a solution that's been in review for two weeks now,

I would respectfully ask you to reframe your expectations on how long it takes to get PRs reviewed and merged. The project maintainers are busy with all varieties of work. We have a slice of time we can dedicate to SPIRE. Within that slice of time we have to fit in servicing myriad PRs, issues, releases, security responses, community support, community engagement, etc. Our time is largely volunteer and in many cases comes at a cost to our own personal time. A PR of this size takes time to review, often requiring a large chunk of time to keep context fresh. Finding that time is challenging and rarely immediate. Your expectation and continued pressure that these PRs merge according to your timetable is unhelpful and unwelcome.

and the maintainers are resorting to attempting to rewrite it to meet some goal, discarding some of the desired characteristics as they progress in redoing the work offered.

Ultimately maintainers are responsible for the long-term maintenance of the project. Every piece of "work offered" must be evaluated on correctness, impact, and maintainability. Vetting a PR against alternative designs is a critical part of that.

Both solutions are non-trivial

I'll leave that as an exercise to the reader. I personally find the PQ implementation to be much easier to reason about. I understand if you feel differently. I look forward to additional scrutiny and evaluation from other interested maintainers and contributors.

@rturner3
Copy link
Collaborator

After reviewing this PR, we as a maintainer group decided that we are going to put a pause on this PR for now given the complexity of the proposed approach. It's our opinion that the complexity of this approach may be difficult for operators to understand and debug should any issues come up related to entry distribution. Additionally, we feel it will be challenging for us to maintain as there is a lot of new state now being tracked in the server that, in our opinion, takes quite a bit of time to reason about. We feel maintainability and ease of understanding are important considerations for any changes we make in the project given its current adoption and maturity level.

We are currently reflecting on whether to pursue an alternative solution targeting the specific issue laid out in #5341 (such as the PQ implementation proposed by @azdagron or some other solution), or whether we should rethink the overall approach taken to solve this event-based entry sync with the database. After seeing this feature take shape over time, it's not clear whether the current solution is addressing the overall problem of optimizing synchronization of entry data between the server and its database in the simplest way. We plan to continue to discuss the future of this feature and help propose a path forward for where to take it.

@edwbuck In the future, if you are planning to propose a non-trivial change to the project, I would highly encourage you to document the design in an issue that several maintainers can review before we get to the implementation stage. I can see that you put a lot of thought and effort into this PR, and I truly appreciate the time you devoted to try to help improve the project. However, I think in this instance, I believe it could have saved both of us some time and trouble if we had aligned earlier on a more detailed design in an issue upfront. Documenting the design in an issue also gives other members of the community an opportunity to provide early feedback in a more public forum.

That being said, I think we as project maintainers could have also done better to provide clearer guidelines to contributors to facilitate the change proposal process. We plan to improve on this and hopefully prevent future situations like this. We are tracking these process improvements in #5567.

Lastly, I want to acknowledge what is probably a frustrating experience for you with this PR. I hope you can appreciate why we have decided not to accept this change at this time and are open to revisiting how we collaborate to make future change proposals more successful.

@azdagron azdagron modified the milestones: 1.11.0, 1.11.1 Oct 15, 2024
This limits the solution to full polling as before, but still preserves
the work that would close spiffe#5349.

Signed-off-by: Edwin Buck <[email protected]>
@azdagron azdagron changed the title Implement cache update deduplication per fetch cycle and backoff algorithm. Implement cache update deduplication per fetch cycle Oct 16, 2024
@@ -351,6 +351,8 @@ func (ds *Plugin) UpdateAttestedNode(ctx context.Context, n *common.AttestedNode
if err != nil {
return err
}
// TODO: this is at the wrong level of the software stack.
// It should be created in the caller of the datastore interface.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we missed removal of this TODO. Can you please remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in next push.

@@ -367,6 +369,8 @@ func (ds *Plugin) DeleteAttestedNode(ctx context.Context, spiffeID string) (atte
if err != nil {
return err
}
// TODO: this is at the wrong level of the software stack.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we missed removal of this TODO. Can you please remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes in next push.

@azdagron azdagron modified the milestones: 1.11.1, 1.11.0 Oct 16, 2024
@azdagron
Copy link
Member

Ah, yes, there is a nuance with sync/pool use that is failing the linter. Pools should generally only be used with pointer values (e.g. pointer to slice in this situation). It won't really make a difference in this situation since we're only putting in one item, so we can either fix it or suppress the linter.

https://staticcheck.dev/docs/checks#SA6002

@azdagron azdagron merged commit 5186212 into spiffe:main Oct 17, 2024
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants