Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-50302][SS] Ensure secondary index sizes equal primary index si…
…zes for TransformWithState stateful variables with TTL ### What changes were proposed in this pull request? This PR ensures that the secondary indexes that state variables with TTL use are at most the size of the corresponding state variable's primary index. This change will eliminate unnecessary work done during the cleanup of stateful variables with TTL. ### Why are the changes needed? #### Context The `TransformWithState` operator (hereon out known as "TWS") will allow users write procedural logic over streams of records. To store state between micro-batches, Spark will provide users _stateful variables_, which persist between micro-batches. For example, you might want to emit an average of the past 5 records, every 5 records. You might only receive 2 records in the first micro-batch, so you have to _buffer_ these 2 records until you get 3 more in a subsequent batch. TWS supports 3 different types of stateful variables: single values, lists, and maps. The TWS operator also supports stateful variables with Time To Live; this allows you to say, "keep a certain record in state for `d` units of time". This TTL is per-record. This means that every record in a list (or map) can expiry at a different point in time, depending on when the element in the list is inserted. A record inserted into a stateful list (or map) at time `t1` will expire at `t1 + d`, and a second that expires at `t2 + d` will expire at `t2 + d`. (For value state, there's only one value, so "everything" expires at the same time.) A very natural question to now ask is, how do we efficiently determine which elements have expired in the list, without having to do a full scan of every record in state? The idea here is to keep a secondary index from expiration timestamp, to the specific record that needs to be evicted. Not so hard, right? #### The state cleanup strategy today Today's cleanup strategy is about as simple as I indicated earlier: for every insert to a value/map/list, you: 1. Write to the primary index 2. Using the current timestamp, you write into the secondary index The issue with this approach is that we do two _unconditional_ writes. This means that if the same state variable is written to with different timestamps, there will exist one element in the primary index, while there exists two elements in the secondary index. Consider the following example for a state variable `foo` with value `v1`, and TTL delay of 500: For batch 0, `batchTimestampMs = 100`, `foo` updates to `v1`: - Primary index: `[foo -> (v1, 600)]` - Secondary index: `[(600, foo) -> EMPTY]` Note that the state variable is included in the secondary index key because we might have several elements with the same expiration timestamp; we want `(600, foo)` to not overwrite a `(600, bar)`, just because they both expire at 600. Batch 1: `batchTimestampMs = 200`, `foo` updates to `v2`. Primary index: `[foo -> (v2, 700)]` Secondary index: `[(600, foo) -> EMPTY, (700, foo) -> EMPTY]` Now, we have two entries in our secondary index. If the current timestamp advanced to something like 800, we'd take the following steps: 1. We'd take the first element from the secondary index `(600, foo)`, and lookup `foo` in the primary index. That would yield `(v2, 700)`. The value of 700 in the primary index is still less than 800, so we would remove `foo` from the primary index. 2. Then, we would look at `(700, foo)`. We'd look up `foo` in the primary index and see nothing, so we'd do nothing. You'll notice here that step 2 is _entirely_ redundant. We read `(700, foo)` and did a get to the primary index, for something that was doomed—it would have never returned anything. While this isn't great, the story is unfortunately significantly worse for lists. The way that we store lists is by having a single key in RocksDB, whose value is the concatenated bytes of all the values in that list. When we do cleanup for a list, we go through _all_ of its records and Thus, it's possible for us to have a list that looks something like: - Primary index: `[foo -> [(v1, 600), (v2, 700), (v3, 900)]]` - Secondary index: `[(600, foo) -> EMPTY, (700, foo) -> EMPTY, (900, foo) -> EMPTY]` Now, suppose that the current timestamp is 800. We need to expire the records in the list. So, we do the following: 1. We take the first element from the secondary index, `(600, foo)`. This tells us that the list `foo` needs cleaning up. We clean up everything in `foo` less than 800. Since we store lists as a single key, we issue a RocksDB `clear` operation, iterate through all of the existing values, eliminate `(v1, 600)` and `(v2, 700)`, and write back `(v3, 900)`. 2. But we still have things left in our secondary index! We now get `(700, foo)`, and we unknowingly do cleanup on `foo` _again_. This consists of clearing `foo`, iterating through its elements, and writing back `(v3, 900)`. But since cleanup already happened, this step is _entirely_ redundant. 3. We encounter `(900, foo)` from the secondary index, and since 900 > 800, we can bail out of cleanup. Step 2 here is extremely wasteful. If we have `n` elements in our secondary index for the same key, then, in the worst case, we will do the extra cleanup `n-1` times; and each time is a _linear_ time operation! Thus, for a list that has `n` elements, `d` of which need to be cleaned up, the worst-case time complexity is in `O(d*(n-d))`, instead of `O(n)`. And it's _completely_ unnecessary work. #### How does this PR fix the issue? It's pretty simple to fix this for value state and map state. This is because every key in value or map state maps to exactly one element in the secondary index. We can maintain a one-to-one correspondence. Any time we modify value/map state, we make sure that we delete the previous entry in the secondary index. This logic is implemented by OneToOneTTLState. The trickier aspect is handling this for ListState, where the secondary index goes from grouping key to the map that needs to be cleaned up. There's a one to many mapping here; one grouping key maps to multiple records, all of which could expire at a different time. The trick to making sure that secondary indexes don't explode is by making your secondary index store only the minimum expiration timestamp in a list. The rough intuition is that you don't need to store anything larger than that, since when you clean up due to the minimum expiration timestamp, you'll go through the list anyway, and you can find the next minimum timestamp; you can then put _that_ into your secondary index. This logic is implemented by OneToManyTTLState. ### How should reviewers review this PR? - Start by reading this long description. If you have questions, please ping me in the comments. I would be more than happy to explain. - Then, understand the class doc comments for `OneToOneTTLState` and `OneToManyTTLState` in `TTLState.scala`. - Then, I'd recommend going through the unit tests, and making sure that the _behavior_ makes sense to you. If it doesn't, please leave a question. - Finally, you can look at the actual stateful variable implementations. ### Does this PR introduce _any_ user-facing change? No, but it is a format difference in the way TWS represents its internal state. However, since TWS is currently `private[sql]` and not publicly available, this is not an issue. ### How was this patch tested? - Existing UTs have been modified to conform with this new behavior. - New UTs added to verify that the new indices we added ### Was this patch authored or co-authored using generative AI tooling? Generated-by: GitHub Copilot Closes #48853 from neilramaswamy/spark-50302. Authored-by: Neil Ramaswamy <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>
- Loading branch information