forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-42931][SS] Introduce dropDuplicatesWithinWatermark
### What changes were proposed in this pull request? This PR proposes to introduce a new API of dropDuplicates which has following different characteristics compared to existing dropDuplicates: * Weaker constraints on the subset (key) * Does not require an event time column on the subset. * Looser semantics on deduplication * Only guarantee to deduplicate events within watermark delay. Since the new API leverages event time, the new API has following new requirements: * The watermark must be defined in the streaming DataFrame * The event time column must be defined in the streaming DataFrame. More specifically on the semantic, once the operator processes the first arrived event, events arriving within the watermark for the first event will be deduplicated. (Technically, the expiration time should be the “event time of the first arrived event + watermark delay threshold”, to match up with future events.) Users are encouraged to set the delay threshold of watermark longer than max timestamp differences among duplicated events. (If they are unsure, they can alternatively set the delay threshold large enough, e.g. 48 hours.) For batch DataFrame, this is equivalent to the dropDuplicates. This PR also updates the SS guide doc to introduce the new feature; screenshots below: <img width="747" alt="스크린샷 2023-04-06 오전 11 09 12" src="https://user-images.githubusercontent.com/1317309/230254868-7fe76175-5883-4700-b018-d85d851799cb.png"> <img width="749" alt="스크린샷 2023-04-06 오전 11 09 18" src="https://user-images.githubusercontent.com/1317309/230254874-a754cdfd-2832-41dd-85b6-291f05eccb3d.png"> <img width="752" alt="스크린샷 2023-04-06 오전 11 09 23" src="https://user-images.githubusercontent.com/1317309/230254876-7fd7b3b1-f59d-481f-8249-5a4ae556c7cf.png"> <img width="751" alt="스크린샷 2023-04-06 오전 11 09 29" src="https://user-images.githubusercontent.com/1317309/230254880-79b158ca-3403-46a6-be4a-46618ec749db.png"> ### Why are the changes needed? Existing dropDuplicates API does not address the valid use case on streaming query. There are many cases where the event time is not exact the same, although these events are same. One example is duplicated events are produced due to non-idempotent writer where event time is issued from producer/broker side. Another example is that the value of event time is unstable and users want to use alternative timestamp e.g. ingestion time. For these case, users have to exclude event time column from subset of deduplication, but then the operator is unable to evict state, leading to indefinitely growing state. To allow eviction of state while event time column is not required to be a part of subset of deduplication, we need to loose the semantic for the API, which warrants a new API. ### Does this PR introduce _any_ user-facing change? Yes, this introduces a new public API, dropDuplicatesWithinWatermark. ### How was this patch tested? New test suite. Closes apache#40561 from HeartSaVioR/SPARK-42931. Authored-by: Jungtaek Lim <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>
- Loading branch information
1 parent
d8b720a
commit 0e9e34c
Showing
13 changed files
with
614 additions
and
31 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.