-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(ingest): stateful redundant run skip handler #8467
fix(ingest): stateful redundant run skip handler #8467
Conversation
@mayurinehate looks like we're using |
Oops. fixed. IDE autocomplete got me. |
…into redundant_run_fix
@asikowitz I have addressed the review comments. Can you please take another pass at review ? Changes:
|
@@ -213,19 +213,6 @@ def ensure_top_n_queries_is_not_too_big(cls, v: int) -> int: | |||
) | |||
return v | |||
|
|||
@pydantic.validator("start_time") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is removed to allow user to specify exact absolute start_time if necessary for some reason. In case of scheduled ingestions (default/relative start time), aligning start_time with bucket start time is already taken care of elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how you've refactored around TimeWindow
. Have a couple nits here and there but I think this is good to go. How do you think we should go about testing this?
metadata-ingestion/src/datahub/ingestion/source/state/redundant_run_skip_handler.py
Outdated
Show resolved
Hide resolved
self.usage_start_time, | ||
self.usage_end_time, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are you doing this?
metadata-ingestion/src/datahub/ingestion/source/state/redundant_run_skip_handler.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/state/redundant_run_skip_handler.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/state/redundant_run_skip_handler.py
Outdated
Show resolved
Hide resolved
For testing -
I've tested this manually against real Snowflake and BigQuery instance for multiple cases to make sure start, end times are correctly picked up. About actual workunits there was an issue when emitting zero usage aspects for configured time window and I've fixed it. As a follow up - we can add integration tests for snowflake, redshift, bigquery for scheduled ingestion cases
|
@@ -36,7 +36,7 @@ def right_intersects(self, other: "TimeWindow") -> bool: | |||
|
|||
def starts_after(self, other: "TimeWindow") -> bool: | |||
"""Whether current window starts after other window ends""" | |||
return other.start_time < other.end_time <= self.start_time | |||
return other.start_time < other.end_time < self.start_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meant to say that this should be other.start_time <= other.endtime
like the change to contains
Summary of changes -
store_last_profiling_timestamps → enable_stateful_profiling
store_last_usage_extraction_timestamp → enable_stateful_usage_ingestion
store_last_lineage_extraction_timestamp → enable_stateful_lineage_ingestion
Checklist