-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest): allow lower freq profiling based on date of month/day of week #8489
Conversation
metadata-ingestion/src/datahub/ingestion/source/elastic_search.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source_config/operation_config.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/elastic_search.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source_config/operation_config.py
Outdated
Show resolved
Hide resolved
97e24db
to
8e8f25e
Compare
return profile_date_of_month | ||
|
||
|
||
def is_profiling_enabled(operation_config: OperationConfig) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR just got my attention because this feature makes totally sense: I want to decouple the scheduling of the profiling from the basic ingestion without having to duplicate the recipes. So glad to see this!
About this is_profiling_enabled
implementation, let me test my understanding of the code. If I run an ingestion on hourly basis and I want the profiling to be executed eg once in a month profile_date_of_month=1
, this will result on the profiling being executed on all the hours for the day 1, in this example. Am I wrong? Unless I'm wrong with my analysis, I would say this is not the desired behavior.
Solving this will likely require to save some state to determine if the profiling has been already executed.
Alternatively, the issue may also get solved by giving more precision to the profiling scheduling beyond weekday XOR monthday. Instead, full cron expression may be set and then check if it is the time for running profiling.
This is just a library supporting this case https://github.com/kiorky/croniter It provides functionality to test whether a date matches cron expression; rounding the date to eg the hour may be enough and mitigate the need to save state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're thinking about this PR as a stop-gap solution to fix the pain point, but know that it's not going to handle all use cases well and is a bit bespoke. Eventually, we want to move towards something that looks more like this:
profiling.enabled:
- operator: day_of_week_matches
values: ["SUN"]
- operator: day_of_month_matches
values: ["5"]
We'd have a broader set of operators + ability to use custom user-provided operators, which should afford us the flexibility that we need.
But definitely looking for thoughts and feedback on this, as well as other use cases that you're thinking about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgomezvillamor This is a stop gap measure.
The specific case this tries to solve for now is to for the very large deployments where getting schema is very fast but doing profiling is very costly because of size of tables being in 100s of GBs or TB scale. In those cases we are seeing people either running the recipe once a week or maintaining 100s of recipes as duplicates. A recent PR was done #8317 for supporting those duplication cases where someone wanted to maintain the recipes in their code with and without profiling and setting them on different schedules to work-around this limitation.
At the end of the day we need improvements in either the scheduler that runs in server side or more operators in recipes that help with more cases on the ingestion side. But that is a larger lift. We can always improve connector specific profiling to make sure it stores the specific information for that ingestion recipe but that involves lot more effort.
We will definitely iterate on this based on feedback that we see with orgs as we see the pain points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this is abstracted away into a class we can easily add a enabled_v2
and add a new set of flags (or use existing flags into a new logic) and the method inside it will take care of adding the functionality to all connectors across the board. So it becomes very easy to iterate and change the functionality without worrying about breaking anything.
Any users can change enabled
to enabled_v2
in recipe and new logic will work seamlessly without major changes in recipes. So we can do what Harshal is saying if that solves use cases for orgs across the board.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you may want to look at #8317 which I mentioned as at least 1 org has found success using that to maintain multiple recipes in their code and just writing small script to maintain the recipes.
Recipes-as-code they called it. While not as elegant as something like terraform it does help with maintaining a large number of recipes with some scripting on top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specific case this tries to solve for now is to for the very large deployments where getting schema is very fast but doing profiling is very costly because of size of tables being in 100s of GBs or TB scale. In those cases we are seeing people either running the recipe once a week or maintaining 100s of recipes as duplicates.
Yes, my team is one of those 😅
My current scenario is my org usually runs connectors on hourly-basis, while we run profiling on weekly/monthly basis. With current implementation in this PR we may set eg profile_date_of_month=1
and that will result on being skipped for all days in the month but day 1. However on day 1 profiling is executed 24 times. Not ideal at all.
Since granularity of the profiling scheduling is days (week day or month day), current implementation is limited to schedulings based on days, weeks, months... not hours. I was just pointing out this in case you missed it. Glad to hear that this is just the starting point of a promising feature and hopefully scenarios such as mine are also addressed in the future 💪 Thanks for the detailed responses, really appreciated.
This allows us to have a common interface that we can use across connectors. Currently it is only to allow for better control of ingestion profiling but we can use the same interface to control various aspects of ingestion.
This is required for profiling currently because it is a costly operation and organisations want to do profiling at a lower cadence. Day of week or date of month is simple enough to implement in python code that we do not have to do anything source config and we can easily allow this on all sources compared to doing source specific changes.
Doing it initially only for elasticsearch for review. Once the approach is reviewed will add this for all sources where profiling is present.
Checklist