Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): allow lower freq profiling based on date of month/day of week #8489

Merged
merged 5 commits into from
Aug 4, 2023

Conversation

anshbansal
Copy link
Collaborator

This allows us to have a common interface that we can use across connectors. Currently it is only to allow for better control of ingestion profiling but we can use the same interface to control various aspects of ingestion.

This is required for profiling currently because it is a costly operation and organisations want to do profiling at a lower cadence. Day of week or date of month is simple enough to implement in python code that we do not have to do anything source config and we can easily allow this on all sources compared to doing source specific changes.

Doing it initially only for elasticsearch for review. Once the approach is reviewed will add this for all sources where profiling is present.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 24, 2023
@anshbansal anshbansal force-pushed the aseem-add-lower-freq-profiling branch from 97e24db to 8e8f25e Compare August 3, 2023 11:13
return profile_date_of_month


def is_profiling_enabled(operation_config: OperationConfig) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR just got my attention because this feature makes totally sense: I want to decouple the scheduling of the profiling from the basic ingestion without having to duplicate the recipes. So glad to see this!

About this is_profiling_enabled implementation, let me test my understanding of the code. If I run an ingestion on hourly basis and I want the profiling to be executed eg once in a month profile_date_of_month=1, this will result on the profiling being executed on all the hours for the day 1, in this example. Am I wrong? Unless I'm wrong with my analysis, I would say this is not the desired behavior.

Solving this will likely require to save some state to determine if the profiling has been already executed.
Alternatively, the issue may also get solved by giving more precision to the profiling scheduling beyond weekday XOR monthday. Instead, full cron expression may be set and then check if it is the time for running profiling.

This is just a library supporting this case https://github.com/kiorky/croniter It provides functionality to test whether a date matches cron expression; rounding the date to eg the hour may be enough and mitigate the need to save state.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're thinking about this PR as a stop-gap solution to fix the pain point, but know that it's not going to handle all use cases well and is a bit bespoke. Eventually, we want to move towards something that looks more like this:

profiling.enabled:
  - operator: day_of_week_matches
    values: ["SUN"]
  - operator: day_of_month_matches
    values: ["5"]

We'd have a broader set of operators + ability to use custom user-provided operators, which should afford us the flexibility that we need.

But definitely looking for thoughts and feedback on this, as well as other use cases that you're thinking about.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgomezvillamor This is a stop gap measure.

The specific case this tries to solve for now is to for the very large deployments where getting schema is very fast but doing profiling is very costly because of size of tables being in 100s of GBs or TB scale. In those cases we are seeing people either running the recipe once a week or maintaining 100s of recipes as duplicates. A recent PR was done #8317 for supporting those duplication cases where someone wanted to maintain the recipes in their code with and without profiling and setting them on different schedules to work-around this limitation.

At the end of the day we need improvements in either the scheduler that runs in server side or more operators in recipes that help with more cases on the ingestion side. But that is a larger lift. We can always improve connector specific profiling to make sure it stores the specific information for that ingestion recipe but that involves lot more effort.

We will definitely iterate on this based on feedback that we see with orgs as we see the pain points.

Copy link
Collaborator Author

@anshbansal anshbansal Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this is abstracted away into a class we can easily add a enabled_v2 and add a new set of flags (or use existing flags into a new logic) and the method inside it will take care of adding the functionality to all connectors across the board. So it becomes very easy to iterate and change the functionality without worrying about breaking anything.

Any users can change enabled to enabled_v2 in recipe and new logic will work seamlessly without major changes in recipes. So we can do what Harshal is saying if that solves use cases for orgs across the board.

Copy link
Collaborator Author

@anshbansal anshbansal Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you may want to look at #8317 which I mentioned as at least 1 org has found success using that to maintain multiple recipes in their code and just writing small script to maintain the recipes.

Recipes-as-code they called it. While not as elegant as something like terraform it does help with maintaining a large number of recipes with some scripting on top.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specific case this tries to solve for now is to for the very large deployments where getting schema is very fast but doing profiling is very costly because of size of tables being in 100s of GBs or TB scale. In those cases we are seeing people either running the recipe once a week or maintaining 100s of recipes as duplicates.

Yes, my team is one of those 😅

My current scenario is my org usually runs connectors on hourly-basis, while we run profiling on weekly/monthly basis. With current implementation in this PR we may set eg profile_date_of_month=1 and that will result on being skipped for all days in the month but day 1. However on day 1 profiling is executed 24 times. Not ideal at all.

Since granularity of the profiling scheduling is days (week day or month day), current implementation is limited to schedulings based on days, weeks, months... not hours. I was just pointing out this in case you missed it. Glad to hear that this is just the starting point of a promising feature and hopefully scenarios such as mine are also addressed in the future 💪 Thanks for the detailed responses, really appreciated.

@anshbansal anshbansal merged commit dac89fb into master Aug 4, 2023
43 checks passed
@anshbansal anshbansal deleted the aseem-add-lower-freq-profiling branch August 4, 2023 04:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants