feat(ingest): allow lower freq profiling based on date of month/day of week #8489

anshbansal · 2023-07-24T11:44:08Z

This allows us to have a common interface that we can use across connectors. Currently it is only to allow for better control of ingestion profiling but we can use the same interface to control various aspects of ingestion.

This is required for profiling currently because it is a costly operation and organisations want to do profiling at a lower cadence. Day of week or date of month is simple enough to implement in python code that we do not have to do anything source config and we can easily allow this on all sources compared to doing source specific changes.

Doing it initially only for elasticsearch for review. Once the approach is reviewed will add this for all sources where profiling is present.

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

metadata-ingestion/src/datahub/ingestion/source/elastic_search.py

metadata-ingestion/src/datahub/ingestion/source_config/operation_config.py

metadata-ingestion/src/datahub/ingestion/source/elastic_search.py

metadata-ingestion/src/datahub/ingestion/source_config/operation_config.py

sgomezvillamor · 2023-08-03T14:01:58Z

metadata-ingestion/src/datahub/ingestion/source_config/operation_config.py

+        return profile_date_of_month
+
+
+def is_profiling_enabled(operation_config: OperationConfig) -> bool:


This PR just got my attention because this feature makes totally sense: I want to decouple the scheduling of the profiling from the basic ingestion without having to duplicate the recipes. So glad to see this!

About this is_profiling_enabled implementation, let me test my understanding of the code. If I run an ingestion on hourly basis and I want the profiling to be executed eg once in a month profile_date_of_month=1, this will result on the profiling being executed on all the hours for the day 1, in this example. Am I wrong? Unless I'm wrong with my analysis, I would say this is not the desired behavior.

Solving this will likely require to save some state to determine if the profiling has been already executed.
Alternatively, the issue may also get solved by giving more precision to the profiling scheduling beyond weekday XOR monthday. Instead, full cron expression may be set and then check if it is the time for running profiling.

This is just a library supporting this case https://github.com/kiorky/croniter It provides functionality to test whether a date matches cron expression; rounding the date to eg the hour may be enough and mitigate the need to save state.

We're thinking about this PR as a stop-gap solution to fix the pain point, but know that it's not going to handle all use cases well and is a bit bespoke. Eventually, we want to move towards something that looks more like this:

profiling.enabled: - operator: day_of_week_matches values: ["SUN"] - operator: day_of_month_matches values: ["5"]

We'd have a broader set of operators + ability to use custom user-provided operators, which should afford us the flexibility that we need.

But definitely looking for thoughts and feedback on this, as well as other use cases that you're thinking about.

@sgomezvillamor This is a stop gap measure.

The specific case this tries to solve for now is to for the very large deployments where getting schema is very fast but doing profiling is very costly because of size of tables being in 100s of GBs or TB scale. In those cases we are seeing people either running the recipe once a week or maintaining 100s of recipes as duplicates. A recent PR was done #8317 for supporting those duplication cases where someone wanted to maintain the recipes in their code with and without profiling and setting them on different schedules to work-around this limitation.

At the end of the day we need improvements in either the scheduler that runs in server side or more operators in recipes that help with more cases on the ingestion side. But that is a larger lift. We can always improve connector specific profiling to make sure it stores the specific information for that ingestion recipe but that involves lot more effort.

We will definitely iterate on this based on feedback that we see with orgs as we see the pain points.

Now this is abstracted away into a class we can easily add a enabled_v2 and add a new set of flags (or use existing flags into a new logic) and the method inside it will take care of adding the functionality to all connectors across the board. So it becomes very easy to iterate and change the functionality without worrying about breaking anything.

Any users can change enabled to enabled_v2 in recipe and new logic will work seamlessly without major changes in recipes. So we can do what Harshal is saying if that solves use cases for orgs across the board.

But you may want to look at #8317 which I mentioned as at least 1 org has found success using that to maintain multiple recipes in their code and just writing small script to maintain the recipes.

Recipes-as-code they called it. While not as elegant as something like terraform it does help with maintaining a large number of recipes with some scripting on top.

The specific case this tries to solve for now is to for the very large deployments where getting schema is very fast but doing profiling is very costly because of size of tables being in 100s of GBs or TB scale. In those cases we are seeing people either running the recipe once a week or maintaining 100s of recipes as duplicates.

Yes, my team is one of those 😅

My current scenario is my org usually runs connectors on hourly-basis, while we run profiling on weekly/monthly basis. With current implementation in this PR we may set eg profile_date_of_month=1 and that will result on being skipped for all days in the month but day 1. However on day 1 profiling is executed 24 times. Not ideal at all.

Since granularity of the profiling scheduling is days (week day or month day), current implementation is limited to schedulings based on days, weeks, months... not hours. I was just pointing out this in case you missed it. Glad to hear that this is just the starting point of a promising feature and hopefully scenarios such as mine are also addressed in the future 💪 Thanks for the detailed responses, really appreciated.

…f week (#8489)

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 24, 2023

vercel bot deployed to Preview July 24, 2023 11:53 View deployment

hsheth2 reviewed Aug 2, 2023

View reviewed changes

anshbansal added 2 commits August 3, 2023 16:37

feat(ingest): allow lower freq profiling

12dfb44

code review feedback

8e8f25e

anshbansal force-pushed the aseem-add-lower-freq-profiling branch from 97e24db to 8e8f25e Compare August 3, 2023 11:13

anshbansal added 2 commits August 3, 2023 16:44

missed saving

83dda4b

add unit test for configuration error

37ba0c6

vercel bot deployed to Preview August 3, 2023 11:40 View deployment

add for all sourceS

4837850

anshbansal marked this pull request as ready for review August 3, 2023 11:46

vercel bot deployed to Preview August 3, 2023 12:00 View deployment

sgomezvillamor reviewed Aug 3, 2023

View reviewed changes

hsheth2 approved these changes Aug 3, 2023

View reviewed changes

anshbansal merged commit dac89fb into master Aug 4, 2023
43 checks passed

anshbansal deleted the aseem-add-lower-freq-profiling branch August 4, 2023 04:43

yoonhyejin pushed a commit that referenced this pull request Aug 24, 2023

feat(ingest): allow lower freq profiling based on date of month/day o…

4df821f

…f week (#8489)

hsheth2 mentioned this pull request Oct 18, 2024

fix(ingest): cache sql is_profiling_enabled method #11665

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest): allow lower freq profiling based on date of month/day of week #8489

feat(ingest): allow lower freq profiling based on date of month/day of week #8489

anshbansal commented Jul 24, 2023

sgomezvillamor Aug 3, 2023

hsheth2 Aug 3, 2023

anshbansal Aug 4, 2023

anshbansal Aug 4, 2023 •

edited

Loading

anshbansal Aug 4, 2023 •

edited

Loading

sgomezvillamor Aug 4, 2023

		return profile_date_of_month


		def is_profiling_enabled(operation_config: OperationConfig) -> bool:

feat(ingest): allow lower freq profiling based on date of month/day of week #8489

feat(ingest): allow lower freq profiling based on date of month/day of week #8489

Conversation

anshbansal commented Jul 24, 2023

Checklist

sgomezvillamor Aug 3, 2023

Choose a reason for hiding this comment

hsheth2 Aug 3, 2023

Choose a reason for hiding this comment

anshbansal Aug 4, 2023

Choose a reason for hiding this comment

anshbansal Aug 4, 2023 • edited Loading

Choose a reason for hiding this comment

anshbansal Aug 4, 2023 • edited Loading

Choose a reason for hiding this comment

sgomezvillamor Aug 4, 2023

Choose a reason for hiding this comment

anshbansal Aug 4, 2023 •

edited

Loading

anshbansal Aug 4, 2023 •

edited

Loading