-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support to cache profiles created via ProfileMapping
#1046
Conversation
✅ Deploy Preview for sunny-pastelito-5ecb04 ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
b9bb095
to
488395f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pankajastro , thanks a lot for advancing on this!
Some questions/comments on the overall feature:
- So we are consistent with other enabling cache mechanism (e.g. Speed up
LoadMode.DBT_LS
by caching dbt ls output in Airflow Variable #1014), please, rename the config toenable_cache_profile
- Why are we forcing users to defined the cache path, differently from Improve performance by 22-35% or more by caching partial parse artefact #904, where we dynamically define a path - and we give the possibility for users to override it with the
cache_dir
config, if they want to. We could perhaps allow users to force a specific profile path if they want to, by havingcache_dir_profile
- but it would be great if this was optional. - What does the
metadata.json
relate to? A dbt profile, an Airflow connection, aDbtDag
, or to aDbtTaskGroup
, to the Airflow deployment..? The name is very generic and the scope is unclear. How are we handling scenarios where concurrent tasks/DAGs would be writing at the same time to this file? - How sensitive is the identifier of the cache (sha256) to changes in the Airflow connection being used?
hashing connection is still a TODO item and we need to pass the actual value of a connection in the hashing input |
A JSON file that stores key-value pairs, where each key is a hash for a Profile Mapping object and the value is the corresponding path to the DBT profile YAML file. I'm using this file to check if we already have profile YAML for this profile object hash
I'm. ok to change the name.
I haven't considered the concurrent case, do we need locking here? |
We can make it optional, I wanted to keep the profile caching-related files in one place. Shall we keep the profile related files in |
3fad8a7
to
e9b1681
Compare
3fbc34b
to
0d78e01
Compare
0d78e01
to
bda2d18
Compare
Improved the below items after our first discussion and review
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HI @pankajastro thanks for addressing all the feedback!
Once you are able to uncomment that assert in the test and rename the config, I believe we're good to go! 🎉
All the comments/feedback were addressed. Following up on:
I haven't considered the concurrent case, do we need locking here?
I believe once we removed the metadata.json
file, we're in a much safer position, since we reduce significantly concurrent writes to the same file. I suggest we go with the current solution and we review if users report any issues.
ProfileMapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for addressing all the feedback, @pankajastro !
New Features * Speed up ``LoadMode.DBT_LS`` by caching dbt ls output in Airflow Variable by @tatiana in #1014 * Support to cache profiles created via ``ProfileMapping`` by @pankajastro in #1046 * Support for running dbt tasks in AWS EKS in #944 by @VolkerSchiewe * Add Clickhouse profile mapping by @roadan and @pankajastro in #353 and #1016 * Add node config to TaskInstance Context by @linchun3 in #1044 Bug fixes * Support partial parsing when cache is disabled by @tatiana in #1070 * Fix disk permission error in restricted env by @pankajastro in #1051 * Add CSP header to iframe contents by @dwreeves in #1055 * Stop attaching log adaptors to root logger to reduce logging costs by @glebkrapivin in #1047 Enhancements * Support ``static_index.html`` docs by @dwreeves in #999 * Support deep linking dbt docs via Airflow UI by @dwreeves in #1038 * Add ability to specify host/port for Snowflake connection by @whummer in #1063 Docs * Fix rendering for env ``enable_cache_dbt_ls`` by @pankajastro in #1069 Others * Update documentation for DbtDocs generator by @arjunanan6 in #1043 * Use uv in CI by @dwreeves in #1013 * Cache hatch folder in the CI by @tatiana in #1056 * Change example DAGs to use ``example_conn`` as opposed to ``airflow_db`` by @tatiana in #1054 * Mark plugin integration tests as integration by @tatiana in #1057 * Ensure compliance with linting rule D300 by using triple quotes for docstrings by @pankajastro in #1049 * Pre-commit hook updates in #1039, #1050, #1064 * Remove duplicates in changelog by @jedcunningham in #1068
Add dbt profile caching mechanism. 1. Introduced env `enable_cache_profile` to enable or disable profile caching. This will be enabled only if global `enable_cache` is enabled. 2. Users can set the env `profile_cache_dir_name`. This will be the name of a sub-dir inside `cache_dir` where cached profiles will be stored. This is optional, and the default name is `profile` 3. Example Path for versioned profile: `{cache_dir}/{profile_cache_dir}/592906f650558ce1dadb75fcce84a2ec09e444441e6af6069f19204d59fe428b/profiles.yml` 4. Implemented profile mapping hashing: first, the profile is serialized using pickle. Then, the profile_name and target_name are appended before hashing the data using the SHA-256 algorithm **Perf test result:** In local dev env with command ``` AIRFLOW_HOME=`pwd` AIRFLOW_CONN_EXAMPLE_CONN="postgres://postgres:[email protected]:5432/postgres" AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.10-2.8:test-performance ``` NUM_MODELS=100 - TIME=167.45248413085938 (with profile cache enabled) - TIME=173.94845390319824 (with profile cache disabled) NUM_MODELS=200 - TIME=376.2585120201111 (with profile cache enabled) - TIME=418.14210200309753 (with profile cache disabled) Closes: astronomer#925 Closes: astronomer#647
New Features * Speed up ``LoadMode.DBT_LS`` by caching dbt ls output in Airflow Variable by @tatiana in astronomer#1014 * Support to cache profiles created via ``ProfileMapping`` by @pankajastro in astronomer#1046 * Support for running dbt tasks in AWS EKS in astronomer#944 by @VolkerSchiewe * Add Clickhouse profile mapping by @roadan and @pankajastro in astronomer#353 and astronomer#1016 * Add node config to TaskInstance Context by @linchun3 in astronomer#1044 Bug fixes * Support partial parsing when cache is disabled by @tatiana in astronomer#1070 * Fix disk permission error in restricted env by @pankajastro in astronomer#1051 * Add CSP header to iframe contents by @dwreeves in astronomer#1055 * Stop attaching log adaptors to root logger to reduce logging costs by @glebkrapivin in astronomer#1047 Enhancements * Support ``static_index.html`` docs by @dwreeves in astronomer#999 * Support deep linking dbt docs via Airflow UI by @dwreeves in astronomer#1038 * Add ability to specify host/port for Snowflake connection by @whummer in astronomer#1063 Docs * Fix rendering for env ``enable_cache_dbt_ls`` by @pankajastro in astronomer#1069 Others * Update documentation for DbtDocs generator by @arjunanan6 in astronomer#1043 * Use uv in CI by @dwreeves in astronomer#1013 * Cache hatch folder in the CI by @tatiana in astronomer#1056 * Change example DAGs to use ``example_conn`` as opposed to ``airflow_db`` by @tatiana in astronomer#1054 * Mark plugin integration tests as integration by @tatiana in astronomer#1057 * Ensure compliance with linting rule D300 by using triple quotes for docstrings by @pankajastro in astronomer#1049 * Pre-commit hook updates in astronomer#1039, astronomer#1050, astronomer#1064 * Remove duplicates in changelog by @jedcunningham in astronomer#1068
New Features * Speed up ``LoadMode.DBT_LS`` by caching dbt ls output in Airflow Variable by @tatiana in #1014 * Support to cache profiles created via ``ProfileMapping`` by @pankajastro in #1046 * Support for running dbt tasks in AWS EKS in #944 by @VolkerSchiewe * Add Clickhouse profile mapping by @roadan and @pankajastro in #353 and #1016 * Add node config to TaskInstance Context by @linchun3 in #1044 Bug fixes * Support partial parsing when cache is disabled by @tatiana in #1070 * Fix disk permission error in restricted env by @pankajastro in #1051 * Add CSP header to iframe contents by @dwreeves in #1055 * Stop attaching log adaptors to root logger to reduce logging costs by @glebkrapivin in #1047 Enhancements * Support ``static_index.html`` docs by @dwreeves in #999 * Support deep linking dbt docs via Airflow UI by @dwreeves in #1038 * Add ability to specify host/port for Snowflake connection by @whummer in #1063 Docs * Fix rendering for env ``enable_cache_dbt_ls`` by @pankajastro in #1069 Others * Update documentation for DbtDocs generator by @arjunanan6 in #1043 * Use uv in CI by @dwreeves in #1013 * Cache hatch folder in the CI by @tatiana in #1056 * Change example DAGs to use ``example_conn`` as opposed to ``airflow_db`` by @tatiana in #1054 * Mark plugin integration tests as integration by @tatiana in #1057 * Ensure compliance with linting rule D300 by using triple quotes for docstrings by @pankajastro in #1049 * Pre-commit hook updates in #1039, #1050, #1064 * Remove duplicates in changelog by @jedcunningham in #1068 (cherry picked from commit 18d2c90)
Description
Add DBT profile caching mechanism.
enable_cache_profile
to enable or disable profile caching. This will be enabled only if globalenable_cache
is enabled.profile_cache_dir_name
. This will be the name of a sub-dir insidecache_dir
where cached profiles will be stored. This is optional, and the default name isprofile
{cache_dir}/{profile_cache_dir}/592906f650558ce1dadb75fcce84a2ec09e444441e6af6069f19204d59fe428b/profiles.yml
Perf test result:
In local dev env with command
NUM_MODELS=100
NUM_MODELS=200
Related Issue(s)
Closes: #925
Closes: #647
Breaking Change?
No
Checklist