Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support to cache profiles created via ProfileMapping #1046

Merged
merged 15 commits into from
Jun 27, 2024
Merged

Conversation

pankajastro
Copy link
Contributor

@pankajastro pankajastro commented Jun 17, 2024

Description

Add DBT profile caching mechanism.

  1. Introduced env enable_cache_profile to enable or disable profile caching. This will be enabled only if global enable_cache is enabled.
  2. Users can set the env profile_cache_dir_name. This will be the name of a sub-dir inside cache_dir where cached profiles will be stored. This is optional, and the default name is profile
  3. Example Path for versioned profile: {cache_dir}/{profile_cache_dir}/592906f650558ce1dadb75fcce84a2ec09e444441e6af6069f19204d59fe428b/profiles.yml
  4. Implemented profile mapping hashing: first, the profile is serialized using pickle. Then, the profile_name and target_name are appended before hashing the data using the SHA-256 algorithm

Perf test result:
In local dev env with command

AIRFLOW_HOME=`pwd` AIRFLOW_CONN_EXAMPLE_CONN="postgres://postgres:[email protected]:5432/postgres"  AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.10-2.8:test-performance

NUM_MODELS=100

  • TIME=167.45248413085938 (with profile cache enabled)
  • TIME=173.94845390319824 (with profile cache disabled)

NUM_MODELS=200

  • TIME=376.2585120201111 (with profile cache enabled)
  • TIME=418.14210200309753 (with profile cache disabled)

Related Issue(s)

Closes: #925
Closes: #647

Breaking Change?

No

Checklist

  • I have made corresponding changes to the documentation (if required)
  • I have added tests that prove my fix is effective or that my feature works

Copy link

netlify bot commented Jun 17, 2024

Deploy Preview for sunny-pastelito-5ecb04 ready!

Name Link
🔨 Latest commit 865b760
🔍 Latest deploy log https://app.netlify.com/sites/sunny-pastelito-5ecb04/deploys/667d69a74ecd530008ced7ec
😎 Deploy Preview https://deploy-preview-1046--sunny-pastelito-5ecb04.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@pankajastro pankajastro marked this pull request as ready for review June 17, 2024 14:47
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jun 17, 2024
@dosubot dosubot bot added area:config Related to configuration, like YAML files, environment variables, or executer configuration area:profile Related to ProfileConfig, like Athena, BigQuery, Clickhouse, Spark, Trino, etc labels Jun 17, 2024
Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pankajastro , thanks a lot for advancing on this!

Some questions/comments on the overall feature:

  1. So we are consistent with other enabling cache mechanism (e.g. Speed up LoadMode.DBT_LS by caching dbt ls output in Airflow Variable #1014), please, rename the config to enable_cache_profile
  2. Why are we forcing users to defined the cache path, differently from Improve performance by 22-35% or more by caching partial parse artefact #904, where we dynamically define a path - and we give the possibility for users to override it with the cache_dir config, if they want to. We could perhaps allow users to force a specific profile path if they want to, by having cache_dir_profile - but it would be great if this was optional.
  3. What does the metadata.json relate to? A dbt profile, an Airflow connection, a DbtDag, or to a DbtTaskGroup, to the Airflow deployment..? The name is very generic and the scope is unclear. How are we handling scenarios where concurrent tasks/DAGs would be writing at the same time to this file?
  4. How sensitive is the identifier of the cache (sha256) to changes in the Airflow connection being used?

cosmos/config.py Outdated Show resolved Hide resolved
cosmos/config.py Outdated Show resolved Hide resolved
cosmos/cache.py Outdated Show resolved Hide resolved
cosmos/config.py Outdated Show resolved Hide resolved
cosmos/config.py Outdated Show resolved Hide resolved
@pankajastro
Copy link
Contributor Author

How sensitive is the identifier of the cache (sha256) to changes in the Airflow connection being used?

hashing connection is still a TODO item and we need to pass the actual value of a connection in the hashing input

@pankajastro
Copy link
Contributor Author

What does the metadata.json relate to? A dbt profile, an Airflow connection, a DbtDag, or to a DbtTaskGroup, to the Airflow deployment..?

A JSON file that stores key-value pairs, where each key is a hash for a Profile Mapping object and the value is the corresponding path to the DBT profile YAML file. I'm using this file to check if we already have profile YAML for this profile object hash

The name is very generic and the scope is unclear.

I'm. ok to change the name.

How are we handling scenarios where concurrent tasks/DAGs would be writing at the same time to this file?

I haven't considered the concurrent case, do we need locking here?

@pankajastro
Copy link
Contributor Author

Why are we forcing users to defined the cache path, differently from #904, where we dynamically define a path - and we give the possibility for users to override it with the cache_dir config, if they want to. We could perhaps allow users to force a specific profile path if they want to, by having cache_dir_profile - but it would be great if this was optional.

We can make it optional, I wanted to keep the profile caching-related files in one place. Shall we keep the profile related files in cache_dir / profile ?

@pankajastro
Copy link
Contributor Author

pankajastro commented Jun 25, 2024

Improved the below items after our first discussion and review

  • Added environment variable enable_profile_cache to toggle profile caching on or off.
  • Introduced environment variable profile_cache_dir, which will be created within cache_dir. The default name is 'profile', and it will store the versioned DBT profile.
  • Example Path for versioned provifile: {cache_dir}/{profile_cache_dir}/version/profiles.yml
  • Implemented profile mapping hashing: first, the profile is serialized using pickle. Then, the profile_name and target_name are appended before hashing the data using the SHA-256 algorithm
  • Removed metadata.json because now I'm not dealing with desired_profile_path variable
  • Added a check to enable_profile_cache only if global caching is enabled

Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI @pankajastro thanks for addressing all the feedback!

Once you are able to uncomment that assert in the test and rename the config, I believe we're good to go! 🎉

All the comments/feedback were addressed. Following up on:

I haven't considered the concurrent case, do we need locking here?

I believe once we removed the metadata.json file, we're in a much safer position, since we reduce significantly concurrent writes to the same file. I suggest we go with the current solution and we review if users report any issues.

@pankajastro pankajastro changed the title Add dbt profile caching mechanism Support to profile caching Jun 26, 2024
cosmos/profiles/base.py Outdated Show resolved Hide resolved
tests/dbt/test_graph.py Outdated Show resolved Hide resolved
tests/dbt/test_graph.py Outdated Show resolved Hide resolved
@tatiana tatiana changed the title Support to profile caching Support to cache profiles created via ProfileMapping Jun 27, 2024
Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for addressing all the feedback, @pankajastro !

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jun 27, 2024
@tatiana tatiana added this to the Cosmos 1.5.0 milestone Jun 27, 2024
@tatiana tatiana merged commit 2d2e7af into main Jun 27, 2024
61 checks passed
@tatiana tatiana deleted the cache-profile branch June 27, 2024 13:39
@pankajkoti pankajkoti mentioned this pull request Jun 27, 2024
tatiana pushed a commit that referenced this pull request Jun 27, 2024
New Features

* Speed up ``LoadMode.DBT_LS`` by caching dbt ls output in Airflow
Variable by @tatiana in #1014
* Support to cache profiles created via ``ProfileMapping`` by
@pankajastro in #1046
* Support for running dbt tasks in AWS EKS in #944 by @VolkerSchiewe
* Add Clickhouse profile mapping by @roadan and @pankajastro in #353 and
#1016
* Add node config to TaskInstance Context by @linchun3 in #1044

Bug fixes

* Support partial parsing when cache is disabled by @tatiana in #1070
* Fix disk permission error in restricted env by @pankajastro in #1051
* Add CSP header to iframe contents by @dwreeves in #1055
* Stop attaching log adaptors to root logger to reduce logging costs by
@glebkrapivin in #1047

Enhancements

* Support ``static_index.html`` docs by @dwreeves in #999
* Support deep linking dbt docs via Airflow UI by @dwreeves in #1038
* Add ability to specify host/port for Snowflake connection by @whummer
in #1063

Docs

* Fix rendering for env ``enable_cache_dbt_ls`` by @pankajastro in #1069

Others

* Update documentation for DbtDocs generator by @arjunanan6 in #1043
* Use uv in CI by @dwreeves in #1013
* Cache hatch folder in the CI by @tatiana in #1056
* Change example DAGs to use ``example_conn`` as opposed to
``airflow_db`` by @tatiana in #1054
* Mark plugin integration tests as integration by @tatiana in #1057
* Ensure compliance with linting rule D300 by using triple quotes for
docstrings by @pankajastro in #1049
* Pre-commit hook updates in #1039, #1050, #1064
* Remove duplicates in changelog by @jedcunningham in #1068
arojasb3 pushed a commit to arojasb3/astronomer-cosmos that referenced this pull request Jul 14, 2024
Add dbt profile caching mechanism.

1. Introduced env `enable_cache_profile` to enable or disable profile
caching. This will be enabled only if global `enable_cache` is enabled.
2. Users can set the env `profile_cache_dir_name`. This will be the name
of a sub-dir inside `cache_dir` where cached profiles will be stored.
This is optional, and the default name is `profile`
3. Example Path for versioned profile:
`{cache_dir}/{profile_cache_dir}/592906f650558ce1dadb75fcce84a2ec09e444441e6af6069f19204d59fe428b/profiles.yml`
4. Implemented profile mapping hashing: first, the profile is serialized
using pickle. Then, the profile_name and target_name are appended before
hashing the data using the SHA-256 algorithm

**Perf test result:**
In local dev env with command
```
AIRFLOW_HOME=`pwd` AIRFLOW_CONN_EXAMPLE_CONN="postgres://postgres:[email protected]:5432/postgres"  AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.10-2.8:test-performance
```

NUM_MODELS=100
- TIME=167.45248413085938 (with profile cache enabled)
- TIME=173.94845390319824 (with profile cache disabled)

NUM_MODELS=200
- TIME=376.2585120201111 (with profile cache enabled)
- TIME=418.14210200309753 (with profile cache disabled)

Closes: astronomer#925
Closes: astronomer#647
arojasb3 pushed a commit to arojasb3/astronomer-cosmos that referenced this pull request Jul 14, 2024
New Features

* Speed up ``LoadMode.DBT_LS`` by caching dbt ls output in Airflow
Variable by @tatiana in astronomer#1014
* Support to cache profiles created via ``ProfileMapping`` by
@pankajastro in astronomer#1046
* Support for running dbt tasks in AWS EKS in astronomer#944 by @VolkerSchiewe
* Add Clickhouse profile mapping by @roadan and @pankajastro in astronomer#353 and
astronomer#1016
* Add node config to TaskInstance Context by @linchun3 in astronomer#1044

Bug fixes

* Support partial parsing when cache is disabled by @tatiana in astronomer#1070
* Fix disk permission error in restricted env by @pankajastro in astronomer#1051
* Add CSP header to iframe contents by @dwreeves in astronomer#1055
* Stop attaching log adaptors to root logger to reduce logging costs by
@glebkrapivin in astronomer#1047

Enhancements

* Support ``static_index.html`` docs by @dwreeves in astronomer#999
* Support deep linking dbt docs via Airflow UI by @dwreeves in astronomer#1038
* Add ability to specify host/port for Snowflake connection by @whummer
in astronomer#1063

Docs

* Fix rendering for env ``enable_cache_dbt_ls`` by @pankajastro in astronomer#1069

Others

* Update documentation for DbtDocs generator by @arjunanan6 in astronomer#1043
* Use uv in CI by @dwreeves in astronomer#1013
* Cache hatch folder in the CI by @tatiana in astronomer#1056
* Change example DAGs to use ``example_conn`` as opposed to
``airflow_db`` by @tatiana in astronomer#1054
* Mark plugin integration tests as integration by @tatiana in astronomer#1057
* Ensure compliance with linting rule D300 by using triple quotes for
docstrings by @pankajastro in astronomer#1049
* Pre-commit hook updates in astronomer#1039, astronomer#1050, astronomer#1064
* Remove duplicates in changelog by @jedcunningham in astronomer#1068
tatiana pushed a commit that referenced this pull request Jul 17, 2024
New Features

* Speed up ``LoadMode.DBT_LS`` by caching dbt ls output in Airflow
Variable by @tatiana in #1014
* Support to cache profiles created via ``ProfileMapping`` by
@pankajastro in #1046
* Support for running dbt tasks in AWS EKS in #944 by @VolkerSchiewe
* Add Clickhouse profile mapping by @roadan and @pankajastro in #353 and
#1016
* Add node config to TaskInstance Context by @linchun3 in #1044

Bug fixes

* Support partial parsing when cache is disabled by @tatiana in #1070
* Fix disk permission error in restricted env by @pankajastro in #1051
* Add CSP header to iframe contents by @dwreeves in #1055
* Stop attaching log adaptors to root logger to reduce logging costs by
@glebkrapivin in #1047

Enhancements

* Support ``static_index.html`` docs by @dwreeves in #999
* Support deep linking dbt docs via Airflow UI by @dwreeves in #1038
* Add ability to specify host/port for Snowflake connection by @whummer
in #1063

Docs

* Fix rendering for env ``enable_cache_dbt_ls`` by @pankajastro in #1069

Others

* Update documentation for DbtDocs generator by @arjunanan6 in #1043
* Use uv in CI by @dwreeves in #1013
* Cache hatch folder in the CI by @tatiana in #1056
* Change example DAGs to use ``example_conn`` as opposed to
``airflow_db`` by @tatiana in #1054
* Mark plugin integration tests as integration by @tatiana in #1057
* Ensure compliance with linting rule D300 by using triple quotes for
docstrings by @pankajastro in #1049
* Pre-commit hook updates in #1039, #1050, #1064
* Remove duplicates in changelog by @jedcunningham in #1068

(cherry picked from commit 18d2c90)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration area:profile Related to ProfileConfig, like Athena, BigQuery, Clickhouse, Spark, Trino, etc lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cache profiles.yml file when using Cosmos ProfileMapping Reuse connections accross entire Dag
3 participants