Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for loading dbt project from cloud store using Airflow Object Store #1148

Closed
wants to merge 21 commits into from

Conversation

CorsettiS
Copy link
Contributor

@CorsettiS CorsettiS commented Aug 9, 2024

Description

This PR is based on #1109 and intends to allow users to load entire dbt projects that are stored at a cloud provider. That way, it is totally possible to decouple dbt and airflow with cosmos.

As of now, the company I currently work at does not have dbt & airflow in the same repo, but by allowing cosmos to fetch entire dbt projects from a cloud provider we can work around that easily. dbt projects are not usually very heavy, so performance should not be drastically impacted (it was not in the local tests I have done).

Would be good if this could be included in the release 1.6.0 #1080

Breaking Change?

No

Checklist

  • I have made corresponding changes to the documentation (if required)
  • I have added tests that prove my fix is effective or that my feature works

Copy link

netlify bot commented Aug 9, 2024

Deploy Preview for sunny-pastelito-5ecb04 canceled.

Name Link
🔨 Latest commit 8cde3a8
🔍 Latest deploy log https://app.netlify.com/sites/sunny-pastelito-5ecb04/deploys/66b66745ac04080008962bb5

Copy link

netlify bot commented Aug 9, 2024

Deploy Preview for sunny-pastelito-5ecb04 ready!

Name Link
🔨 Latest commit 897090e
🔍 Latest deploy log https://app.netlify.com/sites/sunny-pastelito-5ecb04/deploys/66bc95a4fa686b0008ad33fe
😎 Deploy Preview https://deploy-preview-1148--sunny-pastelito-5ecb04.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Comment on lines -227 to -231
"dbt_project.yml": Path(project_yml_path) if project_yml_path else None,
"models directory ": Path(self.models_path) if self.models_path else None,
Copy link
Contributor Author

@CorsettiS CorsettiS Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honestly speaking I could not understand this even after reading the comment. both the self.dbt_project_path and self.models_path are already Path as specified in the init. I just reverted to its previous state

@CorsettiS CorsettiS marked this pull request as ready for review August 10, 2024 12:17
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. area:config Related to configuration, like YAML files, environment variables, or executer configuration labels Aug 10, 2024
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Aug 10, 2024
@@ -36,12 +36,8 @@ def test_init_with_manifest_path_and_project_path_succeeds():
project_name in this case should be based on dbt_project_path
"""
project_config = ProjectConfig(dbt_project_path="/tmp/some-path", manifest_path="target/manifest.json")
if AIRFLOW_IO_AVAILABLE:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is actually wrong since AIRFLOW_IO_AVAILABLE just means we are using airflow >=2.8.0, and not that we are using object storage.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tatiana indeed, although using airflow 2.8 does not exactly mean we are using object storage, which was the assumption of this test, so I just fixed it.

Copy link

codecov bot commented Aug 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.53%. Comparing base (4886823) to head (eb3ec85).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1148   +/-   ##
=======================================
  Coverage   96.53%   96.53%           
=======================================
  Files          64       64           
  Lines        3374     3376    +2     
=======================================
+ Hits         3257     3259    +2     
  Misses        117      117           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CorsettiS This is a very promising change, thank you!

Some questions:

  1. How big are the dbt projects you've tested this feature with?
  2. I'm slightly concerned about performance, since we'd be running this during DAG processing/parsing time very regularly usually in the scheduler - depending on the customer deployment. This may not scale for bigger dbt projects or for situations where users have many Cosmos DbtDags. Did you run any performance tests? Could you share some numbers / how you validated this feature?
  3. Following (2), by syncing the whole dbt project from object store, we would potentially not benefit from the caching introduced in Cosmos 1.5 in PR Speed up LoadMode.DBT_LS by caching dbt ls output in Airflow Variable #1014. Do you have any proposals of how we could avoid performing this action every time?
  4. Would using LoadMethod.MANIFEST + ExecutionMode.KUBERNETES be an option for your company? This should allow the decoupling you are aiming for. Or is there any particular reason why you'd need LoadMethod.DBT_LS and LoadMethod.LOCAL?

It seems unlikely we'll be able to release this in 1.6, given the number of things we're still trying to address before the release - but let's see!

@tatiana tatiana added this to the Cosmos 1.7.0 milestone Aug 14, 2024
@CorsettiS
Copy link
Contributor Author

@CorsettiS This is a very promising change, thank you!

Some questions:

  1. How big are the dbt projects you've tested this feature with?
  2. I'm slightly concerned about performance, since we'd be running this during DAG processing/parsing time very regularly usually in the scheduler - depending on the customer deployment. This may not scale for bigger dbt projects or for situations where users have many Cosmos DbtDags. Did you run any performance tests? Could you share some numbers / how you validated this feature?
  3. Following (2), by syncing the whole dbt project from object store, we would potentially not benefit from the caching introduced in Cosmos 1.5 in PR Speed up LoadMode.DBT_LS by caching dbt ls output in Airflow Variable #1014. Do you have any proposals of how we could avoid performing this action every time?
  4. Would using LoadMethod.MANIFEST + ExecutionMode.KUBERNETES be an option for your company? This should allow the decoupling you are aiming for. Or is there any particular reason why you'd need LoadMethod.DBT_LS and LoadMethod.LOCAL?

It seems unlikely we'll be able to release this in 1.6, given the number of things we're still trying to address before the release - but let's see!

  1. I have been dealing with a project with roughly 200 models. in our repo we do have heavy files that would impact performance since they would increase the time needed for airflow to fetch them, so we just upload the dbt "main" files to s3 to avoid under-performance (CI/CD scripts, JPEG files, etc)

  2. About performance, I did some tests using my private dbt repo in s3 (size: 7 MB). For reference, my internet speed is 18 MB/s. I checked for the runtime of the command airflow dags reserialize (not the DAG build time provided by cosmos since I was confused if the fetching process happens inside or outside the DAG construction process itself, so I decided to be on the safe side). I ran the command 10 times and got the avg and std deviation.

in the first scenario, I loaded both the project and the manifest using a local path and rendered the DbtDag with LoadMode.DBT_MANIFEST

avg time: 2605 miliseconds
std dev: 98

In the second scenario, both the project and the manifest are fetched from S3 and the project parsed with LoadMode.DBT_MANIFEST

avg time: 3975 miliseconds
std dev: 209

So it took roughly 52% longer to parse a dbt dag that is fully stored in a cloud provider. Of course that performance would degrade for bigger repos or show better results in case of a better internet connection. Personally speaking, I do not find it a huge time increase considering we can benefit from all features developed for a local dbt project.

  1. I confess I have not tried this feature in this setup, but when the dbt project is loaded with ObjectStorage airflow treats it as a local file path, so I would assume that nothing would break since the cloud path would be essentially a "local path"

  2. It is possible for sure, but personally speaking I find more appealing to have the possibility to fetch the project from a cloud provider. Of course, the user should be aware that the scheduler will have to work extra hard for it to happen, but I believe it is beneficial to have as many alternatives as possible.

@CorsettiS
Copy link
Contributor Author

I have decided to close the PR for now because I found a severe flaw when I was extending the functionality of this PR to the task execution, which is that each singular task will need to list and fetch all artifacts from the repo in s3, which is going to balloon costs in an unexpected way. I will work with it on the side and once I have a good alternative for the problem I will re-create the PR

@CorsettiS CorsettiS closed this Aug 14, 2024
@tatiana
Copy link
Collaborator

tatiana commented Aug 15, 2024

Thank you very much, @CorsettiS , for the additional information and context! We're looking forward to seeing your follow up PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants