Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update airflow resources and split out dbt tests and alerts #433

Merged
merged 8 commits into from
Jul 18, 2024

Conversation

chowbao
Copy link
Contributor

@chowbao chowbao commented Jul 16, 2024

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with the jira ticket associated with the PR.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated the README with the added features, breaking changes, new instructions on how to use the repository.

What

Reducing airflow resources to reduce costs and splitting out tests from the elementary alert DAG

Why

  • Airflow resources were originally set for using captive core. Now that we have moved to CDP and observed the actual resources needed. We can greatly reduce the resources requested
  • The elementary alert DAG has a dependency on a dbt test. If this test fails the elementary alerts won't actually fire. Separating them out should avoid this

Known limitations

N/A

@chowbao chowbao requested a review from a team as a code owner July 16, 2024 15:16
},
"state": {
"stellar-etl": {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added new stellar-etland dbt resources. This will make it easy to adjust resources without adjusting resources for every dag/task

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this could have been added to the existing dbt eho/state DAG. I opted to separate it out because it is purely for running anomaly tests/other tests which felt different enough to warrant a standalone DAG


# DAG task graph
start_tests >> singular_tests >> singular_tests_elementary_alerts
Copy link
Contributor

@amishas157 amishas157 Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. But curious why were we using empty operator?

Also, different DAG is okay, but we can create independent tasks in same DAG as well, correct?
Explained in #433 (comment)

elementary_alerts
singular_tests_elementary_alerts

Comment on lines +293 to +295
"cpu": "0.5",
"ephemeral-storage": "1Gi",
"memory": "5Gi"
"memory": "1Gi"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own knowledge, are the historical resource usage metrics tracked somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe it is. In theory you could find it in GCP logs

Copy link
Contributor

@amishas157 amishas157 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Left some q inline for my understanding

@@ -18,7 +18,7 @@
default_args=get_default_dag_args(),
start_date=datetime(2024, 6, 25, 0, 0),
description="This DAG runs dbt tests and Elementary alerts at a half-hourly cadence",
schedule="*/15,*/45 * * * *", # Runs every 15th minute and every 45th minute
schedule="15,45 * * * *", # Runs every 15th minute and every 45th minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we're running at the 15th and 45th minutes instead of */30 ****?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. So the other dbt DAGs run on */30 with task run times around 10 mins. If we were to run the quality alerts at the same time we would have a slightly larger gap in time before we are notified. Running at a 15 min offset to those runs hopefully alerts faster

Copy link
Contributor Author

@chowbao chowbao Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively I don't see the harm in running every 15 mins like it has been lol

Edit: Updated to just run every 15 mins

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're no longer separating resources out between dbt, default and stellaretl, is it worth us storing configs beyond the default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth it. I was going to delete all the resources and just make everything run with the default settings but I have a theory that dbt and stellar-etl both have different resource requirements. So in the case we do need to bump up/down resources for one service we wouldn't necessarily OOM or over-allocate the other service

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌 sounds good to me. Your theory sounds plausible

@chowbao chowbao merged commit ceec401 into master Jul 18, 2024
4 checks passed
@amishas157 amishas157 deleted the update-airflow-resources branch July 26, 2024 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants