title-prefix | pagetitle | author | author-meta | date | date-meta | keywords | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
Six Feet Up |
Too Big for DAG Factories |
Calvin Hendryx-Parker, CTO, Six Feet Up |
|
EuroPython 2023 |
2023 |
|
Too Big for DAG Factories? {.semi-filtered data-background-image="images/karsten-wurth-lsJ9jHKIqHg-unsplash.jpg"}
::: notes Brief intro to how this all started
Pointer to talk from last year :::
Data Pipeline Modernization at Scale
https://github.com/calvinhp/2023_PythonWebConf_TooBigforDAGFactories
Why should I care about this talk? {.r-fit-text .semi-filtered data-background-image="images/matt-artz-2dCdOoYDjOQ-unsplash.jpg"}
- Do you think Airflow might be the wrong tool?
- Airflow can’t scale to your needs?
$ python -m this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
Brief introduction to Airflow {.semi-filtered data-background-image="images/matt-artz-J2R6iK8A6mQ-unsplash.jpg"}
- DAGs
- Operators
- Sensors
- Connections
- Hooks
- Scheduler
::: notes DAGs are series of tasks made of Operators, Sensors or TaskFlows
Connections and Hooks provide easy access to external systems and APIs :::
Demo the the problem {.semi-filtered data-background-image="images/vivek-kumar-C5HZDAVQwuQ-unsplash.jpg"}
- Create a Dynamic DAG that takes too long to generate
- Airflow can slow down with lot's of DAGs
Root cause of the problem {.semi-filtered data-background-image="images/venti-views-8RBASNzrrXA-unsplash.jpg"}
::: notes 1. The dangers of runtime dependencies. 2. Dynamic DAG generation can start to take longer than the DAG scanning cycle 1. By sheer numbers that will be generated (just a lot, 1000+) 1. 1 file that was generating 1000s of dags dynamically and failing the time limit 2. By making outside API calls during generation (calling App Config for config data) 1. remove any dag import time externalities 3. By inefficient code (json5 1000x slowdown) 1. 1000s of files each generating 1 dag dynamically and failing the time limit :::
- Start with Dynamic DAGs Factories
- As you scale up you will need to statically generate
- Precompute configurations and data needed during build
- Optimize to run your tasks to run in parallel
- Avoid being limited by the number of slots
DeferredOperators
and asyncTriggers
- Another Talk? "Too Big for the Scheduler?"
Wrap up and Questions {.semi-filtered data-background-image="images/justin-casey-7B0D1zO3PoQ-unsplash.jpg"}
- Airflow Docs
- Six Feet Up Blog Post -- Too Big for DAG Factories?