-
Notifications
You must be signed in to change notification settings - Fork 161
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support using
DatasetAlias
and fix orphaning unreferenced dataset (#…
…1217) **Context** Cosmos versions between 1.1 and 1.6 supported automatically emitting Airflow Datasets when using `ExecutionMode.LOCAL`. Although the datasets generated by these versions of Cosmos could be utilised for dataset-aware scheduling, the implementation had long-standing issues, as described in #522. The main problems were: * Orphaning unreferenced dataset * Not displaying dataset inlets/outlets in the Airflow UI These issues were caused by Cosmos defining task outlets and inlets during Airflow task execution, a feature only partially supported before Airflow 2.10: apache/airflow#34206. **Solution** Airflow 2.10 has introduced the concept of `DatasetAlias`, as described in the [official docs](https://airflow.apache.org/blog/airflow-2.10.0/#dynamic-dataset-scheduling-through-datasetalias), so operators can dynamically define inlets and outlets during task execution. This PR uses Airflow `DatasetAlias`, when possible (Airflow 2.10 or above), and does two things: 1. Adds a `DatasetAlias` to every LocalOperator/VirtualenvOperator subclass 2. Dynamically adds `Dataset` as outlets during the `LocalOperator` subclasses execution, associating them to the desired `Dataset` instance. 3. Exposes to users a function to retrieve Cosmos' `DatasetAlias` names, programatically **Caveats** * Only works for Airflow 2.10 and above This feature relies on `DatasetAlias`, only available in Airflow 2.10 and above. If users use previous versions of Airflow, Cosmos behaves like it did before, and the issues described in this task are not solved. * Unable to leverage DatasetAlias in `airflow dags test` Although the feature described in this PR works well when scheduling DAGs, triggering them via the UI, or using `airflow dags trigger`, it does not work when users attempt to use `dags test` or `dag.test()`. When trying to test the DAG, these commands fail with an `sqlalchemy.orm. etc.FlushError`. This is a known issue from Airflow 2.10.0 and Airflow 2.10.1 when declaring DatasetAliases, as described in apache/airflow#42495. To mitigate this second problem, we've introduced a new Airflow variable, `AIRFLOW__COSMOS__ENABLE_DATASET_ALIAS`, that allows users to disable using dataset aliases when running Cosmos. We'd recommend users who face the `sqlalchemy.orm. etc.FlushError` in their tests to set this configuration to `False` only for running tests - until the issue is solved in Airflow. When this configuration is set to `False`, Cosmos behaves as before the DatasetAlias feature was introduced. **How this feature was validated** TODO **Related tickets** Closes: #522 Closes: #1119 **Pending** * Add docs * Update PR description
- Loading branch information
Showing
8 changed files
with
255 additions
and
20 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
from __future__ import annotations | ||
|
||
from airflow import DAG | ||
from airflow.utils.task_group import TaskGroup | ||
|
||
|
||
def get_dataset_alias_name(dag: DAG | None, task_group: TaskGroup | None, task_id: str) -> str: | ||
""" | ||
Given the Airflow DAG, Airflow TaskGroup and the Airflow Task ID, return the name of the | ||
Airflow DatasetAlias associated to that task. | ||
""" | ||
dag_id = None | ||
task_group_id = None | ||
|
||
if task_group: | ||
if task_group.dag_id is not None: | ||
dag_id = task_group.dag_id | ||
if task_group.group_id is not None: | ||
task_group_id = task_group.group_id | ||
task_group_id = task_group_id.replace(".", "__") | ||
elif dag: | ||
dag_id = dag.dag_id | ||
|
||
identifiers_list = [] | ||
|
||
if dag_id: | ||
identifiers_list.append(dag_id) | ||
if task_group_id: | ||
identifiers_list.append(task_group_id) | ||
|
||
identifiers_list.append(task_id) | ||
|
||
return "__".join(identifiers_list) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.