do not push stale update to related DagRun on TI update after task execution #45348

mobuchowski · 2025-01-02T12:36:45Z

This is a re-targeted pr #44547 for Airflow 2.X version only, as this fix loses relevancy for Airflow 3.

When TI is marked as failed via UI, and later fails itself, scheduler changes the state to failed twice.

(some extra logs for clarity)

[2024-11-28T14:48:11.276+0000] {scheduler_job_runner.py:1092} INFO - another scheduler loop 75
[2024-11-28T14:48:11.284+0000] {scheduler_job_runner.py:1666} ERROR - Scheduling dag run <DagRun wait_to_fail @ 2024-11-28 14:48:04.731
025+00:00: manual__2024-11-28T14:48:04.731025+00:00, state:running, queued_at: 2024-11-28 14:48:04.739184+00:00. externally triggered:
True> state running
[2024-11-28T14:48:11.287+0000] {dagrun.py:906} ERROR - Marking run <DagRun wait_to_fail @ 2024-11-28 14:48:04.731025+00:00: manual__202
4-11-28T14:48:04.731025+00:00, state:running, queued_at: 2024-11-28 14:48:04.739184+00:00. externally triggered: True> failed - from st
ate running at 2024-11-28 14:48:11.286513+00:00
[2024-11-28T14:48:11.288+0000] {dagrun.py:1101} ERROR - NOTIFYING STATE CHANGED TO failed
[2024-11-28T14:48:11.289+0000] {dagrun.py:989} INFO - DagRun Finished: dag_id=wait_to_fail, logical_date=2024-11-28 14:48:04.731025+00:
00, run_id=manual__2024-11-28T14:48:04.731025+00:00, run_start_date=2024-11-28 14:48:05.186397+00:00, run_end_date=2024-11-28 14:48:11.
288084+00:00, run_duration=6.101687, state=failed, external_trigger=True, run_type=manual, data_interval_start=2024-11-28 14:48:04.7310
25+00:00, data_interval_end=2024-11-28 14:48:04.731025+00:00, dag_version_name=wait_to_fail-1
[2024-11-28T14:48:11.299+0000] {client.py:110} INFO - OpenLineageClient will use `http` transport
[2024-11-28T14:48:12.307+0000] {scheduler_job_runner.py:1092} INFO - another scheduler loop 76
[2024-11-28T14:48:13.344+0000] {scheduler_job_runner.py:1092} INFO - another scheduler loop 77
[2024-11-28T14:48:14.382+0000] {scheduler_job_runner.py:1092} INFO - another scheduler loop 78
[2024-11-28T14:48:15.274+0000] {scheduler_job_runner.py:1092} INFO - another scheduler loop 79
[2024-11-28T14:48:16.309+0000] {scheduler_job_runner.py:1092} INFO - another scheduler loop 80
[2024-11-28T14:48:17.345+0000] {scheduler_job_runner.py:1092} INFO - another scheduler loop 81
[2024-11-28T14:48:17.361+0000] {scheduler_job_runner.py:1666} ERROR - Scheduling dag run <DagRun wait_to_fail @ 2024-11-28 14:48:04.731
025+00:00: manual__2024-11-28T14:48:04.731025+00:00, state:running, queued_at: 2024-11-28 14:48:04.739184+00:00. externally triggered:
True> state running
[2024-11-28T14:48:17.366+0000] {dagrun.py:906} ERROR - Marking run <DagRun wait_to_fail @ 2024-11-28 14:48:04.731025+00:00: manual__202
4-11-28T14:48:04.731025+00:00, state:running, queued_at: 2024-11-28 14:48:04.739184+00:00. externally triggered: True> failed - from st
ate running at 2024-11-28 14:48:17.364879+00:00
[2024-11-28T14:48:17.366+0000] {dagrun.py:1101} ERROR - NOTIFYING STATE CHANGED TO failed
[2024-11-28T14:48:17.367+0000] {dagrun.py:989} INFO - DagRun Finished: dag_id=wait_to_fail, logical_date=2024-11-28 14:48:04.731025+00:
00, run_id=manual__2024-11-28T14:48:04.731025+00:00, run_start_date=2024-11-28 14:48:05.186397+00:00, run_end_date=2024-11-28 14:48:17.
366330+00:00, run_duration=12.179933, state=failed, external_trigger=True, run_type=manual, data_interval_start=2024-11-28 14:48:04.731
025+00:00, data_interval_end=2024-11-28 14:48:04.731025+00:00, dag_version_name=wait_to_fail-1
[2024-11-28T14:48:17.370+0000] {client.py:110} INFO - OpenLineageClient will use `http` transport

This happens because on handle_failure, TaskInstance.save_to_db not only updates state of that task instance, it also pushes stale DagRun state - the one it got on TI start. So the actual DR state goes running -> failed -> running -> failed.

This causes other unintended behavior, such as calling on_dag_run_failed listeners twice.

The solution just loads DR state from db before pushing TI state. However, there probably is better solution, that someone with more knowledge of SQLAlchemy might help with.

Link to discussion on Airflow slack: https://apache-airflow.slack.com/archives/C06K9Q5G2UA/p1732805503889679

Signed-off-by: Maciej Obuchowski <[email protected]>

mobuchowski requested review from kaxil, XD-DENG and ashb as code owners January 2, 2025 12:36

boring-cyborg bot added the area:core-operators Operators, Sensors and hooks within Core Airflow label Jan 2, 2025

mobuchowski mentioned this pull request Jan 2, 2025

do not push stale update to related DagRun on TI update after task execution #44547

Closed

mobuchowski force-pushed the do-not-update-dr-after-ti-fails-2.10 branch 4 times, most recently from 2d536fc to 4e4906b Compare January 3, 2025 15:05

do not update DR on TI update after task execution

295a85d

Signed-off-by: Maciej Obuchowski <[email protected]>

mobuchowski force-pushed the do-not-update-dr-after-ti-fails-2.10 branch from 4e4906b to 295a85d Compare January 7, 2025 10:58

potiuk approved these changes Jan 8, 2025

View reviewed changes

potiuk merged commit 586f1ea into apache:v2-10-test Jan 8, 2025
48 checks passed

potiuk added this to the Airflow 2.10.5 milestone Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

do not push stale update to related DagRun on TI update after task execution #45348

do not push stale update to related DagRun on TI update after task execution #45348

mobuchowski commented Jan 2, 2025 •

edited

Loading

do not push stale update to related DagRun on TI update after task execution #45348

do not push stale update to related DagRun on TI update after task execution #45348

Conversation

mobuchowski commented Jan 2, 2025 • edited Loading

mobuchowski commented Jan 2, 2025 •

edited

Loading