Skip to content

Latest commit

 

History

History
 
 

sla-miss-report

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Airflow SLA Miss Report

About

Airflow allows users to define SLAs at DAG & task levels to track instances where processes are running longer than usual. However, making sense of the data is a challenge.

The airflow-sla-miss-report DAG consolidates the data from the metadata tables and provides meaningful insights to ensure SLAs are met when set.

The DAG utilizes three (3) timeframes (default: short: 1d, medium: 3d, long: 7d) to calculate the following KPIs:

Daily SLA Misses (timeframe: long)

Following details broken down on a daily basis for the provided long timeframe (e.g. 7 days):

  SLA Miss %: percentage of tasks that missed their SLAs out of total tasks runs
  Top Violator (%): task that violated its SLA the most as a percentage of its total runs
  Top Violator (absolute): task that violated its SLA the most on an absolute count basis during the day

Hourly SLA Misses (timeframe: short)

Following details broken down on an hourly basis for the provided short timeframe (e.g. 1 day):

  SLA Miss %: percentage of tasks that missed their SLAs out of total tasks runs
  Top Violator (%): task that violated its SLA the most as a percentage of its total runs
  Top Violator (absolute): task that violated its SLA the most on an absolute count basis during the day
  Longest Running Task: task that took the longest time to execute within the hour window
  Average Task Queue Time (s): avg time taken for tasks in `queued` state; can be used to detect scheduling bottlenecks

DAG SLA Misses (timeframe: short, medium, long)

Following details broken down on a task level for all timeframes:

  Current SLA (s): current defined SLA for the task
  Short, Medium, Long Timeframe SLA miss % (avg execution time): % of tasks that missed their SLAs & their avg execution times over the respective timeframes

Sample Email

Airflow SLA miss Email Report Output1

Sample Airflow Task Logs

Airflow SLA miss Email Report Output2

Architecture

The process reads data from the Airflow metadata database to calculate SLA misses based on the defined DAG/task level SLAs using information. The following metadata tables are utilized:

  • SerializedDag: retrieve defined DAG & task SLAs
  • DagRuns: details about each DAG run
  • TaskInstances: details about each task instance in a DAG run

Airflow SLA Process Flow Architecture

Requirements

  • Python: 3.7 and above
  • Pip packages: pandas
  • Airflow: v2.3 and above
  • Airflow metadata tables: DagRuns, TaskInstances, SerializedDag
  • SMTP details in airflow.cfg for sending emails

Deployment

  1. Login to the machine running Airflow
  2. Navigate to the dags directory
  3. Copy the airflow-sla-miss-report.py file to the dags directory. Here's a fast way:
wget https://raw.githubusercontent.com/teamclairvoyant/airflow-maintenance-dags/master/sla-miss-report/airflow-sla-miss-report.py
  1. Update the global variables in the DAG with the desired values:
EMAIL_ADDRESSES (optional): list of recipient emails to send the SLA report
SHORT_TIMEFRAME_IN_DAYS: duration in days of the short timeframe to calculate SLA metrics (default: 1)
MEDIUM_TIMEFRAME_IN_DAYS: duration in days of the medium timeframe to calculate SLA metrics (default: 3)
LONG_TIMEFRAME_IN_DAYS: duration in days of the long timeframe to calculate SLA metrics (default: 7)
  1. Enable the DAG in the Airflow Webserver