-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/add slas #359
Feature/add slas #359
Conversation
airflow_variables_dev.json
Outdated
}, | ||
"task_sla": { | ||
"get_ledger_range_from_times": 240, | ||
"export_task": 240, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to adjust some of these sla times
. Like for instance this export_task
sla 240 will never be hit because the task_timeout
for export_task
is 180.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the current sla times are too aggressive from all the alerts we are getting in the slack channel #alerts-hubble-testnet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to adjust some of these
sla times
. Like for instance thisexport_task
sla 240 will never be hit because thetask_timeout
forexport_task
is 180.
There's an interesting behavior going on for the export_task
. Even though the task_timeout
is set for 180 seconds, the tasks usually run for around 220 seconds without triggering the timeout. Differently from time_task
, where the timeout triggers in the middle of the job.
I'll push the commits where I increase the timeout for export_task
too, which makes more sense to the actual behavior of the pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the current sla times are too aggressive from all the alerts we are getting in the slack channel #alerts-hubble-testnet
Those have been reviewed through the suggested buffer and will be updated
I believe you are missing the I'd do a grep/search for all the operators/build functions in |
airflow_variables_prod.json
Outdated
@@ -238,9 +238,33 @@ | |||
"build_copy_table": 180, | |||
"build_dbt_task": 6000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we lower the build_dbt_task
timeout value? I don't think there is anything currently that should run over 1.5 hours unless it is a backfill
airflow_variables_prod.json
Outdated
"snapshot_state": 840, | ||
"elementary_dbt_sdf_marts": 120, | ||
"build_bq_insert_job": 120, | ||
"sandbox_create": 120 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the sandbox update task_sla
is missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are also missing for like build_delete_data_task
.
I think there should be a task_sla
for every task in airflow
# args.extend( | ||
# ["--cloud-storage-bucket", Variable.get("gcs_exported_data_bucket_name")] | ||
# ) | ||
# args.extend(["--cloud-provider", "gcp"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove the code that is currently from the dev branch PR #371 it is still in development and has a bunch of commented out code in it for testing
62f859a
to
48d2e69
Compare
* HUBBLE-409 Update/delete airflow variable names (#377) * Updates for testnet-reset (#380) * Feature/add slas (#359) --------- Co-authored-by: chowbao <[email protected]> --------- Co-authored-by: Laysa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> Co-authored-by: Eduardo Alves <[email protected]>
* HUBBLE-409 Update/delete airflow variable names (#377) * test time task without affinity * delete affinity from repo * dbt_ * testing config related to stellar_marts * documented * test without var dbt_dataset * with dbt_dataset again * dbt_dataset_for_test * comment excluded * Updates for testnet-reset (#380) * Feature/add slas (#359) * Created alert_sla_miss callback funcion * Defined test value for export task SLA * Defined sla_miss_callback for DAG level parameter * Increased time task timeout value * Added dev Airflow variable for tasks SLA values * Added slack notification logic for the SLA miss callback * Added SLA parameter to the relavant task building functions * Removed SLA miss callback function and related logic * Added refactored logic for sla miss callback function * Added sla miss callback to desired dags * Adjusted callback logic for sentry integration * Added task SLA values for prod Airflow variables * Changed alert message to better distinguish SLA miss from task fail alerts * Update dev tasks sla and timeout values * Update prod tasks sla and timeout values * Added default sla value to dag default args * Added missing SLA param to build task functions * Standardizing logic for fetching SLA param value * Added sla miss callback reference * Remove default SLA from default args * Set missing sla params to tasks * Updated values for sla variables * Updated build_dbt_task timeout value * Changed variable names to match new naming reference --------- Co-authored-by: chowbao <[email protected]> * Update stellar-etl image (#381) * Update stellar-etl image (#383) --------- Co-authored-by: Laysa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> Co-authored-by: Eduardo Alves <[email protected]>
* [PRODUCTION] Update production Airflow environment (#370) * requirements updated * sqlfluff * Use the correct gcs bucket (#372) --------- Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> * [PRODUCTION] Update production Airflow environment (#375) * requirements updated * Update READMEs --------- Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> --------- Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> * [PRODUCTION] Update production Airflow environment (#378) * HUBBLE-409 Update/delete airflow variable names (#377) * Updates for testnet-reset (#380) * Feature/add slas (#359) --------- Co-authored-by: chowbao <[email protected]> --------- Co-authored-by: Laysa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> Co-authored-by: Eduardo Alves <[email protected]> * [PRODUCTION] Update production Airflow environment (#382) * HUBBLE-409 Update/delete airflow variable names (#377) * test time task without affinity * delete affinity from repo * dbt_ * testing config related to stellar_marts * documented * test without var dbt_dataset * with dbt_dataset again * dbt_dataset_for_test * comment excluded * Updates for testnet-reset (#380) * Feature/add slas (#359) * Created alert_sla_miss callback funcion * Defined test value for export task SLA * Defined sla_miss_callback for DAG level parameter * Increased time task timeout value * Added dev Airflow variable for tasks SLA values * Added slack notification logic for the SLA miss callback * Added SLA parameter to the relavant task building functions * Removed SLA miss callback function and related logic * Added refactored logic for sla miss callback function * Added sla miss callback to desired dags * Adjusted callback logic for sentry integration * Added task SLA values for prod Airflow variables * Changed alert message to better distinguish SLA miss from task fail alerts * Update dev tasks sla and timeout values * Update prod tasks sla and timeout values * Added default sla value to dag default args * Added missing SLA param to build task functions * Standardizing logic for fetching SLA param value * Added sla miss callback reference * Remove default SLA from default args * Set missing sla params to tasks * Updated values for sla variables * Updated build_dbt_task timeout value * Changed variable names to match new naming reference --------- Co-authored-by: chowbao <[email protected]> * Update stellar-etl image (#381) * Update stellar-etl image (#383) --------- Co-authored-by: Laysa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> Co-authored-by: Eduardo Alves <[email protected]> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> Co-authored-by: Eduardo Alves <[email protected]>
* [PRODUCTION] Update production Airflow environment (#370) * requirements updated * sqlfluff * Use the correct gcs bucket (#372) --------- Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> * [PRODUCTION] Update production Airflow environment (#375) * requirements updated * Update READMEs --------- Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> --------- Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> * [PRODUCTION] Update production Airflow environment (#378) * HUBBLE-409 Update/delete airflow variable names (#377) * Updates for testnet-reset (#380) * Feature/add slas (#359) --------- Co-authored-by: chowbao <[email protected]> --------- Co-authored-by: Laysa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> Co-authored-by: Eduardo Alves <[email protected]> * [PRODUCTION] Update production Airflow environment (#382) * HUBBLE-409 Update/delete airflow variable names (#377) * test time task without affinity * delete affinity from repo * dbt_ * testing config related to stellar_marts * documented * test without var dbt_dataset * with dbt_dataset again * dbt_dataset_for_test * comment excluded * Updates for testnet-reset (#380) * Feature/add slas (#359) * Created alert_sla_miss callback funcion * Defined test value for export task SLA * Defined sla_miss_callback for DAG level parameter * Increased time task timeout value * Added dev Airflow variable for tasks SLA values * Added slack notification logic for the SLA miss callback * Added SLA parameter to the relavant task building functions * Removed SLA miss callback function and related logic * Added refactored logic for sla miss callback function * Added sla miss callback to desired dags * Adjusted callback logic for sentry integration * Added task SLA values for prod Airflow variables * Changed alert message to better distinguish SLA miss from task fail alerts * Update dev tasks sla and timeout values * Update prod tasks sla and timeout values * Added default sla value to dag default args * Added missing SLA param to build task functions * Standardizing logic for fetching SLA param value * Added sla miss callback reference * Remove default SLA from default args * Set missing sla params to tasks * Updated values for sla variables * Updated build_dbt_task timeout value * Changed variable names to match new naming reference --------- Co-authored-by: chowbao <[email protected]> * Update stellar-etl image (#381) * Update stellar-etl image (#383) --------- Co-authored-by: Laysa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> Co-authored-by: Eduardo Alves <[email protected]> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> Co-authored-by: Eduardo Alves <[email protected]>
* [PRODUCTION] Update production Airflow environment (#370) * requirements updated * sqlfluff * Use the correct gcs bucket (#372) --------- Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> * [PRODUCTION] Update production Airflow environment (#375) * requirements updated * Update READMEs --------- Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> --------- Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> * [PRODUCTION] Update production Airflow environment (#378) * HUBBLE-409 Update/delete airflow variable names (#377) * Updates for testnet-reset (#380) * Feature/add slas (#359) --------- Co-authored-by: chowbao <[email protected]> --------- Co-authored-by: Laysa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> Co-authored-by: Eduardo Alves <[email protected]> * [PRODUCTION] Update production Airflow environment (#382) * HUBBLE-409 Update/delete airflow variable names (#377) * test time task without affinity * delete affinity from repo * dbt_ * testing config related to stellar_marts * documented * test without var dbt_dataset * with dbt_dataset again * dbt_dataset_for_test * comment excluded * Updates for testnet-reset (#380) * Feature/add slas (#359) * Created alert_sla_miss callback funcion * Defined test value for export task SLA * Defined sla_miss_callback for DAG level parameter * Increased time task timeout value * Added dev Airflow variable for tasks SLA values * Added slack notification logic for the SLA miss callback * Added SLA parameter to the relavant task building functions * Removed SLA miss callback function and related logic * Added refactored logic for sla miss callback function * Added sla miss callback to desired dags * Adjusted callback logic for sentry integration * Added task SLA values for prod Airflow variables * Changed alert message to better distinguish SLA miss from task fail alerts * Update dev tasks sla and timeout values * Update prod tasks sla and timeout values * Added default sla value to dag default args * Added missing SLA param to build task functions * Standardizing logic for fetching SLA param value * Added sla miss callback reference * Remove default SLA from default args * Set missing sla params to tasks * Updated values for sla variables * Updated build_dbt_task timeout value * Changed variable names to match new naming reference --------- Co-authored-by: chowbao <[email protected]> * Update stellar-etl image (#381) * Update stellar-etl image (#383) --------- Co-authored-by: Laysa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> Co-authored-by: Eduardo Alves <[email protected]> --------- Co-authored-by: sydneynotthecity <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]>
* HUBBLE-409 Update/delete airflow variable names (#377) * test time task without affinity * delete affinity from repo * dbt_ * testing config related to stellar_marts * documented * test without var dbt_dataset * with dbt_dataset again * dbt_dataset_for_test * comment excluded * Updates for testnet-reset (#380) * Feature/add slas (#359) * Created alert_sla_miss callback funcion * Defined test value for export task SLA * Defined sla_miss_callback for DAG level parameter * Increased time task timeout value * Added dev Airflow variable for tasks SLA values * Added slack notification logic for the SLA miss callback * Added SLA parameter to the relavant task building functions * Removed SLA miss callback function and related logic * Added refactored logic for sla miss callback function * Added sla miss callback to desired dags * Adjusted callback logic for sentry integration * Added task SLA values for prod Airflow variables * Changed alert message to better distinguish SLA miss from task fail alerts * Update dev tasks sla and timeout values * Update prod tasks sla and timeout values * Added default sla value to dag default args * Added missing SLA param to build task functions * Standardizing logic for fetching SLA param value * Added sla miss callback reference * Remove default SLA from default args * Set missing sla params to tasks * Updated values for sla variables * Updated build_dbt_task timeout value * Changed variable names to match new naming reference --------- Co-authored-by: chowbao <[email protected]> * Update stellar-etl image (#381) * Update stellar-etl image (#383) * Updated SLA values for dbt eho related tasks (#384) * Patch/merge conflict resolution (#386) * [PRODUCTION] Update production Airflow environment (#370) * requirements updated * sqlfluff * Use the correct gcs bucket (#372) --------- Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> * [PRODUCTION] Update production Airflow environment (#375) * requirements updated * Update READMEs --------- Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> --------- Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> * [PRODUCTION] Update production Airflow environment (#378) * HUBBLE-409 Update/delete airflow variable names (#377) * Updates for testnet-reset (#380) * Feature/add slas (#359) --------- Co-authored-by: chowbao <[email protected]> --------- Co-authored-by: Laysa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> Co-authored-by: Eduardo Alves <[email protected]> * [PRODUCTION] Update production Airflow environment (#382) * HUBBLE-409 Update/delete airflow variable names (#377) * test time task without affinity * delete affinity from repo * dbt_ * testing config related to stellar_marts * documented * test without var dbt_dataset * with dbt_dataset again * dbt_dataset_for_test * comment excluded * Updates for testnet-reset (#380) * Feature/add slas (#359) * Created alert_sla_miss callback funcion * Defined test value for export task SLA * Defined sla_miss_callback for DAG level parameter * Increased time task timeout value * Added dev Airflow variable for tasks SLA values * Added slack notification logic for the SLA miss callback * Added SLA parameter to the relavant task building functions * Removed SLA miss callback function and related logic * Added refactored logic for sla miss callback function * Added sla miss callback to desired dags * Adjusted callback logic for sentry integration * Added task SLA values for prod Airflow variables * Changed alert message to better distinguish SLA miss from task fail alerts * Update dev tasks sla and timeout values * Update prod tasks sla and timeout values * Added default sla value to dag default args * Added missing SLA param to build task functions * Standardizing logic for fetching SLA param value * Added sla miss callback reference * Remove default SLA from default args * Set missing sla params to tasks * Updated values for sla variables * Updated build_dbt_task timeout value * Changed variable names to match new naming reference --------- Co-authored-by: chowbao <[email protected]> * Update stellar-etl image (#381) * Update stellar-etl image (#383) --------- Co-authored-by: Laysa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> Co-authored-by: Eduardo Alves <[email protected]> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> Co-authored-by: Eduardo Alves <[email protected]> * Help (#388) * [PRODUCTION] Update production Airflow environment (#370) * requirements updated * sqlfluff * Use the correct gcs bucket (#372) --------- Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> * [PRODUCTION] Update production Airflow environment (#375) * requirements updated * Update READMEs --------- Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> --------- Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> * [PRODUCTION] Update production Airflow environment (#378) * HUBBLE-409 Update/delete airflow variable names (#377) * Updates for testnet-reset (#380) * Feature/add slas (#359) --------- Co-authored-by: chowbao <[email protected]> --------- Co-authored-by: Laysa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> Co-authored-by: Eduardo Alves <[email protected]> * [PRODUCTION] Update production Airflow environment (#382) * HUBBLE-409 Update/delete airflow variable names (#377) * test time task without affinity * delete affinity from repo * dbt_ * testing config related to stellar_marts * documented * test without var dbt_dataset * with dbt_dataset again * dbt_dataset_for_test * comment excluded * Updates for testnet-reset (#380) * Feature/add slas (#359) * Created alert_sla_miss callback funcion * Defined test value for export task SLA * Defined sla_miss_callback for DAG level parameter * Increased time task timeout value * Added dev Airflow variable for tasks SLA values * Added slack notification logic for the SLA miss callback * Added SLA parameter to the relavant task building functions * Removed SLA miss callback function and related logic * Added refactored logic for sla miss callback function * Added sla miss callback to desired dags * Adjusted callback logic for sentry integration * Added task SLA values for prod Airflow variables * Changed alert message to better distinguish SLA miss from task fail alerts * Update dev tasks sla and timeout values * Update prod tasks sla and timeout values * Added default sla value to dag default args * Added missing SLA param to build task functions * Standardizing logic for fetching SLA param value * Added sla miss callback reference * Remove default SLA from default args * Set missing sla params to tasks * Updated values for sla variables * Updated build_dbt_task timeout value * Changed variable names to match new naming reference --------- Co-authored-by: chowbao <[email protected]> * Update stellar-etl image (#381) * Update stellar-etl image (#383) --------- Co-authored-by: Laysa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> Co-authored-by: Eduardo Alves <[email protected]> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> Co-authored-by: Eduardo Alves <[email protected]> * Patch/fix release conflicts (#390) * [PRODUCTION] Update production Airflow environment (#370) * requirements updated * sqlfluff * Use the correct gcs bucket (#372) --------- Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> * [PRODUCTION] Update production Airflow environment (#375) * requirements updated * Update READMEs --------- Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> --------- Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> * [PRODUCTION] Update production Airflow environment (#378) * HUBBLE-409 Update/delete airflow variable names (#377) * Updates for testnet-reset (#380) * Feature/add slas (#359) --------- Co-authored-by: chowbao <[email protected]> --------- Co-authored-by: Laysa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> Co-authored-by: Eduardo Alves <[email protected]> * [PRODUCTION] Update production Airflow environment (#382) * HUBBLE-409 Update/delete airflow variable names (#377) * test time task without affinity * delete affinity from repo * dbt_ * testing config related to stellar_marts * documented * test without var dbt_dataset * with dbt_dataset again * dbt_dataset_for_test * comment excluded * Updates for testnet-reset (#380) * Feature/add slas (#359) * Created alert_sla_miss callback funcion * Defined test value for export task SLA * Defined sla_miss_callback for DAG level parameter * Increased time task timeout value * Added dev Airflow variable for tasks SLA values * Added slack notification logic for the SLA miss callback * Added SLA parameter to the relavant task building functions * Removed SLA miss callback function and related logic * Added refactored logic for sla miss callback function * Added sla miss callback to desired dags * Adjusted callback logic for sentry integration * Added task SLA values for prod Airflow variables * Changed alert message to better distinguish SLA miss from task fail alerts * Update dev tasks sla and timeout values * Update prod tasks sla and timeout values * Added default sla value to dag default args * Added missing SLA param to build task functions * Standardizing logic for fetching SLA param value * Added sla miss callback reference * Remove default SLA from default args * Set missing sla params to tasks * Updated values for sla variables * Updated build_dbt_task timeout value * Changed variable names to match new naming reference --------- Co-authored-by: chowbao <[email protected]> * Update stellar-etl image (#381) * Update stellar-etl image (#383) --------- Co-authored-by: Laysa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> Co-authored-by: Eduardo Alves <[email protected]> --------- Co-authored-by: sydneynotthecity <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Laysa de Sousa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> --------- Co-authored-by: Laysa Bitencourt <[email protected]> Co-authored-by: chowbao <[email protected]> Co-authored-by: Eduardo Alves <[email protected]> Co-authored-by: sydneynotthecity <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Overview
This PR addresses the development of notification callbacks and the enabling of SLA checks for tasks that could potentially take longer to finish.
Key Changes
alert_sla_miss
function to send notifications to designed Slack channels leveraging Sentry integrationsla_miss_callback
with the newly defined callback function for the required DAGstask_timeout
value for theget_ledger_range_from_times
in order to avoid task retries while processing XCOM