[Bug][plugins/dora] `incident_deploy_connector` is not matching incident with deployments properly #8150

Adibov · 2024-10-21T10:50:51Z

Search before asking

I had searched in the issues and found no similar issues.

What happened

DORA has introduced a patch to the definition of CFR in its 2023 report:

For the primary application or
service you work on, how long does it generally take
to restore service after a change to production or
release to users results in degraded service (for
example, lead to service impairment or service
outage) and subsequently require remediation
(for example, require a hotfix, rollback, fix
forward, or patch)

The definition has changed to only consider those downtimes as incidents, caused by a deployment. That is for example, if we have two consecutive deployments with 12 hours between them, and some incidents happened after 6 hours after the first deployment, we shouldn't count that incident as a change that led to a failure.

However, in the DORA plugin in Devlake, we match any incident with the latest deployment before it, ignoring the time length between them. This leads to imprecise metrics.

What do you expect to happen

I think a preferred solution is to configure the time length between a deployment and an incident, which we consider as a software fault. That is, having some variable in code that is configurable and user can configure it based on their needs.

How to reproduce

Create one deployment and an incident far after it, that incident will match the deployment.

Anything else

No response

Version

v1.0.1@e061ef2

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

dosubot · 2024-10-21T10:52:53Z

Hey @Adibov! I'm here to help you with any bugs, questions, or contributions you have. Let's work together to solve this issue.

I found a similar closed issue that might be relevant to your concern: [Feature][DORA] Update the logic for metric lead time for changes](#3516). This issue was closed by a pull request that updated the logic for change lead time ^[1].

Regarding your specific issue, the current logic in the DORA plugin for matching incidents with deployments is based on timestamps. Incidents are related to the last deployment on the project with a timestamp that is before the incident's timestamp. This logic does not currently consider the time length between deployments and incidents ^[2].

There is no existing configuration option in the DORA plugin for setting time constraints between deployments and incidents ^[2]. However, you mentioned that you are willing to submit a PR, which could be a great opportunity to introduce such a feature. You might consider adding a configurable time constraint to the ConnectIncidentToDeployment function in incident_deploy_connector.go, which is responsible for connecting incidents to deployments ^[3].

To continue talking to Dosu, mention @dosu.

Shaktty · 2024-10-25T13:14:20Z

I don't agree here. An incident can take time to appear or manifest itslef, but still be due to a deployment of a buggy artifact with defect leakeage. Also, not only full downtime should be identified as a faulty deployment, but also any incident that states a degraded user experience.

Adibov · 2024-10-26T10:50:03Z

@Shaktty
I'm on the same page with you, software defects must also include defect leakage and not only full downtimes. That is why I proposed a configurable window to only consider those incidents lying in that window. For example, suppose we deploy a manifest at 6 PM and its availability drops to 90% at 6 AM the next morning. In that case, we shouldn't consider it an incident since it's probably because of an infrastructure defect and is not software-related.

Another possible example is that when the whole Kubernetes cluster goes down, all availabilities drop, and if we submit an incident via webhook, it will be considered a software incident but it's not.

Adibov added the type/bug This issue is a bug label Oct 21, 2024

dosubot bot added component/plugins This issue or PR relates to plugins devops Something about CI/CD (devops) labels Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug][plugins/dora] `incident_deploy_connector` is not matching incident with deployments properly #8150

[Bug][plugins/dora] `incident_deploy_connector` is not matching incident with deployments properly #8150

Adibov commented Oct 21, 2024

dosubot bot commented Oct 21, 2024

Shaktty commented Oct 25, 2024

Adibov commented Oct 26, 2024

[Bug][plugins/dora] incident_deploy_connector is not matching incident with deployments properly #8150

[Bug][plugins/dora] incident_deploy_connector is not matching incident with deployments properly #8150

Comments

Adibov commented Oct 21, 2024

Search before asking

What happened

What do you expect to happen

How to reproduce

Anything else

Version

Are you willing to submit PR?

Code of Conduct

dosubot bot commented Oct 21, 2024

Shaktty commented Oct 25, 2024

Adibov commented Oct 26, 2024

[Bug][plugins/dora] `incident_deploy_connector` is not matching incident with deployments properly #8150

[Bug][plugins/dora] `incident_deploy_connector` is not matching incident with deployments properly #8150