Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][plugins/dora] incident_deploy_connector is not matching incident with deployments properly #8150

Open
3 tasks done
Adibov opened this issue Oct 21, 2024 · 3 comments
Open
3 tasks done
Labels
component/plugins This issue or PR relates to plugins devops Something about CI/CD (devops) type/bug This issue is a bug

Comments

@Adibov
Copy link

Adibov commented Oct 21, 2024

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

DORA has introduced a patch to the definition of CFR in its 2023 report:

For the primary application or
service you work on, how long does it generally take
to restore service after a change to production or
release to users results in degraded service (for
example, lead to service impairment or service
outage) and subsequently require remediation
(for example, require a hotfix, rollback, fix
forward, or patch)

The definition has changed to only consider those downtimes as incidents, caused by a deployment. That is for example, if we have two consecutive deployments with 12 hours between them, and some incidents happened after 6 hours after the first deployment, we shouldn't count that incident as a change that led to a failure.

However, in the DORA plugin in Devlake, we match any incident with the latest deployment before it, ignoring the time length between them. This leads to imprecise metrics.

What do you expect to happen

I think a preferred solution is to configure the time length between a deployment and an incident, which we consider as a software fault. That is, having some variable in code that is configurable and user can configure it based on their needs.

How to reproduce

Create one deployment and an incident far after it, that incident will match the deployment.

Anything else

No response

Version

v1.0.1@e061ef2

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@Adibov Adibov added the type/bug This issue is a bug label Oct 21, 2024
Copy link

dosubot bot commented Oct 21, 2024

Hey @Adibov! I'm here to help you with any bugs, questions, or contributions you have. Let's work together to solve this issue.

I found a similar closed issue that might be relevant to your concern: [Feature][DORA] Update the logic for metric lead time for changes](#3516). This issue was closed by a pull request that updated the logic for change lead time [1].

Regarding your specific issue, the current logic in the DORA plugin for matching incidents with deployments is based on timestamps. Incidents are related to the last deployment on the project with a timestamp that is before the incident's timestamp. This logic does not currently consider the time length between deployments and incidents [2].

There is no existing configuration option in the DORA plugin for setting time constraints between deployments and incidents [2]. However, you mentioned that you are willing to submit a PR, which could be a great opportunity to introduce such a feature. You might consider adding a configurable time constraint to the ConnectIncidentToDeployment function in incident_deploy_connector.go, which is responsible for connecting incidents to deployments [3].

To continue talking to Dosu, mention @dosu.

@dosubot dosubot bot added component/plugins This issue or PR relates to plugins devops Something about CI/CD (devops) labels Oct 21, 2024
@Shaktty
Copy link

Shaktty commented Oct 25, 2024

I don't agree here. An incident can take time to appear or manifest itslef, but still be due to a deployment of a buggy artifact with defect leakeage. Also, not only full downtime should be identified as a faulty deployment, but also any incident that states a degraded user experience.

@Adibov
Copy link
Author

Adibov commented Oct 26, 2024

@Shaktty
I'm on the same page with you, software defects must also include defect leakage and not only full downtimes. That is why I proposed a configurable window to only consider those incidents lying in that window. For example, suppose we deploy a manifest at 6 PM and its availability drops to 90% at 6 AM the next morning. In that case, we shouldn't consider it an incident since it's probably because of an infrastructure defect and is not software-related.

Another possible example is that when the whole Kubernetes cluster goes down, all availabilities drop, and if we submit an incident via webhook, it will be considered a software incident but it's not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/plugins This issue or PR relates to plugins devops Something about CI/CD (devops) type/bug This issue is a bug
Projects
None yet
Development

No branches or pull requests

2 participants