Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refs 4739: metric check for failing snap tasks on RH repos #833

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dominikvagner
Copy link
Contributor

Summary

This PR adds a new metric for checking if the snapshot tasks are failing on RH repos, specifically if there isn't a successful one in the last 36 hours and returns number of repos on which is this happening.

Testing steps

  1. Import RH repos.
  2. Make the snapshot task fail for some repo(s) (i.e. status = failed) and see if the metric increases.

@jlsherrill
Copy link
Member

@dominikvagner dominikvagner changed the title Refs 4739: new metric check for failing snapshot tasks on RH repos Refs 4739: metric check for failing snap tasks on RH repos Sep 26, 2024
@xbhouse xbhouse self-assigned this Oct 1, 2024
Copy link
Contributor

@xbhouse xbhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metrics look good to me! @rverdile can you take a look when you get a chance in case i missed anything?

@rverdile rverdile self-requested a review October 1, 2024 19:56
Copy link
Contributor

@rverdile rverdile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment, looks really good overall


subQuery := d.db.WithContext(ctx).
Model(&models.RepositoryConfiguration{}).
Select("repository_configurations.uuid, bool_or(tasks.status ILIKE ? AND tasks.finished_at > ?) AS has_successful_tasks", fmt.Sprintf("%%%s%%", config.TaskStatusCompleted), date).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Select("repository_configurations.uuid, bool_or(tasks.status ILIKE ? AND tasks.finished_at > ?) AS has_successful_tasks", fmt.Sprintf("%%%s%%", config.TaskStatusCompleted), date).
Select("repository_configurations.uuid, bool_or(tasks.status = ? AND tasks.finished_at > ?) AS has_successful_tasks", config.TaskStatusCompleted), date).

this can be simplified slightly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query looks good. I believe it would miss the (very unlikely) case where there's no snapshot tasks for a repository. I don't think this would happen naturally, but maybe could happen if there's a bug with task cleanup.

Copy link
Contributor

@rverdile rverdile Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about handling that case? I could see just adding a second query and adding the two counts together as a simple solution.

@@ -134,6 +134,7 @@ type MetricsDao interface {
PendingTasksAverageLatency(ctx context.Context) float64
PendingTasksCount(ctx context.Context) int64
PendingTasksOldestTask(ctx context.Context) float64
RHReposNoSuccessfulSnapshotTaskIn36Hours(ctx context.Context) int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RHReposNoSuccessfulSnapshotTaskIn36Hours(ctx context.Context) int64
RHReposSnapshotNotCompletedInLast36HoursCount(ctx context.Context) int64

This is a nitpick, but what do you think about renaming this metric?

"No" can also be an abbreviation for "number", so "RHReposNoSuccessfulSnapshotTaskIn36Hours" could be interpreted as "Number of sucessful snapshot tasks"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants