Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp metrics to account the status code #320

Merged
merged 6 commits into from
Mar 19, 2024

Conversation

jianoaix
Copy link
Contributor

@jianoaix jianoaix commented Mar 6, 2024

Why are these changes needed?

Use the standard API status code to label the metrics, so we can correctly account for availability.

Checks

  • I've made sure the lint is passing in this PR.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
  • Testing Strategy
    • Unit tests
    • Integration tests
    • This PR is not tested :(

}
return nil, api.NewInvalidArgError(err.Error())
}

if s.ratelimiter != nil {
err := s.checkRateLimitsAndAddRates(ctx, blob, origin, authenticatedAddress)
if err != nil {
for _, param := range securityParams {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was wrong accounting: the checkRateLimitsAndAddRates will error out and return upon the first error (e.g. first quorum), but this accounting increments metrics for each quorum

@@ -169,6 +170,7 @@ func (s *DispersalServer) DisperseBlobAuthenticated(stream pb.Disperser_Disperse
}

func (s *DispersalServer) DisperseBlob(ctx context.Context, req *pb.DisperseBlobRequest) (*pb.DisperseBlobReply, error) {
s.metrics.IncrementBlobRequestNum("DisperseBlob")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it would be more useful to emit this at the end instead of the beginning. Because right now IncrementBlobRequestNum has status code = total. Total should equal "success + failures" anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a time series for "total" will make the monitoring query easy, just 1 - 500 / total, and also more stable (no need to update when there is a new failure case to add here).

We currently have 4 failure codes; adding more will make the query itself more complex if "total" has to be expanded

// IncrementRequestNum increments the number of successful requests
func (g *Metrics) IncrementRequestNum(method string) {
g.NumRequests.With(prometheus.Labels{
"status": "total",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here, feel like we shouldn't track total because we can compute it from failed + success.

Copy link
Contributor

@ian-shim ian-shim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use the consistent grpc error codes in metrics as well instead of tracking the mapped http codes?

@@ -99,17 +100,18 @@ func (g *Metrics) HandleSuccessfulRequest(quorum string, blobBytes int, method s
}

// IncrementFailedBlobRequestNum increments the number of failed blob requests
func (g *Metrics) IncrementFailedBlobRequestNum(quorum string, method string) {
func (g *Metrics) IncrementFailedBlobRequestNum(statusCode string, quorum string, method string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have these http status codes typed somewhere and use that type instead of taking raw string. Something like

type HttpStatusCode string
const (
  SuccessHttpStatusCode = "200"
  ...
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTTP code is already quite standard (probably it's the standard)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. My point isn't whether it's the standard or not. It's that we shouldn't take raw string in this method.
Why don't we use existing consts defined in standard library? https://go.dev/src/net/http/status.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same reason why quorum isn't a uint8 which you don't even need a library.

@jianoaix
Copy link
Contributor Author

jianoaix commented Mar 8, 2024

Why don't we use the consistent grpc error codes in metrics as well instead of tracking the mapped http codes?

Either way can work, http is better understood in general.

@jianoaix
Copy link
Contributor Author

Thinking it again, it makes more sense to me to use gPRC code: these codes are not just used by us, but also users who may communicate back with errors; having them monitored with same codes will reduce the extra mental translation. The HTTP codes could be just used a documentation/annotation for understanding. Going to change it so.

@jianoaix jianoaix requested a review from siddimore March 15, 2024 17:57
@jianoaix
Copy link
Contributor Author

Ping, this pr needs to make progress

@jianoaix jianoaix merged commit eead2c3 into Layr-Labs:master Mar 19, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants