-
-
Notifications
You must be signed in to change notification settings - Fork 607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VA: Add a method for performing MPIC compliant challenge validation #7794
base: main
Are you sure you want to change the base?
Conversation
d77a42b
to
14a8ac5
Compare
14a8ac5
to
29dee31
Compare
Perspective: perspective, | ||
Rir: rir, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not handled these fields being present in pbToValidationResult()
below. Ultimately I believe this is fine as we don't depend on them, but it could be a little surprsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change does two things:
- Add logging and threshold checking required by MPIC
- Reimplement
VA.PerformValidation
asVA.ValidateChallenge
, with the important distinction thatVA.ValidateChallenge
does not check CAA.
I remember we discussed during standup last week some challenges around multi-perspective CAA rechecking that led to (2), but I think we forgot to write down the details. I did my best to write down what I remember of it: #7808.
Looking at (2) in this PR, I'm concerned about how much near-duplication of code it results in. I'd rather do some refactoring to implement (1) in the existing code, and treat (2) as separate followup change - or possibly take a different approach entirely, like the alternatives mentioned in #7808.
If we do pursue (2) as a followup change, my goal would be to reuse as much code as possible between the two RPC methods, so we have less code, and simpler code, to reason about.
pass = "pass" | ||
fail = "fail" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the existing VA metrics we use "valid"
/ "invalid"
(taken from challenge statuses). Any reason not to continue that trend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're using the same histogram for ValidateChallenge
and CheckCAA
, so I picked "pass"
/"fail"
because it's more general and can apply to both.
func (va *ValidationAuthorityImpl) observeLatency(op, perspective, challType, probType, result string, latency time.Duration) { | ||
labels := prometheus.Labels{ | ||
"operation": op, | ||
"perspective": perspective, | ||
"challenge_type": challType, | ||
"problem_type": probType, | ||
"result": result, | ||
} | ||
va.metrics.validationLatency.With(labels).Observe(latency.Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function doesn't add much vs calling Observe()
directly, other than moving from named fields (across multiple lines) to positional fields in the function call. While this allows the call sites to use a single line, it makes it less obvious at the call site that the parameters are correct (also the same result could be achieved by using .WithLabelValues(op, perspective, challType,...)
.
The two call sites look like this:
va.observeLatency(challenge, va.perspective, string(chall.Type), probType, outcome, localLatency)
if va.isPrimaryVA() {
// Observe total validation latency (primary+remote).
va.observeLatency(challenge, all, string(chall.Type), probType, outcome, va.clk.Since(start))
Since several of those are the same, let's use the cool .CurryWith()
method to reduce duplication while still keeping the clarity of naming the labels inline:
hist := va.metrics.validationLatency.CurryWith(prometheus.Labels{
"operation": challenge,
"challenge_type": chall.Type,
"problem_type": probType,
"result": outcome,
}
hist.With({"perspective": va.perspective}).Observe(localLatency.Seconds())
if va.isPrimaryVA() {
hist.With({"perspective": all}).Observe(va.clk.Since(start))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#7799 adds two more call sites. This helper keeps the label boilerplate out of the way. It is less likely to result in mislabeling or missing a label altogether than .WithLabelValues(
. The .CurryWith()
suggestion is a nice trick though! I had forgotten that existed.
We talked about ways to extract out some of the non-core parts of this change into their own changes so the important stuff is more readily visible. Some possibilities: In the current code we have:
In the new code we have:
That is, renaming The current code counts The old code has:
The new code has:
Again, a fine change but one we can make in the existing In va_test.go there is a rename of This PR introduces the new |
if remoteVACount < 3 { | ||
return mpicSummary{}, probs.ServerInternal("Insufficient remote perspectives: need at least 3") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the number of remoteVAs is determined at construction time, it seems better to put this check in NewValidationAuthorityImpl
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it should eventually move there. Today that change would make remote VAs a requirement to run Boulder starting in the next release.
Bring this code more in line with `VA.remoteDoDCV` in #7794. This should make these two easier to diff in review.
Note: not a complete review, I just wanted to get some notes down before I go get the kid. --- /tmp/old.txt 2024-11-15 14:57:20.746779025 -0800
+++ /tmp/new.txt 2024-11-15 14:58:03.728594914 -0800
@@ -1,12 +1,11 @@
-func (va *ValidationAuthorityImpl) performRemoteValidation(
- ctx context.Context,
- req *vapb.PerformValidationRequest,
-) *probs.ProblemDetails {
+func (va *ValidationAuthorityImpl) remoteDoDCV(ctx context.Context, req *vapb.DCVRequest) (mpicSummary, *probs.ProblemDetails) {
+ // Mar 15, 2026: MUST implement using at least 3 perspectives
+ // Jun 15, 2026: MUST implement using at least 4 perspectives
+ // Dec 15, 2026: MUST implement using at least 5 perspectives
remoteVACount := len(va.remoteVAs)
- if remoteVACount == 0 {
- return nil
+ if remoteVACount < 3 {
+ return mpicSummary{}, probs.ServerInternal("Insufficient remote perspectives: need at least 3")
}
-
type response struct {
addr string
result *vapb.ValidationResult
@@ -15,34 +14,31 @@
responses := make(chan *response, remoteVACount)
for _, i := range rand.Perm(remoteVACount) {
- go func(rva RemoteVA, out chan<- *response) {
- res, err := rva.PerformValidation(ctx, req)
- out <- &response{
- addr: rva.Address,
- result: res,
- err: err,
- }
- }(va.remoteVAs[i], responses)
+ go func(rva RemoteVA) {
+ res, err := rva.DoDCV(ctx, req)
+ responses <- &response{rva.Address, res, err}
+ }(va.remoteVAs[i])
}
- required := remoteVACount - va.maxRemoteFailures
var passed []string
var failed []string
+ passedRIRs := make(map[string]struct{})
+
var firstProb *probs.ProblemDetails
+ for i := 0; i < remoteVACount; i++ {
+ resp := <-responses
- for resp := range responses {
var currProb *probs.ProblemDetails
-
if resp.err != nil {
// Failed to communicate with the remote VA.
failed = append(failed, resp.addr)
-
- if canceled.Is(resp.err) {
- currProb = probs.ServerInternal("Remote PerformValidation RPC canceled")
+ if errors.Is(resp.err, context.Canceled) {
+ currProb = probs.ServerInternal("Secondary domain validation RPC canceled")
} else {
- va.log.Errf("Remote VA %q.PerformValidation failed: %s", resp.addr, resp.err)
- currProb = probs.ServerInternal("Remote PerformValidation RPC failed")
+ va.log.Errf("Remote VA %q.ValidateChallenge failed: %s", resp.addr, resp.err)
+ currProb = probs.ServerInternal("Secondary domain validation RPC failed")
}
+
} else if resp.result.Problems != nil {
// The remote VA returned a problem.
failed = append(failed, resp.result.Perspective)
@@ -50,37 +46,53 @@
var err error
currProb, err = bgrpc.PBToProblemDetails(resp.result.Problems)
if err != nil {
- va.log.Errf("Remote VA %q.PerformValidation returned malformed problem: %s", resp.addr, err)
- currProb = probs.ServerInternal("Remote PerformValidation RPC returned malformed result")
+ va.log.Errf("Remote VA %q.ValidateChallenge returned a malformed problem: %s", resp.addr, err)
+ currProb = probs.ServerInternal("Secondary domain validation RPC returned malformed result")
}
+
} else {
// The remote VA returned a successful result.
passed = append(passed, resp.result.Perspective)
+ passedRIRs[resp.result.Rir] = struct{}{}
}
if firstProb == nil && currProb != nil {
// A problem was encountered for the first time.
firstProb = currProb
}
+ }
+
+ // Prepare the summary, this MUST be returned even if the validation failed.
+ summary := prepareSummary(passed, failed, passedRIRs, remoteVACount)
- if len(passed) >= required {
- // Enough successful responses to reach quorum.
- return nil
+ maxRemoteFailures := maxValidationFailures(remoteVACount)
+ if len(failed) > maxRemoteFailures {
+ // Too many failures to reach quorum.
+ if firstProb != nil {
+ firstProb.Detail = fmt.Sprintf("During secondary domain validation: %s", firstProb.Detail)
+ return summary, firstProb
}
- if len(failed) > va.maxRemoteFailures {
- // Too many failed responses to reach quorum.
- firstProb.Detail = fmt.Sprintf("During secondary validation: %s", firstProb.Detail)
- return firstProb
+ return summary, probs.ServerInternal("Secondary domain validation failed due to too many failures")
+ }
+
+ if len(passed) < (remoteVACount - maxRemoteFailures) {
+ // Too few successful responses to reach quorum.
+ if firstProb != nil {
+ firstProb.Detail = fmt.Sprintf("During secondary domain validation: %s", firstProb.Detail)
+ return summary, firstProb
}
+ return summary, probs.ServerInternal("Secondary domain validation failed due to insufficient successful responses")
+ }
- // If we somehow haven't returned early, we need to break the loop once all
- // of the VAs have returned a result.
- if len(passed)+len(failed) >= remoteVACount {
- break
+ if len(passedRIRs) < 2 {
+ // Too few successful responses from distinct RIRs to reach quorum.
+ if firstProb != nil {
+ firstProb.Detail = fmt.Sprintf("During secondary domain validation: %s", firstProb.Detail)
+ return summary, firstProb
}
+ return summary, probs.Unauthorized("Secondary domain validation failed to receive enough corroborations from distinct RIRs")
}
- // This condition should not occur - it indicates the passed/failed counts
- // neither met the required threshold nor the maxRemoteFailures threshold.
- return probs.ServerInternal("Too few remote PerformValidation RPC results")
+ // Enough successful responses from distinct RIRs to reach quorum.
+ return summary, nil
} Now with some of the formatting and naming changes merge to out <- &response{
addr: rva.Address,
result: res,
err: err,
} Became:
Poking around it looks like the latter is the more common style we use for this pattern, so let's merge it upstream as well. Also, the "Remote PerformValidation RPC" to "Secondary domain validation" error message changes didn't wind up making it into the backports. Similarly: for resp := range responses { Became: for i := 0; i < remoteVACount; i++ { I think the former is more appropriate in this case. if canceled.Is(resp.err) { Became: if errors.Is(resp.err, context.Canceled) { The documentation for Other than that the diff here seems to do what I expect. Could you update the PR description to describe how
That last one is probably implicit in the ballot summary text but I think it's useful to list it in the executive summary up top. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is the PR where we're adding enforcement of MPIC-related constraints, there are some useful checks to add:
- There are no duplicate perspectives.
- If a VA considers itself remote (i.e. no backends), its perspective and RIR are non-empty. I'm sure these can also be expressed in the config validation language, but IMO it doesn't hurt to check them in the constructor as well.
- If a VA considers itself remote, its Perspective is not
PrimaryPerspective
. Note: this will run into some trouble in prod because we are usingboulder va
, not yetboulder-remoteva
.
|
||
// setup returns an in-memory VA and a mock logger. The default resolver client | ||
// is MockClient{}, but can be overridden. | ||
func setupVA(srv *httptest.Server, ua string, rvas []RemoteVA, mockDNSClient bdns.Client) (*ValidationAuthorityImpl, *blog.Mock) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function looks like a copy of setup()
, which is still retained in this file. I copied the setup()
from the main
branch into old.txt
and diffed it against setupVA()
on this branch.
--- /tmp/old.txt 2024-11-15 16:58:07.334618542 -0800
+++ /tmp/new.txt 2024-11-15 16:57:49.670734507 -0800
@@ -1,29 +1,28 @@
-func setup(srv *httptest.Server, maxRemoteFailures int, userAgent string, remoteVAs []RemoteVA, mockDNSClientOverride bdns.Client) (*ValidationAuthorityImpl, *blog.Mock) {
+func setupVA(srv *httptest.Server, ua string, rvas []RemoteVA, mockDNSClient bdns.Client) (*ValidationAuthorityImpl, *blog.Mock) {
features.Reset()
fc := clock.NewFake()
- logger := blog.NewMock()
-
- if userAgent == "" {
- userAgent = "user agent 1.0"
+ mockLog := blog.NewMock()
+ if ua == "" {
+ ua = "user agent 1.0"
}
va, err := NewValidationAuthorityImpl(
- &bdns.MockClient{Log: logger},
+ &bdns.MockClient{Log: mockLog},
nil,
- maxRemoteFailures,
- userAgent,
+ 0,
+ ua,
"letsencrypt.org",
metrics.NoopRegisterer,
fc,
- logger,
+ mockLog,
accountURIPrefixes,
PrimaryPerspective,
- "",
+ "ARIN",
)
- if mockDNSClientOverride != nil {
- va.dnsClient = mockDNSClientOverride
+ if mockDNSClient != nil {
+ va.dnsClient = mockDNSClient
}
// Adjusting industry regulated ACME challenge port settings is fine during
@@ -37,8 +36,8 @@
if err != nil {
panic(fmt.Sprintf("Failed to create validation authority: %v", err))
}
- if remoteVAs != nil {
- va.remoteVAs = remoteVAs
+ if rvas != nil {
+ va.remoteVAs = rvas
}
- return va, logger
-}
+ return va, mockLog
+}
It looks like most of the changes here are renames. There's one substantive change: setupVA
doesn't have a maxRemoteValidations
parameter, but setup
does. Turns out I actually made the same change to setup
in #7810! Looks like we were thinking along the same lines. Given that, I think this diff can be reverted, and the new tests can just use setup()
.
Oh, and the other substantive change: making RIR unconditionally be ARIN
. In https://github.com/letsencrypt/boulder/pull/7815/files#diff-38dce5cfdf5ee8afeb3994e69b1d3663bbd6ebf4067342bb175312854dfaf0cdR119-R124 I did something similar with perspective
, with two differences: I set it to PrimaryPerspective if there were remoteVAs, and otherwise set it to a random value. I think we should do the same thing with RIR. It should be empty when simulating the primary, and a random value when simulating a remoteVA. That should work okay for general-purpose tests. I see below that for MPIC-specific tests, setupRVAs manually sets the field based on a test case, so we'll still get the behavior we want in terms of being able to set two instances of the same RIR.
For the renames: They don't seem critical to this PR, so my general advice is to not include them since they're distracting. If they're important, let's do them in another PR like we did for the other renames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that we were porting all of this back into PerformValidation. Lets chat about this PR when you get in today.
// ua if set to pass, the remote VA will always pass validation. If set to | ||
// fail, the remote VA will always fail validation with probs.Unauthorized. | ||
// This is set to pass by default. | ||
ua string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is not quite correct. ua
is set to ""
by default, by the normal zero value rules.
Looking in setupRVAs
, if it receives an empty ua it will default to "user agent 1.0"
.
// ua if set to pass, the remote VA will always pass validation. If set to | |
// fail, the remote VA will always fail validation with probs.Unauthorized. | |
// This is set to pass by default. | |
ua string | |
// ua the user agent to be used by this remote VA. Different test cases use the | |
// user agent string to accept or reject VAs selectively to create conditions for | |
// testing. For instance, `TestDoDCVMPIC` accepts all requests with a user agent | |
// of "pass". | |
ua string |
Previously this was a configuration field. Ports `maxAllowedFailures()` from `determineMaxAllowedFailures()` in #7794. Test updates: Remove the `maxRemoteFailures` param from `setup` in all VA tests. Some tests were depending on setting this param directly to provoke failures. For example, `TestMultiVAEarlyReturn` previously relied on "zero allowed failures". Since the number of allowed failures is now 1 for the number of remote VAs we were testing (2), the VA wasn't returning early with an error; it was succeeding! To fix that, make sure there are two failures. Since two failures from two RVAs wouldn't exercise the right situation, add a third RVA, so we get two failures from three RVAs. Similarly, TestMultiCAARechecking had several test cases that omitted this field, effectively setting it to zero allowed failures. I updated the "1 RVA failure" test case to expect overall success and added a "2 RVA failures" test case to expect overall failure (we previously expected overall failure from a single RVA failing). In TestMultiVA I had to change a test for `len(lines) != 1` to `len(lines) == 0`, because with more backends we were now logging more errors, and finding e.g. `len(lines)` to be 2.
Add VA.ValidateChallenge, a new MPIC compliant gRPC method that will replace VA.PerformValidation for the validation of ACME challenges. A follow-up will add VA.CheckCAA and an RA feature flag that enables their use.
Remove Perspective and RIR from core.ValidationRecord. Storing these in each record didn't end up making sense in the final implementation.
Remove Perspective and RIR from the verificationRequestEvent used by VA.PerformValidation. We're not planning to make this method MPIC aware.
Part of #7615
Part of #7614
Part of #7616
Ballot Summary for Reviewers
You can read the full ballot contents here. I have pulled together a summary below:
3.2.2.9 Multi-Perspective Issuance Corroboration
This PR does not attempt to satisfy the aforementioned distance requirement. This will need to be satisfied as part of the datacenter selection process for perspectives.
These requirements are satisfied by this PR.
These requirements are not satisfied by this PR. The following code will need to be updated to reject validation requests when fewer than 4 (later 5) remote VAs are required.
5.4.1 Types of events recorded
These requirements are satisfied by this PR.