Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MacOS/Android AVD CI failures #68

Closed
cpu opened this issue Feb 28, 2024 · 10 comments
Closed

MacOS/Android AVD CI failures #68

cpu opened this issue Feb 28, 2024 · 10 comments

Comments

@cpu
Copy link
Member

cpu commented Feb 28, 2024

It looks like the MacOS CI jobs running the Android virtual device tests started failing on main ~yesterday (one, two). Kicking the jobs doesn't resolve the failure, so it's probably not a transient flake.

> Task :rustls-platform-verifier:connectedDebugAndroidTest
Starting 3 tests on test(AVD) - 9

org.rustls.platformverifier.CertificateVerifierTests > runRealWorldTestSuite[test(AVD) - 9] FAILED 
	org.junit.ComparisonFailure: A test failed. Check the logs above for Rust panics. expected:<[success]> but was:<[test failed!]>
	at org.junit.Assert.assertEquals(Assert.java:115)
Tests on test(AVD) - 9 failed: There was 1 failure(s).

Unfortunately the mentioned Rust panic logs are not being passed through (similar to #27).

@complexspaces
Copy link
Collaborator

This seems like it would be a good time to try and move the Android jobs to Linux runners instead. We would need to enable KVM as well to keep decent performance.

@cpu
Copy link
Member Author

cpu commented Feb 29, 2024

Sounds reasonable 👍

@cpu
Copy link
Member Author

cpu commented Mar 11, 2024

I was able to unbreak my local dev environment by pinning a specific android-ndk version (23.1.7779620) and when run locally the missing real world test suite panic logs do appear. It's the Let's Encrypt real world verification suite that seems to be failing in the Android VM (if we assume my local env matches what happens in CI). The relevant part was:

03-11 16:21:38.616  6704  6723 I rustls_platform_verif..: -----------------------------------------------------------------------------
03-11 16:21:38.616  6704  6731 I rustls_platform_verif..: verifying ref ID "letsencrypt.org" expected Ok(())
03-11 16:21:38.624  6704  6731 W rustls_platform_verif..: certificate was not trusted: java.security.cert.CertificateException: Chain validation failed
03-11 16:21:38.624  6704  6731 E rustls_platform_verif..: failed to verify TLS certificate: invalid peer certificate: UnknownIssuer
03-11 16:21:38.624  6704  6731 E rustls_platform_verif..: test panic: assertion `left == right` failed
03-11 16:21:38.624  6704  6731 E rustls_platform_verif..:   left: Err(InvalidCertificate(UnknownIssuer))
03-11 16:21:38.624  6704  6731 E rustls_platform_verif..:  right: Ok(())
03-11 16:21:38.625  6704  6723 E rustls_platform_verif..: real world: test failed
03-11 16:21:38.625  6704  6723 I rustls_platform_verif..: -----------------------------------------------------------------------------

The unknown issuer error is curious since (AIUI) the Android trust store shouldn't have changed out from under us and the chain is baked into the test.

On a lark I tried running the test data update script to fetch an updated chain. Only the leaf EE cert changed, and after updating the mock verification time to be after the new EE cert's NotBefore the test suite started passing for me locally. It also fixed the test in CI. Diffing the two EE certs I didn't spot anything obvious changing between them except things you would expect (validity period, SCTs, etc).

I'll put a PR with the updated test data, but something feels fishy here. For one thing, the verification time and chain are both intended to be fixed, so there shouldn't be an expiry issue. Even if the time is not fixed somehow in practice, why is it reporting as an unknown issuer error instead of an expiry error? Lastly, why are logs not showing up in CI when they work locally? I think there's something more worth understanding here... Edit: it's also curious the same tests w/ the same chain are passing on the other platforms.

@complexspaces
Copy link
Collaborator

With the linked PRs merged, the only items remaining in this issue are to determine:

  • Why the mock verification time needed changed for CI to pass
  • Why it failed with Unknown Issuer

@cpu
Copy link
Member Author

cpu commented Mar 12, 2024

Why the mock verification time needed changed for CI to pass

I think I understand that one: the new end-entity certificate has a NotBefore date that was after the previous fixed mock verification time:

[daniel@blanc:~/Code/Rust/rustls-platform-verifier]$ git rev-parse HEAD
d68c2edd4c5fb2aa4cde57f472de059e6966f64c

[daniel@blanc:~/Code/Rust/rustls-platform-verifier]$ openssl x509 -inform der -in rustls-platform-verifier/src/tests/verification_real_world/letsencrypt_org_valid_1.crt -noout -text | grep "Not Before"
            Not Before: Jan 27 21:30:22 2024 GMT

Our mock verification time was set to Wednesday, January 3, 2024 6:03:08 PM prior to the update, and so the new EE leaf was considered as having been issued in the future and not yet valid.

Why it failed with Unknown Issuer

This is indeed still a mystery. I'm going to try and investigate deeper this afternoon.

@cpu
Copy link
Member Author

cpu commented Mar 12, 2024

Why it failed with Unknown Issuer

This is indeed still a mystery. I'm going to try and investigate deeper this afternoon.

It turns out there's a simple explanation for what we've seen.

The fixed verification time is used up front on a call to X509Certificate.checkValidity, but not referenced in the later call to X509TrustManager.checkServerTrusted. The catch here is that checkServerTrusted also enforces certificate validity.

Since checkServerTrusted is using real wallclock time once the vendored Let's Encrypt certificate expired (~Feb 26th) it started returning an exception java.security.cert.CertPathValidatorException for the expired Let's Encrypt end-entity certificate. The first failing cron-invoked CI job was on Feb 27th so that tracks.

We get an unknown issuer error instead of a more sensible error about the expiry because while the earlier checkValidity call maps a caught exception into a StatusCode.Expired result, the catch for the checkServerTrusted call unconditionally converts it to a StatusCode.UnknownCert result:

In this instance e.toString() is the message from the outermost generic CertificateException and just says "Chain validation failed". You'd need to dig in deeper to the nested exceptions to see the underlying certificate expiry message.

The fixed verification time avoids the checkValidity catch that would return StatusCode.Expired and falls into the checkServerTrusted catch that always returns StatusCode.UnknownCert.

That resolves the remaining mysteries but leaves a question about the best fix. It doesn't look like we can set the trust manager's reference time for checkServerTrusted :-/

@cpu
Copy link
Member Author

cpu commented Mar 12, 2024

That resolves the remaining mysteries but leaves a question about the best fix. It doesn't look like we can set the trust manager's reference time for checkServerTrusted :-/

At the very least I think we could at least return a better error so that in combination with the fixed CI logs it should be very obvious what the problem is (test data certificate expired) and what the fix is (run the tooling to update the test data). In practice the error won't be surfaced by this part of the code outside of tests since the earlier call to checkValidity will do the right thing before invoke the trust manager.

@cpu
Copy link
Member Author

cpu commented Mar 12, 2024

At the very least I think we could at least return a better error so that in combination with the fixed CI logs it should be very obvious what the problem is

Here's one attempt at this: #75

I personally think it's worth landing this while we figure out a better solution to use a static verification time throughout, but I also acknowledge it's kind of gross 😿

Edit: probably #59 or similar to fix the verification time issue is a better long term solution.

@cpu
Copy link
Member Author

cpu commented Mar 12, 2024

I'm going to close this now. I think the open PRs and #59 are probably enough to track any additional work here.

@cpu cpu closed this as completed Mar 12, 2024
@complexspaces
Copy link
Collaborator

Sounds good and thank you for diving into this. If nothing else we've made some great Android testing improvements so far as a result even if we haven't solved everything yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants