Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding X.509 SPIFFE auth #14084

Closed
wants to merge 1 commit into from
Closed

Conversation

jgnagy
Copy link

@jgnagy jgnagy commented Sep 22, 2023

Description

Attempts to implement #14083.

This change adds a new gRPC server option that enables X.509 SVIDs for authentication.

Support for this auth plugin requires the use of --grpc_auth_mode=spiffe and --grpc_auth_spiffe_allowed_trust_domains (plus the existing --grpc_key, --grpc_cert, --grpc_ca, --vtctld_grpc_key, --vtctld_grpc_cert, --vtctld_grpc_ca parameters, used the same as the mtls auth mode).

When enabled, gRPC servers will expect that clients will provide valid X.509 Leaf SVIDs as client auth certificates. These are described in the SPIFFE documentation as:

A leaf certificate is an SVID that serves to identify a caller or resource and is suitable for use in authentication processes. A leaf certificate [...] is the only type that may serve to identify a resource or caller.

Leaf certificate SPIFFE IDs MUST have a non-root path component. The Subject field is not required, however, the URI SAN extension MUST be marked as critical if Subject is omitted, per section 4.1.2.6 of RFC 5280.

Its validation is further described as:

When validating an X.509 SVID for authentication purposes, the validator MUST ensure that the CA field in the basic constraints extension is set to false, and that keyCertSign and cRLSign are not set in the key usage extension. The validator must also ensure that the scheme of the SPIFFE ID is set to spiffe://. SVIDs containing more than one URI SAN MUST be rejected.

Thus an X.509 SVID is an mTLS client cert with specific features. Crafting them is fairly straightforward and there are several tools to help do this automatically.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

None that I'm aware of, other than a TLS infrastructure that can issue valid SVIDs.

@vitess-bot
Copy link
Contributor

vitess-bot bot commented Sep 22, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Sep 22, 2023
@github-actions github-actions bot added this to the v18.0.0 milestone Sep 22, 2023
@jgnagy
Copy link
Author

jgnagy commented Sep 22, 2023

I'll admit that I'm very new to developing Vitess (this is my first PR, after all), and I would not consider myself a Golang expert by any stretch.

I may have missed some things, so thanks in advance for your PR reviews!

@jgnagy jgnagy force-pushed the feat/grpc-spiffe-auth branch from ca5fdb9 to 6710cdb Compare September 22, 2023 23:34
go/vt/servenv/grpc_server_auth_spiffe.go Outdated Show resolved Hide resolved
}

for _, trustDomain := range trustedDomains {
if cert.URIs[0].Hostname() == trustDomain {
Copy link
Contributor

@mattrobenolt mattrobenolt Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, this is explicitly only validating the hostnames obviously, but realistically, would you not want to also validate against the path as well?

Typically, this also can be paired with a (excuse me since I don't fully know the terms here), some SPIRE server to phone home to to ask "is this URI allowed to do this", similar to OAuth.

Without that full implementation, is there any merit into only validating the hostname rather than the full path?

I can see a case of:

spiffe://example.com/client1, and spiffe://example.com/client2 and wanting to allow client1, but not allow client2.

I'm not entirely sure of all the use cases, but I believe in a lot of contexts, this translates to a per-user issuing a certificate for auth, or per-app, and you'd want to restrict to some apps/users, and not the others.

I think explicitly declaring we only validate the hostname, but I would assume there'd be a desire to do more and the implementation here would make that a bit harder to shoehorn in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm not an expert on this subject at all, but I tend to refer to Envoy for a golden standard with this configuration, and they seem to cover a lot more options as well as a SPIRE agent. https://spiffe.io/docs/latest/microservices/envoy-x509/readme/

If we're choosing to implement a subset of the feature set, I think that's personally fine, I just worry if we implement a subset, now others might want more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattrobenolt these are great questions/points. For SPIFFE, the entire URI forms the SPIFFE ID, which uniquely identifies a particular user/workload, so things like ACLs could be based on that ID. However, a collection of SPIFFE IDs can belong to a single trust domain (which usually corresponds to a CA, but doesn't necessarily have to). Given this, I think, for accepting connections at all, the thing we should validate is the trust domain, much like we're validating the CA that signed a client cert. This is mostly because we'd accept connections from many individual workloads and users all issued for the same trust domain.

The SPIFFE ID makes more sense to use in ACLs, where we might decide that certain SPIFFE IDs (perhaps some applications or users) should be considered writers, while other IDs are readers.

SPIRE is one implementation of the SPIFFE Workload API which makes it an issuer of SVIDs. It can attest to the identity of workloads and then provide them the tools to access protected resources and validate the IDs of other workloads. You could think of it as something like cert-manager + let's encrypt. SPIRE agents are used to create and automatically rotate workload certs and provide the CA bundle for verifying both sides of an mTLS connection. SPIRE is not used as an API to query information about other workloads. Further, SPIRE is not required to use SPIFFE as long as you have some issuer for SVIDs.

@deepthi
Copy link
Member

deepthi commented Sep 25, 2023

@jgnagy can you provide a concrete motivation for implementing this at this time? Are you adopting Vitess for possible production use?
I did read the linked issue, but that doesn't address my question.

@jgnagy
Copy link
Author

jgnagy commented Sep 26, 2023

[C]an you provide a concrete motivation for implementing this at this time?

Yes, though it'll take a bit of explaining. The current mTLS approach in Vitess is limited. In particular, it uses SANs in an unexpected/non-standard way. Typically, subjectAltNames (SANs) are alternative ways of referring to the same certificate subject. Vitess is using them as a way of providing group information. Beyond that, we're using a substring search that seems like it would easy to work around or collide (since CN=foo would match both CN=foo and CN=foobar). These are symptoms of using mTLS, which in general lacks well-defined standards beyond standard TLS verification.

SPIFFE allows us to defer to a real, well-defined standard for identifying services distributed across providers/clusters/etc., which is precisely what we need for authentication. This is a better form of mTLS that we could rely on working the same wherever it is used.

For my use-case, I would expect a SVID (SPIFFE Verifiable Identity Document) issuer to be created for a Vitess Cluster. The specifics of how this issuer works isn't important, other than it could attest to workloads that need to access various parts of the cluster.

Let's call these components "application", "vtgate", and "vttablet". Let's describe SVIDs issued for these components:

  • The application could be issued spiffe://some.vitess.cluster/applications/some.app
  • The vtgate could be issued spiffe://some.vitess.cluster/vtgates/some.vtgate
  • The vttablet could be issued spiffe://some.vitess.cluster/vttablets/some.vttablet

When issued SVIDs, the components also received the SPIFFE bundle (CA bundle for the issuer). Given this, the application can connect to the vtgate, where the vtgate will recognize the issuer and will be able to see that it is for the allowed trust domain. We could make the vtgate extract the ID of the application caller and pass it via the context to the vttablet.

The vtgate would connect to the vttablet, authenticating via its SVID. The same thing takes place, where the vttablet recognizes the issuer and sees that the SVID is for an allowed trust domain. The vtgate could pass through the effective caller ID, allowing the vttablet to use ACLs based on the application's SPIFFE ID.

This would mean authentication end-to-end, support ACLs, and no passwords required, much like the mTLS authentication tooling, but using a widely-used implementation.

Note: I could definitely add more to this PR. I plan on adding another commit to add the caller effective ID pass-through I described above. We could also add support for requiring certain path components of SPIFFE IDs, but this isn't required for the use-case I'd like to demonstrate.

Are you adopting Vitess for possible production use?

Yes, definitely. I work at Shopify with Vitess regularly as a part of our DB team. That said, I wanted to implement this as a personal PR since I haven't yet proposed the SPIFFE-based approach at work.

This change adds a new gRPC server option that enables X.509 SVIDs for authentication between gRPC clients and servers.

Signed-off-by: Jonathan Gnagy <[email protected]>
@jgnagy jgnagy force-pushed the feat/grpc-spiffe-auth branch from 6710cdb to f01abb2 Compare September 26, 2023 04:31
@jgnagy jgnagy changed the title Adding X.509 SPIFFE auth to gRPC server Adding X.509 SPIFFE auth Sep 26, 2023
@deepthi
Copy link
Member

deepthi commented Sep 26, 2023

I wanted to implement this as a personal PR since I haven't yet proposed the SPIFFE-based approach at work.

This raises some concerns. We've had issues in the past (like any large open-source project) where we accept contributions which are ultimately not used by anyone in production and start bit-rotting. At the same time the original PR author is no longer active, and it falls to the maintainer team to keep things up to date with security issues, golang changes etc.
When proposing anything new, it is much better if there is an actual user that is planning to use that feature soon-ish.
The order I would suggest for this feature is not "Hey Vitess already supports this, let's use it". The way to do it would be to first convince Shopify that this is worth doing and then implement it in Vitess.

Having said that, I do think it will be a good addition to the project. I can say right now that if it is done with Shopify's backing we will almost certainly accept it.

@jgnagy
Copy link
Author

jgnagy commented Sep 26, 2023

We've had issues in the past (like any large open-source project) where we accept contributions which are ultimately not used by anyone in production and [...] it falls to the maintainer team to keep things up to date

@deepthi I completely understand this sentiment. Based on your comment, I had a preliminary conversation with other members of my team and I'll push this to our internal fork so I can put together a proof-of-concept and demonstrate it end-to-end. Happily, as it stands, there is at least some interest in using the feature. I'll explore it more there for now.

While I do have confidence that we'll use this (including in production), it might take some time before we get all the way there. In the meantime, I'd like to keep this PR open if possible, in case there is any other interest in the feature. Does that work?

@frouioui frouioui modified the milestones: v18.0.0, v19.0.0 Sep 29, 2023
@github-actions
Copy link
Contributor

This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:

  • Push additional commits to the associated branch.
  • Remove the stale label.
  • Add a comment indicating why it is not stale.

If no action is taken within 7 days, this PR will be closed.

@github-actions github-actions bot added the Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period. label Oct 30, 2023
Copy link
Contributor

github-actions bot commented Nov 6, 2023

This PR was closed because it has been stale for 7 days with no activity.

@github-actions github-actions bot closed this Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says Stale Marks PRs as stale after a period of inactivity, which are then closed after a grace period.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants