-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds OIDC authentication support for pubsub Apache Pulsar #3023
Adds OIDC authentication support for pubsub Apache Pulsar #3023
Conversation
Signed-off-by: joshvanl <[email protected]>
Signed-off-by: joshvanl <[email protected]>
Signed-off-by: joshvanl <[email protected]>
Signed-off-by: joshvanl <[email protected]>
601a58e
to
dfbce6c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR should have not been merged this quickly…
return | ||
case <-ctx.Done(): | ||
return | ||
case <-c.clock.After(renewDuration): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is anti-pattern for using OAuth and we should not refresh tokens in background. Instead, tokens should be fetched lazily and refreshed only if necessary. This is especially true when the consumer of the tokens is something like Pulsar where connections are long-lived so tokens don't need to be refreshed constantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this an anti-pattern? I have never heard of this. Fetching tokens lazily is strictly worse from the dapr perspective, increasing latency for the pubsub app which is particularly acute for publishing. I don't think we should be concerned with reducing load on the OIDC provider- signing JWTs is not resource intensive and is basically free for all intense and purposes. Regardless of long lived etc. we can assume that lazily requesting tokens will produce about roughly the same amount of requests as renewing in the background for a publishing app (bar the renewal time we agree on), however fetching tokens lazily will increase the latency on the unlucky publish(es).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT Dapr maintains persistent connections to Pulsar. I see that for publishing connections too: https://github.dev/JoshVanL/components-contrib/blob/19e17528a2c2a4a3098e53fad019548402a62e46/pubsub/pulsar/pulsar.go#L254-L255
Since connections are persisted, it's possible that we may not even need to refresh tokens ever for the lifetime of the sidecar, unless the connection is dropped. So refreshing automatically "just in case" seems really wasteful.
Yes, signing a JWT is "free" for the IdP but doing it when we know we don't need it is not polite. It may help hitting rate-limits. It also can create additional noise in the audit logs that are sent to SIEMs, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So refreshing automatically "just in case" seems really wasteful.
Fair enough- I still disagree with this statement though. I maintain that the latency improvements for daprd fair out way the small resource used to fetch more tokens than perhaps needed.
It may help hitting rate-limits. It also can create additional noise in the audit logs that are sent to SIEMs, etc.
An IdP which rate limits clients for requesting tokens at half expiry is miss configured imo, and audit logs are built around the concept of clients renewing their identity documents. They will be indexed and filtered around this exact scenario.
c.lock.RLock() | ||
defer c.lock.RUnlock() | ||
|
||
if c.closed.Load() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is where you check if the token that's in cache is valid, and if not you lazily request a new one.
Since multiple goroutines could be calling Token()
at the same time, I recommend implementing some synchronization method (or use the "promise" library!)
func (c *ClientCredentials) tokenRenewDuration() time.Duration { | ||
c.lock.RLock() | ||
defer c.lock.RUnlock() | ||
return c.currentToken.Expiry.Sub(c.clock.Now()) / 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to halve the time, that's wasteful. Maybe subtract 1 min, but since tokens are used right away there's no need to halve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about 2/3rds? I think 1 minute is far too short- node migrations in kube can take longer than a minute for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most IdP's I'm familiar with issue tokens with a lifetime of 1h by default. Cutting 30 or 20 mins out of that seems unnecessary.
5 mins?
Why are Kubernetes node migrations a concern here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very common to be running an internal provider in (or in another) cluster. A node migration is an example scenario where the issuer server becomes unavailable for a non-trivial amount of time, however the system should be able to cope with that as best it can.
I like keeping the the renewal time a function of the duration of the token lifetime, as the operator would have chosen the expiry based on their deployment constraints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very common to be running an internal provider in (or in another) cluster. A node migration is an example scenario where the issuer server becomes unavailable for a non-trivial amount of time, however the system should be able to cope with that as best it can.
That is an operational thing that should not be a concern of Dapr's. If an ops person runs their own IdP, they are responsible for ensuring its availability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an ops person runs their own IdP, they are responsible for ensuring its availability.
I agree, but it is standard practice as a client to have enough buffer time in the validity of your identity documents to maximize up time if the provider goes down. Using the document validity duration is a great metric to choose that buffer time, rather than using an arbitrary static value. The ops person will have chosen that duration for a reason, and is a great signal to act on.
RedeliveryDelay time.Duration `mapstructure:"redeliveryDelay"` | ||
internalTopicSchemas map[string]schemaMetadata `mapstructure:"-"` | ||
PublicKey string `mapstructure:"publicKey"` | ||
PrivateKey string `mapstructure:"privateKey"` | ||
Keys string `mapstructure:"keys"` | ||
|
||
AuthType string `mapstructure:"authType"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally when a component has multiple auth mechanisms, we do not force users to specify that unless that's strictly necessary.
In this case, it is not necessary.
When the component is initialized, you can check if token
is present and assume auth type is a shared token. Otherwise, check if the OIDC credentials are present.
If none is present, return an error.
This is more aligned with what we do for almost all other components, and saves users from having another metadata option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes this component different from how we handle auth on Kafka for example? If we don't use auth explicitly there may be additional token auth mechanisms in the future that will conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about Kafka, but for example all Azure components automatically select shared token auth vs AzureAD depending on the metadata that's passed.
In the rare case of conflicts we have added one-off metadata keys such as useAzureAD
for Azure SQL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My fear with not having an explicit auth field is that we end up with "magic config" schemas, which are both confusing and surprising to users. It wouldn't be clear to me what the behavior would be if I defined multiple auth types for example.
This is also particularly important for security folks who are responsible for auditing or writing policy over config. A clear auth type field gives them the confidence that things are setup correctly and the software is going to do what they expect.
Audiences: m.OIDCAudiences, | ||
}) | ||
if err != nil { | ||
return fmt.Errorf("could not instantiate oidc token provider: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Total nit:
return fmt.Errorf("could not instantiate oidc token provider: %v", err) | |
return fmt.Errorf("could not instantiate oidc token provider: %w", err) |
)" This reverts commit af5bace.
@JoshVanL please re-open this PR after addressing the feedback above, thanks |
)" This reverts commit af5bace.
Closes #2591
PR adds support for OIDC authentication to the pubsub pulsar. This is configured via the following metadata fields:
Another field
authType
has been added which acceptsnone
,token
, andoidc
. Defaults tonone
, however if undefined and thetoken
metadata field has been defined, then this defaults totoken
to maintain backwards compatibility.Adds a certification test which runs the current pulsar suite, but with OIDC authentication enabled. Uses pulsar v3 since that is the minimum version with OIDC authentication support. Runs a mock OIDC issuer server during the test. We template the CA PEM bundle, based on the generated self signed certificate from the mock server at test runtime.
PR doesn't move the kafka OIDC implementation to use the shared internal package since kafka OIDC auth isn't tested. Once a certificate test is written, we can move kafka to use the internal package.
Docs PR: dapr/docs#3655
Question: Because the pulsar certification tests now take some time with running the suite in both non and oidc auth modes- do we want to split them into two separate workflows?
/cc @yaron2