-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: defork cloud-provider-azure #6947
chore: defork cloud-provider-azure #6947
Conversation
Welcome @comtalyst! |
Hi @comtalyst. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
) | ||
|
||
// DeploymentsClient defines needed functions for azure network.DeploymentsClient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delegated it back to the one provided by cloud-provider-azure
. Don't see any breaking things so far, but haven't test it yet.
interfacesClient interfaceclient.Interface | ||
disksClient diskclient.Interface | ||
storageAccountsClient storageaccountclient.Interface | ||
skuClient compute.ResourceSkusClient | ||
} | ||
|
||
// newServicePrincipalTokenFromCredentials creates a new ServicePrincipalToken using values of the | ||
// passed credentials map. | ||
func newServicePrincipalTokenFromCredentials(config *Config, env *azure.Environment) (*adal.ServicePrincipalToken, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delegated back to cloud-provider-azure
.
|
||
return nil, fmt.Errorf("no credentials provided for AAD application %s", config.AADClientID) | ||
} | ||
|
||
func newAuthorizer(config *Config, env *azure.Environment) (autorest.Authorizer, error) { | ||
switch config.AuthMethod { | ||
case authMethodCLI: | ||
return auth.NewAuthorizerFromCLI() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This CLI authentication is the only real blocker for the near-perfect de-forking of cloud-provider-azure
uses here. Its deprecation (or alternatively, supporting it in cloud-provider-azure
) could be something up for discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's propose supporting it in cloud-provider-azure?
"sigs.k8s.io/cloud-provider-azure/pkg/retry" | ||
) | ||
|
||
const ( | ||
vmTypeVMSS = "vmss" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delegated back to same constants provided by cloud-provider-azure
.
SubscriptionID string `json:"subscriptionId" yaml:"subscriptionId"` | ||
ResourceGroup string `json:"resourceGroup" yaml:"resourceGroup"` | ||
VMType string `json:"vmType" yaml:"vmType"` | ||
providerazure.Config `json:",inline" yaml:",inline"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delegated most of the config back to cloud-provider-azure
.
However, this, along with other user-facing interfaces that got delegated back to cloud-provider-azure
will be more hideous from the user. For example, if VM type string "vmss"
changes to "vmscalesets"
on the newer version of cloud-provider-azure
, it won't be visible to the user that might not be aware of the usage of cloud-provider-azure
.
Well-covered unit testing would definitely help still. But it might even be difficult for the maintainers to keep track of the true size of the exposed interface by the library (e.g., through this Config
and cloud-config-file
which will be unmarshalled into this struct).
Or, maybe it is by design that the user is supposed to know the existence of this module and can configure them? If so, is it intended for the maintainer to not having to worry about those interfaces, and treat them as black boxes?
Anyway, in this case (and likely in most cases--in practice), although not ideal, I think it is worth it to delegate it back given that changes are really rare, the size of the interface is not too large to have a proper coverage/know what to care about, and there should be a proper deprecation procedures if something changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"more hideous" -> "more hidden / less visible / less discoverable"? (And I don't think this is a major issue; can be addressed by documentation.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may also be worth capturing in some way whatever guarantees are intended regarding compatibility of this config with cloud-provider-azure config serialization. Though answer is probably just that this one is a superset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"more hideous" -> "more hidden / less visible / less discoverable"?
Yes. And also less aware of changes.
can be addressed by documentation
How would you imagine the documentation to be like?
Something like A) "we are using cloud-provider-azure, go see their repo for how to use it" or B) "here are the configs" then 'fork' the docs from cloud-provider-azure?
A) would be easier to be maintained, but additional burden to the user. B) would be like a fork to us.
If you ask me I would prefer A). In practice it wouldn't hurt the experience that much given that it seems to always be one-layer.
(also discussed with cloud-provider-azure maintainers--we are allowed to import the config struct)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A) makes sense to me, since cloud-provider-azure
should be keeping their docs up to date (doesn't make sense for us to duplicate the work here).
@@ -42,27 +43,9 @@ const ( | |||
|
|||
imdsServerURL = "http://169.254.169.254" | |||
|
|||
// backoff | |||
backoffRetriesDefault = 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delegated back to cloud-provider-azure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have we confirmed that the cloud-provider-azure defaults are the same as those listed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I did.
Also considering about having unit tests to guarantee them in a long run, but may not be necessary if we rely on user to check cloud-provider-azure directly.
if _, err = assignFromEnvIfExists(&cfg.UserAssignedIdentityID, "ARM_USER_ASSIGNED_IDENTITY_ID"); err != nil { | ||
return nil, err | ||
} | ||
if _, err = assignIntFromEnvIfExists(&cfg.VmssCacheTTLInSeconds, "AZURE_VMSS_CACHE_TTL"); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the fields that have a difference in naming (cloud-provider-azure
version has "InSeconds"). Although we keep the environment variables the same to not introduce breaking changes. Not that environment variable name needs to be changed either, given the minimal change. Changed my mind--I introduced both the new one along with it.
case authMethodCLI: | ||
// Nothing to check at the moment. | ||
default: | ||
return fmt.Errorf("unsupported authorization method: %s", cfg.AuthMethod) | ||
} | ||
|
||
if cfg.CloudProviderBackoff && cfg.CloudProviderBackoffRetries == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This used to be unintentionally(?) neglected when UseManagedIdentityExtension
is true
, which should be completely unrelated. Current changes should fix this bug.
d07dbf2
to
24d4667
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like where this is going, don't see any blockers (though would need another round of close review). Would be good to find a way to prove there are no regressions - to what extent do you think the tests cover this?
For config changes, the existing unit tests could be a prove that the delegations are not breaking interface-wise. More unit tests capturing more corner cases will be added as well. I have confidence in those so far. For service principal token, I think the E2Es are currently the best bet. Some manual long-running tests might help as well. |
db0fb68
to
1b54748
Compare
1b54748
to
ecd0123
Compare
ecd0123
to
2b40ce5
Compare
2b40ce5
to
7e9172d
Compare
/test pull-cluster-autoscaler-e2e-azure |
@comtalyst: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
7e9172d
to
4830b17
Compare
/ok-to-test |
4830b17
to
811f93c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, mostly minor quips/questions.
May also need to check if updates/renames to config/environment variables warrant any updates to readme and/or charts (could be a follow-up).
cluster-autoscaler/cloudprovider/azure/azure_cloud_provider_test.go
Outdated
Show resolved
Hide resolved
if cfg.VMType != providerazureconsts.VMTypeStandard && cfg.VMType != providerazureconsts.VMTypeVMSS { | ||
return fmt.Errorf("unsupported VM type: %s", cfg.VMType) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment on VMType still says vmssflex value is supported. Also need to think through migration, if that's indeed how it was intended to be set, and is changing now (if anybody is using it to begin with)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (m *AzureManager) buildNodeGroupFromSpec(spec string) (cloudprovider.NodeGroup, error) {
scaleToZeroSupported := scaleToZeroSupportedStandard
if strings.EqualFold(m.config.VMType, providerazureconsts.VMTypeVMSS) {
scaleToZeroSupported = scaleToZeroSupportedVMSS
}
s, err := dynamic.SpecFromString(spec, scaleToZeroSupported)
if err != nil {
return nil, fmt.Errorf("failed to parse node group spec: %v", err)
}
vmsPoolSet := m.azureCache.getVMsPoolSet()
if _, ok := vmsPoolSet[s.Name]; ok {
return NewVMsPool(s, m), nil
}
switch m.config.VMType {
case providerazureconsts.VMTypeStandard:
return NewAgentPool(s, m)
case providerazureconsts.VMTypeVMSS:
return NewScaleSet(s, m, -1, false)
default:
return nil, fmt.Errorf("vmtype %s not supported", m.config.VMType)
}
}
From this (that has always been here), VM types other than these two will effectively disable CAS already, given this method is critical. I don't think anyone can use that right now.
I think it is fair to "notify users" through fail initialization like this, given the config is cluster-level.
For our support of flex, I think we do have it (that EnableVMSSFlexNodes
?), but not through this. This sounds like something to revisit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any more tests we could/should add here (e.g. to test the changes in the overrid logic, such as legacy overrides?) Or are these covered in the azure_manager_tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are already covered in azure_manager_test.go, I think the coverage is quite decent now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no suitable mock in the cloud provider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't look at that yet. But, given low traffic for this area (deprecated VMAS type pool), I don't think we should put effort on it now.
As discussed, I believe we should delegate all those to cloud-provider-azure docs. We could sweep through the readmes if there is any left to begin with. Also, I give it backward compatibility for those who wants to use old names for the interface config fields/envs. Included in the release note as well. |
811f93c
to
2d55b65
Compare
2849664
to
dfac3bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but for error checking bug
@@ -258,9 +267,9 @@ func (as *AgentPool) deleteOutdatedDeployments() (err error) { | |||
errList := make([]error, 0) | |||
for _, deployment := range toBeDeleted { | |||
klog.V(4).Infof("deleteOutdatedDeployments: starts deleting outdated deployment (%s)", *deployment.Name) | |||
_, err := as.manager.azClient.deploymentsClient.Delete(ctx, as.manager.config.ResourceGroup, *deployment.Name) | |||
rerr := as.manager.azClient.deploymentClient.Delete(ctx, as.manager.config.ResourceGroup, *deployment.Name) | |||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be checking rerr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and looks like this one of the last of this kind of issue
52fe2da
to
4e12429
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
/approve |
/assign @towca |
CA go.mod changes LGTM /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: comtalyst, tallaxes, towca The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cherry-pick cluster-autoscaler-release-1.30 |
@comtalyst: #6947 failed to apply on top of branch "cluster-autoscaler-release-1.30":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
In the past, Azure provider has borrowed some code from cloud-provider-azure, although some did not even exist in cloud-provider-azure at that time.
This results in several pieces of code being duplicated or branched away. This PR is delegating those functionalities back to cloud-provider-azure as much as possible in attempt to eliminate the duplicated logic.
Examples for instances where we miss cloud-provider-azure updates for the duplicated code:
VmssCacheTTL
in autoscaler's duplicated code is namedVmssCacheTTLInSeconds
in cloud-provider-azure. This causes confusion and kills the purpose of reusable Azure configuration file that cloud-provider-azure is trying to achieve.On the other way around:
Furthermore, we see no need to have a customized version of cloud-provider-azure library for Cluster Autoscaler. Thus, delegating it back is justified.
This attempt, however, will not be perfect, given we have yet to finish the deprecation for existing behavior differences (e.g., config file format, CLI creds support).
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
azure_config.go
andazure_client.go
are the core changes of this PR. Changes in other files are more of consequences.Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: