Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Service Brokers to indicate the state of a Service Instance after a failed update or deprovisioning #637

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

fmui
Copy link
Member

@fmui fmui commented Feb 18, 2019

When an instance update or delete fails, the platform doesn't know the state of the instance and doesn't know if the operation can be repeated.
This PR allows brokers to provide this additional information to the platform in case of a failure.

This PR replaces PR #570.

@cfdreddbot
Copy link

✅ Hey fmui! The commit authors and yourself have already signed the CLA.

spec.md Outdated
@@ -328,6 +328,9 @@ For error responses, the following fields are defined:
| --- | --- | --- |
| error | string | A single word in camel case that uniquely identifies the error condition. If present, MUST be a non-empty string. |
| description | string | A user-facing error message explaining why the request failed. If present, MUST be a non-empty string. |
| instance_usable | boolean | If an update or deprovisioning operation failed, this flag indicates whether or not the Service Instance is still usable. If `true`, the Service Instance can still be used, `false` otherwise. This field MUST NOT be present for errors of other operations. Defaults to true. |
| update_repeatable | boolean | If an update operation failed, this flag indicates whether this update can be repeated or not. If `true`, the same update operation MAY be repeated and MAY succeed; if `false`, repeating the same update operation will fail again. This field MUST NOT be present for errors of other operations. Defaults to true. |
| retry_delay | integer | This field suggests how long (in seconds) the Platform SHOULD wait until it repeats the operation. If this a negative number, the Platform SHOULD NOT automatically repeat the operation. Defaults to 0 seconds. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a retry after concept in the last operation endpoint. Do you think it makes sense to use the same name for this as well, or do you think it needs to be distinct?

@@ -772,15 +775,21 @@ For success responses, the following fields are defined:
| Response Field | Type | Description |
| --- | --- | --- |
| state* | string | Valid values are `in progress`, `succeeded`, and `failed`. While `"state": "in progress"`, the Platform SHOULD continue polling. A response with `"state": "succeeded"` or `"state": "failed"` MUST cause the Platform to cease polling. |
| description | string | A user-facing message that can be used to tell the user details about the status of the operation. If present, MUST be a non-empty string. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description field appears twice in this section. Conflict resolution artifact?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thank you!

Copy link
Contributor

@waterlink waterlink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good to me except for the duplication of description field question.

About retry delay vs retry after I don’t have a strong opinion.

spec.md Outdated
| description | string | A user-facing message that can be used to tell the user details about the status of the operation. If present, MUST be a non-empty string. |
| instance_usable | boolean | If an update or deprovisioning operation failed, this flag indicates whether or not the Service Instance is still usable. If `true`, the Service Instance can still be used, `false` otherwise. This field MUST NOT be present for errors of other operations. Defaults to true. |
| update_repeatable | boolean | If an update operation failed, this flag indicates whether this update can be repeated or not. If `true`, the same update operation MAY be repeated and MAY succeed; if `false`, repeating the same update operation will fail again. This field MUST NOT be present for errors of other operations. Defaults to true. |
| retry_delay | integer | If an operation failed, this field suggests how long (in seconds) the Platform SHOULD wait until it repeats the operation. If this a negative number, the Platform SHOULD NOT automatically repeat the operation. Defaults to 0 seconds. |
| description | string | A user-facing message that can be used to tell the user details about the status of the operation. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate field here.

Copy link
Contributor

@waterlink waterlink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@@ -328,6 +328,9 @@ For error responses, the following fields are defined:
| --- | --- | --- |
| error | string | A single word in camel case that uniquely identifies the error condition. If present, MUST be a non-empty string. |
| description | string | A user-facing error message explaining why the request failed. If present, MUST be a non-empty string. |
| instance_usable | boolean | If an update or deprovisioning operation failed, this flag indicates whether or not the Service Instance is still usable. If `true`, the Service Instance can still be used, `false` otherwise. This field MUST NOT be present for errors of other operations. Defaults to true. |
| update_repeatable | boolean | If an update operation failed, this flag indicates whether this update can be repeated or not. If `true`, the same update operation MAY be repeated and MAY succeed; if `false`, repeating the same update operation will fail again. This field MUST NOT be present for errors of other operations. Defaults to true. |
| retry_after | integer | This field suggests how long (in seconds) the Platform SHOULD wait until it repeats the operation. If this a negative number, the Platform SHOULD NOT automatically repeat the operation. Defaults to 0 seconds. |
Copy link
Member

@tinygrasshopper tinygrasshopper Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not getting the negative number bit for retry_after, is that indicating the same thing as update_repeatable is false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a difference. update_repeatable indicates wether a repeated update has a chance to succeed or not. A negative retry_after value indicates that even if a subsequent update could succeed, the platform should not automatically retry it.

Copy link
Contributor

@georgi-lozev georgi-lozev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I find the concept about embedding some payload in an error, apart from the code and a message, a bit odd.
For me there is either an error or not. Making some kind of interpretation about whether a thing is more or less an error is error prone itself.

| description | string | A user-facing message that can be used to tell the user details about the status of the operation. If present, MUST be a non-empty string. |
| instance_usable | boolean | If an update or deprovisioning operation failed, this flag indicates whether or not the Service Instance is still usable. If `true`, the Service Instance can still be used, `false` otherwise. This field MUST NOT be present for errors of other operations. Defaults to true. |
| update_repeatable | boolean | If an update operation failed, this flag indicates whether this update can be repeated or not. If `true`, the same update operation MAY be repeated and MAY succeed; if `false`, repeating the same update operation will fail again. This field MUST NOT be present for errors of other operations. Defaults to true. |
| retry_after | integer | If an operation failed, this field suggests how long (in seconds) the Platform SHOULD wait until it repeats the operation. If this a negative number, the Platform SHOULD NOT automatically repeat the operation. Defaults to 0 seconds. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we have retry_after as a new field in the body and not using the Retry-After header as per #621?

Maybe the second sentence should be If this is a negative number....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@georgi-lozev the retry_after field if for when the update operation itself has to be retried whereas the Retry-After is to retry the last operation polling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I get the idea.

My question is why we won't re-use the same mechanism with the header to retry the update operation, but we introduce a new one, a field that is part of the body.
To me it looks like we can return a Retry-After header as part of the update operation to instruct the platform retrying after a given interval, same way as in last operation polling.

I don't think it's a showstopper, it's more kind of a nitpicking and aiming for some kind of consistency between implementing similar behaviors in the spec.

@fmui
Copy link
Member Author

fmui commented Mar 26, 2019

@georgi-lozev I agree. The right approach would be to return these states in the response of the fetch endpoint. But there are brokers that cannot provide a fetch endpoint. The only way I can think of that works for all brokers to transport this information is the error response. Do you have a better idea?

@jberkhahn
Copy link
Contributor

This seems reasonable from a k8s perspective. We'd have to plumb this through to our controller, but most of these values already have corresponding values in our reconciliation logic that these could be plugged into

@tinygrasshopper
Copy link
Member

Thinking a bit more about this, I think this PR does 3 things:

  • (Feature 1) Adds a mechanism to denote health/usability of a service instance.
  • (Feature 2) Adds a mechanism to denote if subsequent updates are possible on a service instance
  • (Feature 3) Adds a mechanism to inform the platform when the subsequent update can be trigged

A comment on Feature 3:
Are we are exposing the complexity of the retrying mechanism at two places in the spec, which I think is unnecessary. I think broker authors can leverage the existing Retry-After header to signify that the operation is still in progress today.

Today retry of an update can be encapsulated in the Last Operation invocation:
now

Where as after this PR, in addition to the above mechanism broker authors will also have the ability to retry the update itself:
after

What are the benefits of introducing an alternate mechanism for retry? Can we achieve the same result by adding the health/usability indicator to last operation?

@mattmcneeney
Copy link
Contributor

@fmui Do we still need this PR?

@mattmcneeney mattmcneeney modified the milestones: 2.16, 2.17 May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Inbox
Development

Successfully merging this pull request may close these issues.

7 participants