Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JOBS-11439] Add taskValues support in remoteDbUtils #406

Merged
merged 4 commits into from
Nov 1, 2023

Conversation

vsamoilov
Copy link
Contributor

Changes

Support dbutils.jobs.taskValues methods in RemoteDbUtils and add additional checkTaskValue to allow local testing

Tests

Added three test cases in test_dbutils.py. Integration tests are not needed since changes only for remote mode

  • make test run locally
  • make fmt applied
  • relevant integration tests applied

@vsamoilov vsamoilov requested a review from mgyucht October 13, 2023 13:31
Copy link
Contributor

@mgyucht mgyucht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble really understanding this change. Seems very simple, but maybe I'm confused about the API we are trying to support. Is it not possible to make this work for testing without exposing a new method? I would have thought that you could set a task value and then get it; I think you can do that in DBR but you can't outside, but we can't mock that behavior? Do people ever specify different values for default and debugValue?

Comment on lines 185 to 189
def checkTaskValue(self, key: str, value: any) -> bool:
"""
Check that the task value with given key is equal to the given value
"""
return self.values.get(key) == value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make sure _TaskValuesUtil implements the interface defined at https://github.com/databricks/databricks-sdk-py/blob/main/databricks/sdk/runtime/dbutils_stub.py#L197. This only defines get and set. Otherwise, people may use this method and be surprised that it doesn't exist in DBR.

Copy link
Contributor Author

@vsamoilov vsamoilov Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added documentation update about this specifically in that PR: https://github.com/databricks/docs/pull/13668

"""
Returns the latest task value that belongs to the current job run
"""
return debugValue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this API is a bit confusing. What is the difference between default and debugValue?

Copy link
Contributor Author

@vsamoilov vsamoilov Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.databricks.com/en/dev-tools/databricks-utils.html#get-command-dbutilsjobstaskvaluesget - default is returned when we inside job run and task value can't be found. And debugValue is what we always return when outside of job run: for example, to be able to run notebook outside of job and check if it is just working without errors

@vsamoilov
Copy link
Contributor Author

I'm having trouble really understanding this change. Seems very simple, but maybe I'm confused about the API we are trying to support. Is it not possible to make this work for testing without exposing a new method? I would have thought that you could set a task value and then get it; I think you can do that in DBR but you can't outside, but we can't mock that behavior? Do people ever specify different values for default and debugValue?

I was implementing change according to that doc: https://docs.google.com/document/d/1yrXXA27GJOB968apgnxnMsWfj0SJJn2cpArE4zzIORU/edit#heading=h.a2at0pdkk01r. As I understood the idea for additional method is to check if set(key, value) was called with specified parameters, because get interface is for case when we are inside job run (we need some task key at least) cc @gaborratky-db @mgyucht

@vsamoilov vsamoilov requested a review from mgyucht October 23, 2023 13:52
@gaborratky-db
Copy link

gaborratky-db commented Oct 23, 2023

I'm having trouble really understanding this change. Seems very simple, but maybe I'm confused about the API we are trying to support. Is it not possible to make this work for testing without exposing a new method? I would have thought that you could set a task value and then get it; I think you can do that in DBR but you can't outside, but we can't mock that behavior? Do people ever specify different values for default and debugValue?

Hey @mgyucht, welcome to the world of task values 😄 In DBR, when a notebook is run outside the context of a job run, set is a no-op and get returns the value of debugValue.

The reason we can't simply get what we set is because get has a well-defined behavior when run outside the context of a job run: return debugValue or throw an error. get also requires an explicit taskKey whereas set is implicit.

I originally suggested that we implement an assert(key, value) method on taskValues that does anassert internally, so that it's clear that it is only meant for testing as part of unit tests. If you think the documentation and the extra, SDK-only method would cause more confusion than it is useful, then let's simply get rid of it.

Copy link

@gaborratky-db gaborratky-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are my suggested changes based on my original thinking. If assert is confusing/not useful, then let's remove it.

databricks/sdk/dbutils.py Outdated Show resolved Hide resolved
Comment on lines 185 to 189
def checkTaskValue(self, key: str, value: any) -> bool:
"""
Check that the task value with given key is equal to the given value
"""
return self.values.get(key) == value

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def checkTaskValue(self, key: str, value: any) -> bool:
"""
Check that the task value with given key is equal to the given value
"""
return self.values.get(key) == value
def assert(self, key: str, value: any) -> None:
"""
Assert that the task value with the given key was set to the given value. This method is only available when run outside of a job run (i.e., locally or as part of a unit test).
"""
assert self.values.get(key) == value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't call it assert, is assertTaskValue works?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss with @mgyucht first if we're even going to keep it around or not.

def test_jobs_task_values_set(dbutils):
dbutils.jobs.taskValues.set('key', 'value')

assert dbutils.jobs.taskValues.checkTaskValue('key', 'value')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert dbutils.jobs.taskValues.checkTaskValue('key', 'value')
dbutils.jobs.taskValues.assert('key', 'value')



def test_jobs_task_values_check_task_value(dbutils):
assert (dbutils.jobs.taskValues.checkTaskValue('key', 'value') == False)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert (dbutils.jobs.taskValues.checkTaskValue('key', 'value') == False)
try:
dbutils.jobs.taskValues.assert('key', 'value')
// do something idiomatic here that makes the test fail
except AssertError:
// Swallow error as it's what we expected to happen

@mgyucht
Copy link
Contributor

mgyucht commented Oct 25, 2023

Thanks for clarifying, that makes a lot of sense. My main concern is really that there is a divergence in the API between DBUtils in the SDK and DBUtils in DBR. Where possible, I think preventing this divergence ensures that customers don't run into AttributeErrors at runtime in DBR. I do understand why you need to add an additional method now because of the specific semantics of getting task values outside of a job run. If we add this method here, I suggest adding a corresponding method in DBR as well. Do you think that's feasible?

@vsamoilov
Copy link
Contributor Author

Thanks for clarifying, that makes a lot of sense. My main concern is really that there is a divergence in the API between DBUtils in the SDK and DBUtils in DBR. Where possible, I think preventing this divergence ensures that customers don't run into AttributeErrors at runtime in DBR. I do understand why you need to add an additional method now because of the specific semantics of getting task values outside of a job run. If we add this method here, I suggest adding a corresponding method in DBR as well. Do you think that's feasible?

I think it is not very feasible, because if we want the similar implementation, then we need to check all job tasks for taskValue with specified key. Also as I understood, use case during local development is that we can write unit tests in which we will run task function and then check that we set correct value in specified key. I don't know about such use cases in DBR. I think if we will write in documentation that it is only for unit test purposes during local development and implement it by assert call, then it should be enough to prevent using in DBR, what do you think? cc @mgyucht @gaborratky-db

@gaborratky-db
Copy link

Thanks for clarifying, that makes a lot of sense. My main concern is really that there is a divergence in the API between DBUtils in the SDK and DBUtils in DBR. Where possible, I think preventing this divergence ensures that customers don't run into AttributeErrors at runtime in DBR. I do understand why you need to add an additional method now because of the specific semantics of getting task values outside of a job run. If we add this method here, I suggest adding a corresponding method in DBR as well. Do you think that's feasible?

I think it is not very feasible, because if we want the similar implementation, then we need to check all job tasks for taskValue with specified key. Also as I understood, use case during local development is that we can write unit tests in which we will run task function and then check that we set correct value in specified key. I don't know about such use cases in DBR. I think if we will write in documentation that it is only for unit test purposes during local development and implement it by assert call, then it should be enough to prevent using in DBR, what do you think? cc @mgyucht @gaborratky-db

This isn't necessarily true because we know in DBR which taskKey the dbutils.jobs.taskValues call runs for and we can call dbutils.jobs.taskValues.get("<my own task key>", ... ) to retrieve the task value.

If we want to keep parity between DBR and the SDK implementation, one option is to extend the semantics of dbutils.jobs.taskValues.get and make it possible to query the task's own task values when called as part of a job run. For example:

dbutils.jobs.taskValues.get(None, "myKey") would query whether something was set with dbutils.jobs.taskValues.set("myKey", "myValue"). I don't like using None as a special taskKey value here but I don't have a better idea off the top of my head.

@gaborratky-db
Copy link

I quickly looked at the implementation and we don't actually know the current taskKey, only the current run id, which we can't use to query the values.

I suggest that we simply remove assertTaskValue from the SDK version, keep parity, ship the fix, and come back to this problem when we have additional capacity to improve dbutils.jobs.taskValues.

@codecov-commenter
Copy link

codecov-commenter commented Oct 27, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Files Coverage Δ
databricks/sdk/dbutils.py 79.41% <100.00%> (+1.06%) ⬆️

📢 Thoughts on this report? Let us know!.

Copy link

@gaborratky-db gaborratky-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two requests, LGTM otherwise. I will defer the final review to @mgyucht to see if he has any further questions.

databricks/sdk/dbutils.py Outdated Show resolved Hide resolved
tests/test_dbutils.py Show resolved Hide resolved
@gaborratky-db
Copy link

@mgyucht we addressed the changes and the implementation is simple and straightforward: the code will run in "remote" mode (aka locally) and will be more portable as a result. This should be enough for parity, PTAL.

Copy link
Contributor

@mgyucht mgyucht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good for now. If you are interested in updating the in-DBR dbutils.jobs.task_values to have other helpers in the future, we can make the corresponding changes here too.

@mgyucht mgyucht enabled auto-merge November 1, 2023 08:09
@mgyucht mgyucht disabled auto-merge November 1, 2023 08:09
@mgyucht mgyucht added this pull request to the merge queue Nov 1, 2023
Merged via the queue into databricks:main with commit 9955c08 Nov 1, 2023
7 checks passed
mgyucht added a commit that referenced this pull request Nov 14, 2023
* Introduce more specific exceptions, like `NotFound`, `AlreadyExists`, `BadRequest`, `PermissionDenied`, `InternalError`, and others ([#376](#376)). This makes it easier to handle errors thrown by the Databricks API. Instead of catching `DatabricksError` and checking the error_code field, you can catch one of these subtypes of `DatabricksError`, which is more ergonomic and removes the need to rethrow exceptions that you don't want to catch. For example:
```python
try:
  return (self._ws
    .permissions
    .get(object_type, object_id))
except DatabricksError as e:
  if e.error_code in [
    "RESOURCE_DOES_NOT_EXIST",
    "RESOURCE_NOT_FOUND",
    "PERMISSION_DENIED",
    "FEATURE_DISABLED",
    "BAD_REQUEST"]:
    logger.warning(...)
    return None
  raise RetryableError(...) from e
```
can be replaced with
```python
try:
  return (self._ws
    .permissions
    .get(object_type, object_id))
except PermissionDenied, FeatureDisabled:
  logger.warning(...)
  return None
except NotFound:
  raise RetryableError(...)
```
* Paginate all SCIM list requests in the SDK ([#440](#440)). This change ensures that SCIM list() APIs use a default limit of 100 resources, leveraging SCIM's offset + limit pagination to batch requests to the Databricks API.
* Added taskValues support in remoteDbUtils ([#406](#406)).
* Added more detailed error message on default credentials not found error ([#419](#419)).
* Request management token via Azure CLI only for Service Principals and not human users ([#408](#408)).

API Changes:

 * Fixed `create()` method for [w.functions](https://databricks-sdk-py.readthedocs.io/en/latest/workspace/functions.html) workspace-level service and corresponding `databricks.sdk.service.catalog.CreateFunction` and `databricks.sdk.service.catalog.FunctionInfo` dataclasses.
 * Changed `create()` method for [w.metastores](https://databricks-sdk-py.readthedocs.io/en/latest/workspace/metastores.html) workspace-level service with new required argument order.
 * Changed `storage_root` field for `databricks.sdk.service.catalog.CreateMetastore` to be optional.
 * Added `skip_validation` field for `databricks.sdk.service.catalog.UpdateExternalLocation`.
 * Added `libraries` field for `databricks.sdk.service.compute.CreatePolicy`, `databricks.sdk.service.compute.EditPolicy` and `databricks.sdk.service.compute.Policy`.
 * Added `init_scripts` field for `databricks.sdk.service.compute.EventDetails`.
 * Added `file` field for `databricks.sdk.service.compute.InitScriptInfo`.
 * Added `zone_id` field for `databricks.sdk.service.compute.InstancePoolGcpAttributes`.
 * Added several dataclasses related to init scripts.
 * Added `databricks.sdk.service.compute.LocalFileInfo` dataclass.
 * Replaced `ui_state` field with `edit_mode` for `databricks.sdk.service.jobs.CreateJob` and `databricks.sdk.service.jobs.JobSettings`.
 * Replaced `databricks.sdk.service.jobs.CreateJobUiState` dataclass with `databricks.sdk.service.jobs.CreateJobEditMode`.
 * Added `include_resolved_values` field for `databricks.sdk.service.jobs.GetRunRequest`.
 * Replaced `databricks.sdk.service.jobs.JobSettingsUiState` dataclass with `databricks.sdk.service.jobs.JobSettingsEditMode`.
 * Removed [a.o_auth_enrollment](https://databricks-sdk-py.readthedocs.io/en/latest/account/o_auth_enrollment.html) account-level service. This was only used to aid in OAuth enablement during the public preview of OAuth. OAuth is now enabled for all AWS E2 accounts, so usage of this API is no longer needed.
 * Added `network_connectivity_config_id` field for `databricks.sdk.service.provisioning.UpdateWorkspaceRequest`.
 * Added [a.network_connectivity](https://databricks-sdk-py.readthedocs.io/en/latest/account/network_connectivity.html) account-level service.
 * Added `string_shared_as` field for `databricks.sdk.service.sharing.SharedDataObject`.

Internal changes:

* Added regression question to issue template ([#414](#414)).
* Made test_auth no longer fail if you have a default profile setup ([#426](#426)).

OpenAPI SHA: d136ad0541f036372601bad9a4382db06c3c912d, Date: 2023-11-14
@mgyucht mgyucht mentioned this pull request Nov 14, 2023
github-merge-queue bot pushed a commit that referenced this pull request Nov 14, 2023
* Introduce more specific exceptions, like `NotFound`, `AlreadyExists`,
`BadRequest`, `PermissionDenied`, `InternalError`, and others
([#376](#376)). This
makes it easier to handle errors thrown by the Databricks API. Instead
of catching `DatabricksError` and checking the error_code field, you can
catch one of these subtypes of `DatabricksError`, which is more
ergonomic and removes the need to rethrow exceptions that you don't want
to catch. For example:
```python
try:
  return (self._ws
    .permissions
    .get(object_type, object_id))
except DatabricksError as e:
  if e.error_code in [
    "RESOURCE_DOES_NOT_EXIST",
    "RESOURCE_NOT_FOUND",
    "PERMISSION_DENIED",
    "FEATURE_DISABLED",
    "BAD_REQUEST"]:
    logger.warning(...)
    return None
  raise RetryableError(...) from e
```
can be replaced with
```python
try:
  return (self._ws
    .permissions
    .get(object_type, object_id))
except PermissionDenied, FeatureDisabled:
  logger.warning(...)
  return None
except NotFound:
  raise RetryableError(...)
```
* Paginate all SCIM list requests in the SDK
([#440](#440)). This
change ensures that SCIM list() APIs use a default limit of 100
resources, leveraging SCIM's offset + limit pagination to batch requests
to the Databricks API.
* Added taskValues support in remoteDbUtils
([#406](#406)).
* Added more detailed error message on default credentials not found
error
([#419](#419)).
* Request management token via Azure CLI only for Service Principals and
not human users
([#408](#408)).

API Changes:

* Fixed `create()` method for
[w.functions](https://databricks-sdk-py.readthedocs.io/en/latest/workspace/functions.html)
workspace-level service and corresponding
`databricks.sdk.service.catalog.CreateFunction` and
`databricks.sdk.service.catalog.FunctionInfo` dataclasses.
* Changed `create()` method for
[w.metastores](https://databricks-sdk-py.readthedocs.io/en/latest/workspace/metastores.html)
workspace-level service with new required argument order.
* Changed `storage_root` field for
`databricks.sdk.service.catalog.CreateMetastore` to be optional.
* Added `skip_validation` field for
`databricks.sdk.service.catalog.UpdateExternalLocation`.
* Added `libraries` field for
`databricks.sdk.service.compute.CreatePolicy`,
`databricks.sdk.service.compute.EditPolicy` and
`databricks.sdk.service.compute.Policy`.
* Added `init_scripts` field for
`databricks.sdk.service.compute.EventDetails`.
* Added `file` field for
`databricks.sdk.service.compute.InitScriptInfo`.
* Added `zone_id` field for
`databricks.sdk.service.compute.InstancePoolGcpAttributes`.
 * Added several dataclasses related to init scripts.
 * Added `databricks.sdk.service.compute.LocalFileInfo` dataclass.
* Replaced `ui_state` field with `edit_mode` for
`databricks.sdk.service.jobs.CreateJob` and
`databricks.sdk.service.jobs.JobSettings`.
* Replaced `databricks.sdk.service.jobs.CreateJobUiState` dataclass with
`databricks.sdk.service.jobs.CreateJobEditMode`.
* Added `include_resolved_values` field for
`databricks.sdk.service.jobs.GetRunRequest`.
* Replaced `databricks.sdk.service.jobs.JobSettingsUiState` dataclass
with `databricks.sdk.service.jobs.JobSettingsEditMode`.
* Removed
[a.o_auth_enrollment](https://databricks-sdk-py.readthedocs.io/en/latest/account/o_auth_enrollment.html)
account-level service. This was only used to aid in OAuth enablement
during the public preview of OAuth. OAuth is now enabled for all AWS E2
accounts, so usage of this API is no longer needed.
* Added `network_connectivity_config_id` field for
`databricks.sdk.service.provisioning.UpdateWorkspaceRequest`.
* Added
[a.network_connectivity](https://databricks-sdk-py.readthedocs.io/en/latest/account/network_connectivity.html)
account-level service.
* Added `string_shared_as` field for
`databricks.sdk.service.sharing.SharedDataObject`.

Internal changes:

* Added regression question to issue template
([#414](#414)).
* Made test_auth no longer fail if you have a default profile setup
([#426](#426)).

OpenAPI SHA: d136ad0541f036372601bad9a4382db06c3c912d, Date: 2023-11-14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants