-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JOBS-11439] Add taskValues support in remoteDbUtils #406
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble really understanding this change. Seems very simple, but maybe I'm confused about the API we are trying to support. Is it not possible to make this work for testing without exposing a new method? I would have thought that you could set a task value and then get it; I think you can do that in DBR but you can't outside, but we can't mock that behavior? Do people ever specify different values for default and debugValue?
databricks/sdk/dbutils.py
Outdated
def checkTaskValue(self, key: str, value: any) -> bool: | ||
""" | ||
Check that the task value with given key is equal to the given value | ||
""" | ||
return self.values.get(key) == value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make sure _TaskValuesUtil implements the interface defined at https://github.com/databricks/databricks-sdk-py/blob/main/databricks/sdk/runtime/dbutils_stub.py#L197. This only defines get and set. Otherwise, people may use this method and be surprised that it doesn't exist in DBR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added documentation update about this specifically in that PR: https://github.com/databricks/docs/pull/13668
""" | ||
Returns the latest task value that belongs to the current job run | ||
""" | ||
return debugValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this API is a bit confusing. What is the difference between default
and debugValue
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.databricks.com/en/dev-tools/databricks-utils.html#get-command-dbutilsjobstaskvaluesget - default
is returned when we inside job run and task value can't be found. And debugValue
is what we always return when outside of job run: for example, to be able to run notebook outside of job and check if it is just working without errors
I was implementing change according to that doc: https://docs.google.com/document/d/1yrXXA27GJOB968apgnxnMsWfj0SJJn2cpArE4zzIORU/edit#heading=h.a2at0pdkk01r. As I understood the idea for additional method is to check if |
Hey @mgyucht, welcome to the world of task values 😄 In DBR, when a notebook is run outside the context of a job run, The reason we can't simply I originally suggested that we implement an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are my suggested changes based on my original thinking. If assert
is confusing/not useful, then let's remove it.
databricks/sdk/dbutils.py
Outdated
def checkTaskValue(self, key: str, value: any) -> bool: | ||
""" | ||
Check that the task value with given key is equal to the given value | ||
""" | ||
return self.values.get(key) == value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def checkTaskValue(self, key: str, value: any) -> bool: | |
""" | |
Check that the task value with given key is equal to the given value | |
""" | |
return self.values.get(key) == value | |
def assert(self, key: str, value: any) -> None: | |
""" | |
Assert that the task value with the given key was set to the given value. This method is only available when run outside of a job run (i.e., locally or as part of a unit test). | |
""" | |
assert self.values.get(key) == value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't call it assert
, is assertTaskValue
works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss with @mgyucht first if we're even going to keep it around or not.
tests/test_dbutils.py
Outdated
def test_jobs_task_values_set(dbutils): | ||
dbutils.jobs.taskValues.set('key', 'value') | ||
|
||
assert dbutils.jobs.taskValues.checkTaskValue('key', 'value') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert dbutils.jobs.taskValues.checkTaskValue('key', 'value') | |
dbutils.jobs.taskValues.assert('key', 'value') |
tests/test_dbutils.py
Outdated
|
||
|
||
def test_jobs_task_values_check_task_value(dbutils): | ||
assert (dbutils.jobs.taskValues.checkTaskValue('key', 'value') == False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert (dbutils.jobs.taskValues.checkTaskValue('key', 'value') == False) | |
try: | |
dbutils.jobs.taskValues.assert('key', 'value') | |
// do something idiomatic here that makes the test fail | |
except AssertError: | |
// Swallow error as it's what we expected to happen |
Thanks for clarifying, that makes a lot of sense. My main concern is really that there is a divergence in the API between DBUtils in the SDK and DBUtils in DBR. Where possible, I think preventing this divergence ensures that customers don't run into AttributeErrors at runtime in DBR. I do understand why you need to add an additional method now because of the specific semantics of getting task values outside of a job run. If we add this method here, I suggest adding a corresponding method in DBR as well. Do you think that's feasible? |
I think it is not very feasible, because if we want the similar implementation, then we need to check all job tasks for |
This isn't necessarily true because we know in DBR which If we want to keep parity between DBR and the SDK implementation, one option is to extend the semantics of
|
I quickly looked at the implementation and we don't actually know the current I suggest that we simply remove |
Codecov ReportAll modified and coverable lines are covered by tests ✅
📢 Thoughts on this report? Let us know!. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two requests, LGTM otherwise. I will defer the final review to @mgyucht to see if he has any further questions.
@mgyucht we addressed the changes and the implementation is simple and straightforward: the code will run in "remote" mode (aka locally) and will be more portable as a result. This should be enough for parity, PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good for now. If you are interested in updating the in-DBR dbutils.jobs.task_values to have other helpers in the future, we can make the corresponding changes here too.
* Introduce more specific exceptions, like `NotFound`, `AlreadyExists`, `BadRequest`, `PermissionDenied`, `InternalError`, and others ([#376](#376)). This makes it easier to handle errors thrown by the Databricks API. Instead of catching `DatabricksError` and checking the error_code field, you can catch one of these subtypes of `DatabricksError`, which is more ergonomic and removes the need to rethrow exceptions that you don't want to catch. For example: ```python try: return (self._ws .permissions .get(object_type, object_id)) except DatabricksError as e: if e.error_code in [ "RESOURCE_DOES_NOT_EXIST", "RESOURCE_NOT_FOUND", "PERMISSION_DENIED", "FEATURE_DISABLED", "BAD_REQUEST"]: logger.warning(...) return None raise RetryableError(...) from e ``` can be replaced with ```python try: return (self._ws .permissions .get(object_type, object_id)) except PermissionDenied, FeatureDisabled: logger.warning(...) return None except NotFound: raise RetryableError(...) ``` * Paginate all SCIM list requests in the SDK ([#440](#440)). This change ensures that SCIM list() APIs use a default limit of 100 resources, leveraging SCIM's offset + limit pagination to batch requests to the Databricks API. * Added taskValues support in remoteDbUtils ([#406](#406)). * Added more detailed error message on default credentials not found error ([#419](#419)). * Request management token via Azure CLI only for Service Principals and not human users ([#408](#408)). API Changes: * Fixed `create()` method for [w.functions](https://databricks-sdk-py.readthedocs.io/en/latest/workspace/functions.html) workspace-level service and corresponding `databricks.sdk.service.catalog.CreateFunction` and `databricks.sdk.service.catalog.FunctionInfo` dataclasses. * Changed `create()` method for [w.metastores](https://databricks-sdk-py.readthedocs.io/en/latest/workspace/metastores.html) workspace-level service with new required argument order. * Changed `storage_root` field for `databricks.sdk.service.catalog.CreateMetastore` to be optional. * Added `skip_validation` field for `databricks.sdk.service.catalog.UpdateExternalLocation`. * Added `libraries` field for `databricks.sdk.service.compute.CreatePolicy`, `databricks.sdk.service.compute.EditPolicy` and `databricks.sdk.service.compute.Policy`. * Added `init_scripts` field for `databricks.sdk.service.compute.EventDetails`. * Added `file` field for `databricks.sdk.service.compute.InitScriptInfo`. * Added `zone_id` field for `databricks.sdk.service.compute.InstancePoolGcpAttributes`. * Added several dataclasses related to init scripts. * Added `databricks.sdk.service.compute.LocalFileInfo` dataclass. * Replaced `ui_state` field with `edit_mode` for `databricks.sdk.service.jobs.CreateJob` and `databricks.sdk.service.jobs.JobSettings`. * Replaced `databricks.sdk.service.jobs.CreateJobUiState` dataclass with `databricks.sdk.service.jobs.CreateJobEditMode`. * Added `include_resolved_values` field for `databricks.sdk.service.jobs.GetRunRequest`. * Replaced `databricks.sdk.service.jobs.JobSettingsUiState` dataclass with `databricks.sdk.service.jobs.JobSettingsEditMode`. * Removed [a.o_auth_enrollment](https://databricks-sdk-py.readthedocs.io/en/latest/account/o_auth_enrollment.html) account-level service. This was only used to aid in OAuth enablement during the public preview of OAuth. OAuth is now enabled for all AWS E2 accounts, so usage of this API is no longer needed. * Added `network_connectivity_config_id` field for `databricks.sdk.service.provisioning.UpdateWorkspaceRequest`. * Added [a.network_connectivity](https://databricks-sdk-py.readthedocs.io/en/latest/account/network_connectivity.html) account-level service. * Added `string_shared_as` field for `databricks.sdk.service.sharing.SharedDataObject`. Internal changes: * Added regression question to issue template ([#414](#414)). * Made test_auth no longer fail if you have a default profile setup ([#426](#426)). OpenAPI SHA: d136ad0541f036372601bad9a4382db06c3c912d, Date: 2023-11-14
* Introduce more specific exceptions, like `NotFound`, `AlreadyExists`, `BadRequest`, `PermissionDenied`, `InternalError`, and others ([#376](#376)). This makes it easier to handle errors thrown by the Databricks API. Instead of catching `DatabricksError` and checking the error_code field, you can catch one of these subtypes of `DatabricksError`, which is more ergonomic and removes the need to rethrow exceptions that you don't want to catch. For example: ```python try: return (self._ws .permissions .get(object_type, object_id)) except DatabricksError as e: if e.error_code in [ "RESOURCE_DOES_NOT_EXIST", "RESOURCE_NOT_FOUND", "PERMISSION_DENIED", "FEATURE_DISABLED", "BAD_REQUEST"]: logger.warning(...) return None raise RetryableError(...) from e ``` can be replaced with ```python try: return (self._ws .permissions .get(object_type, object_id)) except PermissionDenied, FeatureDisabled: logger.warning(...) return None except NotFound: raise RetryableError(...) ``` * Paginate all SCIM list requests in the SDK ([#440](#440)). This change ensures that SCIM list() APIs use a default limit of 100 resources, leveraging SCIM's offset + limit pagination to batch requests to the Databricks API. * Added taskValues support in remoteDbUtils ([#406](#406)). * Added more detailed error message on default credentials not found error ([#419](#419)). * Request management token via Azure CLI only for Service Principals and not human users ([#408](#408)). API Changes: * Fixed `create()` method for [w.functions](https://databricks-sdk-py.readthedocs.io/en/latest/workspace/functions.html) workspace-level service and corresponding `databricks.sdk.service.catalog.CreateFunction` and `databricks.sdk.service.catalog.FunctionInfo` dataclasses. * Changed `create()` method for [w.metastores](https://databricks-sdk-py.readthedocs.io/en/latest/workspace/metastores.html) workspace-level service with new required argument order. * Changed `storage_root` field for `databricks.sdk.service.catalog.CreateMetastore` to be optional. * Added `skip_validation` field for `databricks.sdk.service.catalog.UpdateExternalLocation`. * Added `libraries` field for `databricks.sdk.service.compute.CreatePolicy`, `databricks.sdk.service.compute.EditPolicy` and `databricks.sdk.service.compute.Policy`. * Added `init_scripts` field for `databricks.sdk.service.compute.EventDetails`. * Added `file` field for `databricks.sdk.service.compute.InitScriptInfo`. * Added `zone_id` field for `databricks.sdk.service.compute.InstancePoolGcpAttributes`. * Added several dataclasses related to init scripts. * Added `databricks.sdk.service.compute.LocalFileInfo` dataclass. * Replaced `ui_state` field with `edit_mode` for `databricks.sdk.service.jobs.CreateJob` and `databricks.sdk.service.jobs.JobSettings`. * Replaced `databricks.sdk.service.jobs.CreateJobUiState` dataclass with `databricks.sdk.service.jobs.CreateJobEditMode`. * Added `include_resolved_values` field for `databricks.sdk.service.jobs.GetRunRequest`. * Replaced `databricks.sdk.service.jobs.JobSettingsUiState` dataclass with `databricks.sdk.service.jobs.JobSettingsEditMode`. * Removed [a.o_auth_enrollment](https://databricks-sdk-py.readthedocs.io/en/latest/account/o_auth_enrollment.html) account-level service. This was only used to aid in OAuth enablement during the public preview of OAuth. OAuth is now enabled for all AWS E2 accounts, so usage of this API is no longer needed. * Added `network_connectivity_config_id` field for `databricks.sdk.service.provisioning.UpdateWorkspaceRequest`. * Added [a.network_connectivity](https://databricks-sdk-py.readthedocs.io/en/latest/account/network_connectivity.html) account-level service. * Added `string_shared_as` field for `databricks.sdk.service.sharing.SharedDataObject`. Internal changes: * Added regression question to issue template ([#414](#414)). * Made test_auth no longer fail if you have a default profile setup ([#426](#426)). OpenAPI SHA: d136ad0541f036372601bad9a4382db06c3c912d, Date: 2023-11-14
Changes
Support
dbutils.jobs.taskValues
methods inRemoteDbUtils
and add additionalcheckTaskValue
to allow local testingTests
Added three test cases in
test_dbutils.py
. Integration tests are not needed since changes only for remote modemake test
run locallymake fmt
applied