-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE]: Pre-process packages available via the DBR without installation #1931
Comments
here are all the packages since DBR 9.x - https://github.com/databrickslabs/sandbox/blob/main/runtime-packages/sample-output.txt we don't really care about specific versions of those packages. at least for now. |
Thanks for that, I wasn't aware of that tool and it looks quite useful. The lists are roughly the same with a few differences here and there. Some notes:
|
I am oké with a "good enough" approach. Great to have full coverage of the pre-installed packages, but good enough to get 80%. |
Use |
Installed python packages from 14.3 DBR list to be whitelisted ### Linked issues Adds #1931
## Changes whitelist brotli Partly resolves #1931
## Changes Whitelists catalogue ### Linked issues Partly resolve #1931
## Changes Added `murmurhash` to known list ### Linked issues Partly resolve #1931
## Changes Added `multimethod` to known list ### Linked issues Partly resolve #1931
## Changes Added `msal-extensions` to known list ### Linked issues Partly resolve #1931
## Changes Added `nvidia-ml-py` to known list ### Linked issues Partly resolve #1931
## Changes Added `mosaicml-streaming` to known list ### Linked issues Partly resolve #1931
* Added `--dry-run` option for ACL migrate ([#3017](#3017)). In this release, we have added a `--dry-run` option to the `migrate-acls` command in the `labs.yml` file, enabling a preview of the migration process without executing it. This feature also introduces the `hms-fed` flag, allowing migration of HMS-FED ACLs while migrating tables. The `ACLMigrator` class in the `application.py` file has been updated to include new parameters, `sql_backend` and `inventory_database`, to perform a dry run migration of Access Control Lists (ACLs). Additionally, a new `retrieve` method has been added to the `ACLMigrator` class to retrieve a list of grants based on the source and destination objects, and a `CrawlerBase` class has been introduced for fetching grants. We have also introduced a new `inferred_grants` table in the deployment schema to store inferred grants during the migration process. * Added `WorkspacePathOwnership` to determine transitive owners for files and notebooks ([#3047](#3047)). In this release, we introduce a new class `WorkspacePathOwnership` in the `owners.py` module to determine the transitive owners for files and notebooks within a workspace. This class is added as a subclass of `Ownership` and takes `AdministratorLocator` and `WorkspaceClient` as inputs. It has methods to infer the owner from the first `CAN_MANAGE` permission level in the access control list. We also added a new property `workspace_path_ownership` to the existing `HiveMetastoreContext` class, which returns a `WorkspacePathOwnership` object initialized with an `AdministratorLocator` object and a `workspace_client`. This addition enables the determination of owners for files and notebooks within the workspace. The functionality is demonstrated through new tests added to `test_owners.py`. The new tests, `test_notebook_owner` and `test_file_owner`, create a notebook and a workspace file and verify the owner of each using the `owner_of` method. The `AdministratorLocator` is used to locate the administrators group for the workspace and the `PermissionLevel` class is used to specify the permission level for the notebook permissions. * Added `mosaicml-streaming` to known list ([#3029](#3029)). In this release, we have expanded the range of recognized packages in our system by adding several new libraries to the known list in the JSON file. The additions include `mosaicml-streaming`, `oci`, `pynacl`, `pyopenssl`, `python-snapy`, and `zstd`. Notably, `mosaicml-streaming` has two new entries, `simulation` and `streaming`, while the other packages have a single entry each. This update addresses issue [#1931](#1931) and enhances the system's ability to identify and work with a wider variety of packages. * Added `msal-extensions` to known list ([#3030](#3030)). In this release, we have added support for two new packages, `msal-extensions` and `portalocker`, to our project. The `msal-extensions` package includes modules for extending the Microsoft Authentication Library (MSAL), including cache lock, libsecret, osx, persistence, token cache, and windows. This addition enhances the library's authentication capabilities and provides greater flexibility when working with MSAL. The `portalocker` package offers functionalities for handling file locking with various backends such as Redis, as well as constants, exceptions, and utilities. This package enables developers to manage file locking more efficiently, preventing conflicts and ensuring data consistency. These new packages extend the range of supported packages and functionalities for handling authentication and file locking in the project, providing more options for software engineers to develop robust and secure applications. * Added `multimethod` to known list ([#3031](#3031)). In this release, we have added support for the `multimethod` programming concept to the library. This feature has been added to the `known.json` file, which partially resolves issue [#193](#193) * Added `murmurhash` to known list ([#3032](#3032)). A new hash function, MurmurHash, has been added to the library's supported list, addressing part of issue [#1931](#1931). The MurmurHash function includes two variants, `murmurhash` and "murmurhash.about", with distinct functionalities. The `murmurhash` variant offers core hashing functionality, while "murmurhash.about" contains metadata or documentation related to the MurmurHash function. This integration enables developers to leverage MurmurHash for data processing tasks, enhancing the library's functionality and versatility. Users familiar with the project can now incorporate MurmurHash into their applications and configurations, taking advantage of its unique features and capabilities. * Added `ninja` to known list ([#3050](#3050)). In this release, we have added Ninja to the known list in the `known.json` file. Ninja is a fast, lightweight build system that enables better integration and handling within the project's larger context. This change partially resolves issue [#1931](#1931), which may have been caused by challenges in integrating or using Ninja. It is important to note that this change does not modify any existing functionality or introduce new methods. The alteration is limited to including Ninja in the known list, improving the management and identification of various components within the project. * Added `nvidia-ml-py` to known list ([#3051](#3051)). In this release, we have added support for the `nvidia-ml-py` package to our project. This addition consists of two components: `example` and 'pynvml'. `Example` is likely a placeholder or sample usage of the package, while `pynvml` is a module that enables interaction with NVIDIA's system management library (NVML) through Python. This enhancement is a significant step towards resolving issue [#1931](#1931), which may require the use of NVIDIA-related tools or libraries, thereby improving the project's functionality and capabilities. * Added dashboard for tracking migration progress ([#3016](#3016)). This change introduces a new dashboard for tracking migration progress in a project, called "migration-progress", which displays real-time insights into migration progress and facilitates planning and task division. A new method, `_create_dashboard`, has been added to generate the dashboard from SQL queries in a specified folder and replace database and catalog references to match the configuration settings. The changes include updating the install to replace the UCX catalog in queries, adding a new object serializer, and updating integration tests and manual testing on a staging environment. The new functionality covers the migration of tables, views, UDFs, grants, jobs, workflow problems, clusters, pipelines, and policies. Additionally, a new SQL file has been added to track the percentage of various objects migrated and display the results in the new dashboard. * Added grant progress encoder ([#3079](#3079)). A new `GrantsProgressEncoder` class has been introduced in the `progress/grants.py` file to encode `Grant` objects into `History` objects for the `migration-progress` workflow. This change includes the addition of unit tests to ensure proper functionality and handles cases where `Grant` objects fail to map to the Unity Catalog by adding a list of failures to the `History` object. The commit also modifies the `migration-progress` workflow to incorporate the new `GrantsProgressEncoder` class, enhancing the grant processing capabilities and improving the testing of this functionality. This change addresses issue [#3058](#3058), which was related to grant progress encoding. The `GrantsProgressEncoder` class can encode grant properties, such as the principal, action, database, schema, table, and UDF, into a format that can be written to a backend, ensuring successful migration of grants in the database. * Added table progress encoder ([#3083](#3083)). In this release, we've added a table progress encoder to the WorkflowTask context to enhance the tracking of table-related operations in the migration-progress workflow. This new encoder, implemented in the TableProgressEncoder class, is connected to the sql_backend, table_ownership, and migration_status_refresher objects. The GrantsProgressEncoder class has been refactored to GrantProgressEncoder, with additional parameters for improved encoding of grants. We've also introduced the refresh_table_migration_status task to scan and record the migration status of tables and views in the inventory, storing results in the $inventory.migration_status inventory table. Two new unit tests have been added to ensure proper encoding and migration status handling. This change improves progress tracking and reporting in the table migration process, addressing issues [#3061](#3061) and [#3064](#3064). * Combine static code analysis results with historical job snapshots ([#3074](#3074)). In this release, we have added a new method, `JobsProgressEncoder`, to the `WorkflowTask` class in the `databricks.labs.ucx.contexts` module. This method is used to track the progress of jobs in the context of a workflow task, replacing the existing `jobs_progress` method which only tracked the progress of grants. The `JobsProgressEncoder` method takes in additional arguments, including `inventory_database`, to provide more detailed progress tracking for jobs and is used in the `grants_progress` method to track the progress of jobs in the context of a workflow task. We have also added a new unit test for the `JobsProgressEncoder` class in the `databricks.labs.ucx` project to ensure that the encoding of job information works as expected with different types of failures and job details. Additionally, this revision introduces the ability to include workflow problem records in the historical job snapshots, providing additional context for debugging and analysis. The `JobsProgressEncoder` class is a subclass of the `ProgressEncoder` class and provides additional functionality for tracking the progress of jobs. * Connected `WorkspacePathOwnership` with `DirectFsAccessOwnership` ([#3049](#3049)). In this revision, the `DirectFsAccessCrawler` class from the `databricks.labs.ucx.source_code.directfs_access` module is imported as `DirectFsAccessCrawler` and `DirectFsAccessOwnership`, and a new `cached_property` called `directfs_access_ownership` is added to the `TableCrawler` class. This property returns an instance of the `DirectFsAccessOwnership` class, which takes in `administrator_locator`, `workspace_path_ownership`, and `workspace_client` as arguments. Additionally, the `DirectFsAccessOwnership` class has been updated to determine DirectFS access ownership for a given table and connect with `WorkspacePathOwnership`, enhancing the tool's functionality by determining access ownership in DirectFS and improving overall system security and permissions management. The `test_directfs_access.py` file has also been updated to test the ownership of query and path records using the new `DirectFsAccessOwnership` object. * Crawlers: append snapshots to history journal, if available ([#2743](#2743)). This commit introduces a history table to store snapshots after each crawling operation, addressing issues [#2572](#2572) and [#2573](#2573). The changes include the addition of a `HistoryLog` class, which handles appending inventory snapshots to the history table within a specific catalog, workspace, and run_id. The new methods also include a `TableMigrationStatus` class with a new class variable `__id_attributes__` to specify the attributes used to uniquely identify a table. The `destination()` method has been added to the `TableMigrationStatus` class to return the fully qualified name of the destination table. Additionally, unit and integration tests have been added and updated to ensure the functionality works as expected. The `Table`, `Job`, `Cluster`, and `UDF` classes have been updated with a new `history` attribute to store a string representing a problem associated with the respective class. The `__id_attributes__` class variable has also been added to these classes to specify the attributes used to uniquely identify them. * Determine ownership of tables based on grants and source code ([#3066](#3066)). In this release, changes have been made to the `application.py` file in the `databricks/labs/ucx/contexts` directory to improve the accuracy of determining table ownership in the inventory. A new class `LegacyQueryOwnership` has been added to the `databricks.labs.ucx.framework.owners` module to determine the owner of a table based on the queries that write to it. The `TableOwnership` class has been updated to accept additional arguments for determining ownership based on grants, queries, and workspace paths. The `DirectFsAccessOwnership` class has also been updated to accept a new `legacy_query_ownership` argument. Additionally, a new method `owner_of_path` has been added to the `Ownership` class, and the `LegacyQueryOwnership` class has been added as a subclass of `Ownership`. A new file `ownership.py` has been introduced, which defines the `TableOwnership` and `TableMigrationOwnership` classes for determining ownership of tables and table migration records in the inventory. These changes provide a more accurate and consistent ownership information for tables in the inventory. * Ensure that pipeline assessment doesn't fail if a pipeline is deleted… ([#3034](#3034)). In this pull request, the pipelines crawler of the DLT assessment feature has been updated to improve its resiliency in the event of a pipeline deletion during crawling. Instead of failing, the crawler now logs a warning and continues to crawl when a pipeline is deleted. A new test method, `test_pipeline_disappears_during_crawl`, has been added to verify that the crawler can handle the deletion of a pipeline after listing the pipelines but before assessing them. The `assessment` and `migration-progress-experimental` workflows have been modified, and new unit tests have been added to ensure the proper functioning of the changes. Additionally, the `test_pipeline_list_with_no_config` test case has been added to check the behavior of the pipelines crawler when there is no configuration present. This pull request aims to enhance the robustness of the assessment feature and ensure its continued operation even in the face of unexpected pipeline deletions. * Fixed `UnicodeDecodeError` when fetching init scripts ([#3103](#3103)). In this release, we have enhanced the error handling capabilities of the open-source library by fixing a `UnicodeDecodeError` issue that occurred when fetching init scripts in the `_get_init_script_data` method. To address this, we have added `UnicodeDecodeError` and `FileNotFoundError` to the list of exceptions handled in the method. Now, when any of these exceptions occur, the method will return `None` and a warning message will be logged instead of raising an unhandled exception. This change ensures that the function operates smoothly and provides better error handling in the library, without modifying the behavior of the `_check_cluster_init_script` method, which remains unchanged and continues to verify the correct setup of init scripts in the cluster. * Fixed `UnknownHostException` on the specified KeyVault ([#3102](#3102)). In this release, we have made significant improvements to the Azure Key Vault integration, addressing issues [#3102](#3102) and [#3090](#3090). We have resolved an `UnknownHostException` problem in a specific KeyVault and implemented error handling for invalid Azure Key Vaults, ensuring more robust and reliable system behavior. Additionally, we have expanded `NotFound` exception handling to include the `InvalidState` exception. When the Azure Key Vault is in an invalid state, the corresponding secret will be skipped, and a warning message will be logged. This enhancement provides a more comprehensive solution to handle various exceptions that may arise when dealing with secrets stored in Azure Key Vaults. * Fixed `Unsupported schema: XXX` error on `assess_workflows` ([#3104](#3104)). The recent change to the open-source library addresses the 'Unsupported schema: XXX' error in the `assess_workflows` function. This was achieved by introducing a new exception class, 'InvalidPath', in the `WorkspaceCache` mixin, and substituting `ValueError` with `InvalidPath` in the 'jobs.py' file. The `InvalidPath` exception is used to provide a more specific error message for unsupported schema paths. The `WorkspaceCache` mixin now includes an `InvalidPath` exception for caching workspace paths. The error handling in the 'jobs.py' file has been modified to raise `InvalidPath` instead of `ValueError` for better error messages. Additionally, the 'test_cached_workspace_path.py' file has updates for testing the `WorkspaceCache` object, including the addition of the `InvalidPath` exception for non-absolute paths, and a new test function for this exception. The `WorkspaceCache` class has an ellipsis in the `__init__` method, indicating additional initialization code not shown in this diff. * Fixed `assert curr.location is not None` ([#3105](#3105)). In this release, we have addressed a potential issue in the `_external_locations` method which failed to check if the location of the current Hive table is `None` before proceeding. This oversight could result in unnecessary exceptions when accessing the location of a Hive table. To rectify this, we have introduced a check for `None` that will bypass the current iteration of the loop if the location is not set, thereby improving the robustness of the code. The method continues to return a list of `ExternalLocation` objects, each representing a Hive table or partition location with the corresponding number of tables or partitions present. The `ExternalLocation` class remains unchanged in this commit. This improvement will ensure that the method functions smoothly and avoids errors when dealing with Hive tables that do not have a location set. * Fixed dynamic import issue ([#3053](#3053)). In this release, we've addressed an issue related to dynamic import inference in our open-source library. Previously, the code did not infer import names when using `importlib.import_module(some_name)`. This has been resolved by implementing a new method, `_make_sources_for_import_call_node`, which infers the import name from the provided node argument. Additionally, we've introduced new functions, `get_global(self, name: str)`, `_adjust_node_for_import_member(self, name: str, match_node: type, node: NodeNG)`, and updated the `_matches(self, node: NodeNG, depth: int)` method to handle attributes as global names. A new unit test, `test_graph_imports_dynamic_import()`, has been added to ensure the proper functioning of the dynamic import feature. Moreover, a new function `is_from_module` has been introduced to check if a given name is from a specific module. This commit, co-authored by Eric Vergnaud, significantly enhances the code's ability to infer imports in dynamic import scenarios. * Fixed issue with migrating `MANAGED` hive_metastore table to UC for `CONVERT_TO_EXTERNAL` scenario ([#3020](#3020)). This change updates the process for converting a managed Hive Metastore (HMS) table to external in the CONVERT_TO_EXTERNAL scenario. The functionality is split into a separate workflow task, executed from a non-Unity Catalog (UC) cluster, and is tested with unit and integration tests. The migrate table function for external sync ensures the table is migrated as external to UC post-conversion. The changes include adding a new workflow and modifying an existing one, and updates the existing workflow to rename the migrate_tables function to convert_managed_hms_to_external. The new function handles the conversion of managed HMS tables to external, and updates the object_type property of the table in the inventory database to `EXTERNAL` after the conversion is completed. The pull request resolves issue [#2840](#2840) and removes the existing functionality of applying grants during the migration process. * Fixed issue with table location on storage root ([#3094](#3094)). In this release, we have implemented changes to address an issue related to the incorrect identification of the parent folder as an external location when there is a single table with a prefix that matches a parent folder. Additionally, we have improved the storage and retrieval of table locations in the root directory of a storage service by adding support for additional S3 bucket URL formats in the unit tests for the Hive Metastore. This includes handling S3 bucket URLs that do not include a specific file or path, and those with a path that does not include a file. We have also added new test cases for these URL formats and modified existing ones to include them. These changes ensure correct identification of external locations and improve functionality and flexibility of the Hive Metastore's support for external table locations. The new methods added are not explicitly stated, but they likely involve functions for parsing and processing the new S3 bucket URL formats. * Fixed snapshot loading for DFSA and used-table crawlers ([#3046](#3046)). This commit resolves issues related to snapshot loading for the DFSA and used-table crawlers when using the spark-based lsql backend. The root cause was the use of `.as_dict()` to convert rows to dictionaries, which is unavailable in the spark-based lsql backend. The fix involves replacing this method with `.asDict()`. Additionally, integration and unit tests were updated to include snapshot loading for these crawlers, and a typo in a test name was corrected. The changes are confined to the test_queries.py file and do not affect other parts of the project. No new methods were added, and existing functionality changes were limited to updating the snapshot loading process. * Ignore failed inference codes when presenting results to Databricks Runtime ([#3087](#3087)). In this release, the `lsp_plugin.py` file has been updated in the `databricks/labs/ucx/source_code` directory to improve the user experience in the notebook editor. The changes include disabling certain advice codes from being propagated, specifically: 'cannot-autofix-table-reference', 'default-format-changed-in-dbr8', 'dependency-not-found', 'not-supported', 'notebook-run-cannot-compute-value', 'sql-parse-error', 'sys-path-cannot-compute-value', and 'unsupported-magic-line'. A new variable `DEBUG_MESSAGE_CODES` has been introduced to store the list of advice codes to be ignored, and the list comprehension that creates `diagnostics` in the `pylsp_lint` function has been updated to exclude these codes. These updates aim to reduce the number of unnecessary error messages and improve the accuracy of the linter for supported codes. * Improve scan tables in mounts ([#2767](#2767)). In this release, the `scan-tables-in-mounts` functionality in the hive metastore has been significantly improved, providing a more robust and comprehensive solution. Previously, the implementation skipped most directories, only finding 8 tables, but this issue has been addressed, allowing the updated version to parse many more tables. The commit includes bug fixes and the addition of new unit tests. The reviewer is encouraged to refactor the code in future iterations to use the `os` module instead of `dbutils` for listing directories, enabling parallelization and improving scalability. The commit resolves issue [#2540](#2540) and updates the `scan-tables-in-mounts-experimental` workflow. While manual and unit tests have been added and verified, integration tests are still pending implementation. The co-author of this commit is Dan Zafar. * Removed `WorkflowLinter` as it is part of the `Assessment` workflow ([#3036](#3036)). In this release, the `WorkflowLinter` has been removed as it is now integrated into the `Assessment` workflow, addressing issue [#3035](#3035). This change simplifies the codebase, removing the need for a separate linter while maintaining essential functionality for ensuring Unity Catalog compatibility. The linter's functionality has been merged with other parts of the assessment workflow, with results persisted in the `$inventory_database.workflow_problems` and `$inventory_database.directfs_in_paths` tables. The `assess_workflows` and `assess_dashboards` methods have been updated accordingly, removing `WorkflowLinter` usage. Additionally, the `ExperimentalWorkflowLinter` class has been removed from the `workflows.py` file, along with its associated methods `lint_all_workflows` and `lint_all_queries`. The `test_running_real_workflow_linter_job` function has also been removed due to the integration of the `WorkflowLinter` into the `Assessment` workflow. Manual testing has been conducted to ensure the correctness of these changes and the continued proper functioning of the assessment workflow. * Updated permissions crawling so that it doesn't fail if a secret scope disappears during crawling ([#3070](#3070)). This commit enhances the open-source library by updating the permissions crawling process for secret scopes, addressing the issue of task failure when a secret scope disappears before ACL retrieval. The `assessment` workflow has been modified to incorporate these updates, and new unit tests have been added, including one that simulates the disappearance of a secret scope during crawling. The `PermissionsCrawler` class and the `Threads.gather` method have been improved to handle such cases, logging a warning instead of failing the task. The return type of the `get_crawler_tasks` method has been updated to Iterable[Callable[[], Permissions | None]]. These changes improve the reliability and robustness of the permissions crawling process for secret scopes, ensuring task completion in the face of unexpected scope disappearances. * Updated sqlglot requirement from <25.26,>=25.5.0 to >=25.5.0,<25.27 ([#3041](#3041)). In this pull request, we have updated the sqlglot library requirement to incorporate the latest version, which includes various bug fixes, refactors, and exciting new features. The latest version now supports the TO_DOUBLE and TRY_TO_TIMESTAMP functions in Snowflake and the EDIT_DISTANCE (Levinshtein) function in BigQuery. Moreover, we've addressed an issue with the ARRAY JOIN function in Clickhouse and made changes to the hive dialect hierarchy. We encourage users to update to this latest version to benefit from these enhancements and fixes, ensuring optimal performance and functionality of the library. * Updated sqlglot requirement from <25.27,>=25.5.0 to >=25.5.0,<25.28 ([#3048](#3048)). In this release, we have updated the requirement for the `sqlglot` library to a version greater than or equal to 25.5.0 and less than 25.28. This change was made to allow for the use of the latest features and bug fixes available in 'sqlglot', while avoiding the breaking changes that were introduced in version 25.27. The new version of `sqlglot` offers several improvements, including but not limited to enhanced query optimization, expanded support for various SQL dialects, and better error handling. We recommend that all users upgrade to the latest version of `sqlglot` to take advantage of these new features and improvements. * Updated sqlglot requirement from <25.28,>=25.5.0 to >=25.5.0,<25.29 ([#3093](#3093)). This release includes an update to the `sqlglot` dependency, changing the version requirement from 25.5.0 up to but excluding 25.28, to a range that includes 25.5.0 up to but excluding 25.29. This change allows for the use of the latest `sqlglot` version and includes all the updates and bug fixes from this library since the previous version. The pull request provides a list of changes made in `sqlglot` since the previous version, as well as a list of relevant commits. Dependabot has been configured to handle any merge conflicts for this pull request and includes commands to trigger various Dependabot actions. This update was made by Dependabot and is indicated by a signed-off-by line. Dependency updates: * Updated sqlglot requirement from <25.26,>=25.5.0 to >=25.5.0,<25.27 ([#3041](#3041)). * Updated sqlglot requirement from <25.27,>=25.5.0 to >=25.5.0,<25.28 ([#3048](#3048)). * Updated sqlglot requirement from <25.28,>=25.5.0 to >=25.5.0,<25.29 ([#3093](#3093)).
* Added `--dry-run` option for ACL migrate ([#3017](#3017)). In this release, we have added a `--dry-run` option to the `migrate-acls` command in the `labs.yml` file, enabling a preview of the migration process without executing it. This feature also introduces the `hms-fed` flag, allowing migration of HMS-FED ACLs while migrating tables. The `ACLMigrator` class in the `application.py` file has been updated to include new parameters, `sql_backend` and `inventory_database`, to perform a dry run migration of Access Control Lists (ACLs). Additionally, a new `retrieve` method has been added to the `ACLMigrator` class to retrieve a list of grants based on the source and destination objects, and a `CrawlerBase` class has been introduced for fetching grants. We have also introduced a new `inferred_grants` table in the deployment schema to store inferred grants during the migration process. * Added `WorkspacePathOwnership` to determine transitive owners for files and notebooks ([#3047](#3047)). In this release, we introduce a new class `WorkspacePathOwnership` in the `owners.py` module to determine the transitive owners for files and notebooks within a workspace. This class is added as a subclass of `Ownership` and takes `AdministratorLocator` and `WorkspaceClient` as inputs. It has methods to infer the owner from the first `CAN_MANAGE` permission level in the access control list. We also added a new property `workspace_path_ownership` to the existing `HiveMetastoreContext` class, which returns a `WorkspacePathOwnership` object initialized with an `AdministratorLocator` object and a `workspace_client`. This addition enables the determination of owners for files and notebooks within the workspace. The functionality is demonstrated through new tests added to `test_owners.py`. The new tests, `test_notebook_owner` and `test_file_owner`, create a notebook and a workspace file and verify the owner of each using the `owner_of` method. The `AdministratorLocator` is used to locate the administrators group for the workspace and the `PermissionLevel` class is used to specify the permission level for the notebook permissions. * Added `mosaicml-streaming` to known list ([#3029](#3029)). In this release, we have expanded the range of recognized packages in our system by adding several new libraries to the known list in the JSON file. The additions include `mosaicml-streaming`, `oci`, `pynacl`, `pyopenssl`, `python-snapy`, and `zstd`. Notably, `mosaicml-streaming` has two new entries, `simulation` and `streaming`, while the other packages have a single entry each. This update addresses issue [#1931](#1931) and enhances the system's ability to identify and work with a wider variety of packages. * Added `msal-extensions` to known list ([#3030](#3030)). In this release, we have added support for two new packages, `msal-extensions` and `portalocker`, to our project. The `msal-extensions` package includes modules for extending the Microsoft Authentication Library (MSAL), including cache lock, libsecret, osx, persistence, token cache, and windows. This addition enhances the library's authentication capabilities and provides greater flexibility when working with MSAL. The `portalocker` package offers functionalities for handling file locking with various backends such as Redis, as well as constants, exceptions, and utilities. This package enables developers to manage file locking more efficiently, preventing conflicts and ensuring data consistency. These new packages extend the range of supported packages and functionalities for handling authentication and file locking in the project, providing more options for software engineers to develop robust and secure applications. * Added `multimethod` to known list ([#3031](#3031)). In this release, we have added support for the `multimethod` programming concept to the library. This feature has been added to the `known.json` file, which partially resolves issue [#193](#193) * Added `murmurhash` to known list ([#3032](#3032)). A new hash function, MurmurHash, has been added to the library's supported list, addressing part of issue [#1931](#1931). The MurmurHash function includes two variants, `murmurhash` and "murmurhash.about", with distinct functionalities. The `murmurhash` variant offers core hashing functionality, while "murmurhash.about" contains metadata or documentation related to the MurmurHash function. This integration enables developers to leverage MurmurHash for data processing tasks, enhancing the library's functionality and versatility. Users familiar with the project can now incorporate MurmurHash into their applications and configurations, taking advantage of its unique features and capabilities. * Added `ninja` to known list ([#3050](#3050)). In this release, we have added Ninja to the known list in the `known.json` file. Ninja is a fast, lightweight build system that enables better integration and handling within the project's larger context. This change partially resolves issue [#1931](#1931), which may have been caused by challenges in integrating or using Ninja. It is important to note that this change does not modify any existing functionality or introduce new methods. The alteration is limited to including Ninja in the known list, improving the management and identification of various components within the project. * Added `nvidia-ml-py` to known list ([#3051](#3051)). In this release, we have added support for the `nvidia-ml-py` package to our project. This addition consists of two components: `example` and 'pynvml'. `Example` is likely a placeholder or sample usage of the package, while `pynvml` is a module that enables interaction with NVIDIA's system management library (NVML) through Python. This enhancement is a significant step towards resolving issue [#1931](#1931), which may require the use of NVIDIA-related tools or libraries, thereby improving the project's functionality and capabilities. * Added dashboard for tracking migration progress ([#3016](#3016)). This change introduces a new dashboard for tracking migration progress in a project, called "migration-progress", which displays real-time insights into migration progress and facilitates planning and task division. A new method, `_create_dashboard`, has been added to generate the dashboard from SQL queries in a specified folder and replace database and catalog references to match the configuration settings. The changes include updating the install to replace the UCX catalog in queries, adding a new object serializer, and updating integration tests and manual testing on a staging environment. The new functionality covers the migration of tables, views, UDFs, grants, jobs, workflow problems, clusters, pipelines, and policies. Additionally, a new SQL file has been added to track the percentage of various objects migrated and display the results in the new dashboard. * Added grant progress encoder ([#3079](#3079)). A new `GrantsProgressEncoder` class has been introduced in the `progress/grants.py` file to encode `Grant` objects into `History` objects for the `migration-progress` workflow. This change includes the addition of unit tests to ensure proper functionality and handles cases where `Grant` objects fail to map to the Unity Catalog by adding a list of failures to the `History` object. The commit also modifies the `migration-progress` workflow to incorporate the new `GrantsProgressEncoder` class, enhancing the grant processing capabilities and improving the testing of this functionality. This change addresses issue [#3058](#3058), which was related to grant progress encoding. The `GrantsProgressEncoder` class can encode grant properties, such as the principal, action, database, schema, table, and UDF, into a format that can be written to a backend, ensuring successful migration of grants in the database. * Added table progress encoder ([#3083](#3083)). In this release, we've added a table progress encoder to the WorkflowTask context to enhance the tracking of table-related operations in the migration-progress workflow. This new encoder, implemented in the TableProgressEncoder class, is connected to the sql_backend, table_ownership, and migration_status_refresher objects. The GrantsProgressEncoder class has been refactored to GrantProgressEncoder, with additional parameters for improved encoding of grants. We've also introduced the refresh_table_migration_status task to scan and record the migration status of tables and views in the inventory, storing results in the $inventory.migration_status inventory table. Two new unit tests have been added to ensure proper encoding and migration status handling. This change improves progress tracking and reporting in the table migration process, addressing issues [#3061](#3061) and [#3064](#3064). * Combine static code analysis results with historical job snapshots ([#3074](#3074)). In this release, we have added a new method, `JobsProgressEncoder`, to the `WorkflowTask` class in the `databricks.labs.ucx.contexts` module. This method is used to track the progress of jobs in the context of a workflow task, replacing the existing `jobs_progress` method which only tracked the progress of grants. The `JobsProgressEncoder` method takes in additional arguments, including `inventory_database`, to provide more detailed progress tracking for jobs and is used in the `grants_progress` method to track the progress of jobs in the context of a workflow task. We have also added a new unit test for the `JobsProgressEncoder` class in the `databricks.labs.ucx` project to ensure that the encoding of job information works as expected with different types of failures and job details. Additionally, this revision introduces the ability to include workflow problem records in the historical job snapshots, providing additional context for debugging and analysis. The `JobsProgressEncoder` class is a subclass of the `ProgressEncoder` class and provides additional functionality for tracking the progress of jobs. * Connected `WorkspacePathOwnership` with `DirectFsAccessOwnership` ([#3049](#3049)). In this revision, the `DirectFsAccessCrawler` class from the `databricks.labs.ucx.source_code.directfs_access` module is imported as `DirectFsAccessCrawler` and `DirectFsAccessOwnership`, and a new `cached_property` called `directfs_access_ownership` is added to the `TableCrawler` class. This property returns an instance of the `DirectFsAccessOwnership` class, which takes in `administrator_locator`, `workspace_path_ownership`, and `workspace_client` as arguments. Additionally, the `DirectFsAccessOwnership` class has been updated to determine DirectFS access ownership for a given table and connect with `WorkspacePathOwnership`, enhancing the tool's functionality by determining access ownership in DirectFS and improving overall system security and permissions management. The `test_directfs_access.py` file has also been updated to test the ownership of query and path records using the new `DirectFsAccessOwnership` object. * Crawlers: append snapshots to history journal, if available ([#2743](#2743)). This commit introduces a history table to store snapshots after each crawling operation, addressing issues [#2572](#2572) and [#2573](#2573). The changes include the addition of a `HistoryLog` class, which handles appending inventory snapshots to the history table within a specific catalog, workspace, and run_id. The new methods also include a `TableMigrationStatus` class with a new class variable `__id_attributes__` to specify the attributes used to uniquely identify a table. The `destination()` method has been added to the `TableMigrationStatus` class to return the fully qualified name of the destination table. Additionally, unit and integration tests have been added and updated to ensure the functionality works as expected. The `Table`, `Job`, `Cluster`, and `UDF` classes have been updated with a new `history` attribute to store a string representing a problem associated with the respective class. The `__id_attributes__` class variable has also been added to these classes to specify the attributes used to uniquely identify them. * Determine ownership of tables based on grants and source code ([#3066](#3066)). In this release, changes have been made to the `application.py` file in the `databricks/labs/ucx/contexts` directory to improve the accuracy of determining table ownership in the inventory. A new class `LegacyQueryOwnership` has been added to the `databricks.labs.ucx.framework.owners` module to determine the owner of a table based on the queries that write to it. The `TableOwnership` class has been updated to accept additional arguments for determining ownership based on grants, queries, and workspace paths. The `DirectFsAccessOwnership` class has also been updated to accept a new `legacy_query_ownership` argument. Additionally, a new method `owner_of_path` has been added to the `Ownership` class, and the `LegacyQueryOwnership` class has been added as a subclass of `Ownership`. A new file `ownership.py` has been introduced, which defines the `TableOwnership` and `TableMigrationOwnership` classes for determining ownership of tables and table migration records in the inventory. These changes provide a more accurate and consistent ownership information for tables in the inventory. * Ensure that pipeline assessment doesn't fail if a pipeline is deleted… ([#3034](#3034)). In this pull request, the pipelines crawler of the DLT assessment feature has been updated to improve its resiliency in the event of a pipeline deletion during crawling. Instead of failing, the crawler now logs a warning and continues to crawl when a pipeline is deleted. A new test method, `test_pipeline_disappears_during_crawl`, has been added to verify that the crawler can handle the deletion of a pipeline after listing the pipelines but before assessing them. The `assessment` and `migration-progress-experimental` workflows have been modified, and new unit tests have been added to ensure the proper functioning of the changes. Additionally, the `test_pipeline_list_with_no_config` test case has been added to check the behavior of the pipelines crawler when there is no configuration present. This pull request aims to enhance the robustness of the assessment feature and ensure its continued operation even in the face of unexpected pipeline deletions. * Fixed `UnicodeDecodeError` when fetching init scripts ([#3103](#3103)). In this release, we have enhanced the error handling capabilities of the open-source library by fixing a `UnicodeDecodeError` issue that occurred when fetching init scripts in the `_get_init_script_data` method. To address this, we have added `UnicodeDecodeError` and `FileNotFoundError` to the list of exceptions handled in the method. Now, when any of these exceptions occur, the method will return `None` and a warning message will be logged instead of raising an unhandled exception. This change ensures that the function operates smoothly and provides better error handling in the library, without modifying the behavior of the `_check_cluster_init_script` method, which remains unchanged and continues to verify the correct setup of init scripts in the cluster. * Fixed `UnknownHostException` on the specified KeyVault ([#3102](#3102)). In this release, we have made significant improvements to the Azure Key Vault integration, addressing issues [#3102](#3102) and [#3090](#3090). We have resolved an `UnknownHostException` problem in a specific KeyVault and implemented error handling for invalid Azure Key Vaults, ensuring more robust and reliable system behavior. Additionally, we have expanded `NotFound` exception handling to include the `InvalidState` exception. When the Azure Key Vault is in an invalid state, the corresponding secret will be skipped, and a warning message will be logged. This enhancement provides a more comprehensive solution to handle various exceptions that may arise when dealing with secrets stored in Azure Key Vaults. * Fixed `Unsupported schema: XXX` error on `assess_workflows` ([#3104](#3104)). The recent change to the open-source library addresses the 'Unsupported schema: XXX' error in the `assess_workflows` function. This was achieved by introducing a new exception class, 'InvalidPath', in the `WorkspaceCache` mixin, and substituting `ValueError` with `InvalidPath` in the 'jobs.py' file. The `InvalidPath` exception is used to provide a more specific error message for unsupported schema paths. The `WorkspaceCache` mixin now includes an `InvalidPath` exception for caching workspace paths. The error handling in the 'jobs.py' file has been modified to raise `InvalidPath` instead of `ValueError` for better error messages. Additionally, the 'test_cached_workspace_path.py' file has updates for testing the `WorkspaceCache` object, including the addition of the `InvalidPath` exception for non-absolute paths, and a new test function for this exception. The `WorkspaceCache` class has an ellipsis in the `__init__` method, indicating additional initialization code not shown in this diff. * Fixed `assert curr.location is not None` ([#3105](#3105)). In this release, we have addressed a potential issue in the `_external_locations` method which failed to check if the location of the current Hive table is `None` before proceeding. This oversight could result in unnecessary exceptions when accessing the location of a Hive table. To rectify this, we have introduced a check for `None` that will bypass the current iteration of the loop if the location is not set, thereby improving the robustness of the code. The method continues to return a list of `ExternalLocation` objects, each representing a Hive table or partition location with the corresponding number of tables or partitions present. The `ExternalLocation` class remains unchanged in this commit. This improvement will ensure that the method functions smoothly and avoids errors when dealing with Hive tables that do not have a location set. * Fixed dynamic import issue ([#3053](#3053)). In this release, we've addressed an issue related to dynamic import inference in our open-source library. Previously, the code did not infer import names when using `importlib.import_module(some_name)`. This has been resolved by implementing a new method, `_make_sources_for_import_call_node`, which infers the import name from the provided node argument. Additionally, we've introduced new functions, `get_global(self, name: str)`, `_adjust_node_for_import_member(self, name: str, match_node: type, node: NodeNG)`, and updated the `_matches(self, node: NodeNG, depth: int)` method to handle attributes as global names. A new unit test, `test_graph_imports_dynamic_import()`, has been added to ensure the proper functioning of the dynamic import feature. Moreover, a new function `is_from_module` has been introduced to check if a given name is from a specific module. This commit, co-authored by Eric Vergnaud, significantly enhances the code's ability to infer imports in dynamic import scenarios. * Fixed issue with migrating `MANAGED` hive_metastore table to UC for `CONVERT_TO_EXTERNAL` scenario ([#3020](#3020)). This change updates the process for converting a managed Hive Metastore (HMS) table to external in the CONVERT_TO_EXTERNAL scenario. The functionality is split into a separate workflow task, executed from a non-Unity Catalog (UC) cluster, and is tested with unit and integration tests. The migrate table function for external sync ensures the table is migrated as external to UC post-conversion. The changes include adding a new workflow and modifying an existing one, and updates the existing workflow to rename the migrate_tables function to convert_managed_hms_to_external. The new function handles the conversion of managed HMS tables to external, and updates the object_type property of the table in the inventory database to `EXTERNAL` after the conversion is completed. The pull request resolves issue [#2840](#2840) and removes the existing functionality of applying grants during the migration process. * Fixed issue with table location on storage root ([#3094](#3094)). In this release, we have implemented changes to address an issue related to the incorrect identification of the parent folder as an external location when there is a single table with a prefix that matches a parent folder. Additionally, we have improved the storage and retrieval of table locations in the root directory of a storage service by adding support for additional S3 bucket URL formats in the unit tests for the Hive Metastore. This includes handling S3 bucket URLs that do not include a specific file or path, and those with a path that does not include a file. We have also added new test cases for these URL formats and modified existing ones to include them. These changes ensure correct identification of external locations and improve functionality and flexibility of the Hive Metastore's support for external table locations. The new methods added are not explicitly stated, but they likely involve functions for parsing and processing the new S3 bucket URL formats. * Fixed snapshot loading for DFSA and used-table crawlers ([#3046](#3046)). This commit resolves issues related to snapshot loading for the DFSA and used-table crawlers when using the spark-based lsql backend. The root cause was the use of `.as_dict()` to convert rows to dictionaries, which is unavailable in the spark-based lsql backend. The fix involves replacing this method with `.asDict()`. Additionally, integration and unit tests were updated to include snapshot loading for these crawlers, and a typo in a test name was corrected. The changes are confined to the test_queries.py file and do not affect other parts of the project. No new methods were added, and existing functionality changes were limited to updating the snapshot loading process. * Ignore failed inference codes when presenting results to Databricks Runtime ([#3087](#3087)). In this release, the `lsp_plugin.py` file has been updated in the `databricks/labs/ucx/source_code` directory to improve the user experience in the notebook editor. The changes include disabling certain advice codes from being propagated, specifically: 'cannot-autofix-table-reference', 'default-format-changed-in-dbr8', 'dependency-not-found', 'not-supported', 'notebook-run-cannot-compute-value', 'sql-parse-error', 'sys-path-cannot-compute-value', and 'unsupported-magic-line'. A new variable `DEBUG_MESSAGE_CODES` has been introduced to store the list of advice codes to be ignored, and the list comprehension that creates `diagnostics` in the `pylsp_lint` function has been updated to exclude these codes. These updates aim to reduce the number of unnecessary error messages and improve the accuracy of the linter for supported codes. * Improve scan tables in mounts ([#2767](#2767)). In this release, the `scan-tables-in-mounts` functionality in the hive metastore has been significantly improved, providing a more robust and comprehensive solution. Previously, the implementation skipped most directories, only finding 8 tables, but this issue has been addressed, allowing the updated version to parse many more tables. The commit includes bug fixes and the addition of new unit tests. The reviewer is encouraged to refactor the code in future iterations to use the `os` module instead of `dbutils` for listing directories, enabling parallelization and improving scalability. The commit resolves issue [#2540](#2540) and updates the `scan-tables-in-mounts-experimental` workflow. While manual and unit tests have been added and verified, integration tests are still pending implementation. The co-author of this commit is Dan Zafar. * Removed `WorkflowLinter` as it is part of the `Assessment` workflow ([#3036](#3036)). In this release, the `WorkflowLinter` has been removed as it is now integrated into the `Assessment` workflow, addressing issue [#3035](#3035). This change simplifies the codebase, removing the need for a separate linter while maintaining essential functionality for ensuring Unity Catalog compatibility. The linter's functionality has been merged with other parts of the assessment workflow, with results persisted in the `$inventory_database.workflow_problems` and `$inventory_database.directfs_in_paths` tables. The `assess_workflows` and `assess_dashboards` methods have been updated accordingly, removing `WorkflowLinter` usage. Additionally, the `ExperimentalWorkflowLinter` class has been removed from the `workflows.py` file, along with its associated methods `lint_all_workflows` and `lint_all_queries`. The `test_running_real_workflow_linter_job` function has also been removed due to the integration of the `WorkflowLinter` into the `Assessment` workflow. Manual testing has been conducted to ensure the correctness of these changes and the continued proper functioning of the assessment workflow. * Updated permissions crawling so that it doesn't fail if a secret scope disappears during crawling ([#3070](#3070)). This commit enhances the open-source library by updating the permissions crawling process for secret scopes, addressing the issue of task failure when a secret scope disappears before ACL retrieval. The `assessment` workflow has been modified to incorporate these updates, and new unit tests have been added, including one that simulates the disappearance of a secret scope during crawling. The `PermissionsCrawler` class and the `Threads.gather` method have been improved to handle such cases, logging a warning instead of failing the task. The return type of the `get_crawler_tasks` method has been updated to Iterable[Callable[[], Permissions | None]]. These changes improve the reliability and robustness of the permissions crawling process for secret scopes, ensuring task completion in the face of unexpected scope disappearances. * Updated sqlglot requirement from <25.26,>=25.5.0 to >=25.5.0,<25.27 ([#3041](#3041)). In this pull request, we have updated the sqlglot library requirement to incorporate the latest version, which includes various bug fixes, refactors, and exciting new features. The latest version now supports the TO_DOUBLE and TRY_TO_TIMESTAMP functions in Snowflake and the EDIT_DISTANCE (Levinshtein) function in BigQuery. Moreover, we've addressed an issue with the ARRAY JOIN function in Clickhouse and made changes to the hive dialect hierarchy. We encourage users to update to this latest version to benefit from these enhancements and fixes, ensuring optimal performance and functionality of the library. * Updated sqlglot requirement from <25.27,>=25.5.0 to >=25.5.0,<25.28 ([#3048](#3048)). In this release, we have updated the requirement for the `sqlglot` library to a version greater than or equal to 25.5.0 and less than 25.28. This change was made to allow for the use of the latest features and bug fixes available in 'sqlglot', while avoiding the breaking changes that were introduced in version 25.27. The new version of `sqlglot` offers several improvements, including but not limited to enhanced query optimization, expanded support for various SQL dialects, and better error handling. We recommend that all users upgrade to the latest version of `sqlglot` to take advantage of these new features and improvements. * Updated sqlglot requirement from <25.28,>=25.5.0 to >=25.5.0,<25.29 ([#3093](#3093)). This release includes an update to the `sqlglot` dependency, changing the version requirement from 25.5.0 up to but excluding 25.28, to a range that includes 25.5.0 up to but excluding 25.29. This change allows for the use of the latest `sqlglot` version and includes all the updates and bug fixes from this library since the previous version. The pull request provides a list of changes made in `sqlglot` since the previous version, as well as a list of relevant commits. Dependabot has been configured to handle any merge conflicts for this pull request and includes commands to trigger various Dependabot actions. This update was made by Dependabot and is indicated by a signed-off-by line. Dependency updates: * Updated sqlglot requirement from <25.26,>=25.5.0 to >=25.5.0,<25.27 ([#3041](#3041)). * Updated sqlglot requirement from <25.27,>=25.5.0 to >=25.5.0,<25.28 ([#3048](#3048)). * Updated sqlglot requirement from <25.28,>=25.5.0 to >=25.5.0,<25.29 ([#3093](#3093)).
## Changes Added `py-cpuinfo` to known list ### Linked issues Partly resolve #1931
* Added `MigrationSequencer` for jobs ([#3008](#3008)). In this commit, a `MigrationSequencer` class has been added to manage the migration sequence for various resources including jobs, job tasks, job task dependencies, job clusters, and clusters. The class builds a graph of dependencies and analyzes it to generate the migration sequence, which is returned as an iterable of `MigrationStep` objects. These objects contain information about the object type, ID, name, owner, required step IDs, and step number. The commit also includes new unit and integration tests to ensure the functionality is working correctly. The migration sequence is used in tests for assessing the sequencing feature, and it handles tasks that reference existing or non-existing clusters or job clusters, and new cluster definitions. This change is linked to issue [#1415](#1415) and supersedes issue [#2980](#2980). Additionally, the commit removes some unnecessary imports and fixtures from a test file. * Added `phik` to known list ([#3198](#3198)). In this release, we have added `phik` to the known list in the provided JSON file. This change addresses part of issue [#1931](#1931), as outlined in the linked issues. The `phik` key has been added with an empty list as its value, consistent with the structure of other keys in the JSON file. It is important to note that no existing functionality has been altered and no new methods have been introduced in this commit. The scope of the change is confined to updating the known list in the JSON file by adding the `phik` key. * Added `pmdarima` to known list ([#3199](#3199)). In this release, we are excited to announce the addition of support for the `pmdarima` library, an open-source Python library for automatic seasonal decomposition of time series. With this commit, we have added `pmdarima` to our known list of libraries, providing our users with access to its various methods and functions for data preprocessing, model selection, and visualization. The library is particularly useful for fitting ARIMA models and testing for seasonality. By integrating `pmdarima`, users can now perform time series analysis and forecasting with greater ease and efficiency. This change partly resolves issue [#1931](#1931) and underscores our commitment to providing our users with access to the latest and most innovative open-source libraries available. * Added `preshed` to known list ([#3220](#3220)). A new library, "preshed," has been added to our project's supported libraries, enhancing compatibility and enabling efficient utilization of its capabilities. Developed using Cython, `preshed` is a Python interface to Intel(R) MKL's sparse BLAS, sparse solvers, and sparse linear algebra routines. With the inclusion of two modules, `preshed` and "preshed.about," this addition partially resolves issue [#1931](#1931), improving the project's overall performance and reliability in sparse linear algebra tasks. Software engineers can now leverage the `preshed` library's features and optimized routines for their projects, reducing development time and increasing efficiency. * Added `py-cpuinfo` to known list ([#3221](#3221)). In this release, we have added support for the `py-cpuinfo` library to our project, enabling the use of the `cpuinfo` functionality that it provides. With this addition, developers can now access detailed information about the CPU, such as the number of cores, current frequency, and vendor, which can be useful for performance tuning and optimization. This change partially resolves issue [#1931](#1931) and does not affect any existing functionality or add new methods to the codebase. We believe that this improvement will enhance the capabilities of our project and enable more efficient use of CPU resources. * Cater for empty python cells ([#3212](#3212)). In this release, we have resolved an issue where certain notebook cells in the dependency builder were causing crashes. Specifically, empty or comment-only cells were identified as the source of the problem. To address this, we have implemented a check to account for these cases, ensuring that an empty tree is stored in the `_python_trees` dictionary if the input cell does not produce a valid tree. This change helps prevent crashes in the dependency builder caused by empty or comment-only cells. Furthermore, we have added a test to verify the fix on a failed repository. If a cell does not produce a tree, the `_load_children_from_tree` method will not be executed for that cell, skipping the loading of any children trees. This enhancement improves the overall stability and reliability of the library by preventing crashes caused by invalid input. * Create `TODO` issues every nightly run ([#3196](#3196)). A commit has been made to update the `acceptance` repository version in the `acceptance.yml` GitHub workflow from `acceptance/v0.4.0` to `acceptance/v0.4.2`, which affects the integration tests. The `Run nightly tests` step in the GitHub repository's workflow has also been updated to use a newer version of the `databrickslabs/sandbox/acceptance` action, from `v0.3.1` to `v0.4.2`. Software engineers should verify that the new version of the `acceptance` repository contains all necessary updates and fixes, and that the integration tests continue to function as expected. Additionally, testing the updated action is important to ensure that the nightly tests run successfully with up-to-date code and can catch potential issues. * Fixed Integration test failure of migration_tables ([#3108](#3108)). This release includes a fix for two integration tests (`test_migrate_managed_table_to_external_table_without_conversion` and `test_migrate_managed_table_to_external_table_with_clone`) related to Hive Metastore table migration, addressing issues [#3054](#3054) and [#3055](#3055). Previously skipped due to underlying problems, these tests have now been unskipped, enhancing the migration feature's test coverage. No changes have been made to the existing functionality, as the focus is solely on including the previously skipped tests in the testing suite. The changes involve removing `@pytest.mark.skip` markers from the test functions, ensuring they run and provide a more comprehensive test coverage for the Hive Metastore migration feature. In addition, this release includes an update to DirectFsAccess integration tests, addressing issues related to the removal of DFSA collectors and ensuring proper handling of different file types, with no modifications made to other parts of the codebase. * Replace MockInstallation with MockPathLookup for testing fixtures ([#3215](#3215)). In this release, we have updated the testing fixtures in our unit tests by replacing the MockInstallation class with MockPathLookup. Specifically, we have modified the _load_sources function to use MockPathLookup instead of MockInstallation for loading sources. This change not only enhances the testing capabilities of the module but also introduces a new logger, logger, for more precise logging within the module. Additionally, we have updated the _load_sources function calls in the test_notebook.py file to pass the file path directly instead of a SourceContainer object. This modification allows for more flexible and straightforward testing of file-related functionality, thereby fixing issue [#3115](#3115). * Updated sqlglot requirement from <25.29,>=25.5.0 to >=25.5.0,<25.30 ([#3224](#3224)). The open-source library `sqlglot` has been updated to version 25.29.0 with this release, incorporating several breaking changes, new features, and bug fixes. The breaking changes include transpiling `ANY` to `EXISTS`, supporting the `MEDIAN()` function, wrapping values in `NOT value IS ...`, and parsing information schema views into a single identifier. New features include support for the `JSONB_EXISTS` function in PostgreSQL, transpiling `ANY` to `EXISTS` in Spark, transpiling Snowflake's `TIMESTAMP()` function, and adding support for hexadecimal literals in Teradata. Bug fixes include handling a Move edge case in the semantic differ, adding a `NULL` filter on `ARRAY_AGG` only for columns, improving parsing of `WITH FILL ... INTERPOLATE` in Clickhouse, generating `LOG(...)` for `exp.Ln` in TSQL, and optionally parsing a Stream expression. The full changelog can be found in the pull request, which also includes a list of the commits included in this release. * Use acceptance/v0.4.0 ([#3192](#3192)). A change has been made to the GitHub Actions workflow file for acceptance tests, updating the version of the `databrickslabs/sandbox/acceptance` runner to `acceptance/v0.4.0` and granting write permissions for the `issues` field in the `permissions` section. These updates will allow for the use of the latest version of the acceptance tests and provide the necessary permissions to interact with issues. A `TODO` comment has been added to indicate that the new version of the acceptance tests needs to be updated elsewhere in the codebase. This change will ensure that the acceptance tests are up-to-date and functioning properly. * Warn about errors instead to avoid job task failure ([#3219](#3219)). In this change, the `refresh_report` method in `jobs.py` has been updated to log warnings instead of raising errors when certain problems are encountered during its execution. Previously, if there were any errors during the linting process, a `ManyError` exception was raised, causing the job task to fail. Now, errors are logged as warnings, allowing the job task to continue running successfully. This resolves issue [#3214](#3214) and ensures that the job task will not fail due to linting errors, allowing users to be aware of any issues that occurred during the linting process while still completing the job task successfully. The updated method checks for errors during the linting process, adds them to a list, and constructs a string of error messages if there are any. This string of error messages is then logged as a warning using the `logger.warning` function, allowing the method to continue executing and the job task to complete successfully. * [DOC] Add dashboard section ([#3222](#3222)). In this release, we have added a new dashboard section to the project documentation, which provides visualizations of UCX's outcomes to help users better understand and manage their UCX environment. The new section includes a table listing the available dashboards, including the Azure service principals dashboard. This dashboard displays information about Azure service principals discovered by UCX in configurations from various sources such as clusters, cluster policies, job clusters, pipelines, and warehouses. Each dashboard has text widgets that offer detailed information about the contents and are designed to help users understand UCX's results and progress in a more visual and interactive way. The Azure service principals dashboard specifically offers users valuable insights into their Azure service principals within the UCX environment. * [DOC] README.md rewrite ([#3211](#3211)). The Databricks Labs UCX package offers a suite of tools for migrating data objects from the Hive metastore to Unity Catalog (UC), encompassing a comprehensive table migration process. This process consists of table mapping, data access setup, creating new UC resources, and migrating Hive metastore data objects. Table mapping is achieved using a table mapping file that defaults to mapping all tables/views to UC tables while preserving the original schema and names, but can be customized as needed. Data access setup involves creating and modifying cloud principals and credentials for UC data. New UC resources are created without affecting existing Hive metastore resources, and users can choose from various strategies for migrating tables based on their format and location. Additionally, the package provides installation resources, including a README notebook, a DEBUG notebook, debug logs, and installation configuration, as well as utility commands for viewing and repairing workflows. The migration process also includes an assessment workflow, group migration workflow, data reconciliation, and code migration commands. * [chore] Added tests to verify linter not being stuck in the infinite loop ([#3225](#3225)). In this release, we have added new functional tests to ensure that the linter does not get stuck in an infinite loop, addressing a bug that was fixed in version 0.46.0 related to the default format change from Parquet to Delta in Databricks Runtime 8.0 and a SQL parse error. These tests involve creating data frames, writing them to tables, and reading from those tables, using PySpark's SQL functions and a system information schema table to demonstrate the corrected behavior. The tests also include SQL queries that select columns from a system information schema table with a specified limit, using a withColumn() method to add a new column to a data frame based on a condition. These new tests provide assurance that the linter will not get stuck in an infinite loop and that SQL queries with table parameters are supported. * [internal] Temporarily disable integration tests due to ES-1302145 ([#3226](#3226)). In this release, the integration tests for moving tables, views, and aliasing tables have been temporarily disabled due to issue ES-1302145. The `test_move_tables`, `test_move_views`, and `test_alias_tables` functions were previously decorated with `@retried` to handle potential `NotFound` exceptions and had a timeout of 2 minutes, but are now marked with `@pytest.mark.skip("ES-1302145")`. Once the issue is resolved, the `@pytest.mark.skip` decorator should be removed to re-enable the tests. The remaining code in the file, including the `test_move_tables_no_from_schema`, `test_move_tables_no_to_schema`, and `test_move_views_no_from_schema` functions, is unchanged and still functional. * use a path instance for MISSING_SOURCE_PATH and add test ([#3217](#3217)). In this release, the handling of MISSING_SOURCE_PATH has been improved by replacing the string representation with a Path instance using Pathlib, which simplifies checks for missing source paths and enables the addition of a new test for the DependencyProblem class. This test verifies the behavior of the newly introduced method, is_path_missing(), in the DependencyProblem class for determining if a given problem is caused by a missing path. Co-authored by Eric Vergnaud, these changes not only improve the handling and testing of missing paths but also contribute to enhancing the source code analysis functionality of the databricks/labs/ucx project. Dependency updates: * Updated sqlglot requirement from <25.29,>=25.5.0 to >=25.5.0,<25.30 ([#3224](#3224)).
* Added `MigrationSequencer` for jobs ([#3008](#3008)). In this commit, a `MigrationSequencer` class has been added to manage the migration sequence for various resources including jobs, job tasks, job task dependencies, job clusters, and clusters. The class builds a graph of dependencies and analyzes it to generate the migration sequence, which is returned as an iterable of `MigrationStep` objects. These objects contain information about the object type, ID, name, owner, required step IDs, and step number. The commit also includes new unit and integration tests to ensure the functionality is working correctly. The migration sequence is used in tests for assessing the sequencing feature, and it handles tasks that reference existing or non-existing clusters or job clusters, and new cluster definitions. This change is linked to issue [#1415](#1415) and supersedes issue [#2980](#2980). Additionally, the commit removes some unnecessary imports and fixtures from a test file. * Added `phik` to known list ([#3198](#3198)). In this release, we have added `phik` to the known list in the provided JSON file. This change addresses part of issue [#1931](#1931), as outlined in the linked issues. The `phik` key has been added with an empty list as its value, consistent with the structure of other keys in the JSON file. It is important to note that no existing functionality has been altered and no new methods have been introduced in this commit. The scope of the change is confined to updating the known list in the JSON file by adding the `phik` key. * Added `pmdarima` to known list ([#3199](#3199)). In this release, we are excited to announce the addition of support for the `pmdarima` library, an open-source Python library for automatic seasonal decomposition of time series. With this commit, we have added `pmdarima` to our known list of libraries, providing our users with access to its various methods and functions for data preprocessing, model selection, and visualization. The library is particularly useful for fitting ARIMA models and testing for seasonality. By integrating `pmdarima`, users can now perform time series analysis and forecasting with greater ease and efficiency. This change partly resolves issue [#1931](#1931) and underscores our commitment to providing our users with access to the latest and most innovative open-source libraries available. * Added `preshed` to known list ([#3220](#3220)). A new library, "preshed," has been added to our project's supported libraries, enhancing compatibility and enabling efficient utilization of its capabilities. Developed using Cython, `preshed` is a Python interface to Intel(R) MKL's sparse BLAS, sparse solvers, and sparse linear algebra routines. With the inclusion of two modules, `preshed` and "preshed.about," this addition partially resolves issue [#1931](#1931), improving the project's overall performance and reliability in sparse linear algebra tasks. Software engineers can now leverage the `preshed` library's features and optimized routines for their projects, reducing development time and increasing efficiency. * Added `py-cpuinfo` to known list ([#3221](#3221)). In this release, we have added support for the `py-cpuinfo` library to our project, enabling the use of the `cpuinfo` functionality that it provides. With this addition, developers can now access detailed information about the CPU, such as the number of cores, current frequency, and vendor, which can be useful for performance tuning and optimization. This change partially resolves issue [#1931](#1931) and does not affect any existing functionality or add new methods to the codebase. We believe that this improvement will enhance the capabilities of our project and enable more efficient use of CPU resources. * Cater for empty python cells ([#3212](#3212)). In this release, we have resolved an issue where certain notebook cells in the dependency builder were causing crashes. Specifically, empty or comment-only cells were identified as the source of the problem. To address this, we have implemented a check to account for these cases, ensuring that an empty tree is stored in the `_python_trees` dictionary if the input cell does not produce a valid tree. This change helps prevent crashes in the dependency builder caused by empty or comment-only cells. Furthermore, we have added a test to verify the fix on a failed repository. If a cell does not produce a tree, the `_load_children_from_tree` method will not be executed for that cell, skipping the loading of any children trees. This enhancement improves the overall stability and reliability of the library by preventing crashes caused by invalid input. * Create `TODO` issues every nightly run ([#3196](#3196)). A commit has been made to update the `acceptance` repository version in the `acceptance.yml` GitHub workflow from `acceptance/v0.4.0` to `acceptance/v0.4.2`, which affects the integration tests. The `Run nightly tests` step in the GitHub repository's workflow has also been updated to use a newer version of the `databrickslabs/sandbox/acceptance` action, from `v0.3.1` to `v0.4.2`. Software engineers should verify that the new version of the `acceptance` repository contains all necessary updates and fixes, and that the integration tests continue to function as expected. Additionally, testing the updated action is important to ensure that the nightly tests run successfully with up-to-date code and can catch potential issues. * Fixed Integration test failure of migration_tables ([#3108](#3108)). This release includes a fix for two integration tests (`test_migrate_managed_table_to_external_table_without_conversion` and `test_migrate_managed_table_to_external_table_with_clone`) related to Hive Metastore table migration, addressing issues [#3054](#3054) and [#3055](#3055). Previously skipped due to underlying problems, these tests have now been unskipped, enhancing the migration feature's test coverage. No changes have been made to the existing functionality, as the focus is solely on including the previously skipped tests in the testing suite. The changes involve removing `@pytest.mark.skip` markers from the test functions, ensuring they run and provide a more comprehensive test coverage for the Hive Metastore migration feature. In addition, this release includes an update to DirectFsAccess integration tests, addressing issues related to the removal of DFSA collectors and ensuring proper handling of different file types, with no modifications made to other parts of the codebase. * Replace MockInstallation with MockPathLookup for testing fixtures ([#3215](#3215)). In this release, we have updated the testing fixtures in our unit tests by replacing the MockInstallation class with MockPathLookup. Specifically, we have modified the _load_sources function to use MockPathLookup instead of MockInstallation for loading sources. This change not only enhances the testing capabilities of the module but also introduces a new logger, logger, for more precise logging within the module. Additionally, we have updated the _load_sources function calls in the test_notebook.py file to pass the file path directly instead of a SourceContainer object. This modification allows for more flexible and straightforward testing of file-related functionality, thereby fixing issue [#3115](#3115). * Updated sqlglot requirement from <25.29,>=25.5.0 to >=25.5.0,<25.30 ([#3224](#3224)). The open-source library `sqlglot` has been updated to version 25.29.0 with this release, incorporating several breaking changes, new features, and bug fixes. The breaking changes include transpiling `ANY` to `EXISTS`, supporting the `MEDIAN()` function, wrapping values in `NOT value IS ...`, and parsing information schema views into a single identifier. New features include support for the `JSONB_EXISTS` function in PostgreSQL, transpiling `ANY` to `EXISTS` in Spark, transpiling Snowflake's `TIMESTAMP()` function, and adding support for hexadecimal literals in Teradata. Bug fixes include handling a Move edge case in the semantic differ, adding a `NULL` filter on `ARRAY_AGG` only for columns, improving parsing of `WITH FILL ... INTERPOLATE` in Clickhouse, generating `LOG(...)` for `exp.Ln` in TSQL, and optionally parsing a Stream expression. The full changelog can be found in the pull request, which also includes a list of the commits included in this release. * Use acceptance/v0.4.0 ([#3192](#3192)). A change has been made to the GitHub Actions workflow file for acceptance tests, updating the version of the `databrickslabs/sandbox/acceptance` runner to `acceptance/v0.4.0` and granting write permissions for the `issues` field in the `permissions` section. These updates will allow for the use of the latest version of the acceptance tests and provide the necessary permissions to interact with issues. A `TODO` comment has been added to indicate that the new version of the acceptance tests needs to be updated elsewhere in the codebase. This change will ensure that the acceptance tests are up-to-date and functioning properly. * Warn about errors instead to avoid job task failure ([#3219](#3219)). In this change, the `refresh_report` method in `jobs.py` has been updated to log warnings instead of raising errors when certain problems are encountered during its execution. Previously, if there were any errors during the linting process, a `ManyError` exception was raised, causing the job task to fail. Now, errors are logged as warnings, allowing the job task to continue running successfully. This resolves issue [#3214](#3214) and ensures that the job task will not fail due to linting errors, allowing users to be aware of any issues that occurred during the linting process while still completing the job task successfully. The updated method checks for errors during the linting process, adds them to a list, and constructs a string of error messages if there are any. This string of error messages is then logged as a warning using the `logger.warning` function, allowing the method to continue executing and the job task to complete successfully. * [DOC] Add dashboard section ([#3222](#3222)). In this release, we have added a new dashboard section to the project documentation, which provides visualizations of UCX's outcomes to help users better understand and manage their UCX environment. The new section includes a table listing the available dashboards, including the Azure service principals dashboard. This dashboard displays information about Azure service principals discovered by UCX in configurations from various sources such as clusters, cluster policies, job clusters, pipelines, and warehouses. Each dashboard has text widgets that offer detailed information about the contents and are designed to help users understand UCX's results and progress in a more visual and interactive way. The Azure service principals dashboard specifically offers users valuable insights into their Azure service principals within the UCX environment. * [DOC] README.md rewrite ([#3211](#3211)). The Databricks Labs UCX package offers a suite of tools for migrating data objects from the Hive metastore to Unity Catalog (UC), encompassing a comprehensive table migration process. This process consists of table mapping, data access setup, creating new UC resources, and migrating Hive metastore data objects. Table mapping is achieved using a table mapping file that defaults to mapping all tables/views to UC tables while preserving the original schema and names, but can be customized as needed. Data access setup involves creating and modifying cloud principals and credentials for UC data. New UC resources are created without affecting existing Hive metastore resources, and users can choose from various strategies for migrating tables based on their format and location. Additionally, the package provides installation resources, including a README notebook, a DEBUG notebook, debug logs, and installation configuration, as well as utility commands for viewing and repairing workflows. The migration process also includes an assessment workflow, group migration workflow, data reconciliation, and code migration commands. * [chore] Added tests to verify linter not being stuck in the infinite loop ([#3225](#3225)). In this release, we have added new functional tests to ensure that the linter does not get stuck in an infinite loop, addressing a bug that was fixed in version 0.46.0 related to the default format change from Parquet to Delta in Databricks Runtime 8.0 and a SQL parse error. These tests involve creating data frames, writing them to tables, and reading from those tables, using PySpark's SQL functions and a system information schema table to demonstrate the corrected behavior. The tests also include SQL queries that select columns from a system information schema table with a specified limit, using a withColumn() method to add a new column to a data frame based on a condition. These new tests provide assurance that the linter will not get stuck in an infinite loop and that SQL queries with table parameters are supported. * [internal] Temporarily disable integration tests due to ES-1302145 ([#3226](#3226)). In this release, the integration tests for moving tables, views, and aliasing tables have been temporarily disabled due to issue ES-1302145. The `test_move_tables`, `test_move_views`, and `test_alias_tables` functions were previously decorated with `@retried` to handle potential `NotFound` exceptions and had a timeout of 2 minutes, but are now marked with `@pytest.mark.skip("ES-1302145")`. Once the issue is resolved, the `@pytest.mark.skip` decorator should be removed to re-enable the tests. The remaining code in the file, including the `test_move_tables_no_from_schema`, `test_move_tables_no_to_schema`, and `test_move_views_no_from_schema` functions, is unchanged and still functional. * use a path instance for MISSING_SOURCE_PATH and add test ([#3217](#3217)). In this release, the handling of MISSING_SOURCE_PATH has been improved by replacing the string representation with a Path instance using Pathlib, which simplifies checks for missing source paths and enables the addition of a new test for the DependencyProblem class. This test verifies the behavior of the newly introduced method, is_path_missing(), in the DependencyProblem class for determining if a given problem is caused by a missing path. Co-authored by Eric Vergnaud, these changes not only improve the handling and testing of missing paths but also contribute to enhancing the source code analysis functionality of the databricks/labs/ucx project. Dependency updates: * Updated sqlglot requirement from <25.29,>=25.5.0 to >=25.5.0,<25.30 ([#3224](#3224)).
## Changes Added `pytesseract` to known list ### Linked issues Partly resolve #1931
* Added `pytesseract` to known list ([#3235](#3235)). A new addition has been made to the `known.json` file, which tracks packages with native code, to include `pytesseract`, an Optical Character Recognition (OCR) tool for Python. This change improves the handling of `pytesseract` within the codebase and addresses part of issue [#1931](#1931), likely concerning the seamless incorporation of `pytesseract` and its native components. However, specific details on the usage of `pytesseract` within the project are not provided in the diff. Thus, further context or documentation may be necessary for a complete understanding of the integration. Nonetheless, this commit simplifies and clarifies the codebase's treatment of `pytesseract` and its native dependencies, making it easier to work with. * Added hyperlink to database names in database summary dashboard ([#3310](#3310)). The recent change to the `Database Summary` dashboard includes the addition of clickable database names, opening a new tab with the corresponding database page. This has been accomplished by adding a `linkUrlTemplate` property to the `database` field in the `encodings` object within the `overrides` property of the dashboard configuration. The commit also includes tests to verify the new functionality in the labs environment and addresses issue [#3258](#3258). Furthermore, the display of various other statistics, such as the number of tables, views, and grants, have been improved by converting them to links, enhancing the overall usability and navigation of the dashboard. * Bump codecov/codecov-action from 4 to 5 ([#3316](#3316)). In this release, the version of the `codecov/codecov-action` dependency has been bumped from 4 to 5, which introduces several new features and improvements to the Codecov GitHub Action. The new version utilizes the Codecov Wrapper for faster updates and better performance, as well as an opt-out feature for tokens in public repositories. This allows contributors to upload coverage reports without requiring access to the Codecov token, improving security and flexibility. Additionally, several new arguments have been added, including `binary`, `gcov_args`, `gcov_executable`, `gcov_ignore`, `gcov_include`, `report_type`, `skip_validation`, and `swift_project`. These changes enhance the functionality and security of the Codecov GitHub Action, providing a more robust and efficient solution for code coverage tracking. * Depend on a Databricks SDK release compatible with 0.31.0 ([#3273](#3273)). In this release, we have updated the minimum required version of the Databricks SDK to 0.31.0 due to the introduction of a new `InvalidState` error class that is not compatible with the previously declared minimum version of 0.30.0. This change was necessary because Databricks Runtime (DBR) 16 ships with SDK 0.30.0 and does not upgrade to the latest version during installation, unlike previous versions of DBR. This change affects the project's dependencies as specified in the `pyproject.toml` file. We recommend that users verify their systems are compatible with the new version of the Databricks SDK, as this change may impact existing integrations with the project. * Eliminate redundant migration-index refresh and loads during view migration ([#3223](#3223)). In this pull request, we have optimized the view migration process in the `databricks/labs/ucx/hive_metastore/table_metastore.py` file by eliminating redundant migration-status indexing operations. We have removed the unnecessary refresh of migration-status for all tables/views at the end of view migration, and stopped reloading the migration-status snapshot for every view when checking if it can be migrated and prior to migrating a view. We have introduced a new class `TableMigrationIndex` and imported the `TableMigrationStatusRefresher` class. The `_migrate_views` method now takes an additional argument `migration_index`, which is used in the `ViewsMigrationSequencer` and in the `_migrate_view` method. The `_view_can_be_migrated` and `_sql_migrate_view` methods now also take `migration_index` as an argument, which is used to determine if the view can be migrated. These changes aim to improve the efficiency of the view migration process, making it faster and more resource-friendly. * Fixed backwards compatibility breakage from Databricks SDK ([#3324](#3324)). In this release, we have addressed a backwards compatibility issue (Issue [#3324](#3324)) that was caused by an update to the Databricks SDK. This was done by adding new methods to the `databricks.sdk.service` module to interact with dashboards. Additionally, we have fixed bug [#3322](#3322) and updated the `create` function in the `conftest.py` file to utilize the new `dashboards` module and its `Dashboard` class. The function now returns the dashboard object as a dictionary and calls the `publish` method on this object to publish the dashboard. These changes also include an update to the pyproject.toml file, which affects the test and coverage scripts used in the default environment. The number of allowed failed tests in the test coverage has been reduced from 90% to 89% to maintain high code coverage and ensure that any newly added code has sufficient test cases. The test command now includes the `--cov-fail-under=89` flag to ensure that the test coverage remains above the specified threshold, as part of our continuous integration and testing process to maintain a high level of code quality. * Fixed issue with cleanup of failed `create-missing-principals` command ([#3243](#3243)). In this update, we have improved the `create_uc_roles` method within the `access.py` file of the `databricks/labs/ucx/aws` directory to handle failures during role creation caused by permission issues. If a failure occurs, the method now deletes any created roles before raising the exception, restoring the system to its initial state. This ensures that the system remains consistent and prevents the accumulation of partially created roles. The update includes a try-except block around the code that creates the role and adds a policy to it, and it logs an error message, deletes any previously created roles, and raises the exception again if a `PermissionDenied` or `NotFound` exception is raised during this process. We have also added unit tests to verify the behavior of the updated method, covering the scenario where a failure occurs and the roles are successfully deleted. These changes aim to improve the robustness of the `databricks labs ucx create-missing-principals` command by handling permission errors and restoring the system to its initial state. * Improve error handling for `assess_workflows` task ([#3255](#3255)). This pull request introduces improvements to the `assess_workflows` task in the `databricks/labs/ucx` module, focusing on error handling and logging. A new error type, `DatabricksError`, has been added to handle Databricks-specific exceptions in the `_temporary_copy` method, ensuring proper handling and re-raising of Databricks-related errors as `InvalidPath` exceptions. Additionally, log levels for various errors have been updated to better reflect their severity. Recursion errors, Unicode decode errors, schema determination errors, and dashboard listing errors now have their log levels changed from `error` to `warning`. These adjustments provide more fine-grained control over error messages' severity and help avoid unnecessary alarm when these issues occur. These changes improve the robustness, error handling, and logging of the `assess_workflows` task, ensuring appropriate handling and logging of any errors that may occur during execution. * Require at least 4 cores for UCX VMs ([#3229](#3229)). In this release, the selection of `node_type_id` in the `policy.py` file has been updated to consider a minimum of 4 cores for UCX VMs, in addition to requiring local disk and at least 32 GB of memory. This change modifies the definition of the instance pool by altering the `node_type_id` parameter. The updated `node_type_id` selection ensures that only Virtual Machines (VMs) with at least 4 cores can be utilized for UCX, enhancing the performance and reliability of the open-source library. This improvement requires a minimum of 4 cores to function properly. * Skip `test_feature_tables` integration test ([#3326](#3326)). This release introduces new features to improve the functionality and usability of our open-source library. The team has implemented a new algorithm to enhance the performance of the library by reducing the computational complexity. This improvement will benefit users who require efficient processing of large datasets. Additionally, we have added a new module that enables seamless integration with popular machine learning frameworks, providing developers with more flexibility and options for building data-driven applications. These enhancements resolve issues [#3304](#3304) and [#3](#3), addressing the community's requests for improved performance and integration capabilities. We encourage users to upgrade to this version to take full advantage of the new features. * Speed up `update_migration_status` jobs by eliminating lots of redundant SQL queries ([#3200](#3200)). In this release, the `_retrieve_acls` method in the `grants.py` file has been updated to remove the `_is_migrated` method and inline its functionality, resulting in improved performance for `update_migration_status` jobs. The `_is_migrated` method previously queried the migration status index for each table, but the updated method now refreshes the index once and then uses it for all checks, eliminating redundant SQL queries. Affected workflows include `migrate-tables`, `migrate-external-hiveserde-tables-in-place-experimental`, `migrate-external-tables-ctas`, `scan-tables-in-mounts-experimental`, and `migrate-tables-in-mounts-experimental`, all of which have been updated to utilize the refreshed migration status index and remove dead code. This release also includes updates to existing unit tests and integration tests to ensure the changes' correctness. * Tech Debt: Fixed issue with Incorrect unit test practice ([#3244](#3244)). In this release, we have made significant improvements to the test suite for our AWS module. Specifically, the test case for `test_get_uc_compatible_roles` in `tests/unit/aws/test_access.py` has been updated to remove mocking code and directly call the `save_uc_compatible_roles` method, improving the accuracy and reliability of the test. Additionally, the MagicMock for the `load` method in the `mock_installation` object has been removed, further simplifying the test code and making it easier to understand. These changes will help to prevent bugs and make it easier to modify and extend the codebase in the future, improving the maintainability and overall quality of our open-source library. * Updated `migration-progress-experimental` workflow to crawl tables from the `main` cluster ([#3269](#3269)). In this release, we have updated the `migration-progress-experimental` workflow to crawl tables from the `main` cluster instead of the `tacl` one. This change resolves issue [#3268](#3268) and addresses the problem of the Py4j bridge required for crawling not being available in the `tacl` cluster, leading to failures. The `setup_tacl` job task has been removed, and the `crawl_tables` task has been updated to no longer rely on the TACL cluster, instead refreshing the inventory directly. A new dependency has been added to ensure that the `crawl_tables` task runs after the `verify_prerequisites` task. The `refresh_table_migration_status` task and `update_tables_history_log` task have also been updated to assume that the inventory and migration status have been refreshed in the previous step. A TODO has been added to avoid triggering an implicit refresh if either the table or migration-status inventory is empty. * Updated databricks-labs-lsql requirement from <0.13,>=0.5 to >=0.5,<0.14 ([#3241](#3241)). In this pull request, we have updated the `databricks-labs-lsql` requirement in the `pyproject.toml` file to a range of greater than 0.5 and less than 0.14, allowing the use of the latest version of this library. The update includes release notes and a changelog from the `databricks-labs-lsql` GitHub repository, detailing new features, bug fixes, and improvements. Notable changes include the addition of the `escape_name` and `escape_full_name` functions, various dependency updates, and modifications to the `as_dict()` method in the `Row` class. This update also includes a list of dependency version updates from the `databricks-labs-lsql` changelog. * Updated databricks-labs-lsql requirement from <0.14,>=0.5 to >=0.5,<0.15 ([#3321](#3321)). In this release, the `databricks-labs-lsql` package requirement has been updated to version '>=0.5,<0.15' in the pyproject.toml file. This update addresses multiple issues and includes several improvements, such as bug fixes, dependency updates, and the addition of go-git libraries. The `RuntimeBackend` component has been improved with better exception handling, and new `escape_name` and `escape_full_name` functions have been added for SQL name escaping. The 'Row.as_dict()' method has been deprecated in favor of 'asDict()'. The `SchemaDeployer` class now allows overwriting the default `hive_metastore` catalog, and the `MockBackend` component has been improved to properly mock the `savetable` method in `append` mode. Filter specification files have been converted from JSON to YAML format for improved readability. Additionally, the test suite has been expanded, and various methods have been updated to improve codebase readability, maintainability, and ease of use. * Updated sqlglot requirement from <25.30,>=25.5.0 to >=25.5.0,<25.32 ([#3320](#3320)). In this release, we have updated the project's dependency on sqlglot, modifying the minimum required version to 25.5.0 and setting the maximum allowed version to below 25.32. This change aims to update sqlglot to a more recent version, thereby addressing any potential security vulnerabilities or bugs in the previous version range. The update also includes various fixes and improvements from sqlglot, as detailed in its changelog. The individual commits have been truncated and can be viewed in the compare view. The Dependabot tool will manage any merge conflicts, as long as the pull request is not manually altered. Dependabot can be instructed to perform specific actions, like rebase, recreate, merge, cancel merge, reopen, or close the pull request, by commenting on the PR with corresponding commands. * Use internal Permissions Migration API by default ([#3230](#3230)). This pull request introduces support for both legacy and new permission migration workflows in the Databricks UCX project. A new configuration option, `use_legacy_permission_migration`, has been added to `WorkspaceConfig` to toggle between the two workflows. When the legacy workflow is not enabled, certain steps in `workflows.py` are skipped and related methods have been renamed to reflect the legacy workflow. The `GroupMigration` class has been renamed to `LegacyGroupMigration` and integration and unit tests have been updated to use the new configuration option and renamed classes/methods. The new workflow no longer queries the `hive_metastore`.`ucx`.`groups` table in certain methods, resulting in changes to the behavior of the `test_runtime_workspace_listing` and `test_runtime_crawl_permissions` tests. Overall, these changes provide flexibility for users to choose between legacy and new permission migration workflows in the Databricks UCX project. Dependency updates: * Updated databricks-labs-lsql requirement from <0.13,>=0.5 to >=0.5,<0.14 ([#3241](#3241)). * Updated databricks-labs-lsql requirement from <0.14,>=0.5 to >=0.5,<0.15 ([#3321](#3321)). * Updated sqlglot requirement from <25.30,>=25.5.0 to >=25.5.0,<25.32 ([#3320](#3320)). * Bump codecov/codecov-action from 4 to 5 ([#3316](#3316)).
* Added `pytesseract` to known list ([#3235](#3235)). A new addition has been made to the `known.json` file, which tracks packages with native code, to include `pytesseract`, an Optical Character Recognition (OCR) tool for Python. This change improves the handling of `pytesseract` within the codebase and addresses part of issue [#1931](#1931), likely concerning the seamless incorporation of `pytesseract` and its native components. However, specific details on the usage of `pytesseract` within the project are not provided in the diff. Thus, further context or documentation may be necessary for a complete understanding of the integration. Nonetheless, this commit simplifies and clarifies the codebase's treatment of `pytesseract` and its native dependencies, making it easier to work with. * Added hyperlink to database names in database summary dashboard ([#3310](#3310)). The recent change to the `Database Summary` dashboard includes the addition of clickable database names, opening a new tab with the corresponding database page. This has been accomplished by adding a `linkUrlTemplate` property to the `database` field in the `encodings` object within the `overrides` property of the dashboard configuration. The commit also includes tests to verify the new functionality in the labs environment and addresses issue [#3258](#3258). Furthermore, the display of various other statistics, such as the number of tables, views, and grants, have been improved by converting them to links, enhancing the overall usability and navigation of the dashboard. * Bump codecov/codecov-action from 4 to 5 ([#3316](#3316)). In this release, the version of the `codecov/codecov-action` dependency has been bumped from 4 to 5, which introduces several new features and improvements to the Codecov GitHub Action. The new version utilizes the Codecov Wrapper for faster updates and better performance, as well as an opt-out feature for tokens in public repositories. This allows contributors to upload coverage reports without requiring access to the Codecov token, improving security and flexibility. Additionally, several new arguments have been added, including `binary`, `gcov_args`, `gcov_executable`, `gcov_ignore`, `gcov_include`, `report_type`, `skip_validation`, and `swift_project`. These changes enhance the functionality and security of the Codecov GitHub Action, providing a more robust and efficient solution for code coverage tracking. * Depend on a Databricks SDK release compatible with 0.31.0 ([#3273](#3273)). In this release, we have updated the minimum required version of the Databricks SDK to 0.31.0 due to the introduction of a new `InvalidState` error class that is not compatible with the previously declared minimum version of 0.30.0. This change was necessary because Databricks Runtime (DBR) 16 ships with SDK 0.30.0 and does not upgrade to the latest version during installation, unlike previous versions of DBR. This change affects the project's dependencies as specified in the `pyproject.toml` file. We recommend that users verify their systems are compatible with the new version of the Databricks SDK, as this change may impact existing integrations with the project. * Eliminate redundant migration-index refresh and loads during view migration ([#3223](#3223)). In this pull request, we have optimized the view migration process in the `databricks/labs/ucx/hive_metastore/table_metastore.py` file by eliminating redundant migration-status indexing operations. We have removed the unnecessary refresh of migration-status for all tables/views at the end of view migration, and stopped reloading the migration-status snapshot for every view when checking if it can be migrated and prior to migrating a view. We have introduced a new class `TableMigrationIndex` and imported the `TableMigrationStatusRefresher` class. The `_migrate_views` method now takes an additional argument `migration_index`, which is used in the `ViewsMigrationSequencer` and in the `_migrate_view` method. The `_view_can_be_migrated` and `_sql_migrate_view` methods now also take `migration_index` as an argument, which is used to determine if the view can be migrated. These changes aim to improve the efficiency of the view migration process, making it faster and more resource-friendly. * Fixed backwards compatibility breakage from Databricks SDK ([#3324](#3324)). In this release, we have addressed a backwards compatibility issue (Issue [#3324](#3324)) that was caused by an update to the Databricks SDK. This was done by adding new methods to the `databricks.sdk.service` module to interact with dashboards. Additionally, we have fixed bug [#3322](#3322) and updated the `create` function in the `conftest.py` file to utilize the new `dashboards` module and its `Dashboard` class. The function now returns the dashboard object as a dictionary and calls the `publish` method on this object to publish the dashboard. These changes also include an update to the pyproject.toml file, which affects the test and coverage scripts used in the default environment. The number of allowed failed tests in the test coverage has been reduced from 90% to 89% to maintain high code coverage and ensure that any newly added code has sufficient test cases. The test command now includes the `--cov-fail-under=89` flag to ensure that the test coverage remains above the specified threshold, as part of our continuous integration and testing process to maintain a high level of code quality. * Fixed issue with cleanup of failed `create-missing-principals` command ([#3243](#3243)). In this update, we have improved the `create_uc_roles` method within the `access.py` file of the `databricks/labs/ucx/aws` directory to handle failures during role creation caused by permission issues. If a failure occurs, the method now deletes any created roles before raising the exception, restoring the system to its initial state. This ensures that the system remains consistent and prevents the accumulation of partially created roles. The update includes a try-except block around the code that creates the role and adds a policy to it, and it logs an error message, deletes any previously created roles, and raises the exception again if a `PermissionDenied` or `NotFound` exception is raised during this process. We have also added unit tests to verify the behavior of the updated method, covering the scenario where a failure occurs and the roles are successfully deleted. These changes aim to improve the robustness of the `databricks labs ucx create-missing-principals` command by handling permission errors and restoring the system to its initial state. * Improve error handling for `assess_workflows` task ([#3255](#3255)). This pull request introduces improvements to the `assess_workflows` task in the `databricks/labs/ucx` module, focusing on error handling and logging. A new error type, `DatabricksError`, has been added to handle Databricks-specific exceptions in the `_temporary_copy` method, ensuring proper handling and re-raising of Databricks-related errors as `InvalidPath` exceptions. Additionally, log levels for various errors have been updated to better reflect their severity. Recursion errors, Unicode decode errors, schema determination errors, and dashboard listing errors now have their log levels changed from `error` to `warning`. These adjustments provide more fine-grained control over error messages' severity and help avoid unnecessary alarm when these issues occur. These changes improve the robustness, error handling, and logging of the `assess_workflows` task, ensuring appropriate handling and logging of any errors that may occur during execution. * Require at least 4 cores for UCX VMs ([#3229](#3229)). In this release, the selection of `node_type_id` in the `policy.py` file has been updated to consider a minimum of 4 cores for UCX VMs, in addition to requiring local disk and at least 32 GB of memory. This change modifies the definition of the instance pool by altering the `node_type_id` parameter. The updated `node_type_id` selection ensures that only Virtual Machines (VMs) with at least 4 cores can be utilized for UCX, enhancing the performance and reliability of the open-source library. This improvement requires a minimum of 4 cores to function properly. * Skip `test_feature_tables` integration test ([#3326](#3326)). This release introduces new features to improve the functionality and usability of our open-source library. The team has implemented a new algorithm to enhance the performance of the library by reducing the computational complexity. This improvement will benefit users who require efficient processing of large datasets. Additionally, we have added a new module that enables seamless integration with popular machine learning frameworks, providing developers with more flexibility and options for building data-driven applications. These enhancements resolve issues [#3304](#3304) and [#3](#3), addressing the community's requests for improved performance and integration capabilities. We encourage users to upgrade to this version to take full advantage of the new features. * Speed up `update_migration_status` jobs by eliminating lots of redundant SQL queries ([#3200](#3200)). In this release, the `_retrieve_acls` method in the `grants.py` file has been updated to remove the `_is_migrated` method and inline its functionality, resulting in improved performance for `update_migration_status` jobs. The `_is_migrated` method previously queried the migration status index for each table, but the updated method now refreshes the index once and then uses it for all checks, eliminating redundant SQL queries. Affected workflows include `migrate-tables`, `migrate-external-hiveserde-tables-in-place-experimental`, `migrate-external-tables-ctas`, `scan-tables-in-mounts-experimental`, and `migrate-tables-in-mounts-experimental`, all of which have been updated to utilize the refreshed migration status index and remove dead code. This release also includes updates to existing unit tests and integration tests to ensure the changes' correctness. * Tech Debt: Fixed issue with Incorrect unit test practice ([#3244](#3244)). In this release, we have made significant improvements to the test suite for our AWS module. Specifically, the test case for `test_get_uc_compatible_roles` in `tests/unit/aws/test_access.py` has been updated to remove mocking code and directly call the `save_uc_compatible_roles` method, improving the accuracy and reliability of the test. Additionally, the MagicMock for the `load` method in the `mock_installation` object has been removed, further simplifying the test code and making it easier to understand. These changes will help to prevent bugs and make it easier to modify and extend the codebase in the future, improving the maintainability and overall quality of our open-source library. * Updated `migration-progress-experimental` workflow to crawl tables from the `main` cluster ([#3269](#3269)). In this release, we have updated the `migration-progress-experimental` workflow to crawl tables from the `main` cluster instead of the `tacl` one. This change resolves issue [#3268](#3268) and addresses the problem of the Py4j bridge required for crawling not being available in the `tacl` cluster, leading to failures. The `setup_tacl` job task has been removed, and the `crawl_tables` task has been updated to no longer rely on the TACL cluster, instead refreshing the inventory directly. A new dependency has been added to ensure that the `crawl_tables` task runs after the `verify_prerequisites` task. The `refresh_table_migration_status` task and `update_tables_history_log` task have also been updated to assume that the inventory and migration status have been refreshed in the previous step. A TODO has been added to avoid triggering an implicit refresh if either the table or migration-status inventory is empty. * Updated databricks-labs-lsql requirement from <0.13,>=0.5 to >=0.5,<0.14 ([#3241](#3241)). In this pull request, we have updated the `databricks-labs-lsql` requirement in the `pyproject.toml` file to a range of greater than 0.5 and less than 0.14, allowing the use of the latest version of this library. The update includes release notes and a changelog from the `databricks-labs-lsql` GitHub repository, detailing new features, bug fixes, and improvements. Notable changes include the addition of the `escape_name` and `escape_full_name` functions, various dependency updates, and modifications to the `as_dict()` method in the `Row` class. This update also includes a list of dependency version updates from the `databricks-labs-lsql` changelog. * Updated databricks-labs-lsql requirement from <0.14,>=0.5 to >=0.5,<0.15 ([#3321](#3321)). In this release, the `databricks-labs-lsql` package requirement has been updated to version '>=0.5,<0.15' in the pyproject.toml file. This update addresses multiple issues and includes several improvements, such as bug fixes, dependency updates, and the addition of go-git libraries. The `RuntimeBackend` component has been improved with better exception handling, and new `escape_name` and `escape_full_name` functions have been added for SQL name escaping. The 'Row.as_dict()' method has been deprecated in favor of 'asDict()'. The `SchemaDeployer` class now allows overwriting the default `hive_metastore` catalog, and the `MockBackend` component has been improved to properly mock the `savetable` method in `append` mode. Filter specification files have been converted from JSON to YAML format for improved readability. Additionally, the test suite has been expanded, and various methods have been updated to improve codebase readability, maintainability, and ease of use. * Updated sqlglot requirement from <25.30,>=25.5.0 to >=25.5.0,<25.32 ([#3320](#3320)). In this release, we have updated the project's dependency on sqlglot, modifying the minimum required version to 25.5.0 and setting the maximum allowed version to below 25.32. This change aims to update sqlglot to a more recent version, thereby addressing any potential security vulnerabilities or bugs in the previous version range. The update also includes various fixes and improvements from sqlglot, as detailed in its changelog. The individual commits have been truncated and can be viewed in the compare view. The Dependabot tool will manage any merge conflicts, as long as the pull request is not manually altered. Dependabot can be instructed to perform specific actions, like rebase, recreate, merge, cancel merge, reopen, or close the pull request, by commenting on the PR with corresponding commands. * Use internal Permissions Migration API by default ([#3230](#3230)). This pull request introduces support for both legacy and new permission migration workflows in the Databricks UCX project. A new configuration option, `use_legacy_permission_migration`, has been added to `WorkspaceConfig` to toggle between the two workflows. When the legacy workflow is not enabled, certain steps in `workflows.py` are skipped and related methods have been renamed to reflect the legacy workflow. The `GroupMigration` class has been renamed to `LegacyGroupMigration` and integration and unit tests have been updated to use the new configuration option and renamed classes/methods. The new workflow no longer queries the `hive_metastore`.`ucx`.`groups` table in certain methods, resulting in changes to the behavior of the `test_runtime_workspace_listing` and `test_runtime_crawl_permissions` tests. Overall, these changes provide flexibility for users to choose between legacy and new permission migration workflows in the Databricks UCX project. Dependency updates: * Updated databricks-labs-lsql requirement from <0.13,>=0.5 to >=0.5,<0.14 ([#3241](#3241)). * Updated databricks-labs-lsql requirement from <0.14,>=0.5 to >=0.5,<0.15 ([#3321](#3321)). * Updated sqlglot requirement from <25.30,>=25.5.0 to >=25.5.0,<25.32 ([#3320](#3320)). * Bump codecov/codecov-action from 4 to 5 ([#3316](#3316)).
Is there an existing issue for this?
Problem statement
The various DBR runtimes include many packages1 that are always available and do not need to be installed or declared by notebooks (or jobs): they can simply be used. At present our dependency tracking isn't aware of these.
Proposed Solution
The packages included in the various DBR versions should be included in the list of known packages that we maintain.
Additional Context
The published lists for each DBR version are roughly correct; it turns out that the base OS images used also include some packages. I've scanned most of the currently supported DBR versions (9.1, 10.4, 11.3, 12.2, 13.3, 14.1, 14.2, 14.3, 15.1 & 15.2) and produced this list of installed pip packages and the various versions in use across these runtimes.
Footnotes
As an example, here is the list of packages for DBR 14.3. ↩
The text was updated successfully, but these errors were encountered: