Glue DataCatalog Cache (GDC) Design Document #1484

ps48 · 2024-03-05T18:19:00Z

Overview

OpenSearch S3 Zero ETL (Flint) feature provides a unique solution to allow users run direct queries with/w/o acceleration with their data stored in S3. This possesses a unique problem where UI needs to display the catalog including datasources, databases and tables to enable the user to run any query. The fetch the catalog UI needs to run a sequence of show databases, tables and accelerations queries which take ~10-20s to respond each. This adds up a lot of UI wait time for the user, eventually losing interest in the solution. Glue DataCatalog Cache (GDC) aims to solve this issue by running the show databases, tables and accelerations in the background and storing them in cache so UI retrieval is faster and users can focus on working with the overall solution.

Motivation

With Glue DataCatalog Cache we aim to solve the below items:

Reduce user wait time on UI to see catalog data
Reduce no. of queries sent to the backend to fetch catalog information
Give users choice to hard reload the cached data

OpenSearch Dashboards elements using GDC:

The Query Workbench and Datasources dashboards plugins will be the first UI elements to start using the Glue DataCatalog cache in OS version 2.13. Later, In OS 2.14 we’ll have Data Explorer/Discover Cohesion project will utilize the same cache object to load tables and accelerations.

Workbench 2.13

DataSources 2.13

Requirements

Save a cached view of the Glue data catalog
Persist cache across browser reloads
Restrict cache access to the users who have access to the datasource
Allow users to invoke a hard reload on all cached data
Allow users to invoke a hard reload on partial cached data (only accelerations, only tables and databases)

Architecture Design

Today, with every page load of workbench plugin for a datasource we auto-run the queries to load databases. Thereafter, users need to click on loaded databases to fetch tables for the particular database and further click on the tables to loaded the acceleration indices. The would have worked well for the UI if the catalog load was instantaneous instead of ~10-20s per load.

In the Glue DataCatalog Cache, we propose to load top 4000 elements including databases, tables and accelerations (10k elements is the OSD api limit, but UI can become sluggish well before 10k). Not limiting to a certain number of elements may lead to infinite loading time.

Cache Data Request

Approach 1:

+---------------------------+
| 1. Show Databases         |
+---------------------------+
       |
       V
+---------------------------+
| 2. For each Database      |
|    Show Tables            |
|    Show Materialized Views|
+---------------------------+
       |
       V
+---------------------------+
| 3. For each Table         |
|    Show Indexes           |
|    (Skip Index & Covering |
|     Index Information)    |
+---------------------------+

Show Databases: Initially, we query the Glue Data Catalog with SHOW DATABASES query to list all available databases.
For Each Database, Show Tables and Materialized Views: For each database obtained in the previous step, we retrieve a list of tables and materialized views present within that database using SHOW TABLES and SHOW MATERIALIZED VIEWS.
For Each Table, run show Indexes: we run the queries SHOW SKIPPING INDEX and SHOW COVERING INDEX to retrieve information about its skipping and covering indexes for the particular table

Approach 2 (Preferred):

+---------------------------+
| 1. Show Databases         |
+---------------------------+
       |
       V
+---------------------------+
| 2. For each Database      |
|    Show Tables            |
+---------------------------+
       |
       V
+---------------------------+
| 3. Show Accelerations     |
|    for All Tables         |
+---------------------------+

Show Databases: Again, we start by querying the Glue Data Catalog to list all available databases.
For Each Database: Show Tables: For each database obtained in the previous step, we retrieve a list of tables present within that database.
Show Accelerations for All Tables: After obtaining the list of tables from all databases, we directly run SHOW ACCELERATIONS query to retrieve information about all accelerations associated with all tables and databases. This step consolidates all tables from all databases to fetch acceleration details in a single operation, rather than querying each table individually.

Cache Data Structure

{
  "version": "1.0",
  "dataSources": [
    {
      "name": "DataSource1",
      "lastUpdated": "2024-02-20T12:00:00Z", // Saved in UTC
      "status": "Updated", // Failed, Empty, Loading
      "databases": [
        {
          "name": "Database1",
          "materializedViews": [
            {
              "name": "MaterializedView1"
            },
            {
              "name": "MaterializedView2"
            }
          ],
          "tables": [
            {
              "name": "Table1",
              "columns": [
                {"name": "column1", "dataType": "datatype1"},
                {"name": "column2", "dataType": "datatype2"},
                {"name": "column3", "dataType": "datatype3"}
              ],
              "skippingIndex": {
                "indexName": "SkippingIndex1"
              },
              "coveringIndices": [
                {
                  "indexName": "CoveringIndex1"
                },
                {
                  "indexName": "CoveringIndex2"
                }
              ]
            },
            {
              "name": "Table2",
              "columns": [
                {"name": "column4", "dataType": "datatype4"},
                {"name": "column5", "dataType": "datatype5"}
              ],
              "skippingIndex": {
                "indexName": "SkippingIndex2"
              },
              "coveringIndices": []
            }
          ]
        }
      ]
    }
  ]
}

This JSON structure provides a cached representation of the data catalog, including databases, tables, materialized views, and associated index information.

LastUpdated: Indicates the timestamp when the cache of the data catalog was last updated. In this case, it was last updated on February 20, 2024, at 12:00:00 (UTC).
Version: Specifies the version of the cache. This helps in tracking changes and updates to the cache format. The current version is 1.0.
Status: Indicates the status of the cache. In this case, it's marked as "Updated", suggesting that the cache is currently in use and up-to-date.
DataSources: Represents different data sources available in the data catalog.
- Name: The name of the data source.
- Databases: Represents databases within each data source.
  - Name: The name of the database.
  - MaterializedViews: Represents materialized views within the database. These are precomputed views stored on disk.
  - Name: The name of the materialized view.
  - Tables: Represents tables within the database.
  - Name: The name of the table.
    - SkippingIndex: Represents the skipping index associated with the table. A skipping index is an index that may not cover all the columns but can still be useful for direct queries.
      - IndexName: The name of the skipping index.
    - CoveringIndices: Represents covering indices associated with the table. A covering index is an index that includes all the columns required for a query, allowing the query to be answered directly from the index without accessing the table.
      - IndexName: The name of the covering index.

Cache Storage

Approach 1: Kibana Index

In this approach, we opt to store the cache in the .kibana index, leveraging Kibana's saved objects feature. Since the Glue DataCatalog exclusively contains metadata without user-specific data, it is suitable for storage within the Kibana index of the respective user. This approach ensures that the cache is stored per user, aligning with the lifecycle of the user's interactions. To facilitate storing and retrieving the cached information, plugins utilize the OSD SavedObjects Client.

Approach 2: Index Managed by the SQL plugin

Option 2.1: Re-use the .ql-datasources index. This index is currently managed for storing Glue datasources and associated metadata, such as datasource type, connector type, and authentication type. However, adopting this option would necessitate significant refactoring since adding any information to this index is presently restricted to super admins of OpenSearch. Additionally, utilizing this index for storing cache information related to the catalog may not align well with the Separation of Responsibility principle, as it could introduce complexities and potential conflicts with the primary purpose of the index.

Option 2.2: Create a new flint index .cache-datasources managed by the sql plugin. This requires the SQL plugin to manage one index per datasource. This approach provides more flexibility and autonomy in managing cache-related data independently from other indices and metadata.

Approach 3: Browser storage

Option 3.1 Local Storage: (preferred) In this approach, the cache is stored directly within the browser's local storage. This option offers simplicity and speed since data retrieval occurs locally without the need for server communication. However, browser storage may have limitations in terms of storage capacity and data persistence, particularly across different devices or sessions.

Option 3.2 Session storage: Similar to browser storage, session storage involves storing the cache within the browser's session storage. This option provides a temporary storage solution tied to the current browser session, ensuring data persistence only for the duration of the session. Session storage offers enhanced security and privacy since data is cleared upon session termination but may not be suitable for long-term storage requirements.

Other Design Considerations

SQL plugin handles the management of cache by running the Show databases, tables and accelerations queries internally and providing APIs to the OSD plugin for consuming the cached information. This approach may pose an issue in the longer term when we head towards creating a Managed Flint Service.

Workflows

Initialization and Cache Refresh:
- The User loads the UI for the first time or requests a cache refresh.
- The OpenSearchDashboards component activates and checks or updates the cache status.
- OpenSearchDashboards sends a query to OpenSearchSQL to show databases.
- OpenSearchSQL forwards the query request to Spark, which accesses the metadata store (AWSGlue).
- OpenSearchSQL receives the response from Spark and sends it back to OpenSearchDashboards.
- OpenSearchDashboards then proceeds to show tables in multiple databases, similar to the initial steps of showing databases.
- Lastly, OpenSearchDashboards queries for accelerations using OpenSearchSQL, and the process repeats with Spark accessing the metadata store.
Subsequent UI Loads:
- After the initial load or cache refresh, subsequent UI loads involve a simpler process.
- The User loads the UI again, and OpenSearchDashboards checks the cache status.
- OpenSearchDashboards then proceeds to show results to the User.

Security

Today, the access controls of datasources if by default restricted to the admin of the OpenSearch domain. To add more users we need to edit access permissions of the datasource and add security roles to it. Users with the security role will get access to the datasource. While, we keep GDC in one of the storage options we should adhere to the access controls maintained over the datasources.

Appendix

SQL Async queries Interface: https://github.com/opensearch-project/sql/blob/main/docs/user/interfaces/asyncqueryinterface.rst
OpenSearch Spark Accelerations: https://github.com/opensearch-project/opensearch-spark/blob/main/docs/index.md
OpenSearch Spark Design: [RFC] OpenSearch and Apache Spark Integration opensearch-spark#4
AWS OpenSearch Direct Query Setup: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/direct-query-s3.html

The text was updated successfully, but these errors were encountered:

…ensearch-project#1484) Signed-off-by: Derek Ho <[email protected]> (cherry picked from commit 8dab6a3ef00c7127f6713701187abac1addfc68a) Co-authored-by: Derek Ho <[email protected]>

ps48 added enhancement New feature or request untriaged labels Mar 5, 2024

ps48 self-assigned this Mar 5, 2024

ps48 removed the untriaged label Mar 5, 2024

This was referenced Mar 8, 2024

Catalog cache and Session update for async queries #1500

Merged

Add datasource field in accelerations cache #1525

Merged

ps48 mentioned this issue Mar 26, 2024

Clear cache on any 401 response errors #1634

Merged

6 tasks

ps48 mentioned this issue May 1, 2024

Adding test for clear cache on logout #1794

Merged

6 tasks

ps48 closed this as completed Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Glue DataCatalog Cache (GDC) Design Document #1484

Glue DataCatalog Cache (GDC) Design Document #1484

ps48 commented Mar 5, 2024

Glue DataCatalog Cache (GDC) Design Document #1484

Glue DataCatalog Cache (GDC) Design Document #1484

Comments

ps48 commented Mar 5, 2024

Overview

Motivation

OpenSearch Dashboards elements using GDC:

Workbench 2.13

DataSources 2.13

Requirements

Architecture Design

Cache Data Request

Approach 1:

Approach 2 (Preferred):

Cache Data Structure

Cache Storage

Approach 1: Kibana Index

Approach 2: Index Managed by the SQL plugin

Approach 3: Browser storage

Other Design Considerations

Workflows

Security

Appendix