Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support loading source data to OpenSearch index in batches #699

Open
dai-chen opened this issue Sep 25, 2024 · 1 comment
Open
Labels

Comments

@dai-chen
Copy link
Collaborator

dai-chen commented Sep 25, 2024

Is your feature request related to a problem?

Currently, there is no way to load data from a source table to an OpenSearch index while controlling the number of rows in each batch. Users are forced to rely on covering indexes or materialized views for data loading, but these only allow controlling the number of files or the total byte size (Spark 4.0) per refresh.

What solution would you like?

To achieve row-based batching, I propose allowing users to utilize a low-level INSERT statement directly on the OpenSearch table. This would enable users to control the number of rows loaded in each batch by specifying row ranges, similar to:

# Generate batch by SQL windowing function:
WITH numbered_records AS (
    SELECT *,
           ROW_NUMBER() OVER (ORDER BY id) AS row_num
    FROM table
    WHERE date = 'xxx'
)
INSERT INTO [os_index_table]
SELECT *
FROM numbered_records
WHERE row_num > [start] AND row_num <= [end]

# Alternatively, Spark 3.5.2 supports OFFSET clause:
INSERT INTO [os_index_table]
SELECT *
FROM table
WHERE date = 'XXX'
ORDER BY id
LIMIT [page_size]
OFFSET [page_start]

What alternatives have you considered?

In certain conditions, such as when no filtering is applied or filtering is limited to partitions, it may be possible to implement row-based control within covering indexes or materialized views. However, it's essential to evaluate whether this approach aligns with the intended behavior and design of Flint index refresh.

Do you have any additional context?

  • OpenSearch table meta issue: [EPIC] Zero-ETL - OpenSearch Table #185
  • Performance degradation: Without context shared across requests, the solution could result in the deep pagination problem, which may affect performance when paging deeper on large datasets.
@dai-chen
Copy link
Collaborator Author

dai-chen commented Oct 1, 2024

To implement the approach outlined above, we need to integrate the OpenSearch table concept into SparkSQL by:

  1. Enabling the OpenSearch catalog and ensuring that multiple catalogs function seamlessly in Spark.
  2. Supporting the CREATE TABLE statement for the OpenSearch catalog.
  3. Supporting the INSERT statement for OpenSearch tables.

In addition to these primary tasks, several supplementary efforts are required, such as handling data type mapping between SparkSQL and OpenSearch, as discussed in issue #699. Given the complexity of these tasks, I am currently exploring alternative approaches that avoid exposing the OpenSearch table concept for now, such as COPY command previously proposed in #129 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant