Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Post-Processor Hooks for Refreshing Index after Ingestion #4885

Open
SavvasSriAnushaVeeramachineni opened this issue Aug 28, 2024 · 2 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@SavvasSriAnushaVeeramachineni
Copy link

SavvasSriAnushaVeeramachineni commented Aug 28, 2024

Is your feature request related to a problem? Please describe.
Currently our ETL job runs every 30 minutes and inserts a file into S3, triggering OpenSearch ingestion pipeline. Due to varying ETL completion time, it's challenging to determine suitable refresh_interval at the index level that works consistently for all scenarios.

As a result of this behavior - there is a delay in the data being available even though the ingestion to OpenSearch is complete.

Describe the solution you'd like
We propose to add a new configuration option for http post-processor hooks in the Data Prepper pipeline definition, which will allow us to specify the http POST endpoint and make refresh API call( /index-name/_refresh), post pipeline ingestion is completed.

Currently the processor available in the pipeline definition only works before ingesting data to OpenSearch.

Describe alternatives you've considered (Optional)
Provide refresh option at pipeline index settings which will internally refresh the index after the execution of pipeline.

Additional context
N/A

@dlvenable
Copy link
Member

@SavvasSriAnushaVeeramachineni , Thank you for opening this issue. I understand that you'd like Data Prepper to automatically call the _refresh API for every updated index.

Can you clarify what will try making that call? Are you using S3-scan? Do you want the completion of the scan to trigger the refresh?

As a result of this behavior - there is a delay in the data being available even though the ingestion to OpenSearch is complete.

What is your delay?

Also, have you tried using the default refresh_interval to let OpenSearch handle it?

@dlvenable dlvenable added enhancement New feature or request question Further information is requested and removed untriaged labels Sep 3, 2024
@SavvasSriAnushaVeeramachineni
Copy link
Author

SavvasSriAnushaVeeramachineni commented Sep 3, 2024

@dlvenable Thanks for Replying!
Regarding : Can you clarify what will try making that call? Are you using S3-scan? Do you want the completion of the scan to trigger the refresh?

  • We are using S3-SQS processing.
  • I want the _refresh API to be called after all the records in csv are ingested at sink(OpenSearch).
  • We are hoping the data is available to search results, immediately after the data is ingested to OpenSearch from pipeline.

What is your delay?

  • delay is around 30 min

Also, have you tried using the default refresh_interval to let OpenSearch handle it?

  • Yes I have tried to set refresh_interval at 30 minutes, but the Ingestion complete time always doesn't fall within the 30 minute window. If the current refresh cycle has completed and the pipeline inserted data just after 1 minute, still we have to wait for an other 30 min for the data to be available in search results.
  • We don't want to keep a lower refresh_interval either, as it would increase the load and computational cost on index and also the scheduler which inserts data to S3 runs 30 minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
Development

No branches or pull requests

2 participants