[Reporting] Improve deep pagination method for CSV export #143144

tsullivan · 2022-10-11T22:20:47Z

CSV export uses the scroll API to paginate through all the data for a user's search. This has considerable expense internally in Elasticsearch, especially when the search spans a large number of shards.

Alternatives:

Async search can be used to search a large amount of data over a large amount of shards.
- Negative: Elasticsearch doesn't allow to store an async search response larger than 10Mb
Point-in-time can be used to page through search hits when there are more than 10,000. There's no limit to the number of shards for backing the data. Elasticsearch adds an automatic tiebreaker to the sort results when PIT is used, so the search_after will not skip over documents.

Requirements:

Testing needs to be put in place where CSV export is pulling data from a search with 500+ shards.
Include shard failure warnings in the job output. See [Reporting] The CSV exporter should inform users of when there are too many scroll contexts open #142824
No breaking changes to the xpack.reporting.csv.scroll settings: try to leverage those settings where possible.
No changes needed to the Cloud settings allowlist

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-10-11T22:21:17Z

Pinging @elastic/kibana-app-services (Team:AppServicesUx)

tsullivan · 2022-10-17T15:54:55Z

The goal is to use the Point-in-time API.

When the data is time-based, the user's timestamp field will be used as the sort key for sort_after. In case there are duplicate timestamps in their data, we must add another sort field to be the tiebreaker. I'll try to use the document _id as the tie breaker.
When the data is not time-based, the data will be sorted by _id.

tsullivan · 2022-10-27T23:22:36Z

related #88303

tsullivan added (Deprecated) Feature:Reporting Use Reporting:Screenshot, Reporting:CSV, or Reporting:Framework instead impact:critical This issue should be addressed immediately due to a critical level of impact on the product. Team:AppServicesUx labels Oct 11, 2022

This was referenced Oct 11, 2022

[Reporting] The CSV exporter should inform users of when there are too many scroll contexts open #142824

Closed

[Reporting/CSV Export] issues with high search latency in ES #129524

Closed

tsullivan changed the title ~~[Reporting] Use Async Search for CSV export~~ [Reporting] Improve deep pagination method for CSV export Oct 17, 2022

tsullivan added the bug Fixes for quality problems that affect the customer experience label Oct 24, 2022

exalate-issue-sync bot added the loe:large Large Level of Effort label Oct 26, 2022

tsullivan mentioned this issue Oct 29, 2022

[Reporting] use point-in-time for paging search results #144201

Merged

1 task

tsullivan closed this as completed in #144201 Nov 7, 2022

geekpete mentioned this issue May 24, 2023

[DOCS] Document as breaking change in 8.6 the switch from Scroll to PIT for CSV reports that no longer work against index aliases if the permissions are not granted to underlying indices #158338

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Reporting] Improve deep pagination method for CSV export #143144

[Reporting] Improve deep pagination method for CSV export #143144

tsullivan commented Oct 11, 2022 •

edited

Loading

elasticmachine commented Oct 11, 2022

tsullivan commented Oct 17, 2022

tsullivan commented Oct 27, 2022

[Reporting] Improve deep pagination method for CSV export #143144

[Reporting] Improve deep pagination method for CSV export #143144

Comments

tsullivan commented Oct 11, 2022 • edited Loading

Alternatives:

elasticmachine commented Oct 11, 2022

tsullivan commented Oct 17, 2022

tsullivan commented Oct 27, 2022

tsullivan commented Oct 11, 2022 •

edited

Loading