Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Reporting/CSV Export] issues with high search latency in ES #129524

Closed
llermaly opened this issue Apr 5, 2022 · 21 comments
Closed

[Reporting/CSV Export] issues with high search latency in ES #129524

llermaly opened this issue Apr 5, 2022 · 21 comments
Labels
bug Fixes for quality problems that affect the customer experience (Deprecated) Feature:Reporting Use Reporting:Screenshot, Reporting:CSV, or Reporting:Framework instead feedback_needed impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:medium Medium Level of Effort Team:SharedUX Team label for AppEx-SharedUX (formerly Global Experience)

Comments

@llermaly
Copy link

llermaly commented Apr 5, 2022

Kibana version:

8.1.2

Elasticsearch version:

8.1.2

Server OS version:

Elastic Cloud

Browser version:
Latest Chrome

Browser OS version:

Original install method (e.g. download page, yum, from source, etc.):

Cloud

Describe the bug:

Exporting large saved searches will show a toast saying your download will be ready shortly. And never finishes, no way to see the export progress. I understand this export options are for summarized tables so few seconds after exporting the download should start.

This is not the case with large saved searches where the export will take minutes and may fail. This kind of large searches should be saved as reports in the report section.

Steps to reproduce:

  1. Save a huge search into a Dashboard
  2. Export that search as CSV
  3. Instead of creating a report in background will tell you "your download will be ready soon"

Expected behavior:

Large saved search CSV exports running as a Report for progress tracking

@llermaly llermaly added the bug Fixes for quality problems that affect the customer experience label Apr 5, 2022
@botelastic botelastic bot added the needs-team Issues missing a team label label Apr 5, 2022
@tsullivan
Copy link
Member

The solution we went with was to stream the output data to the client as the search results are compiled into CSV

Fixed in #123067

@tsullivan
Copy link
Member

Sorry - I just realized that this issue was filed against 8.1, which includes that change I linked to.

@tsullivan tsullivan reopened this Apr 18, 2022
@tsullivan
Copy link
Member

@llermaly Can you provide a HAR file of the dashboard app, so we can see the network requests and what is happening with them? If not, maybe just a screenshot of the network panel of the browser, showing the request to download the CSV?

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app-services (Team:AppServicesUx)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Apr 18, 2022
@tsullivan tsullivan changed the title Large dashboard CSV exports should convert into reports [Reporting] Download CSV option in the dashboard isn't working Apr 18, 2022
@llermaly
Copy link
Author

Hello @tsullivan , the download times out before finishing:

image

If you wait until the end the red toast mentioning time out appears.

image

Moment of failure

image

@exalate-issue-sync exalate-issue-sync bot added impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:small Small Level of Effort labels Apr 19, 2022
@tsullivan
Copy link
Member

@llermaly a few more questions to understand the line where things go wrong:

  1. The download of a large CSV export times out before finishing. Was partial data downloaded?
  2. How many seconds until the timeout?
  3. If you generate a report from Discover, how long does that take to complete?
  4. How many bytes are these reports?

Thanks very much for the screenshots, these were super helpful.

@llermaly
Copy link
Author

  1. No, error and no download
  2. 40s the time when I measured it. I will try again and report here if it is different
  3. 10 minutes
  4. 750MB is our csv max export so report generated is partially generated sometimes (we have 3GB of data)

Thanks! , will be posting here after some tests.

@tsullivan
Copy link
Member

Takes 40s to download 400MB so it is a fair statement to assume would take only 80s approx.

@llermaly I'm unsure about this. I will have to experiment with test data to try to reproduce the issue.

@llermaly
Copy link
Author

I just removed that message because seems like I was confused.

This is more realistic:

image

4.6min for 400MB

When 750MB reports are taking 8-10mins makes more sense

@llermaly
Copy link
Author

Failed after 5min

image

@llermaly
Copy link
Author

Curl request output:

curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)

@tsullivan
Copy link
Member

Hi @llermaly I started looking into this issue in an Elastic Cloud deployment (v8.1.3). I loaded 450K documents into an index and put a saved search to that index in a dashboard. When I downloaded CSV from there, I received a file that was 10MB by default. I increased the xpack.reporting.csv.maxSizeBytes value in my deployment to 52428800 (52MB), which is the highest allowed in Cloud. I was able to export a file with that size, without an unreasonable period of waiting.

It looks like you are working with an export that is 400MB, but there's a known issue in Elastic Cloud making the largest export available (currently) set to 52MB.

Questions:

  • Is there an instance of Kibana that is running outside of the Cloud deployment?
  • Can you provide the kibana.yml settings for all Kibana instances?

@tsullivan
Copy link
Member

I ran a test with on-prem after setting xpack.reporting.csv.maxSizeBytes: 1gb in my kibana.yml. The browser seemed to be able to download 130MB of the data, and then experienced a crash:
image

By right-clicking the request in the Developer tools and choosing "Copy > Copy as cURL", I could run a command that allowed me to download 1gb of CSV:
image

So there does appear to be a browser-related issue with downloading large CSV files using this feature.

@llermaly
Copy link
Author

llermaly commented May 4, 2022

@tsullivan Curl is also failing in my side:

Curl request output:

curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)

The browser failing is different though, my browser does not show that message, only the request fails.

@exalate-issue-sync exalate-issue-sync bot added loe:medium Medium Level of Effort and removed loe:small Small Level of Effort labels Jun 3, 2022
@tsullivan tsullivan changed the title [Reporting] Download CSV option in the dashboard isn't working [Reporting] Download CSV option in the dashboard doesn't always work Jun 16, 2022
@tsullivan
Copy link
Member

Related: #135274

@tsullivan
Copy link
Member

tsullivan commented Jun 28, 2022

I was able to come up with a test case that breaks the "Download CSV" feature by adding 30 seconds of latency to the initial request for the data over scan/scroll. At that level of latency, increasing the xpack.reporting.csv.scroll.duration won't resolve the issue, because of other timeouts potentially happening:

  • The search() function we call in Node can time out
  • The browser connection to Kibana can time out.

I ran tests to download 5000 documents from a saved search, using a dev instance of Kibana that connected to Elasticsearch through a simple node proxy, which intercepted requests and responses to Elasticsearch, and slowed down the response for Download CSV data.

Dev setup diagram:
scan and scroll throttle

What I found is that Download CSV can handle the latency of Elasticsearch data only when latency is lower than 30 seconds for the first request. When it is 30 seconds or higher, the ES client connection times out and logs an Error: ReportingError(code: unknown_error).

These logs show the search timeout error happening when generating a CSV report in Discover:

[2022-06-28T15:41:39.457-07:00][INFO ][plugins.reporting.runTask.csv_searchsource.execute-job:l4yr6xb4fyx40763b5bjqidk] Generating request for space: reporting-test-fixtures
[2022-06-28T15:41:39.457-07:00][INFO ][plugins.reporting.runTask.csv_searchsource.execute-job:l4yr6xb4fyx40763b5bjqidk] Request uses Space ID: reporting-test-fixtures
[2022-06-28T15:41:39.458-07:00][INFO ][plugins.reporting.runTask.csv_searchsource.execute-job:l4yr6xb4fyx40763b5bjqidk] Creating UI Settings Client for space: reporting-test-fixtures
[2022-06-28T15:41:39.627-07:00][DEBUG][plugins.reporting.runTask.csv_searchsource.execute-job:l4yr6xb4fyx40763b5bjqidk] setting timezone on customer_birth_date
[2022-06-28T15:41:39.627-07:00][DEBUG][plugins.reporting.runTask.csv_searchsource.execute-job:l4yr6xb4fyx40763b5bjqidk] setting timezone on order_date
[2022-06-28T15:41:39.627-07:00][DEBUG][plugins.reporting.runTask.csv_searchsource.execute-job:l4yr6xb4fyx40763b5bjqidk] setting timezone on products.created_on
[2022-06-28T15:41:39.627-07:00][DEBUG][plugins.reporting.runTask.csv_searchsource.execute-job:l4yr6xb4fyx40763b5bjqidk] executing search request
[2022-06-28T15:43:39.468-07:00][DEBUG][plugins.reporting.events] an error occurred
[2022-06-28T15:43:39.484-07:00][ERROR][plugins.reporting.runTask] Error: ReportingError(code: unknown_error) "Max attempts (1) reached for job l4yr6xb4fyx40763b5bjqidk. Failed with: Timeout has occurred"
    at mapToReportingError (/home/tsullivan/elastic/kibana/x-pack/plugins/reporting/common/errors/map_to_reporting_error.ts:50:10)
    at Object.run (/home/tsullivan/elastic/kibana/x-pack/plugins/reporting/server/lib/tasks/execute_report.ts:433:31)
    at TaskManagerRunner.run (/home/tsullivan/elastic/kibana/x-pack/plugins/task_manager/server/task_running/task_runner.ts:306:22)
[2022-06-28T15:43:39.541-07:00][DEBUG][plugins.reporting.events] report l4yr6xb4fyx40763b5bjqidk has failed
[2022-06-28T15:43:39.541-07:00][DEBUG][plugins.reporting.runTask] Reports running: 0.
[2022-06-28T15:43:39.672-07:00][ERROR][plugins.reporting.runTask.csv_searchsource.execute-job:l4yr6xb4fyx40763b5bjqidk] Error: Request timed out
    at getKbnServerError (/home/tsullivan/elastic/kibana/src/plugins/kibana_utils/server/report_server_error.ts:28:10)
    at search (/home/tsullivan/elastic/kibana/src/plugins/data/server/search/strategies/es_search/es_search_strategy.ts:55:15)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
[2022-06-28T15:43:39.673-07:00][WARN ][plugins.reporting.runTask.csv_searchsource.execute-job:l4yr6xb4fyx40763b5bjqidk] No scrollId to clear!
[2022-06-28T15:43:39.673-07:00][DEBUG][plugins.reporting.runTask.csv_searchsource.execute-job:l4yr6xb4fyx40763b5bjqidk] Finished generating. Row count: 0.

These logs show the errors when attempting to use "Download CSV" in Dashboard:

[2022-06-28T16:04:06.058-07:00][ERROR][plugins.reporting.csv_searchsource_immediate.execute-job] Error: Request timed out
    at getKbnServerError (/home/tsullivan/elastic/kibana/src/plugins/kibana_utils/server/report_server_error.ts:28:10)
    at search (/home/tsullivan/elastic/kibana/src/plugins/data/server/search/strategies/es_search/es_search_strategy.ts:55:15)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
[2022-06-28T16:04:06.059-07:00][WARN ][plugins.reporting.csv_searchsource_immediate.execute-job] No scrollId to clear!
[2022-06-28T16:04:06.059-07:00][DEBUG][plugins.reporting.csv_searchsource_immediate.execute-job] Finished generating. Row count: 0.
[2022-06-28T16:04:06.060-07:00][WARN ][plugins.reporting.csv_searchsource_immediate.execute-job] Encountered an unknown error: Request timed out
[2022-06-28T16:04:06.060-07:00][INFO ][plugins.reporting.csv_searchsource_immediate] Job output size: 0 bytes.
[2022-06-28T16:04:06.060-07:00][WARN ][plugins.reporting.csv_searchsource_immediate] CSV Job Execution created empty content result

This server-side client timeout doesn't have a workaround. Theoretically, xpack.reporting.csv.scroll.duration can be increased and make ES keep search contexts open for longer, but if ES sends no data back to the server within 30 seconds for the initial search request, NodeJS times out the socket connection.

When using curl, the download request will fail with curl: (52) Empty reply from server.

In Chrome, the network panel will show the request for csv_searchsource has a stalled connection. After 8-10 minutes (the timing info is not regularly updated in Chrome), the status of the request will be (failed)net::ERR_EMPTY_RESPONSE

If the end-user accesses Kibana through a proxy or in a Cloud instance of Kibana, the error as seen in the client will be different, as it is a response crafted by the proxy. The status of the response may be 502 or 504, instead of "empty response."

The workaround is to address performance and search latency on the Elasticsearch side.

Other issues I discovered during this investigation:

@tsullivan tsullivan changed the title [Reporting] Download CSV option in the dashboard doesn't always work [Reporting] CSV Export breaks with high search latency in ES Jun 28, 2022
@exalate-issue-sync exalate-issue-sync bot added impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. and removed impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. labels Jul 5, 2022
@tsullivan
Copy link
Member

A proper long-term fix for this will be to use asynchronous search for CSV export, so we don't depend on keeping a connection open for a long time in NodeJS.

A decent band-aid will be to add louder warnings in the logs and in the report job itself, telling the user / administrator that the results are incomplete. Similar to #136445

@exalate-issue-sync exalate-issue-sync bot added impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. and removed impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. labels Aug 26, 2022
@tsullivan tsullivan changed the title [Reporting] CSV Export breaks with high search latency in ES [Reporting/CSV Export] issues with high search latency in ES Sep 14, 2022
@tsullivan
Copy link
Member

I'm going to keep this issue open, and put it back on the backlog. Long-running connections don't work well in Cloud environments, regardless of any timeout settings we can try to apply. CSV export needs an async search implementation.

@tsullivan
Copy link
Member

An issue has come up where troubleshooting could have gone much faster if we pass on the shard failure warnings from the search response, and add them to the warnings of the report job. Rather than filing a separate issue, this needs to be a consideration as part of this issue effort to improve CSV export.

@tsullivan
Copy link
Member

Will be closed by #143144

@petrklapka petrklapka added (Deprecated) Feature:Reporting Use Reporting:Screenshot, Reporting:CSV, or Reporting:Framework instead Team:AppServicesSv Team:SharedUX Team label for AppEx-SharedUX (formerly Global Experience) and removed Team:AppServicesUx labels Oct 24, 2022
@tsullivan
Copy link
Member

tsullivan commented Nov 21, 2022

I want to close the loop on this issue. With the investigation that has come from this issue, and the latest changes made from #143144, there are certain things we know can be supported, and certain things that can not.

  1. It's not async search. While implementing [Reporting] use point-in-time for paging search results #144201, I believe we came to the best options available, yet they are not perfect. Our choice to not use async search means that querying a page of data could time out, if it takes longer than 30 seconds: any longer than that, and the connection could be killed by intermediate proxies. This was the best choice, since it allowed us to use point-in-time search for paging, and unrestricted us from supporting large exports.

  2. Generating a CSV report from the Discover application is the best way to export the data. Due to the potential for browsers to time out while waiting for data, using the "Download CSV" action in the saved search panel has a worse surface area for problems when search is slow. The browser un-prioritizes connections that are lagging, and puts in automatic re-tries as well [1]. By the time the user sees any indication that a problem happened, it could be way longer than I expected. On the other hand, if search is slow while generating a CSV export in Discover, we can handle the timeout in the NodeJS side, and attach an appropriate error to the report job. Therefore, I plan to work with @sixstringcode on deprecating the "Download CSV" feature in the Dashboard, and replacing it with an integration to generate a CSV report in that context.

[1] It should be noted that retries can be controlled on our side, but that wasn't part of my research.

I'm closing this issue, as I believe we've done all we can to make CSV export robust and stable. There are known problems where latency in ES can cause problems with CSV export. However, in my research on this, I found that is true of many features that run in the background in the Kibana server and (for now) the best thing to do may be to scale up ES with more memory or nodes, so that search is not slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience (Deprecated) Feature:Reporting Use Reporting:Screenshot, Reporting:CSV, or Reporting:Framework instead feedback_needed impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. loe:medium Medium Level of Effort Team:SharedUX Team label for AppEx-SharedUX (formerly Global Experience)
Projects
None yet
Development

No branches or pull requests

4 participants