[Research] Data format improvements for charting (arrow) #175695

thomasneirynck · 2024-01-26T15:30:53Z

Currently, the majority of all charts use the default Json output from Elasticsearch. These responses by default have a row-like (in the case of es|ql or doc-search) or nested (in the case of aggs) layout.

Internally, Kibana will reformat these to something more usable. e.g. a format understood by elastic/charts, nested-array tables for easier ergonomics, etc...

These client-side reformattings introduce an overhead.

Is it possible to have a more efficient pipeline (?), either by reducing network traffic, reducing reconversions (or both).

Goal

Investigate impact of data format on kibana data visualization (specifically, Lens & Dashboard).

Consider both the context of:

_search
_query (ES|QL)

Consider alternatives:

Already supported by Elasticsearch: e.g. (cbor, smile, ..) or column based SQL output
Other possibilities (yet unsupported by Elasticsearch) e.g. arrow flight, parquet

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-01-26T19:57:24Z

Pinging @elastic/kibana-visualizations (Team:Visualizations)

stratoula · 2024-01-28T08:43:07Z

@nik9000 was working on their ON week on exposing the ESQL results in arrow format. I think it is awesome to continue investigations in this front. Can make our visualizations much more performant and dense!

drewdaemon · 2024-01-30T16:26:30Z

These client-side reformattings introduce an overhead.

Agree, both in terms of performance and complexity.

markov00 · 2024-04-02T14:35:50Z

linked to #178471

ppisljar · 2024-05-22T06:05:28Z

adding (basic) arrow support to expressions: #183909

This showcases that it is not very hard to convert from arrow to datatable and vice versa, which would allow us to gradually migrate our code to the new format.

thomasneirynck · 2024-05-29T20:45:54Z

Thanks @ppisljar for #175695 (comment). This was super useful.

Below a follow from an offline convo with @markov00 and @ppisljar. Apart fromthis initial look into arrow, there are a few more open questions. I think it might also be useful to recap some of the underlying reasons for this research for wider visibility.

1. We should build up our knowledge arrow because of its strategic value in contemporary tech stacks

Arrow has strategic value because it the main data-interchange formats for interprocess data analytics (e.g. ML with pandas in Python), GPU-based charting (e.g. dense scatterplots), or in a web context to do client-side analytics (e.g. duckdb-wasm https://duckdb.org/docs/api/wasm/overview.html)

For that reason alone, it is important to gain a better understanding of this format.

From @ppisljar initial investigation (#183909), the short term take-away seems to be that a "backend swap" of JSON vs Arrow may not be hard technically, but it would not be the right choice in the short term.
(a) poor client support in the browser (e.g. having to use unsafe-eval)
(b) existing data-pipeline in Lens (ie. "expressions") - which needs to marshall the data into a new table and which does some intermediate data-enrichment - requires a full read of the arrow-table, remarshaling everything to JSON anyway. This conversion is slow.

Kibana has a very low investment in GPU-technology today (except flamegraph and maps), and introducing a new model of client-side analytics (e.g. one which runs in WASM with duckdb) is not directly on the horizon either. imho it is OK with postponing further investigation in these long term topics. We can always pick up those aspects up once it becomes more tactically relevant (e.g. when scatterplots are prioritized)

What is not answered though is whether:

arrow has any meaningful space savings in amount of data sent over the wire. This size-comparison would still be useful to get some numbers on (see Apache arrow support for ES|QL elasticsearch#104877 to test wrt ES|QL).
any blockers on kibana-server which would prevent streaming arrow data to browser without having to unpack it first

2. JSON vs Binary. Is there any low-hanging fruit for saving space in size of data transmitted over the wire?

arrow is just one example of a binary format. Other examples could be cbor or smile, which are supported by Elasticsearch.

Any gains we can make in transfer format can be meaningful, especially since it would get our stack to closer deliver data in a streaming-fashion: ie. an elasticsearch response stream should just be streamed back as-is to the browser, without further modification, especially if that modification is redundant.

It seems there is some additional processing in Kibana server (specifically for async searches (?)), which would prevent us from doing this.

Whether the Elasticsearch-js client supports formats other than JSON is imo less relevant. Users can always unpack the data manually by using the as-stream option (https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/as_stream_examples.html). If we can demonstrate value, we can always push this support down in the ES-client as well.

So for this, I think we are missing answers to:

are there size benefits to using binary format, like cbor or smile, over JSON?
can footprint from Kibana server be reduced more? Ideally, entirely, in the case of sync search (async search can be a special case)?

3. column vs row layout ("visualization friendly" format)

This is more of an orthogonal issue to binary/JSON. This is about data layout.

This would need to be investigated in the context of expressions and elastic/charts.

I believe this is already the default for ES|QL (?).

4. ES|QL versus DSL

(1) and (3) the questions above only have relevance for ES|QL.

arrow is only on the horizon for ES|QL
column-layouts are only be supported by ES|QL (?)

(2) applies to both, and imo is therefore important.

5. The "big picture" - thinning kibana server

The big picture for 1, 2, 3, and 4 is that we should aim to remove as much intermediate, redundant processing on data, especially on Kibana-server. Processing-in-browser is only felt by one particular user, while load on kibana server affects all userss. Does Lens really need enrichment of data in kibana-server?

e.g.: a performant data-viz architecture

6. What team does this belong?

@markov00 raised whether the @elastic/kibana-visualizations should own this research. imho, yes, but with an asterix.

Yes, because visualizations are the main consumer of Elasticsearch-agg responses, and we would expect changes to be motivated by reducing the time it takes to render data on screen in a chart.

The asterix is that if there any resource constraints we can always see if we can distribute these investigations more broadly (e.g. @elastic/kibana-data-discovery, @davismcphee @kertal @lukasolson)

So to recap; I see following open questions:

Size comparison of arrow versus current ES|QL (Using Apache arrow support for ES|QL elasticsearch#104877 may be helpful here)
What are the size/performance advantages of cbor/smile? Are they supported by ES|QL?
What are the blockers to adopt cbor/smile? Specifically, what is going on in Kibana-server that requires enrichment of the Elasticsearch-agg response? Is it necessary?
Is there anything more that needs be done wrt column-based layouts?

ppisljar · 2024-05-30T05:35:38Z

Arrow is a binary format, so it will generally be more efficient from size perspective than json. In some tests i did i saw around 30% reduction of size. However important note here is that we are using gzip compression, and after compression the filesizes are mostly the same, or arrow format actually becomes bigger.
Havent tested this yet, but from resources on the internet it looks its similar to arrow, there is a significant reduction if you dont gzip, but after gziping reduction is less noticable.

http://zderadicka.eu/comparison-of-json-like-serializations-json-vs-ubjson-vs-messagepack-vs-cbor/
https://gist.github.com/kajuberdut/0191ec20f14253094792cd3c00f06257
https://medium.com/@ayushguptadtu/gzip-smile-json-gives-a-better-size-reduction-over-smile-uncompressed-for-sure-6c5060a670a5

vadimkibana · 2024-05-31T13:17:00Z

The most performant way would be to request data from ES in CBOR and pass it through the Kibana server without any parsing (or minimal parsing) straight to the client. So this is the key question:

Specifically, what is going on in Kibana-server that requires enrichment of the Elasticsearch-agg response? Is it necessary?

If we can make it such that ES CBOR response is passed-through directly to the client-side we will save on request/response copying, UTF8 decoding, JSON decoding, JSON encoding, UTF8 encoding; and all the memory savings if we don't need to hold those intermediate representations.

lukasolson · 2024-05-31T14:37:02Z

If we can make it such that ES CBOR response is passed-through directly to the client-side we will save on request/response copying, UTF8 decoding, JSON decoding, JSON encoding, UTF8 encoding; and all the memory savings if we don't need to hold those intermediate representations.

Related: #170062

thomasneirynck · 2024-05-31T16:23:29Z

thx @ppisljar - if arrow is larger gzipped, I think it's another argument against arrow being a pathway for a tactical improvement.

@vadimkibana agreed. The key part of these investigations is whether we can slim down the data pipeline from Elasticsearch all the way to the browser. Reduction in size of the data format (faster delivery, cheaper too), wasted cycles of encoding/decoding (faster), and removing redundant enrichment (wasted processing) are all pathways to get there. Any footprint on kibana-server is particularly bad because it is felt by all users, and any impact from processing doesn't scale favorably due to single threaded execution (e.g. by delaying other requests, and this compounds)

thomasneirynck · 2024-07-30T13:29:17Z

Let's consider this done.

tl:dr;

cbor can show marginal improvements (due to speed up of browser-decoding)
arrow will not give benefits in current architecture
more impactful improvements will come from moving out all unnecessary encoding/decoding of ES-responses from Kibana-server

swallez · 2024-09-30T10:56:47Z

After the ping from @thomasneirynck in elastic/elasticsearch#109576 (comment) I took a closer look at the experiments done by @ppisljar in PR #193803.

In particular I was surprised by the full copy of the Arrow dataframe into a new array, which obviously isn't ideal, so I went digging 😉

First of all, the Arrow Table type, which represents the dataframe, has a toArray() method that apparently hasn't been evaluated. It is specifically targeted at applications that process arrays of objects to avoid the refactoring needed to use dataframes directly. It builds an array of proxies to the dataframe vectors that make them look like regular objects.

Running some benchmarks showed that using Table.toArray() reduced memory usage for the data table by a factor of 3!

Still, I was surprised by the amount of heap memory used by this method, so looked further and found that we could even eliminate this allocation (see PR apache/arrow#44247). With this change, memory usage for data tables is reduced by a factor of 4.5.

The benchmarks also show reduced performance caused by the indirection layer added by proxying the dataframe. Whether it is acceptable has to be evaluated. But here again, digging in the code showed that it could be improved significantly.

The fact that toArray() and associated code has room for improvement shouldn't be considered as reflecting a poor quality of this library, and as shown in this PR, the maintainers are open to improvements. This is isn't the primary intended usage of dataframes, and iterating on table columns obtained using table.getChild() shows performance on par with plain object property access with, as shown above, huge memory savings. Arrow also eliminates the need to parse data sent by ES, as the dataframe just wraps the byte buffer received over the network.

ppisljar · 2024-10-08T07:59:26Z

The reason why toArray() was not used in the linked PR is that it produces a generic js array, which does not match the kibana datatable structure. The purpose of #183909 was to evaluate conversion between arrow table and kibana datatable specifically. toArray() produces quite different structure, converting that one to kibana datatable was not any faster in my tests.

But the main thing keeping us from starting to us the library is not the performance reduction (our actual table sizes at the moment are way smaller than what i was testing in my PR) but the fact that its using unsafe eval. I haven't looked into how hard would it be to address that in the library as i was using it as a black box.

thomasneirynck added the research label Jan 26, 2024

botelastic bot added the needs-team Issues missing a team label label Jan 26, 2024

thomasneirynck mentioned this issue Jan 26, 2024

[META] Dashboard Performance #166211

Closed

19 tasks

thomasneirynck added the Team:Visualizations Visualization editors, elastic-charts and infrastructure label Jan 26, 2024

botelastic bot removed the needs-team Issues missing a team label label Jan 26, 2024

thomasneirynck assigned markov00 Jan 26, 2024

teresaalvarezsoler changed the title ~~[Research] Data format improvements for charting~~ [Research] Data format improvements for charting (arrow) Apr 30, 2024

markov00 assigned ppisljar May 16, 2024

markov00 removed their assignment May 29, 2024

thomasneirynck closed this as completed Jul 30, 2024

swallez mentioned this issue Sep 30, 2024

async ESQL and _search: add search_id, is_running and is_complete to response headers elastic/elasticsearch#109576

Closed

JoshMock mentioned this issue Sep 30, 2024

ES|QL Apache Arrow Support elastic/elasticsearch-js#2269

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Research] Data format improvements for charting (arrow) #175695

[Research] Data format improvements for charting (arrow) #175695

thomasneirynck commented Jan 26, 2024 •

edited

Loading

elasticmachine commented Jan 26, 2024

stratoula commented Jan 28, 2024

drewdaemon commented Jan 30, 2024

markov00 commented Apr 2, 2024

ppisljar commented May 22, 2024 •

edited

Loading

thomasneirynck commented May 29, 2024

ppisljar commented May 30, 2024

vadimkibana commented May 31, 2024 •

edited

Loading

lukasolson commented May 31, 2024

thomasneirynck commented May 31, 2024

thomasneirynck commented Jul 30, 2024

swallez commented Sep 30, 2024

ppisljar commented Oct 8, 2024

[Research] Data format improvements for charting (arrow) #175695

[Research] Data format improvements for charting (arrow) #175695

Comments

thomasneirynck commented Jan 26, 2024 • edited Loading

elasticmachine commented Jan 26, 2024

stratoula commented Jan 28, 2024

drewdaemon commented Jan 30, 2024

markov00 commented Apr 2, 2024

ppisljar commented May 22, 2024 • edited Loading

thomasneirynck commented May 29, 2024

1. We should build up our knowledge arrow because of its strategic value in contemporary tech stacks

2. JSON vs Binary. Is there any low-hanging fruit for saving space in size of data transmitted over the wire?

3. column vs row layout ("visualization friendly" format)

4. ES|QL versus DSL

5. The "big picture" - thinning kibana server

6. What team does this belong?

ppisljar commented May 30, 2024

vadimkibana commented May 31, 2024 • edited Loading

lukasolson commented May 31, 2024

thomasneirynck commented May 31, 2024

thomasneirynck commented Jul 30, 2024

swallez commented Sep 30, 2024

ppisljar commented Oct 8, 2024

thomasneirynck commented Jan 26, 2024 •

edited

Loading

ppisljar commented May 22, 2024 •

edited

Loading

vadimkibana commented May 31, 2024 •

edited

Loading