-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Research] Data format improvements for charting (arrow) #175695
Comments
Pinging @elastic/kibana-visualizations (Team:Visualizations) |
@nik9000 was working on their ON week on exposing the ESQL results in arrow format. I think it is awesome to continue investigations in this front. Can make our visualizations much more performant and dense! |
Agree, both in terms of performance and complexity. |
linked to #178471 |
adding (basic) arrow support to expressions: #183909 This showcases that it is not very hard to convert from arrow to datatable and vice versa, which would allow us to gradually migrate our code to the new format. |
Thanks @ppisljar for #175695 (comment). This was super useful. Below a follow from an offline convo with @markov00 and @ppisljar. Apart fromthis initial look into arrow, there are a few more open questions. I think it might also be useful to recap some of the underlying reasons for this research for wider visibility. 1. We should build up our knowledge arrow because of its strategic value in contemporary tech stacksArrow has strategic value because it the main data-interchange formats for interprocess data analytics (e.g. ML with pandas in Python), GPU-based charting (e.g. dense scatterplots), or in a web context to do client-side analytics (e.g. duckdb-wasm https://duckdb.org/docs/api/wasm/overview.html) For that reason alone, it is important to gain a better understanding of this format. From @ppisljar initial investigation (#183909), the short term take-away seems to be that a "backend swap" of JSON vs Arrow may not be hard technically, but it would not be the right choice in the short term. Kibana has a very low investment in GPU-technology today (except flamegraph and maps), and introducing a new model of client-side analytics (e.g. one which runs in WASM with duckdb) is not directly on the horizon either. imho it is OK with postponing further investigation in these long term topics. We can always pick up those aspects up once it becomes more tactically relevant (e.g. when scatterplots are prioritized) What is not answered though is whether:
2. JSON vs Binary. Is there any low-hanging fruit for saving space in size of data transmitted over the wire?arrow is just one example of a binary format. Other examples could be cbor or smile, which are supported by Elasticsearch. Any gains we can make in transfer format can be meaningful, especially since it would get our stack to closer deliver data in a streaming-fashion: ie. an elasticsearch response stream should just be streamed back as-is to the browser, without further modification, especially if that modification is redundant. It seems there is some additional processing in Kibana server (specifically for async searches (?)), which would prevent us from doing this. Whether the Elasticsearch-js client supports formats other than JSON is imo less relevant. Users can always unpack the data manually by using the as-stream option (https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/as_stream_examples.html). If we can demonstrate value, we can always push this support down in the ES-client as well. So for this, I think we are missing answers to:
3. column vs row layout ("visualization friendly" format)This is more of an orthogonal issue to binary/JSON. This is about data layout. This would need to be investigated in the context of expressions and elastic/charts. I believe this is already the default for ES|QL (?). 4. ES|QL versus DSL(1) and (3) the questions above only have relevance for ES|QL.
(2) applies to both, and imo is therefore important. 5. The "big picture" - thinning kibana serverThe big picture for 1, 2, 3, and 4 is that we should aim to remove as much intermediate, redundant processing on data, especially on Kibana-server. Processing-in-browser is only felt by one particular user, while load on kibana server affects all userss. Does Lens really need enrichment of data in kibana-server? e.g.: a performant data-viz architecture 6. What team does this belong?@markov00 raised whether the @elastic/kibana-visualizations should own this research. imho, yes, but with an asterix. Yes, because visualizations are the main consumer of Elasticsearch-agg responses, and we would expect changes to be motivated by reducing the time it takes to render data on screen in a chart. The asterix is that if there any resource constraints we can always see if we can distribute these investigations more broadly (e.g. @elastic/kibana-data-discovery, @davismcphee @kertal @lukasolson) So to recap; I see following open questions:
|
http://zderadicka.eu/comparison-of-json-like-serializations-json-vs-ubjson-vs-messagepack-vs-cbor/ |
The most performant way would be to request data from ES in CBOR and pass it through the Kibana server without any parsing (or minimal parsing) straight to the client. So this is the key question:
If we can make it such that ES CBOR response is passed-through directly to the client-side we will save on request/response copying, UTF8 decoding, JSON decoding, JSON encoding, UTF8 encoding; and all the memory savings if we don't need to hold those intermediate representations. |
Related: #170062 |
thx @ppisljar - if arrow is larger gzipped, I think it's another argument against arrow being a pathway for a tactical improvement. @vadimkibana agreed. The key part of these investigations is whether we can slim down the data pipeline from Elasticsearch all the way to the browser. Reduction in size of the data format (faster delivery, cheaper too), wasted cycles of encoding/decoding (faster), and removing redundant enrichment (wasted processing) are all pathways to get there. Any footprint on kibana-server is particularly bad because it is felt by all users, and any impact from processing doesn't scale favorably due to single threaded execution (e.g. by delaying other requests, and this compounds) |
Let's consider this done. tl:dr;
|
After the ping from @thomasneirynck in elastic/elasticsearch#109576 (comment) I took a closer look at the experiments done by @ppisljar in PR #193803. In particular I was surprised by the full copy of the Arrow dataframe into a new array, which obviously isn't ideal, so I went digging 😉 First of all, the Arrow Running some benchmarks showed that using Still, I was surprised by the amount of heap memory used by this method, so looked further and found that we could even eliminate this allocation (see PR apache/arrow#44247). With this change, memory usage for data tables is reduced by a factor of 4.5. The benchmarks also show reduced performance caused by the indirection layer added by proxying the dataframe. Whether it is acceptable has to be evaluated. But here again, digging in the code showed that it could be improved significantly. The fact that |
The reason why But the main thing keeping us from starting to us the library is not the performance reduction (our actual table sizes at the moment are way smaller than what i was testing in my PR) but the fact that its using unsafe eval. I haven't looked into how hard would it be to address that in the library as i was using it as a black box. |
Currently, the majority of all charts use the default Json output from Elasticsearch. These responses by default have a row-like (in the case of es|ql or doc-search) or nested (in the case of aggs) layout.
Internally, Kibana will reformat these to something more usable. e.g. a format understood by elastic/charts, nested-array tables for easier ergonomics, etc...
These client-side reformattings introduce an overhead.
Is it possible to have a more efficient pipeline (?), either by reducing network traffic, reducing reconversions (or both).
Goal
Investigate impact of data format on kibana data visualization (specifically, Lens & Dashboard).
Consider both the context of:
_search
_query
(ES|QL)Consider alternatives:
The text was updated successfully, but these errors were encountered: