2024 Q3-Q4 Roadmap? #11442

alamb · 2024-07-12T19:20:38Z

Is your feature request related to a problem or challenge?

do we have a roadmap for 2024?

Which I think is an excellent question.

In general since this project isn't really coordinated centrally the roadmap typically follows what people are working on / want to invest time in

However it is a neat idea to collect any thoughts people have / want to share about what they might work on

Describe the solution you'd like

Let's collect any projects that people think they are likely to spend time on or projects that the broader community would really like to see done and write them down!

Then we can add it to the roadmap on the doc site https://datafusion.apache.org/contributor-guide/roadmap.html

Describe alternatives you've considered

No response

Additional context

No response

alamb · 2024-07-12T19:25:55Z

🤔 I have stuff I would like to do. But that doesn't really count as a roadmap for the project 😆

Here are some things I might guess

streaming stuff [DISCUSSION] Support for Streaming in DataFusion #11404 (@ozankabak and @ameyc perhaps)
ASOF / range joins ASOF join support / Specialize Range Joins #318 (InfluxData might do something here)
Performance improvements (what I hope to would personally like to spend more time on): Enable parquet filter pushdown by default #3463 Improve Memory usage + performance with large numbers of groups / High Cardinality Aggregates #6937 Improve performance for grouping by variable length columns (strings) #9403
@XiangpengHao 's work on StringView: [Epic] Implement support for StringView in DataFusion #10918 / use StringViewArray when reading String columns from Parquet #10921

Possibilities:

Logical / user dfined Types: [draft] Add LogicalType, try to support user-defined types #11160 @notfilippo
Make window functions user defined [Epic] Unify WindowFunction Interface (remove built in list of BuiltInWindowFunction s) #8709
Split DataSource / catalogs from the core: Break datafusion-catalog code into its own crate #11182

ozankabak · 2024-07-13T09:20:20Z

It would be great to have one or two quarters where we focus on perf. I think we are at a pretty good place in terms of extensibility/customizability (evidenced by rapidly increasing number of projects), but the situation could be much better wrt performance.

That being said, team Synnada will keep adding baseline mechanisms to upstream DF to enable streaming use cases (when appropriate and not overfitting, obviously) by downstream projects.

alamb · 2024-07-14T15:29:13Z

It would be great to have one or two quarters where we focus on perf. I think we are at a pretty good place in terms of extensibility/customizability (evidenced by rapidly increasing number of projects), but the situation could be much better wrt performance.

I agree with this sentiment.

Something about performance improvements is I think they take sustained engineering investment and significant existing engine expertise (thus it is hard to have newcomers to the project make singificant performance improvements)

I will try and find time from myself and InfluxData these next two quarters to meaninfully invest in improvements in this area

However, I can't realistically do that if I am also helping to shepherd other large projects along (I am thinking specifically of #11160 from @notfilippo) so I need to make some hard choices there

That being said, team Synnada will keep adding baseline mechanisms to upstream DF to enable streaming use cases (when appropriate and not overfitting, obviously) by downstream projects.

Thank you. Your help (and everyone else's help) with documentation and reviews I think would also be tremendously beneficial

notfilippo · 2024-07-14T17:55:14Z

However, I can't realistically do that if I am also helping to shepherd other large projects along (I am thinking specifically of #11160 from @notfilippo) so I need to make some hard choices there

As a side thought, I would argue that introducing proper support for logical types would benefit performance, especially in late materialization for REE arrays and string views.

That said, I fully agree with focusing on performance, and I would be happy to rescope my proposal to make it easier to manage.

lewiszlw · 2024-07-15T03:45:17Z

I would like to see more progress on logical type #11160 and index support #9963.

jayzhan211 · 2024-07-15T06:27:02Z

It would be great to have one or two quarters where we focus on perf. I think we are at a pretty good place in terms of extensibility/customizability (evidenced by rapidly increasing number of projects), but the situation could be much better wrt performance.

That being said, team Synnada will keep adding baseline mechanisms to upstream DF to enable streaming use cases (when appropriate and not overfitting, obviously) by downstream projects.

Do you have a list in mind the area that is worth for performance improvement? Somethings I known that are still active in my head

Aggregate Group by
Joins
Apply StringView
Planning
Parquet

Anything else?

There's always room for improvement, particularly in terms of performance. Regularly updating the active lists could provider valuable pointer for the community

alamb · 2024-07-15T10:18:29Z

Do you have a list in mind the area that is worth for performance improvement? Somethings I known that are still active in my head

In my mind, here are somre "obvious" performance projects (the ones I have the most confidence that would make a meaningful difference on ClickBench or TPCH queries) are as follows (I can maybe put this in the documentation)

Integrate StringView into Parquet / Filtering / Grouping

[Epic] Implement support for StringView in DataFusion #10918

@XiangpengHao is doing this as his summer project and doing an amazing job. I also think this is a great example of the the level of effort required to drive one of these performance projects. It requires implementing the features, then analyzing / profiling, identifying the bottlenecks, and then making PRs to remove the bottlenecks. ee #10918 and apache/arrow-rs#5374 have the entire list. Some of my favorites:

What: Use newly added StringView from arrow to improve performance (by avoiding variable length/string data copies)
Why: This For queries that deal with string data in ClickBench or TPCH a large amount of time is spent in parquet decoding as well as filtering and grouping.
What is left: See #10918 and apache/arrow-rs#5374

alamb · 2024-07-15T10:19:42Z

Complete Parquet Filter Performance

Enable parquet filter pushdown by default #3463

What: Enable the most advanced form of predicate pushdown / late materialization that DataFusion
Why: Influx enables this and it helps with many of our queries. I think Coralogix uses it too (maybe @Dandandan or @thinkharderdev could correct me)
What is left: The actual code is straight forward (change a default config value). The hard part is that last time we ran benchmarks this option actually made some queries slower. So the work is to help debug / profile / figure out why and then what changes are needed to ensure performance doesn't slow down. There are some ideas :

alamb · 2024-07-15T10:21:55Z

Improve Aggregate performance for multi-column grouping when at least one column is variable length

Improve performance for grouping by variable length columns (strings) #9403

What: Queries like GROUP BY url, code where url is a string are significantly slower than GROUP BY url. We already have the single string column case handled with #7064
Why: There are several queries like this in ClickBench where copying string data to form group keys consumes significant time
What is left: @jayzhan211 already has shown the basic idea of #9430 works in #10976. What is left is to figure out how to get the types to work out in the plans and ensure it doesn't cause performance regressions

alamb · 2024-07-15T10:24:59Z

Aggregate performance / memory use for high cardinality aggregates

Improve Memory usage + performance with large numbers of groups / High Cardinality Aggregates #6937

What: Improve Queries when the number of groups is very high (1 million+)
Why: Queries when the number of groups is high are significantly slower than DuckDB and use substantially more memory. I think there is at least a factor of 2 of performance here
What is left: There are ideas on #6937 but someone has to try them out, prototype / see if they would work and then productionize them

alamb · 2024-07-15T10:29:52Z

Join performance with dynamic join filters / Sideways Information Passing

Push Dynamic Join Predicates into Scan ("Sideways Information Passing", etc) #7955

What: Introduce filters apply join filtering during the Scan in addition to during the actual join.
Why: Joins in general (and TPCH definitely) end up being very selective (they filter many rows). However join operators are complex and often much slower than filters (this applies to DataFusion too). By pushing much of the filtering work that would be done in the join down into a scan, plans can be made to go much faster
What is left: There has been no work yet on this. The first thing to do would be to prototype some ideas and see if we can make the TPCH queries much faster, and then figure out how to structure the code

alamb · 2024-07-15T10:31:46Z

As a side thought, I would argue that introducing proper support for logical types would benefit performance, especially in late materialization for REE arrays and string views.

@notfilippo -- I think LogicalTypes would make it easier to support improved performance for the reasons you mention, but I don't think it would alone improve performance.

thinkharderdev · 2024-07-15T10:40:12Z

Complete Parquet Filter Performance

Enable parquet filter pushdown by default #3463

What: Enable the most advanced form of predicate pushdown / late materialization that DataFusion Why: Influx enables this and it helps with many of our queries. I think Coralogix uses it too (maybe @Dandandan or @thinkharderdev could correct me) What is left: The actual code is straight forward (change a default config value). The hard part is that last time we ran benchmarks this option actually made some queries slower. So the work is to help debug / profile / figure out why and then what changes are needed to ensure performance doesn't slow down. There are some ideas :

Adaptive Parquet Predicate Pushdown arrow-rs#5523

Return TableProviderFilterPushDown::Exact when Parquet Pushdown Enabled #4028)

Yeah, we use it as well. We have some custom code to decide when to push down predicates and in general its a pretty tricky thing to get right.

alamb · 2024-07-15T10:50:02Z

Yeah, we use it as well. We have some custom code to decide when to push down predicates and in general its a pretty tricky thing to get right.

My (perhaps unrealistic) hope is that we could find additional improvements (like #4028 or other optimizations) that could make back up any performance that was lost so that we didn't have to have code to choose.

notfilippo · 2024-07-15T11:05:15Z

@notfilippo -- I think LogicalTypes would make it easier to support improved performance for the reasons you mention, but I don't think it would alone improve performance.

I agree! 😄 I was just highlighting how the logical/physical separation could support performance improvements while simplifying things, such as handling custom code for dictionaries.

That said, if the next quarter's focus is on performance, should I continue drafting a complete proposal for this change or put it on hold?

ozankabak · 2024-07-15T11:17:58Z

I think further improvements to planning code structure will also help w.r.t. performance, we used to spend a lot of time in planning phase due to avoidable issues (cloning etc.). We are at a better state now, but still have more work to do.

Also we can finalize our previous discussion on a better statistics infrastructure and start using this information in better ways during planning.

ozankabak · 2024-07-15T11:22:57Z

That said, if the next quarter's focus is on performance, should I continue drafting a complete proposal for this change or put it on hold?

@notfilippo, I think we should complete the exploratory work. Even if we don't get to full-focus on it, this is something we likely want to do at some point (gradually or otherwise).

For example, it was through your draft I started to think about what would happen to ScalarValue in such a design. It is widely used in physical-level machinery, so contemplating a change to logical types for that object could be problematic. Maybe we should have a logical scalar value and a physical one, maybe we will think of something else. But all these thinking was induced by your exploratory work, so I think we should complete it. Then we can decide what to do with the full findings.

alamb · 2024-07-15T11:23:52Z

I think further improvements to planning code structure will also help w.r.t. performance, we used to spend a lot of time in planning phase due to avoidable issues (cloning etc.). We are at a better state now, but still have more work to do.

I think we are at a much better place now for LogicalPlaning -- I don't think we have done anything similar for the various physical optimizer passes for ExecutionPlan and I suspect there is quite a bit of improvement to be had there

alamb · 2024-07-15T11:25:50Z

For example, it was through your draft I started to think about what would happen to ScalarValue in such a design. It is widely used in physical-level machinery, so contemplating a change to logical types for that object could be problematic. Maybe we should have a logical scalar value and a physical one, maybe we will think of something else. But all these thinking was induced by your exploratory work, so I think we should complete it. Then we can decide what to do with the full findings.

I agree with this basic approach -- my feeling is that introducing logical types successfully will require some concerted effort from existing committers and people with expertise with the current code to help it along. @notfilippo is doing a great job but unless we find them help / support I think the project would struggle to be successful

Abdullahsab3 · 2024-09-15T11:57:34Z

If I may chime in with my 2 cents. I have been following Datafusion and its development over the last couple of months and I think Datafusion is a really unique project in the DMBS space and fits the "Future is extensible" vision greatly. I think especially in the RND world (industrial and academic), Datafusion makes research easier and more interesting, since you're starting from a already-present foundation and extending it/modifying it as you need. I think that Datafusion could probably benefit greatly from more academic collaborations? I'd imagine that a lot of the performance optimisations, but also other kinds of projects, would make a great Master thesis or research paper in the DBMS world. Not sure how you can attract more people to this kinds of projects, but it's just an idea I wanted to share.

I think though that Datafusion could benefit from more visibility and presence. The sigmod conference and all the other conferences in which Datafusion was presented were great to promote it among different audiences, ranging from industry to academics and open source developers. It would be great to see Datafusion in other conferences and journals as well (e.g. fosdem, Foundations and Trends in Databases, etc). I think especially in Europe, Datafusion is still not as well-known, and if more DBMS people were to know about it, it would be beneficial to the future of the project.

It might also be interesting if projects that are built on top of Datafusion could also present and explain how they used Datafusion to build their project and what the advantages were of using Datafusion.

alamb · 2024-09-16T17:20:46Z

I think especially in the RND world (industrial and academic), Datafusion makes research easier and more interesting, since you're starting from a already-present foundation and extending it/modifying it as you need. I think that Datafusion could probably benefit greatly from more academic collaborations? I'd imagine that a lot of the performance optimisations, but also other kinds of projects, would make a great Master thesis or research paper in the DBMS world.

I agree entirely @Abdullahsab3 -- thank you. In fact I believe it is exactly the plan of @XiangpengHao to do so. Perhaps he has some insights about how to make it more appealing to researchers

I also think Andy Pavlo's Advanced Database Course was an early adopter and tried to make projects based on DataFusion Spring 2024: https://15721.courses.cs.cmu.edu/spring2024/project.html . I didn't hear much about how this actually went or what we could do to make it easier next time.

It might also be interesting if projects that are built on top of Datafusion could also present and explain how they used Datafusion to build their project and what the advantages were of using Datafusion.

100% agree. This was the topic of many of the DataFusion San Franciso meetup talks recently, and I spoked about it in this talk:

DataCouncil 2024: Building InfluxDB 3.0 with Apache Arrow, DataFusion, Flight and Parquet. slides, recording,

I am particularly excited about the CMU database series this spring promises to be full of such explanations (the majority of those systems use DataFusion in some way) : https://db.cs.cmu.edu/seminar2024/

alamb · 2024-09-16T17:22:40Z

I think especially in Europe, Datafusion is still not as well-known, and if more DBMS people were to know about it, it would be beneficial to the future of the project.

I also 10% agree. I am actually speaking at the first DataFusion meetup in Europe next week:

DataFusion Belgrade Meetup 2024/09/27 #11431 , organized by @gruuya ❤️

Any chance anyone on this issue (or @Abdullahsab3 ) wants to help organize another European meetup? Perhaps as 2025 CIDR: https://www.cidrdb.org/cidr2025/ 🤔

findepi · 2024-09-17T14:56:59Z

I am actually speaking at the first DataFusion meetup in Europe next week:

DataFusion Belgrade Meetup 2024/09/27 #11431 , organized by @gruuya

✌️ me too

Abdullahsab3 · 2024-09-19T16:11:34Z

Any chance anyone on this issue (or @Abdullahsab3 ) wants to help organize another European meetup?

Yes! My team and I would be interested in helping organize a meetup. We don't use Datafusion directly but we use it through influxdb -- might be an interesting POV/angle to see how technical customers of a product built on top of Datafusion can (ab)use and benefit from Datafusion :D and also how using something like Datafusion, which can sound scary to use for building a product/doing RND, can actually have great benefits.
A lot of the influxdb issues that we found and tackled with Datafusion were actually guided and/or solved by people from outside influxDB that are also working with Datafusion. It reminds me of the multi-vendor participation that Andrew often mentions in his talks.

I will try to solidify something with my team this week/next week regarding a Western Europe meetup and come back with a proposal :)

alamb · 2024-09-19T19:38:23Z

One possibility might be to try and arrange something colocated with CIDR in Amsterdam https://www.cidrdb.org/cidr2025/ -- there might be many people in town already that could be interested

alamb added the enhancement New feature or request label Jul 12, 2024

alamb mentioned this issue Jul 12, 2024

Combine the Roadmap / Quarterly Roadmap sections #11426

Merged

This was referenced Jul 15, 2024

DataFusion weekly project plan (Andrew Lamb) - July 15, 2024 #11474

Closed

DataFusion weekly project plan (Andrew Lamb) - July 22, 2024 #11601

Closed

alamb mentioned this issue Jul 29, 2024

DataFusion weekly project plan (Andrew Lamb) - July 29, 2024 #11710

Closed

8 tasks

jayzhan211 pinned this issue Aug 29, 2024

jayzhan211 mentioned this issue Sep 6, 2024

[DISCUSS] Document criteria for adding new features / what belongs in core DataFusion (e.g. sql syntax, functions, etc) #12357

Open

alamb mentioned this issue Sep 9, 2024

September 2024 ASF Board Report #10156

Closed

alamb mentioned this issue Oct 8, 2024

[DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench #12821

Open

alamb unpinned this issue Oct 16, 2024

This was referenced Oct 16, 2024

Oct 16, 2024: This week in DataFusion #12973

Closed

Oct 21, 2024: This week in DataFusion #13035

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2024 Q3-Q4 Roadmap? #11442

2024 Q3-Q4 Roadmap? #11442

alamb commented Jul 12, 2024 •

edited

Loading

alamb commented Jul 12, 2024

ozankabak commented Jul 13, 2024 •

edited

Loading

alamb commented Jul 14, 2024

notfilippo commented Jul 14, 2024

lewiszlw commented Jul 15, 2024 •

edited

Loading

jayzhan211 commented Jul 15, 2024

alamb commented Jul 15, 2024

alamb commented Jul 15, 2024

alamb commented Jul 15, 2024

alamb commented Jul 15, 2024

alamb commented Jul 15, 2024

alamb commented Jul 15, 2024

thinkharderdev commented Jul 15, 2024

Complete Parquet Filter Performance

alamb commented Jul 15, 2024 •

edited

Loading

notfilippo commented Jul 15, 2024

ozankabak commented Jul 15, 2024

ozankabak commented Jul 15, 2024 •

edited

Loading

alamb commented Jul 15, 2024

alamb commented Jul 15, 2024

Abdullahsab3 commented Sep 15, 2024 •

edited

Loading

alamb commented Sep 16, 2024

alamb commented Sep 16, 2024 •

edited

Loading

findepi commented Sep 17, 2024

Abdullahsab3 commented Sep 19, 2024 •

edited

Loading

alamb commented Sep 19, 2024

2024 Q3-Q4 Roadmap? #11442

2024 Q3-Q4 Roadmap? #11442

Comments

alamb commented Jul 12, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Jul 12, 2024

ozankabak commented Jul 13, 2024 • edited Loading

alamb commented Jul 14, 2024

notfilippo commented Jul 14, 2024

lewiszlw commented Jul 15, 2024 • edited Loading

jayzhan211 commented Jul 15, 2024

alamb commented Jul 15, 2024

Integrate StringView into Parquet / Filtering / Grouping

alamb commented Jul 15, 2024

Complete Parquet Filter Performance

alamb commented Jul 15, 2024

Improve Aggregate performance for multi-column grouping when at least one column is variable length

alamb commented Jul 15, 2024

Aggregate performance / memory use for high cardinality aggregates

alamb commented Jul 15, 2024

Join performance with dynamic join filters / Sideways Information Passing

alamb commented Jul 15, 2024

thinkharderdev commented Jul 15, 2024

Complete Parquet Filter Performance

alamb commented Jul 15, 2024 • edited Loading

notfilippo commented Jul 15, 2024

ozankabak commented Jul 15, 2024

ozankabak commented Jul 15, 2024 • edited Loading

alamb commented Jul 15, 2024

alamb commented Jul 15, 2024

Abdullahsab3 commented Sep 15, 2024 • edited Loading

alamb commented Sep 16, 2024

alamb commented Sep 16, 2024 • edited Loading

findepi commented Sep 17, 2024

Abdullahsab3 commented Sep 19, 2024 • edited Loading

alamb commented Sep 19, 2024

alamb commented Jul 12, 2024 •

edited

Loading

ozankabak commented Jul 13, 2024 •

edited

Loading

lewiszlw commented Jul 15, 2024 •

edited

Loading

alamb commented Jul 15, 2024 •

edited

Loading

ozankabak commented Jul 15, 2024 •

edited

Loading

Abdullahsab3 commented Sep 15, 2024 •

edited

Loading

alamb commented Sep 16, 2024 •

edited

Loading

Abdullahsab3 commented Sep 19, 2024 •

edited

Loading