Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2024 Q3-Q4 Roadmap? #11442

Open
alamb opened this issue Jul 12, 2024 · 25 comments
Open

2024 Q3-Q4 Roadmap? #11442

alamb opened this issue Jul 12, 2024 · 25 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jul 12, 2024

Is your feature request related to a problem or challenge?

@comphead asked #11426 (comment)

do we have a roadmap for 2024?

Which I think is an excellent question.

In general since this project isn't really coordinated centrally the roadmap typically follows what people are working on / want to invest time in

However it is a neat idea to collect any thoughts people have / want to share about what they might work on

Describe the solution you'd like

Let's collect any projects that people think they are likely to spend time on or projects that the broader community would really like to see done and write them down!

Then we can add it to the roadmap on the doc site https://datafusion.apache.org/contributor-guide/roadmap.html

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Jul 12, 2024
@alamb
Copy link
Contributor Author

alamb commented Jul 12, 2024

🤔 I have stuff I would like to do. But that doesn't really count as a roadmap for the project 😆

Here are some things I might guess

  1. streaming stuff [DISCUSSION] Support for Streaming in DataFusion #11404 (@ozankabak and @ameyc perhaps)
  2. ASOF / range joins ASOF join support / Specialize Range Joins #318 (InfluxData might do something here)
  3. Performance improvements (what I hope to would personally like to spend more time on): Enable parquet filter pushdown by default #3463 Improve Memory usage + performance with large numbers of groups / High Cardinality Aggregates #6937 Improve performance for grouping by variable length columns (strings) #9403
  4. @XiangpengHao 's work on StringView: [Epic] Implement support for StringView in DataFusion #10918 / use StringViewArray when reading String columns from Parquet #10921

Possibilities:

  1. Logical / user dfined Types: [draft] Add LogicalType, try to support user-defined types #11160 @notfilippo
  2. Make window functions user defined [Epic] Unify WindowFunction Interface (remove built in list of BuiltInWindowFunction s) #8709
  3. Split DataSource / catalogs from the core: Break datafusion-catalog code into its own crate #11182

@ozankabak
Copy link
Contributor

ozankabak commented Jul 13, 2024

It would be great to have one or two quarters where we focus on perf. I think we are at a pretty good place in terms of extensibility/customizability (evidenced by rapidly increasing number of projects), but the situation could be much better wrt performance.

That being said, team Synnada will keep adding baseline mechanisms to upstream DF to enable streaming use cases (when appropriate and not overfitting, obviously) by downstream projects.

@alamb
Copy link
Contributor Author

alamb commented Jul 14, 2024

It would be great to have one or two quarters where we focus on perf. I think we are at a pretty good place in terms of extensibility/customizability (evidenced by rapidly increasing number of projects), but the situation could be much better wrt performance.

I agree with this sentiment.

Something about performance improvements is I think they take sustained engineering investment and significant existing engine expertise (thus it is hard to have newcomers to the project make singificant performance improvements)

I will try and find time from myself and InfluxData these next two quarters to meaninfully invest in improvements in this area

However, I can't realistically do that if I am also helping to shepherd other large projects along (I am thinking specifically of #11160 from @notfilippo) so I need to make some hard choices there

That being said, team Synnada will keep adding baseline mechanisms to upstream DF to enable streaming use cases (when appropriate and not overfitting, obviously) by downstream projects.

Thank you. Your help (and everyone else's help) with documentation and reviews I think would also be tremendously beneficial

@notfilippo
Copy link
Contributor

However, I can't realistically do that if I am also helping to shepherd other large projects along (I am thinking specifically of #11160 from @notfilippo) so I need to make some hard choices there

As a side thought, I would argue that introducing proper support for logical types would benefit performance, especially in late materialization for REE arrays and string views.

That said, I fully agree with focusing on performance, and I would be happy to rescope my proposal to make it easier to manage.

@lewiszlw
Copy link
Member

lewiszlw commented Jul 15, 2024

I would like to see more progress on logical type #11160 and index support #9963.

@jayzhan211
Copy link
Contributor

It would be great to have one or two quarters where we focus on perf. I think we are at a pretty good place in terms of extensibility/customizability (evidenced by rapidly increasing number of projects), but the situation could be much better wrt performance.

That being said, team Synnada will keep adding baseline mechanisms to upstream DF to enable streaming use cases (when appropriate and not overfitting, obviously) by downstream projects.

Do you have a list in mind the area that is worth for performance improvement? Somethings I known that are still active in my head

  • Aggregate Group by
  • Joins
  • Apply StringView
  • Planning
  • Parquet

Anything else?

There's always room for improvement, particularly in terms of performance. Regularly updating the active lists could provider valuable pointer for the community

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

Do you have a list in mind the area that is worth for performance improvement? Somethings I known that are still active in my head

In my mind, here are somre "obvious" performance projects (the ones I have the most confidence that would make a meaningful difference on ClickBench or TPCH queries) are as follows (I can maybe put this in the documentation)

Integrate StringView into Parquet / Filtering / Grouping

@XiangpengHao is doing this as his summer project and doing an amazing job. I also think this is a great example of the the level of effort required to drive one of these performance projects. It requires implementing the features, then analyzing / profiling, identifying the bottlenecks, and then making PRs to remove the bottlenecks. ee #10918 and apache/arrow-rs#5374 have the entire list. Some of my favorites:

What: Use newly added StringView from arrow to improve performance (by avoiding variable length/string data copies)
Why: This For queries that deal with string data in ClickBench or TPCH a large amount of time is spent in parquet decoding as well as filtering and grouping.
What is left: See #10918 and apache/arrow-rs#5374

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

Complete Parquet Filter Performance

What: Enable the most advanced form of predicate pushdown / late materialization that DataFusion
Why: Influx enables this and it helps with many of our queries. I think Coralogix uses it too (maybe @Dandandan or @thinkharderdev could correct me)
What is left: The actual code is straight forward (change a default config value). The hard part is that last time we ran benchmarks this option actually made some queries slower. So the work is to help debug / profile / figure out why and then what changes are needed to ensure performance doesn't slow down. There are some ideas :

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

Improve Aggregate performance for multi-column grouping when at least one column is variable length

What: Queries like GROUP BY url, code where url is a string are significantly slower than GROUP BY url. We already have the single string column case handled with #7064
Why: There are several queries like this in ClickBench where copying string data to form group keys consumes significant time
What is left: @jayzhan211 already has shown the basic idea of #9430 works in #10976. What is left is to figure out how to get the types to work out in the plans and ensure it doesn't cause performance regressions

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

Aggregate performance / memory use for high cardinality aggregates

What: Improve Queries when the number of groups is very high (1 million+)
Why: Queries when the number of groups is high are significantly slower than DuckDB and use substantially more memory. I think there is at least a factor of 2 of performance here
What is left: There are ideas on #6937 but someone has to try them out, prototype / see if they would work and then productionize them

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

Join performance with dynamic join filters / Sideways Information Passing

What: Introduce filters apply join filtering during the Scan in addition to during the actual join.
Why: Joins in general (and TPCH definitely) end up being very selective (they filter many rows). However join operators are complex and often much slower than filters (this applies to DataFusion too). By pushing much of the filtering work that would be done in the join down into a scan, plans can be made to go much faster
What is left: There has been no work yet on this. The first thing to do would be to prototype some ideas and see if we can make the TPCH queries much faster, and then figure out how to structure the code

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

As a side thought, I would argue that introducing proper support for logical types would benefit performance, especially in late materialization for REE arrays and string views.

@notfilippo -- I think LogicalTypes would make it easier to support improved performance for the reasons you mention, but I don't think it would alone improve performance.

@thinkharderdev
Copy link
Contributor

Complete Parquet Filter Performance

What: Enable the most advanced form of predicate pushdown / late materialization that DataFusion Why: Influx enables this and it helps with many of our queries. I think Coralogix uses it too (maybe @Dandandan or @thinkharderdev could correct me) What is left: The actual code is straight forward (change a default config value). The hard part is that last time we ran benchmarks this option actually made some queries slower. So the work is to help debug / profile / figure out why and then what changes are needed to ensure performance doesn't slow down. There are some ideas :

Yeah, we use it as well. We have some custom code to decide when to push down predicates and in general its a pretty tricky thing to get right.

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

Yeah, we use it as well. We have some custom code to decide when to push down predicates and in general its a pretty tricky thing to get right.

My (perhaps unrealistic) hope is that we could find additional improvements (like #4028 or other optimizations) that could make back up any performance that was lost so that we didn't have to have code to choose.

@notfilippo
Copy link
Contributor

@notfilippo -- I think LogicalTypes would make it easier to support improved performance for the reasons you mention, but I don't think it would alone improve performance.

I agree! 😄 I was just highlighting how the logical/physical separation could support performance improvements while simplifying things, such as handling custom code for dictionaries.

That said, if the next quarter's focus is on performance, should I continue drafting a complete proposal for this change or put it on hold?

@ozankabak
Copy link
Contributor

I think further improvements to planning code structure will also help w.r.t. performance, we used to spend a lot of time in planning phase due to avoidable issues (cloning etc.). We are at a better state now, but still have more work to do.

Also we can finalize our previous discussion on a better statistics infrastructure and start using this information in better ways during planning.

@ozankabak
Copy link
Contributor

ozankabak commented Jul 15, 2024

That said, if the next quarter's focus is on performance, should I continue drafting a complete proposal for this change or put it on hold?

@notfilippo, I think we should complete the exploratory work. Even if we don't get to full-focus on it, this is something we likely want to do at some point (gradually or otherwise).

For example, it was through your draft I started to think about what would happen to ScalarValue in such a design. It is widely used in physical-level machinery, so contemplating a change to logical types for that object could be problematic. Maybe we should have a logical scalar value and a physical one, maybe we will think of something else. But all these thinking was induced by your exploratory work, so I think we should complete it. Then we can decide what to do with the full findings.

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

I think further improvements to planning code structure will also help w.r.t. performance, we used to spend a lot of time in planning phase due to avoidable issues (cloning etc.). We are at a better state now, but still have more work to do.

I think we are at a much better place now for LogicalPlaning -- I don't think we have done anything similar for the various physical optimizer passes for ExecutionPlan and I suspect there is quite a bit of improvement to be had there

@alamb
Copy link
Contributor Author

alamb commented Jul 15, 2024

For example, it was through your draft I started to think about what would happen to ScalarValue in such a design. It is widely used in physical-level machinery, so contemplating a change to logical types for that object could be problematic. Maybe we should have a logical scalar value and a physical one, maybe we will think of something else. But all these thinking was induced by your exploratory work, so I think we should complete it. Then we can decide what to do with the full findings.

I agree with this basic approach -- my feeling is that introducing logical types successfully will require some concerted effort from existing committers and people with expertise with the current code to help it along. @notfilippo is doing a great job but unless we find them help / support I think the project would struggle to be successful

@Abdullahsab3
Copy link
Contributor

Abdullahsab3 commented Sep 15, 2024

If I may chime in with my 2 cents. I have been following Datafusion and its development over the last couple of months and I think Datafusion is a really unique project in the DMBS space and fits the "Future is extensible" vision greatly. I think especially in the RND world (industrial and academic), Datafusion makes research easier and more interesting, since you're starting from a already-present foundation and extending it/modifying it as you need. I think that Datafusion could probably benefit greatly from more academic collaborations? I'd imagine that a lot of the performance optimisations, but also other kinds of projects, would make a great Master thesis or research paper in the DBMS world. Not sure how you can attract more people to this kinds of projects, but it's just an idea I wanted to share.

I think though that Datafusion could benefit from more visibility and presence. The sigmod conference and all the other conferences in which Datafusion was presented were great to promote it among different audiences, ranging from industry to academics and open source developers. It would be great to see Datafusion in other conferences and journals as well (e.g. fosdem, Foundations and Trends in Databases, etc). I think especially in Europe, Datafusion is still not as well-known, and if more DBMS people were to know about it, it would be beneficial to the future of the project.

It might also be interesting if projects that are built on top of Datafusion could also present and explain how they used Datafusion to build their project and what the advantages were of using Datafusion.

@alamb
Copy link
Contributor Author

alamb commented Sep 16, 2024

I think especially in the RND world (industrial and academic), Datafusion makes research easier and more interesting, since you're starting from a already-present foundation and extending it/modifying it as you need. I think that Datafusion could probably benefit greatly from more academic collaborations? I'd imagine that a lot of the performance optimisations, but also other kinds of projects, would make a great Master thesis or research paper in the DBMS world.

I agree entirely @Abdullahsab3 -- thank you. In fact I believe it is exactly the plan of @XiangpengHao to do so. Perhaps he has some insights about how to make it more appealing to researchers

I also think Andy Pavlo's Advanced Database Course was an early adopter and tried to make projects based on DataFusion Spring 2024: https://15721.courses.cs.cmu.edu/spring2024/project.html . I didn't hear much about how this actually went or what we could do to make it easier next time.

It might also be interesting if projects that are built on top of Datafusion could also present and explain how they used Datafusion to build their project and what the advantages were of using Datafusion.

100% agree. This was the topic of many of the DataFusion San Franciso meetup talks recently, and I spoked about it in this talk:

I am particularly excited about the CMU database series this spring promises to be full of such explanations (the majority of those systems use DataFusion in some way) : https://db.cs.cmu.edu/seminar2024/

@alamb
Copy link
Contributor Author

alamb commented Sep 16, 2024

I think especially in Europe, Datafusion is still not as well-known, and if more DBMS people were to know about it, it would be beneficial to the future of the project.

I also 10% agree. I am actually speaking at the first DataFusion meetup in Europe next week:

Any chance anyone on this issue (or @Abdullahsab3 ) wants to help organize another European meetup? Perhaps as 2025 CIDR: https://www.cidrdb.org/cidr2025/ 🤔

@findepi
Copy link
Member

findepi commented Sep 17, 2024

I am actually speaking at the first DataFusion meetup in Europe next week:

✌️ me too

@Abdullahsab3
Copy link
Contributor

Abdullahsab3 commented Sep 19, 2024

Any chance anyone on this issue (or @Abdullahsab3 ) wants to help organize another European meetup?

Yes! My team and I would be interested in helping organize a meetup. We don't use Datafusion directly but we use it through influxdb -- might be an interesting POV/angle to see how technical customers of a product built on top of Datafusion can (ab)use and benefit from Datafusion :D and also how using something like Datafusion, which can sound scary to use for building a product/doing RND, can actually have great benefits.
A lot of the influxdb issues that we found and tackled with Datafusion were actually guided and/or solved by people from outside influxDB that are also working with Datafusion. It reminds me of the multi-vendor participation that Andrew often mentions in his talks.

I will try to solidify something with my team this week/next week regarding a Western Europe meetup and come back with a proposal :)

@alamb
Copy link
Contributor Author

alamb commented Sep 19, 2024

One possibility might be to try and arrange something colocated with CIDR in Amsterdam https://www.cidrdb.org/cidr2025/ -- there might be many people in town already that could be interested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants