-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2024 Q3-Q4 Roadmap? #11442
Comments
It would be great to have one or two quarters where we focus on perf. I think we are at a pretty good place in terms of extensibility/customizability (evidenced by rapidly increasing number of projects), but the situation could be much better wrt performance. That being said, team Synnada will keep adding baseline mechanisms to upstream DF to enable streaming use cases (when appropriate and not overfitting, obviously) by downstream projects. |
I agree with this sentiment. Something about performance improvements is I think they take sustained engineering investment and significant existing engine expertise (thus it is hard to have newcomers to the project make singificant performance improvements) I will try and find time from myself and InfluxData these next two quarters to meaninfully invest in improvements in this area However, I can't realistically do that if I am also helping to shepherd other large projects along (I am thinking specifically of #11160 from @notfilippo) so I need to make some hard choices there
Thank you. Your help (and everyone else's help) with documentation and reviews I think would also be tremendously beneficial |
As a side thought, I would argue that introducing proper support for logical types would benefit performance, especially in late materialization for REE arrays and string views. That said, I fully agree with focusing on performance, and I would be happy to rescope my proposal to make it easier to manage. |
Do you have a list in mind the area that is worth for performance improvement? Somethings I known that are still active in my head
Anything else? There's always room for improvement, particularly in terms of performance. Regularly updating the active lists could provider valuable pointer for the community |
In my mind, here are somre "obvious" performance projects (the ones I have the most confidence that would make a meaningful difference on ClickBench or TPCH queries) are as follows (I can maybe put this in the documentation) Integrate StringView into Parquet / Filtering / Grouping@XiangpengHao is doing this as his summer project and doing an amazing job. I also think this is a great example of the the level of effort required to drive one of these performance projects. It requires implementing the features, then analyzing / profiling, identifying the bottlenecks, and then making PRs to remove the bottlenecks. ee #10918 and apache/arrow-rs#5374 have the entire list. Some of my favorites:
What: Use newly added |
Complete Parquet Filter PerformanceWhat: Enable the most advanced form of predicate pushdown / late materialization that DataFusion |
Improve Aggregate performance for multi-column grouping when at least one column is variable lengthWhat: Queries like |
Aggregate performance / memory use for high cardinality aggregatesWhat: Improve Queries when the number of groups is very high (1 million+) |
Join performance with dynamic join filters / Sideways Information PassingWhat: Introduce filters apply join filtering during the Scan in addition to during the actual join. |
@notfilippo -- I think |
Yeah, we use it as well. We have some custom code to decide when to push down predicates and in general its a pretty tricky thing to get right. |
My (perhaps unrealistic) hope is that we could find additional improvements (like #4028 or other optimizations) that could make back up any performance that was lost so that we didn't have to have code to choose. |
I agree! 😄 I was just highlighting how the logical/physical separation could support performance improvements while simplifying things, such as handling custom code for dictionaries. That said, if the next quarter's focus is on performance, should I continue drafting a complete proposal for this change or put it on hold? |
I think further improvements to planning code structure will also help w.r.t. performance, we used to spend a lot of time in planning phase due to avoidable issues (cloning etc.). We are at a better state now, but still have more work to do. Also we can finalize our previous discussion on a better statistics infrastructure and start using this information in better ways during planning. |
@notfilippo, I think we should complete the exploratory work. Even if we don't get to full-focus on it, this is something we likely want to do at some point (gradually or otherwise). For example, it was through your draft I started to think about what would happen to |
I think we are at a much better place now for LogicalPlaning -- I don't think we have done anything similar for the various physical optimizer passes for |
I agree with this basic approach -- my feeling is that introducing logical types successfully will require some concerted effort from existing committers and people with expertise with the current code to help it along. @notfilippo is doing a great job but unless we find them help / support I think the project would struggle to be successful |
If I may chime in with my 2 cents. I have been following Datafusion and its development over the last couple of months and I think Datafusion is a really unique project in the DMBS space and fits the "Future is extensible" vision greatly. I think especially in the RND world (industrial and academic), Datafusion makes research easier and more interesting, since you're starting from a already-present foundation and extending it/modifying it as you need. I think that Datafusion could probably benefit greatly from more academic collaborations? I'd imagine that a lot of the performance optimisations, but also other kinds of projects, would make a great Master thesis or research paper in the DBMS world. Not sure how you can attract more people to this kinds of projects, but it's just an idea I wanted to share. I think though that Datafusion could benefit from more visibility and presence. The sigmod conference and all the other conferences in which Datafusion was presented were great to promote it among different audiences, ranging from industry to academics and open source developers. It would be great to see Datafusion in other conferences and journals as well (e.g. fosdem, Foundations and Trends in Databases, etc). I think especially in Europe, Datafusion is still not as well-known, and if more DBMS people were to know about it, it would be beneficial to the future of the project. It might also be interesting if projects that are built on top of Datafusion could also present and explain how they used Datafusion to build their project and what the advantages were of using Datafusion. |
I agree entirely @Abdullahsab3 -- thank you. In fact I believe it is exactly the plan of @XiangpengHao to do so. Perhaps he has some insights about how to make it more appealing to researchers I also think Andy Pavlo's Advanced Database Course was an early adopter and tried to make projects based on DataFusion Spring 2024: https://15721.courses.cs.cmu.edu/spring2024/project.html . I didn't hear much about how this actually went or what we could do to make it easier next time.
100% agree. This was the topic of many of the DataFusion San Franciso meetup talks recently, and I spoked about it in this talk:
I am particularly excited about the CMU database series this spring promises to be full of such explanations (the majority of those systems use DataFusion in some way) : https://db.cs.cmu.edu/seminar2024/ |
I also 10% agree. I am actually speaking at the first DataFusion meetup in Europe next week:
Any chance anyone on this issue (or @Abdullahsab3 ) wants to help organize another European meetup? Perhaps as 2025 CIDR: https://www.cidrdb.org/cidr2025/ 🤔 |
✌️ me too |
Yes! My team and I would be interested in helping organize a meetup. We don't use Datafusion directly but we use it through influxdb -- might be an interesting POV/angle to see how technical customers of a product built on top of Datafusion can (ab)use and benefit from Datafusion :D and also how using something like Datafusion, which can sound scary to use for building a product/doing RND, can actually have great benefits. I will try to solidify something with my team this week/next week regarding a Western Europe meetup and come back with a proposal :) |
One possibility might be to try and arrange something colocated with CIDR in Amsterdam https://www.cidrdb.org/cidr2025/ -- there might be many people in town already that could be interested |
Is your feature request related to a problem or challenge?
@comphead asked #11426 (comment)
Which I think is an excellent question.
In general since this project isn't really coordinated centrally the roadmap typically follows what people are working on / want to invest time in
However it is a neat idea to collect any thoughts people have / want to share about what they might work on
Describe the solution you'd like
Let's collect any projects that people think they are likely to spend time on or projects that the broader community would really like to see done and write them down!
Then we can add it to the roadmap on the doc site https://datafusion.apache.org/contributor-guide/roadmap.html
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: