Replies: 23 comments 6 replies
-
I could be mistaken, but I'm not sure there is much overlap between the various language implementations in general. I'm personally not aware of this being a problem. I definitely think this relates to #6441, as if the focus is an arrow-native query engine, it seems in both arrow and DFs interests to promote cross-pollination between the projects by not introducing additional beuracratic hurdles between the two. If the focus is instead on the non-execution components, then perhaps that level is a more logical place to draw some kind of separation, similar to Velox? |
Beta Was this translation helpful? Give feedback.
-
I want to be clear that I view this proposal as mostly a "marketing" / "branding" exercise with some governance improvements thrown in -- I don't expect there to be much impact on the day to day development activities of DataFusion and arrow-rs |
Beta Was this translation helpful? Give feedback.
-
I agree with @alamb's proposal in that the distinction does greatly benefit Datafusion from a marketing and branding perspective. |
Beta Was this translation helpful? Give feedback.
-
ASF has a mantra "community over code". Applying that here, Data Fusion seems to have a separate community and therefore should be a separate project. This wouldn't be cutting ties with Arrow. The projects can still interoperate. But being separate projects, they don't have to coordinate and remain in lock-step. Their release schedule, governance, and community practices can drift the way their communities want them to go. People have asked whether "in Rust" should be part of the project description. From what I can see, Rust is a big unifying factor in the Data Fusion community, so I am inclined to say yes. |
Beta Was this translation helpful? Give feedback.
-
This makes a lot of sense to me and I think it's a great thing. I see this as the ecosystem growing to the point where we can support two complete projects.
If others are interested then I think that's fine. However, my personal opinion is that this probably isn't necessary at this point. I'm just not sure what benefits there would be. |
Beta Was this translation helpful? Give feedback.
-
I think this is a good idea overall.
I agree with this and I think being a project on its own would benefit DataFusion. A PMC/committer roster with members actively contributing to the Datafusion project would improve governance and speed of progress IMO. |
Beta Was this translation helpful? Give feedback.
-
Totally agree! |
Beta Was this translation helpful? Give feedback.
-
Looking forward to it! |
Beta Was this translation helpful? Give feedback.
-
agree with this proposal |
Beta Was this translation helpful? Give feedback.
-
Agree with this idea. |
Beta Was this translation helpful? Give feedback.
-
I think this is a great idea !! On the Acero question ?
Because the community is different I don't think it makes much sense....
I agree with the above point as well that there is no need to bring Acero along... |
Beta Was this translation helpful? Give feedback.
-
FWIW an update here is that once my term as Arrow PMC chair runs out, I will put my organizational energy into figuring out the next steps and moving this proposal forward. I expect that to be in a few more months |
Beta Was this translation helpful? Give feedback.
-
It has been brought to my attention that changing the project structure (even a new project structure in Apache) may cause some disruption, espeically for contributors from large companies where contributing to each Apache project must be specifically vetted. WIth this in mind, I will ensure to over communicate any movement on this issue and will ensure consensus prior to making anything official |
Beta Was this translation helpful? Give feedback.
-
For anyone following along we are working on a more specific proposal via #8491 |
Beta Was this translation helpful? Give feedback.
-
Update: Here is the specific proposal for the new top level project The expected timeline is
See mailing list discussion for more details |
Beta Was this translation helpful? Give feedback.
-
Thank you @alamb for working on this 💪 |
Beta Was this translation helpful? Give feedback.
-
Hi, what's the status now? |
Beta Was this translation helpful? Give feedback.
-
Hi @liurenjie1024 - I still plan to submit for the formal vote this week (I am a bit behind -- thank you for the reminder) and then submit for formal ASF board approval at the April Board meeting |
Beta Was this translation helpful? Give feedback.
-
I have updated the proposal document https://docs.google.com/document/d/11WTNYS8KWScOt3ySTX39WVS6krPhUvHsuJRY9PZQx4g/edit with additional information (new committers Jeffrey Vo and Jay Zhan, and the new datafusion-comet repository) I plan to:
This extended timeline is designed to balance the needs of some contributors to prepare for the changed structure with their employers. |
Beta Was this translation helpful? Give feedback.
-
Here is the official vote thread on the arrow dev list: https://lists.apache.org/thread/tv8s8ootxf7nrsp3vo1mt8mtxxt5qcor (if that passes, we'll submit to the board for approval in April) |
Beta Was this translation helpful? Give feedback.
-
Added the board resolution to the draft ASF Arrow board report: https://docs.google.com/document/d/1q6uBW4MNijY8cThZf0d-XZTcCxAOhUgp1sUG6_eOBMY/edit (mailing list link: https://lists.apache.org/thread/qyg07kdvjnb0178087kpt5osqglrrw8x) |
Beta Was this translation helpful? Give feedback.
-
We are also tracking todo items after we become a top level project in #9691 |
Beta Was this translation helpful? Give feedback.
-
Its official 🎉 Apache DataFusion is now its own top level project in the Apache Software Foundation https://projects.apache.org/project.html?datafusion We are tracking various todos here: #9691 |
Beta Was this translation helpful? Give feedback.
-
TLDR
Apache Arrow DataFusion
-->Apache DataFusion
?Introduction
Arrow has become a large and widely adopted project, with at least 12 language implementations, 96 committers, and 47 PMC members) at the time of this writing. It also has several sub projects such as FlightSQL, and at least two query engine sub-projects.
Apache Arrow DataFusion, a query engine written in Rust, has been part of Arrow since February 2019. In that time, it has helped drive the Rust implementations of Arrow arrow-rs and parquet. It also includes the Ballista engine since 2021 and datafusion-python bindings
Challenges
Being part of Arrow has been great for DataFusion. DataFusion has benefited from the existing stable and well understood governance structures, processes, and community. DataFusion contributors learned many things from the Arrow community, both technically as well as organizationally. The community has grown the point where there is now very active development from many different contributors and a history of multiple commits to the
main
branch daily for several years.However, DataFusion arguably has a different mission than Arrow. Arrow powers and enables interoperability for in-memory analytics. DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in Rust. DataFusion does use Arrow extensively and to good effect but not the other way around.
The DataFusion community is largely different as well. It has its own core set of committers, which largely don't overlap with other Arrow implementations or Arrow repositories. There are also many PMC members whose near sole focus appears to be DataFusion. Having non overlapping communities is sometimes challenging for communications such as on the mailing lists [email protected] because relevant and irrelevant content is intermixed. I have also head anecdotally that writing to the main [email protected] mailing list for questions about DataFusion is intimidating given its perceived broad reach.
Finally, DataFusion is technically different than Arrow. DataFusion is developed in its own repository, at a different release cadence (every 2 weeks vs 3 months for the main repo), and has a different documentation site.
Thus given the differences in mission, code, and community I believe both Arrow and DataFusion could benefit from being separated so they can continue to focus and grow.
Proposal:
I propose we move DataFusion to its own top level Apache project (graduate the subproject in Arrow to its own top level project). The project could be "Apache DataFusion"
I think we need a community discussion and consensus before taking any steps so I wanted to start the conversation here.
Specifics
Benefits to DataFusion being its own top level project:
Potential Downsides
Open questions:
Related
Beta Was this translation helpful? Give feedback.
All reactions