-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pandas API #3
Comments
It would nice to also see what a dream-like DataFrame API would look like. I guess there are multiple opions on how a DataFrame API should look like but it would be really good to cover them. This also goes a bit into the direction of #4, as there will be choices like eager vs lazy API, inplace modifications vs full immutability, out-of-core/in-memory/distributed and so on. Everyone is vocal that there were some design choices made in the history of pandas that are regretted nowdays, e.g. https://wesmckinney.com/blog/apache-arrow-pandas-internals/. We cannot solve them with a single API but we can definitely improve on the Pandas API. While Apache Arrow is trying to bring an in-memory format for the DataFrame-like data and basic algorithms and a lot of I/O, its intention is not to provide an end-user API. It is tough a tool in building future DataFrame API for end-users by providing the necessary, performant building blocks. |
This is something we're exploring as well in Modin, but at a multiprocessing/distributed level. We are taking an academic approach toward solving some of these issues. We are also building Modin to be pluggable and have played around with Arrow compute kernels and one of the Gandiva LLVM operators to see how it affects performance in the multiprocessing setting. (Spoiler alert Arrow is fast 😄). Modin is modular to allow for these types of improvements (you can also run it on Dask now too).
I doubt there's an easy answer here. Once something becomes a standard it is very difficult to change that standard in a significant way, and for better or worse pandas (and by extension the API) is a standard. "The best API is the one you already know". Easy to use is also relative because any issue I have with pandas is almost guaranteed to have been answered on StackOverflow before. "How do I ... in pandas?" is really easy to find answers for. As intuitive as a new API could be, it's going to be tough to beat the internet help/community of pandas.
Part of the mission of Modin is to meet users where they are and take on the challenges that presents. Some things within the pandas API will never be fast in a distributed environment. Some operations are extremely difficult to support. Truthfully, the API scope problem is different from the execution problem, because a sufficiently intelligent query planner would be able to identify poorly written code and optimize it, that includes understanding what a user is trying to do rather than what they are typing. To that end, we have a reduced internal API for the operations that we support because we want there to be one implementation for a given behavior. From a systems perspective, pandas is an easy system to hate because it breaks so much of the conventional wisdom that database works have brought us over the last 40+ years. Part of my PhD work is to bridge this gap without losing things that make pandas what it is. It might seem like I love the pandas API, but I do not (trust me on this 😄). I wanted to lay out the challenges I see in changing or making a new API/library and I think it's a lot harder than just making things faster or simplifying the API. |
There are cases where the pandas API can be inconsistent or not very intuitive.
Also, the pandas namespaces are huge (
Series
has around 200 public attributes/methods)Would be useful to discuss about possible improvements to the API, places where people reimplementing it thought that they were replicating something wrong, and general ideas for making the pandas public API easier for users.
The text was updated successfully, but these errors were encountered: