-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fair dataframe API vs API vs SQL benchmarking. #1515
Comments
Thanks for opening this issue and apologies for the lack of engagement on #1498 so far. Our intention is not to publish any unfair results and we are interested in comparing engines on equal footing to the best of our ability. You are mentioning a couple of different topics and some of them require further discussion. From what I can tell, you are raising these problems
I will grant you that 1.) is indeed not correct in some situations. This wasn't done intentionally and we still have to address this. I think #1498 describes this problem already such that I will not open a separate issue. I think the two other issues require a bit further debate. LeftsemiDask is using a Let's talk about Q4 in detail since it's the simplest example. The functional definition of the query according to the TPC-H specification is select
o_orderpriority,
count(*) as order_count
from
orders
where
o_orderdate >= date '[DATE]'
and o_orderdate < date '[DATE]' + interval '3' month
and exists (
select
*
from
lineitem
where
l_orderkey = o_orderkey
and l_commitdate < l_receiptdate
)
group by
o_orderpriority
order by
o_orderpriority; The My attempt to put into words what the above query is describing... "Take every row of This is pretty much the definition of a Comparing this to how polars is implementing this query, namely joining the table with a simple The cases for query 18 and 20 are more complex and instead of I would actually encourage polars to use a semi join as well. Thoughts? Groupby aggregationsAs an example, you are pointing here to Q1 and especially how the following line is being translated. select
...
sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
...
from
lineitem
... In dask and pandas we currently don't have a way to express more complex columnar expressions like this. Instead, we do... lineitem_filtered["sum_charge"] = (
lineitem_filtered.l_extendedprice
* (1 - lineitem_filtered.l_discount)
* (1 + lineitem_filtered.l_tax)
)
...
gb = lineitem_filtered.groupby(["l_returnflag", "l_linestatus"])
total = gb.agg(
{
...
"sum_charge": "sum",
...
}
) I fully agree that the way polars can describe this operation using expressions is more elegant. We currently don't support this but have it on our roadmap (see dask/dask-expr#386). |
Thanks for engaging on this @fjetter. I think what we should try to achieve is to come up with a way for the dataframe queries that fit the SQL best, for both Polars and Dask in that matter. left semiGiven your explanation of Optimization
Sorry, I don't think I follow. What do you think is neglible? If an optimizer doesn't come to certain conclusion a query can explode in runtime. For instance q5 doesn't even finish if we turn off optimizations. On the other hand, we also do optimizations that are fairly expensive. For instance CSE. And coming to a correct conclusion that Dask applied to the group-by pre-aggregations is not per-se trivial. I understand that might be a current limitation in the API, but I do think these can have significant runtime effects if an optimizer cannot prove CSE can be applied earlier. Aligning queries.I think we agree on point 1. and 2. and can align the queries on those fronts. As I think it is important that we start from the same starting blocks (as much as possible). For the filtering locations: these queries are all checked with narwhals to ensure they both compile down to the same eager queries with regard to filters: |
sorry if I haven't made myself clear. I was specifically and exclusively referring to the introduction of column expressions. The introduction of column expressions would have to be interpreted by the optimizer which could yield suboptimal results and additional computation overhead compared to what we are doing right now since we are already writing down the aggregation in a different, more explicit form. When I'm talking about whether things are negligible, I'm also thinking about the very large runs on scale 1k or 10k which all run for minutes so if the optimizer is done in less than a second, I don't care much about how fast it is. FWIW If we disabled query optimizations, dask wouldn't be able to run most of the queries at all. (We've been writing a bit about this over here) |
I think they should be another point added as well. Join reordering. This has huge impact over performance. Polars doesn't have join re-ordering optimizer yet, but still we choose to keep true to the SQL translation. However I see Dask queries having a manually chosen better ordering. This is not fair to Polars and SQL solutions as those need to figure out the join ordering themselves and if they don't the memory/runtime can explode. I think that the current benchmarks are completely apples to peaches as it is now. |
(as a heads-up, Florian is out on vacation this coming week, and may be
less responsive)
…On Sat, May 25, 2024 at 5:59 AM Ritchie Vink ***@***.***> wrote:
I think they should be another point added as well. *Join reordering*.
This has huge impact over performance. Polars doesn't have join
re-ordering optimizer yet, but still we choose to keep true to the SQL
translation. However I see Dask queries having a manually chosen better
ordering. This is not fair to Polars and SQL solutions as those need to
figure out the join ordering themselves and if they don't the
memory/runtime can explode.
I think that the current benchmarks are completely apples to peaches as it
is now.
image.png (view on web)
<https://github.com/coiled/benchmarks/assets/3023000/3335f120-d497-49b0-880d-051c2ede6613>
—
Reply to this email directly, view it on GitHub
<#1515 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTAF45GE2JPPFUI3CSTZEBVKLAVCNFSM6AAAAABIBQVFS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZRGIYTEOJWHE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
<https://coiled.io>
Matthew Rocklin CEO, Dask Maintainer
|
Yes, join ordering is something that is done manually to some extend right now. We've been transparent about this when publishing the results, see https://docs.coiled.io/blog/tpch.html#dask The query you are pointing to is Q9. This is actually an interesting example. The functional SQL definition is select
nation,
o_year,
round(sum(amount), 2) as sum_profit
from
(
select
n_name as nation,
year(o_orderdate) as o_year,
l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount
from
part,
supplier,
lineitem,
partsupp,
orders,
nation
where
s_suppkey = l_suppkey
and ps_suppkey = l_suppkey
and ps_partkey = l_partkey
and p_partkey = l_partkey
and o_orderkey = l_orderkey
and s_nationkey = n_nationkey
and p_name like '%green%'
) as profit
group by
nation,
o_year
order by
nation,
o_year desc Following this very strictly would mean you'd have to join In this example, the reordering was required to make sense of the query. There are other examples where we did reorder manually to make the query faster. One example is Q18 where we started off with the same ordering as the tables listed in the from clause of the functional SQL. In this example, polars uses an ordering that is also different to how the tables are listed in SQL and we adopted this here instead for improved performance. So, you are very right. We are comparing apples to peaches again because we are comparing a declarative language to an imperative API. There is no universal, unambiguous set of rules (that I am aware of) that translates SQL to a DataFrame API. This example shows that even with a join order optimizer as part of the DataFrame API, this ambiguity still exists to some extend. I'm sorry if this feels unfair to polars as the only other DataFrame API library. We just used the polars code that was in your repo at the time but haven't gone through optimizing it. I'm not sure how to do this better. |
A comment doesn't reach as far as a graph. ;) I just modified the q9 according to the dask query ordering and saw a 2.5x improvement on the Polars side. I agree that doing fair benchmark between a declarative vs imperative API is hard. However, what I do think is important is to do the same wherever it is possible.
We can start by applying the same join orderings for the DataFrame API's, if you do a manual join reordering for Dask you should do it for Polars as well, and I don't think a comment is enough here as that means the chart will be misleading and those will be often shared without context. I can do a pass in the Polars queries upstream to make them more similar to the Dask ones. And I think any hand written optimization should be tried to applied to other tools whenever possible. Upstream we will make an effort to do the same. |
Similar #1498. I think that as the queries are currently written it isn't a fair comparison between DataFrame API's.
For SQL it is fair as the TPCH benchmark states that all engines should parse the same SQL. However for DataFrames with different API's this is ofcourse harder.
TPCH is not only a operation benchmark, it is also an optimization benchmark. If we compare against SQL engines, the DataFrame API's shouldn't enjoy more optimizations than the SQL queries.
For this reason we removed hand written optimizations on the polars tpch benchmarks in Polars, pandas, Dask and modin:
To make the benchmarks apples to apples I think this should be done in this repo as well. The Dask queries for instance show hand written "leftsemi"
benchmarks/tests/tpch/dask_queries.py
Line 153 in 13ebb9c
benchmarks/tests/tpch/dask_queries.py
Line 153 in 13ebb9c
benchmarks/tests/tpch/dask_queries.py
Line 1120 in 13ebb9c
pre-filtering:
benchmarks/tests/tpch/dask_queries.py
Line 10 in 13ebb9c
benchmarks/tests/tpch/dask_queries.py
Line 123 in 13ebb9c
benchmarks/tests/tpch/dask_queries.py
Line 239 in 13ebb9c
< actually almost all queries in some form >
And pre-aggregation computation where this defined in the aggregation context.
In many group-by's:
For this one, we are lenient towards pandas API's upstream as there isn't a good way to describe this. But since
dask-expr
is developed I believe Dask should just inline this and leave this to the optimizer.If we compare DataFrame API's and SQL I truly believe it is important that we give all implementations a fair starting position. Optimization engines could come to the same conclusions, but have to spend compute to do so, and often will not come to the most optimal conclusions.
The text was updated successfully, but these errors were encountered: