Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query 18 spill to disk and take a lot of space #90

Open
djouallah opened this issue Mar 23, 2023 · 1 comment
Open

Query 18 spill to disk and take a lot of space #90

djouallah opened this issue Mar 23, 2023 · 1 comment

Comments

@djouallah
Copy link

djouallah commented Mar 23, 2023

I am trying to run TPCH-SF100 on my laptop, the good news, I can build the hyper file easily, the test run very well except Query 18 which timeout as I don't have enough empty space on my laptop , I saw some people sort lineitem but I can't do it on my laptop as it time out too

defining FK and PK did not help

SELECT
    --Query18
    c_name,
    c_custkey,
    o_orderkey,
    o_orderdate,
    o_totalprice,
    SUM(l_quantity)
FROM
    customer,
    orders,
    lineitem
WHERE
    o_orderkey IN (
        SELECT
            l_orderkey
        FROM
            lineitem
        GROUP BY
            l_orderkey
        HAVING
            SUM(l_quantity) > 300
    )
    AND c_custkey = o_custkey
    AND o_orderkey = l_orderkey
GROUP BY
    c_name,
    c_custkey,
    o_orderkey,
    o_orderdate,
    o_totalprice
ORDER BY
    o_totalprice DESC,
    o_orderdate
LIMIT
    100;
@djouallah djouallah changed the title Hyper struggle with Query 18 Query 18 spill to disk and take a lot of space Mar 23, 2023
@vogelsgesang
Copy link
Contributor

Thank you for this report! And thanks for benchmarking Hyper! Benchmarks results are very valuable for us to understand where Hyper should still be improved. In particular, if those benchmarks are done by people outside the Hyper team, that also helps to confirm the reproduceability of Hyper's performance 🙂

In this particular case, it's probably good that Hyper isn't able to successfully finish the query - finishing it with spooling would be horribly slow. In general, I would recommend to just disable spooling altogether... I prefer a quick error message over a very long-running query.

The problem for this query is that the subquery

SELECT
    l_orderkey
FROM
    lineitem
GROUP BY
    l_orderkey
HAVING
    SUM(l_quantity) > 300

runs out of memory. The GROUP BY l_orderkey needs to keep too much state in memory.

We will have to see when we get to this. Fixing TPC-H queries is currently not our primary business priority, but as performance enthusiasts, we are of course personally interested in unblocking TPC-H Q18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants