forked from opensearch-project/opensearch-spark
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add TPC-H PPL query suite (opensearch-project#830)
* Add TPC-H PPL query suite Signed-off-by: Lantao Jin <[email protected]> * fix failure of loading resources Signed-off-by: Lantao Jin <[email protected]> * fix data_add() Signed-off-by: Lantao Jin <[email protected]> * enable q21 and add docs Signed-off-by: Lantao Jin <[email protected]> --------- Signed-off-by: Lantao Jin <[email protected]>
- Loading branch information
Showing
27 changed files
with
1,323 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
## TPC-H Benchmark | ||
|
||
TPC-H is a decision support benchmark designed to evaluate the performance of database systems in handling complex business-oriented queries and concurrent data modifications. The benchmark utilizes a dataset that is broadly representative of various industries, making it widely applicable. TPC-H simulates a decision support environment where large volumes of data are analyzed, intricate queries are executed, and critical business questions are answered. | ||
|
||
### Test PPL Queries | ||
|
||
TPC-H 22 test query statements: [TPCH-Query-PPL](https://github.com/opensearch-project/opensearch-spark/blob/main/integ-test/src/integration/resources/tpch) | ||
|
||
### Data Preparation | ||
|
||
#### Option 1 - from PyPi | ||
|
||
``` | ||
# Create the virtual environment | ||
python3 -m venv .venv | ||
# Activate the virtual environment | ||
. .venv/bin/activate | ||
pip install tpch-datagen | ||
``` | ||
|
||
#### Option 2 - from source | ||
|
||
``` | ||
git clone https://github.com/gizmodata/tpch-datagen | ||
cd tpch-datagen | ||
# Create the virtual environment | ||
python3 -m venv .venv | ||
# Activate the virtual environment | ||
. .venv/bin/activate | ||
# Upgrade pip, setuptools, and wheel | ||
pip install --upgrade pip setuptools wheel | ||
# Install TPC-H Datagen - in editable mode with client and dev dependencies | ||
pip install --editable .[dev] | ||
``` | ||
|
||
#### Usage | ||
|
||
Here are the options for the tpch-datagen command: | ||
``` | ||
tpch-datagen --help | ||
Usage: tpch-datagen [OPTIONS] | ||
Options: | ||
--version / --no-version Prints the TPC-H Datagen package version and | ||
exits. [required] | ||
--scale-factor INTEGER The TPC-H Scale Factor to use for data | ||
generation. | ||
--data-directory TEXT The target output data directory to put the | ||
files into [default: data; required] | ||
--work-directory TEXT The work directory to use for data | ||
generation. [default: /tmp; required] | ||
--overwrite / --no-overwrite Can we overwrite the target directory if it | ||
already exists... [default: no-overwrite; | ||
required] | ||
--num-chunks INTEGER The number of chunks that will be generated | ||
- more chunks equals smaller memory | ||
requirements, but more files generated. | ||
[default: 10; required] | ||
--num-processes INTEGER The maximum number of processes for the | ||
multi-processing pool to use for data | ||
generation. [default: 10; required] | ||
--duckdb-threads INTEGER The number of DuckDB threads to use for data | ||
generation (within each job process). | ||
[default: 1; required] | ||
--per-thread-output / --no-per-thread-output | ||
Controls whether to write the output to a | ||
single file or multiple files (for each | ||
process). [default: per-thread-output; | ||
required] | ||
--compression-method [none|snappy|gzip|zstd] | ||
The compression method to use for the | ||
parquet files generated. [default: zstd; | ||
required] | ||
--file-size-bytes TEXT The target file size for the parquet files | ||
generated. [default: 100m; required] | ||
--help Show this message and exit. | ||
``` | ||
|
||
### Generate 1 GB data with zstd (by default) compression | ||
|
||
``` | ||
tpch-datagen --scale-factor 1 | ||
``` | ||
|
||
### Generate 10 GB data with snappy compression | ||
|
||
``` | ||
tpch-datagen --scale-factor 10 --compression-method snappy | ||
``` | ||
|
||
### Query Test | ||
|
||
All TPC-H PPL Queries located in `integ-test/src/integration/resources/tpch` folder. | ||
|
||
To test all queries, run `org.opensearch.flint.spark.ppl.tpch.TPCHQueryITSuite`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
/* | ||
select | ||
l_returnflag, | ||
l_linestatus, | ||
sum(l_quantity) as sum_qty, | ||
sum(l_extendedprice) as sum_base_price, | ||
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, | ||
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, | ||
avg(l_quantity) as avg_qty, | ||
avg(l_extendedprice) as avg_price, | ||
avg(l_discount) as avg_disc, | ||
count(*) as count_order | ||
from | ||
lineitem | ||
where | ||
l_shipdate <= date '1998-12-01' - interval '90' day | ||
group by | ||
l_returnflag, | ||
l_linestatus | ||
order by | ||
l_returnflag, | ||
l_linestatus | ||
*/ | ||
|
||
source = lineitem | ||
| where l_shipdate <= subdate(date('1998-12-01'), 90) | ||
| stats sum(l_quantity) as sum_qty, | ||
sum(l_extendedprice) as sum_base_price, | ||
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, | ||
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, | ||
avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, | ||
avg(l_discount) as avg_disc, | ||
count() as count_order | ||
by l_returnflag, l_linestatus | ||
| sort l_returnflag, l_linestatus |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
/* | ||
select | ||
c_custkey, | ||
c_name, | ||
sum(l_extendedprice * (1 - l_discount)) as revenue, | ||
c_acctbal, | ||
n_name, | ||
c_address, | ||
c_phone, | ||
c_comment | ||
from | ||
customer, | ||
orders, | ||
lineitem, | ||
nation | ||
where | ||
c_custkey = o_custkey | ||
and l_orderkey = o_orderkey | ||
and o_orderdate >= date '1993-10-01' | ||
and o_orderdate < date '1993-10-01' + interval '3' month | ||
and l_returnflag = 'R' | ||
and c_nationkey = n_nationkey | ||
group by | ||
c_custkey, | ||
c_name, | ||
c_acctbal, | ||
c_phone, | ||
n_name, | ||
c_address, | ||
c_comment | ||
order by | ||
revenue desc | ||
limit 20 | ||
*/ | ||
|
||
source = customer | ||
| join ON c_custkey = o_custkey orders | ||
| join ON l_orderkey = o_orderkey lineitem | ||
| join ON c_nationkey = n_nationkey nation | ||
| where o_orderdate >= date('1993-10-01') | ||
AND o_orderdate < date_add(date('1993-10-01'), interval 3 month) | ||
AND l_returnflag = 'R' | ||
| stats sum(l_extendedprice * (1 - l_discount)) as revenue by c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment | ||
| sort - revenue | ||
| head 20 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
/* | ||
select | ||
ps_partkey, | ||
sum(ps_supplycost * ps_availqty) as value | ||
from | ||
partsupp, | ||
supplier, | ||
nation | ||
where | ||
ps_suppkey = s_suppkey | ||
and s_nationkey = n_nationkey | ||
and n_name = 'GERMANY' | ||
group by | ||
ps_partkey having | ||
sum(ps_supplycost * ps_availqty) > ( | ||
select | ||
sum(ps_supplycost * ps_availqty) * 0.0001000000 | ||
from | ||
partsupp, | ||
supplier, | ||
nation | ||
where | ||
ps_suppkey = s_suppkey | ||
and s_nationkey = n_nationkey | ||
and n_name = 'GERMANY' | ||
) | ||
order by | ||
value desc | ||
*/ | ||
|
||
source = partsupp | ||
| join ON ps_suppkey = s_suppkey supplier | ||
| join ON s_nationkey = n_nationkey nation | ||
| where n_name = 'GERMANY' | ||
| stats sum(ps_supplycost * ps_availqty) as value by ps_partkey | ||
| where value > [ | ||
source = partsupp | ||
| join ON ps_suppkey = s_suppkey supplier | ||
| join ON s_nationkey = n_nationkey nation | ||
| where n_name = 'GERMANY' | ||
| stats sum(ps_supplycost * ps_availqty) as check | ||
| eval threshold = check * 0.0001000000 | ||
| fields threshold | ||
] | ||
| sort - value |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
/* | ||
select | ||
l_shipmode, | ||
sum(case | ||
when o_orderpriority = '1-URGENT' | ||
or o_orderpriority = '2-HIGH' | ||
then 1 | ||
else 0 | ||
end) as high_line_count, | ||
sum(case | ||
when o_orderpriority <> '1-URGENT' | ||
and o_orderpriority <> '2-HIGH' | ||
then 1 | ||
else 0 | ||
end) as low_line_count | ||
from | ||
orders, | ||
lineitem | ||
where | ||
o_orderkey = l_orderkey | ||
and l_shipmode in ('MAIL', 'SHIP') | ||
and l_commitdate < l_receiptdate | ||
and l_shipdate < l_commitdate | ||
and l_receiptdate >= date '1994-01-01' | ||
and l_receiptdate < date '1994-01-01' + interval '1' year | ||
group by | ||
l_shipmode | ||
order by | ||
l_shipmode | ||
*/ | ||
|
||
source = orders | ||
| join ON o_orderkey = l_orderkey lineitem | ||
| where l_commitdate < l_receiptdate | ||
and l_shipdate < l_commitdate | ||
and l_shipmode in ('MAIL', 'SHIP') | ||
and l_receiptdate >= date('1994-01-01') | ||
and l_receiptdate < date_add(date('1994-01-01'), interval 1 year) | ||
| stats sum(case(o_orderpriority = '1-URGENT' or o_orderpriority = '2-HIGH', 1 else 0)) as high_line_count, | ||
sum(case(o_orderpriority != '1-URGENT' and o_orderpriority != '2-HIGH', 1 else 0)) as low_line_countby | ||
by l_shipmode | ||
| sort l_shipmode |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
/* | ||
select | ||
c_count, | ||
count(*) as custdist | ||
from | ||
( | ||
select | ||
c_custkey, | ||
count(o_orderkey) as c_count | ||
from | ||
customer left outer join orders on | ||
c_custkey = o_custkey | ||
and o_comment not like '%special%requests%' | ||
group by | ||
c_custkey | ||
) as c_orders | ||
group by | ||
c_count | ||
order by | ||
custdist desc, | ||
c_count desc | ||
*/ | ||
|
||
source = [ | ||
source = customer | ||
| left outer join ON c_custkey = o_custkey AND not like(o_comment, '%special%requests%') | ||
orders | ||
| stats count(o_orderkey) as c_count by c_custkey | ||
] as c_orders | ||
| stats count() as custdist by c_count | ||
| sort - custdist, - c_count |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
/* | ||
select | ||
100.00 * sum(case | ||
when p_type like 'PROMO%' | ||
then l_extendedprice * (1 - l_discount) | ||
else 0 | ||
end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue | ||
from | ||
lineitem, | ||
part | ||
where | ||
l_partkey = p_partkey | ||
and l_shipdate >= date '1995-09-01' | ||
and l_shipdate < date '1995-09-01' + interval '1' month | ||
*/ | ||
|
||
source = lineitem | ||
| join ON l_partkey = p_partkey | ||
AND l_shipdate >= date('1995-09-01') | ||
AND l_shipdate < date_add(date('1995-09-01'), interval 1 month) | ||
part | ||
| stats sum(case(like(p_type, 'PROMO%'), l_extendedprice * (1 - l_discount) else 0)) as sum1, | ||
sum(l_extendedprice * (1 - l_discount)) as sum2 | ||
| eval promo_revenue = 100.00 * sum1 / sum2 // Stats and Eval commands can combine when issues/819 resolved | ||
| fields promo_revenue |
Oops, something went wrong.