The following instructions are derived from this project.
We will built tools and run TPS-DS benchmarks on Spark 3.0.1 using an S3 storage data lake.
If you want to directly Generate the Data and Run the Benchmark using pre-built images, you can directly go to the Run the benchmark section.
You will need a pre-built Spark image with the S3 connector built in. Refer to this project.
Note
|
You will need the sbt tool to build the dependency and the benchmark tool.
|
-
spark-sql-perf
Latest version in maven central repo is 0.3.2 which is too old, we need to build a new libary from source.
git clone https://github.com/databricks/spark-sql-perf
cd spark-sql-perf
Check/Edit the version of Spark or Scala you want to use in the file build.sbt
.
sbt +package
cp target/scala-2.12/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar <your-code-path>/benchmark/libs
docker build -t spark-benchmark:s3.0.1-h3.3.0_v0.0.1 --build-arg SPARK_BASE_IMAGE=quay.io/guimou/spark-odh:s3.0.1-h3.3.0_v0.0.2 .
docker tag spark-benchmark:s3.0.1-h3.3.0_v0.0.1 quay.io/guimou/spark-benchmark:s3.0.1-h3.3.0_v0.0.1
docker push quay.io/guimou/spark-benchmark:s3.0.1-h3.3.0_v0.0.1
In the examples
folder you will find different examples to create the Datasets or Run the benchmark. You can adapt them to your environment: data location, logs location,…
Following instructions use an S3 bucket for the Data, as well as the History Server as described here.
We’ll create a bucket to hold the TPC-DS Data using and ObjectBucketClaim.
Note
|
This OBC creates a bucket in the RGW from an OpenShift Data Foundation deployment. Adapt the instructions depending on your S3 provider. |
From the test
folder:
oc apply -f obc-tpcds-data.yaml
To make it easier to retrieve all the data from the generation and the benchmark, point your history server to the same bucket.
Note
|
you must create a logs-dir folder in this bucket prior to launching the history server (populate it with a hidden empty file like .s3keep ).
|
By default OpenShift places restrictions on the size of the pods you can launch, which may be an issue for intensive workloads. We must adjust what the Spark-Operator is allowed to do when launching the driver and the executors.
From the OpenShift UI, as a cluster-admin, go to Administration→LimitRanges and edit spark-operator-core-resource-limits
according to your needs and the available resources. The parameters to change are the max
for cpu
and memory
for both Containers and Pods.
From the examples
folder, review and apply one of the file called tpcds-data-generation-SIZE.yaml
. The SIZE
in the filename indicates the size of the synthetic dataset created, in GB.
In the file, choose the number or executors you want to run along with their sizing.
You must also replace the placeholders with the values for your bucket namen, access key and secret key.
oc apply -f tpcds-data-generation-1G.yaml
From the examples
folder, review and apply one of the file called tpcds-benchmark-SIZE.yaml
. The SIZE
in the filename indicates the size of the synthetic dataset created, in GB.
In the file, choose the number or executors you want to run along with their sizing.
You must also replace the placeholders with the values for your bucket namen, access key and secret key.
oc apply -f tpcds-benchmark-1G.yaml
Apart from the History Server, all the Results are saved in YOUR_BUCKET/TPCDS-TEST-1G-RESULT
.
Note
|
Adapat all those instructions for different sizes of datasets. |