Skip to content

Commit

Permalink
update spark-docker example with iceberg tables
Browse files Browse the repository at this point in the history
update documentation

Signed-off-by: YANGDB <[email protected]>
  • Loading branch information
YANG-DB committed Dec 22, 2024
1 parent 20ef890 commit 781ec9e
Show file tree
Hide file tree
Showing 8 changed files with 103 additions and 46 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Please refer to the [Flint Index Reference Manual](./docs/index.md) for more inf

* For additional details on Spark PPL commands project, see [PPL Project](https://github.com/orgs/opensearch-project/projects/214/views/2)

* Experiment ppl queries on local spark cluster [PPL on local spark ](docs/ppl-lang/local-spark-ppl-test-instruction.md)
* Experiment ppl queries on local spark cluster [PPL on local spark ](docs/local-spark-ppl-test-instruction.md)

## Prerequisites

Expand Down Expand Up @@ -88,7 +88,7 @@ bin/spark-shell --packages "org.opensearch:opensearch-spark-ppl_2.12:0.7.0-SNAPS
```

### PPL Run queries on a local spark cluster
See ppl usage sample on local spark cluster [PPL on local spark ](docs/ppl-lang/local-spark-ppl-test-instruction.md)
See ppl usage sample on local spark cluster [PPL on local spark ](docs/local-spark-ppl-test-instruction.md)

### Running integration tests on a local spark cluster
See integration test documentation [Docker Integration Tests](integ-test/script/README.md)
Expand Down
2 changes: 1 addition & 1 deletion docker/apache-spark-sample/.env
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
MASTER_UI_PORT=8080
MASTER_PORT=7077
UI_PORT=4040
PPL_JAR=../../ppl-spark-integration/target/scala-2.12/ppl-spark-integration-assembly-0.7.0-SNAPSHOT.jar
PPL_JAR=../../sparkPPLCosmetic/target/scala-2.12/opensearch-spark-ppl-assembly-0.7.0-SNAPSHOT.jar
26 changes: 26 additions & 0 deletions docker/apache-spark-sample/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
FROM bitnami/spark:3.5.3

# Install wget
USER root
RUN apt-get update && apt-get install -y wget && rm -rf /var/lib/apt/lists/*

# Define the Iceberg version and Maven repository URL
ENV ICEBERG_VERSION=1.5.0
ENV MAVEN_REPO=https://repo1.maven.org/maven2

# Download the Iceberg runtime JAR
RUN wget $MAVEN_REPO/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/$ICEBERG_VERSION/iceberg-spark-runtime-3.5_2.12-$ICEBERG_VERSION.jar \
-O /opt/bitnami/spark/jars/iceberg-spark-runtime-3.5.jar

# Optional: Add configuration files
COPY spark-defaults.conf /opt/bitnami/spark/conf/

# Set up environment variables for Spark
ENV SPARK_MODE=master
ENV SPARK_RPC_AUTHENTICATION_ENABLED=no
ENV SPARK_RPC_ENCRYPTION_ENABLED=no
ENV SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
ENV SPARK_SSL_ENABLED=no

# Switch back to non-root user for security
USER 1001
39 changes: 39 additions & 0 deletions docker/apache-spark-sample/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Sanity Test OpenSearch Spark PPL
This document shows how to locally test OpenSearch PPL commands on top of Spark using docker-compose.

See instructions for running docker-compose [here](../../docs/spark-docker.md)

Once the docker services are running,[connect to the spark-sql](../../docs/local-spark-ppl-test-instruction.md#running-spark-shell)

In the spark-sql shell - [run the next create table statements](../../docs/local-spark-ppl-test-instruction.md#testing-ppl-commands)

Now PPL commands can [run](../../docs/local-spark-ppl-test-instruction.md#test-grok--top-commands-combination) on top of the table just created

### Using Iceberg Tables
The following example utilize https://iceberg.apache.org/ table as an example
```sql
CREATE TABLE iceberg_table (
id INT,
name STRING,
age INT,
city STRING
)
USING iceberg
PARTITIONED BY (city)
LOCATION 'file:/tmp/iceberg-tables/default/iceberg_table';

INSERT INTO iceberg_table VALUES
(1, 'Alice', 30, 'New York'),
(2, 'Bob', 25, 'San Francisco'),
(3, 'Charlie', 35, 'New York'),
(4, 'David', 40, 'Chicago'),
(5, 'Eve', 28, 'San Francisco');
```

### PPL queries
```sql
source=`default`.`iceberg_table`;
source=`default`.`iceberg_table` | where age > 30 | fields id, name, age, city | sort - age;
source=`default`.`iceberg_table` | where age > 30 | stats count() by city;
source=`default`.`iceberg_table` | stats avg(age) by city;
```
8 changes: 6 additions & 2 deletions docker/apache-spark-sample/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
services:
spark:
image: bitnami/spark:3.5.3
build:
context: .
dockerfile: Dockerfile
ports:
- "${MASTER_UI_PORT:-8080}:8080"
- "${MASTER_PORT:-7077}:7077"
Expand All @@ -21,7 +23,9 @@ services:
target: /opt/bitnami/spark/jars/ppl-spark-integration.jar

spark-worker:
image: bitnami/spark:3.5.3
build:
context: .
dockerfile: Dockerfile
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
Expand Down
9 changes: 7 additions & 2 deletions docker/apache-spark-sample/spark-defaults.conf
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,10 @@
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 5g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.sql.extensions org.opensearch.flint.spark.FlintPPLSparkExtensions
spark.sql.catalog.dev org.apache.spark.opensearch.catalog.OpenSearchCatalog
spark.sql.extensions org.opensearch.flint.spark.FlintPPLSparkExtensions, org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.dev org.apache.spark.opensearch.catalog.OpenSearchCatalog, org.apache.iceberg.spark.SparkCatalog

# Enable Iceberg catalog
spark.sql.catalog.spark_catalog org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.spark_catalog.type hadoop
spark.sql.catalog.spark_catalog.warehouse file:/tmp/iceberg-tables
Original file line number Diff line number Diff line change
@@ -1,55 +1,38 @@
# Testing PPL using local Spark

## Produce the PPL artifact
The first step would be to produce the spark-ppl artifact: `sbt clean sparkPPLCosmetic/assembly`

The resulting artifact would be located in the project's build directory:
```sql
[info] Built: ./opensearch-spark/sparkPPLCosmetic/target/scala-2.12/opensearch-spark-ppl-assembly-x.y.z-SNAPSHOT.jar
Build the Flint and PPL extensions for Spark:
```
sbt clean
sbt sparkSqlApplicationCosmetic/assembly sparkPPLCosmetic/assembly
```
## Downloading spark 3.5.3 version
Download spark from the [official website](https://spark.apache.org/downloads.html) and install locally.

## Start Spark with the plugin
Once installed, run spark with the generated PPL artifact:
```shell
bin/spark-sql --jars "/PATH_TO_ARTIFACT/opensearch-spark-ppl-assembly-x.y.z-SNAPSHOT.jar" \
--conf "spark.sql.extensions=org.opensearch.flint.spark.FlintPPLSparkExtensions" \
--conf "spark.sql.catalog.dev=org.apache.spark.opensearch.catalog.OpenSearchCatalog" \
--conf "spark.hadoop.hive.cli.print.header=true"

WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore
Spark Web UI available at http://*.*.*.*:4040
Spark master: local[*], Application Id: local-1731523264660

spark-sql (default)>
### Using docker compose to run spark local cluster

Next update the [`.env`](../../docker/apache-spark-sample/.env) file to match the `PPL_JAR` location:
```
PPL_JAR=../../sparkPPLCosmetic/target/scala-2.12/opensearch-spark-ppl-assembly-0.7.0-SNAPSHOT.jar
```
The resulting would be a spark-sql prompt: `spark-sql (default)> ...`

### Spark UI Html
One can also explore spark's UI portal to examine the execution jobs and how they are performing:
Next start the Docker containers that will be used for the tests. In the directory `docker/integ-test`

![Spark-UX](../img/spark-ui.png)
```shell
docker compose up -d
```

## Running Spark Shell

### Configuring hive partition mode
For simpler configuration of partitioned tables, use the following non-strict mode:
Can run `spark-sql` on the master node.

```shell
spark-sql (default)> SET hive.exec.dynamic.partition.mode = nonstrict;
```
docker exec -it apache-spark-sample-spark-1 /opt/bitnami/spark/bin/spark-sql
```

---
## Testing PPL Commands

# Testing PPL Commands
Within the Spark Shell, you can submit queries, including PPL queries.

In order to test ppl commands using the spark-sql command line - create and populate the following set of tables:
We will create and populate the following set of tables:

## emails table
```sql
Expand Down
4 changes: 2 additions & 2 deletions docs/spark-docker.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Running Queries with Apache Spark in Docker

There are [Bitnami Apache Spark docker images](https://hub.docker.com/r/bitnami/spark). These
can be modified to be able to include the OpenSearch Spark PPL extension. With the OpenSearch
There are [Bitnami Apache Spark docker images](https://hub.docker.com/r/bitnami/spark).
These can be modified to be able to include the OpenSearch Spark PPL extension. With the OpenSearch
Spark PPL extension, the docker image can be used to test PPL commands.

The Bitnami Apache Spark image can be used to run a Spark cluster and also to run
Expand Down

0 comments on commit 781ec9e

Please sign in to comment.