update spark-docker example with iceberg tables

update documentation Signed-off-by: YANGDB <[email protected]>
YANG-DB · Dec 22, 2024 · 781ec9e · 781ec9e
1 parent 20ef890
commit 781ec9e
Show file tree

Hide file tree

Showing 8 changed files with 103 additions and 46 deletions.
diff --git a/README.md b/README.md
@@ -19,7 +19,7 @@ Please refer to the [Flint Index Reference Manual](./docs/index.md) for more inf
 
 * For additional details on Spark PPL commands project, see [PPL Project](https://github.com/orgs/opensearch-project/projects/214/views/2)
 
-* Experiment ppl queries on local spark cluster [PPL on local spark ](docs/ppl-lang/local-spark-ppl-test-instruction.md)
+* Experiment ppl queries on local spark cluster [PPL on local spark ](docs/local-spark-ppl-test-instruction.md)
 
 ## Prerequisites
 
@@ -88,7 +88,7 @@ bin/spark-shell --packages "org.opensearch:opensearch-spark-ppl_2.12:0.7.0-SNAPS
 ```
 
 ### PPL Run queries on a local spark cluster
-See ppl usage sample on local spark cluster [PPL on local spark ](docs/ppl-lang/local-spark-ppl-test-instruction.md)
+See ppl usage sample on local spark cluster [PPL on local spark ](docs/local-spark-ppl-test-instruction.md)
 
 ### Running integration tests on a local spark cluster
 See integration test documentation [Docker Integration Tests](integ-test/script/README.md)

diff --git a/docker/apache-spark-sample/.env b/docker/apache-spark-sample/.env
@@ -1,4 +1,4 @@
 MASTER_UI_PORT=8080
 MASTER_PORT=7077
 UI_PORT=4040
-PPL_JAR=../../ppl-spark-integration/target/scala-2.12/ppl-spark-integration-assembly-0.7.0-SNAPSHOT.jar
+PPL_JAR=../../sparkPPLCosmetic/target/scala-2.12/opensearch-spark-ppl-assembly-0.7.0-SNAPSHOT.jar
diff --git a/docker/apache-spark-sample/Dockerfile b/docker/apache-spark-sample/Dockerfile
@@ -0,0 +1,26 @@
+FROM bitnami/spark:3.5.3
+
+# Install wget
+USER root
+RUN apt-get update && apt-get install -y wget && rm -rf /var/lib/apt/lists/*
+
+# Define the Iceberg version and Maven repository URL
+ENV ICEBERG_VERSION=1.5.0
+ENV MAVEN_REPO=https://repo1.maven.org/maven2
+
+# Download the Iceberg runtime JAR
+RUN wget $MAVEN_REPO/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/$ICEBERG_VERSION/iceberg-spark-runtime-3.5_2.12-$ICEBERG_VERSION.jar \
+    -O /opt/bitnami/spark/jars/iceberg-spark-runtime-3.5.jar
+
+# Optional: Add configuration files
+COPY spark-defaults.conf /opt/bitnami/spark/conf/
+
+# Set up environment variables for Spark
+ENV SPARK_MODE=master
+ENV SPARK_RPC_AUTHENTICATION_ENABLED=no
+ENV SPARK_RPC_ENCRYPTION_ENABLED=no
+ENV SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
+ENV SPARK_SSL_ENABLED=no
+
+# Switch back to non-root user for security
+USER 1001
diff --git a/docker/apache-spark-sample/README.md b/docker/apache-spark-sample/README.md
@@ -0,0 +1,39 @@
+# Sanity Test OpenSearch Spark PPL
+This document shows how to locally test OpenSearch PPL commands on top of Spark using docker-compose. 
+
+See instructions for running docker-compose [here](../../docs/spark-docker.md)
+
+Once the docker services are running,[connect to the spark-sql](../../docs/local-spark-ppl-test-instruction.md#running-spark-shell)
+
+In the spark-sql shell - [run the next create table statements](../../docs/local-spark-ppl-test-instruction.md#testing-ppl-commands)
+
+Now PPL commands can [run](../../docs/local-spark-ppl-test-instruction.md#test-grok--top-commands-combination) on top of the table just created
+
+### Using Iceberg Tables
+The following example utilize https://iceberg.apache.org/ table as an example
+```sql
+CREATE TABLE iceberg_table (
+  id INT,
+  name STRING,
+  age INT,
+  city STRING
+)
+USING iceberg
+PARTITIONED BY (city)
+LOCATION 'file:/tmp/iceberg-tables/default/iceberg_table';
+
+INSERT INTO iceberg_table VALUES
+                              (1, 'Alice', 30, 'New York'),
+                              (2, 'Bob', 25, 'San Francisco'),
+                              (3, 'Charlie', 35, 'New York'),
+                              (4, 'David', 40, 'Chicago'),
+                              (5, 'Eve', 28, 'San Francisco');
+```
+
+### PPL queries 
+```sql
+ source=`default`.`iceberg_table`;
+ source=`default`.`iceberg_table` | where age > 30 | fields id, name, age, city | sort - age;
+ source=`default`.`iceberg_table` | where age > 30 | stats count() by city;
+ source=`default`.`iceberg_table` | stats avg(age) by city;
+```
diff --git a/docker/apache-spark-sample/docker-compose.yml b/docker/apache-spark-sample/docker-compose.yml
@@ -1,6 +1,8 @@
 services:
   spark:
-    image: bitnami/spark:3.5.3
+    build:
+      context: . 
+      dockerfile: Dockerfile
     ports:
       - "${MASTER_UI_PORT:-8080}:8080"
       - "${MASTER_PORT:-7077}:7077"
@@ -21,7 +23,9 @@ services:
         target: /opt/bitnami/spark/jars/ppl-spark-integration.jar
 
   spark-worker:
-    image: bitnami/spark:3.5.3
+    build:
+      context: .
+      dockerfile: Dockerfile
     environment:
       - SPARK_MODE=worker
       - SPARK_MASTER_URL=spark://spark:7077

diff --git a/docker/apache-spark-sample/spark-defaults.conf b/docker/apache-spark-sample/spark-defaults.conf
@@ -25,5 +25,10 @@
 # spark.serializer                 org.apache.spark.serializer.KryoSerializer
 # spark.driver.memory              5g
 # spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
-spark.sql.extensions org.opensearch.flint.spark.FlintPPLSparkExtensions
-spark.sql.catalog.dev org.apache.spark.opensearch.catalog.OpenSearchCatalog
+spark.sql.extensions org.opensearch.flint.spark.FlintPPLSparkExtensions,  org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
+spark.sql.catalog.dev org.apache.spark.opensearch.catalog.OpenSearchCatalog, org.apache.iceberg.spark.SparkCatalog
+
+# Enable Iceberg catalog
+spark.sql.catalog.spark_catalog org.apache.iceberg.spark.SparkCatalog
+spark.sql.catalog.spark_catalog.type hadoop
+spark.sql.catalog.spark_catalog.warehouse file:/tmp/iceberg-tables
diff --git a/...-lang/local-spark-ppl-test-instruction.md → docs/local-spark-ppl-test-instruction.md b/...-lang/local-spark-ppl-test-instruction.md → docs/local-spark-ppl-test-instruction.md
@@ -1,55 +1,38 @@
 # Testing PPL using local Spark
 
 ## Produce the PPL artifact
-The first step would be to produce the spark-ppl artifact: `sbt clean sparkPPLCosmetic/assembly`
-
-The resulting artifact would be located in the project's build directory:
-```sql
-[info] Built: ./opensearch-spark/sparkPPLCosmetic/target/scala-2.12/opensearch-spark-ppl-assembly-x.y.z-SNAPSHOT.jar
+Build the Flint and PPL extensions for Spark:
+```
+sbt clean
+sbt sparkSqlApplicationCosmetic/assembly sparkPPLCosmetic/assembly
 ```
-## Downloading spark 3.5.3 version
-Download spark from the [official website](https://spark.apache.org/downloads.html) and install locally.
 
-## Start Spark with the plugin
-Once installed, run spark with the generated PPL artifact: 
-```shell
-bin/spark-sql --jars "/PATH_TO_ARTIFACT/opensearch-spark-ppl-assembly-x.y.z-SNAPSHOT.jar" \
---conf "spark.sql.extensions=org.opensearch.flint.spark.FlintPPLSparkExtensions"  \
---conf "spark.sql.catalog.dev=org.apache.spark.opensearch.catalog.OpenSearchCatalog" \
---conf "spark.hadoop.hive.cli.print.header=true"
-
-WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-Setting default log level to "WARN".
-To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
-WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
-WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
-WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
-WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore 
-Spark Web UI available at http://*.*.*.*:4040
-Spark master: local[*], Application Id: local-1731523264660
-
-spark-sql (default)>
+### Using docker compose to run spark local cluster
+
+Next update the [`.env`](../../docker/apache-spark-sample/.env) file to match the `PPL_JAR` location: 
+```
+PPL_JAR=../../sparkPPLCosmetic/target/scala-2.12/opensearch-spark-ppl-assembly-0.7.0-SNAPSHOT.jar
 ```
-The resulting would be a spark-sql prompt: `spark-sql (default)> ...`
 
-### Spark UI Html 
-One can also explore spark's UI portal to examine the execution jobs and how they are performing:
+Next start the Docker containers that will be used for the tests. In the directory `docker/integ-test`
 
-![Spark-UX](../img/spark-ui.png)
+```shell
+docker compose up -d
+```
 
+## Running Spark Shell
 
-### Configuring hive partition mode
-For simpler configuration of partitioned tables, use the following non-strict mode:
+Can run `spark-sql` on the master node.
 
-```shell
-spark-sql (default)> SET hive.exec.dynamic.partition.mode = nonstrict;
+```
+docker exec -it apache-spark-sample-spark-1 /opt/bitnami/spark/bin/spark-sql
 ```
 
----
+## Testing PPL Commands
 
-# Testing PPL Commands
+Within the Spark Shell, you can submit queries, including PPL queries.
 
-In order to test ppl commands using the spark-sql command line - create and populate the following set of tables:
+We will create and populate the following set of tables:
 
 ## emails table
 ```sql

diff --git a/docs/spark-docker.md b/docs/spark-docker.md
@@ -1,7 +1,7 @@
 # Running Queries with Apache Spark in Docker
 
-There are [Bitnami Apache Spark docker images](https://hub.docker.com/r/bitnami/spark). These
-can be modified to be able to include the OpenSearch Spark PPL extension. With the OpenSearch
+There are [Bitnami Apache Spark docker images](https://hub.docker.com/r/bitnami/spark).
+These can be modified to be able to include the OpenSearch Spark PPL extension. With the OpenSearch
 Spark PPL extension, the docker image can be used to test PPL commands.
 
 The Bitnami Apache Spark image can be used to run a Spark cluster and also to run