diff --git a/README.md b/README.md index 6732db3af..31af35d63 100644 --- a/README.md +++ b/README.md @@ -19,7 +19,7 @@ Please refer to the [Flint Index Reference Manual](./docs/index.md) for more inf * For additional details on Spark PPL commands project, see [PPL Project](https://github.com/orgs/opensearch-project/projects/214/views/2) -* Experiment ppl queries on local spark cluster [PPL on local spark ](docs/ppl-lang/local-spark-ppl-test-instruction.md) +* Experiment ppl queries on local spark cluster [PPL on local spark ](docs/local-spark-ppl-test-instruction.md) ## Prerequisites @@ -88,7 +88,7 @@ bin/spark-shell --packages "org.opensearch:opensearch-spark-ppl_2.12:0.7.0-SNAPS ``` ### PPL Run queries on a local spark cluster -See ppl usage sample on local spark cluster [PPL on local spark ](docs/ppl-lang/local-spark-ppl-test-instruction.md) +See ppl usage sample on local spark cluster [PPL on local spark ](docs/local-spark-ppl-test-instruction.md) ### Running integration tests on a local spark cluster See integration test documentation [Docker Integration Tests](integ-test/script/README.md) diff --git a/docker/apache-spark-sample/.env b/docker/apache-spark-sample/.env index a047df5ba..403d4a21e 100644 --- a/docker/apache-spark-sample/.env +++ b/docker/apache-spark-sample/.env @@ -1,4 +1,4 @@ MASTER_UI_PORT=8080 MASTER_PORT=7077 UI_PORT=4040 -PPL_JAR=../../ppl-spark-integration/target/scala-2.12/ppl-spark-integration-assembly-0.7.0-SNAPSHOT.jar +PPL_JAR=../../sparkPPLCosmetic/target/scala-2.12/opensearch-spark-ppl-assembly-0.7.0-SNAPSHOT.jar diff --git a/docker/apache-spark-sample/Dockerfile b/docker/apache-spark-sample/Dockerfile new file mode 100644 index 000000000..0f5f49a2e --- /dev/null +++ b/docker/apache-spark-sample/Dockerfile @@ -0,0 +1,26 @@ +FROM bitnami/spark:3.5.3 + +# Install wget +USER root +RUN apt-get update && apt-get install -y wget && rm -rf /var/lib/apt/lists/* + +# Define the Iceberg version and Maven repository URL +ENV ICEBERG_VERSION=1.5.0 +ENV MAVEN_REPO=https://repo1.maven.org/maven2 + +# Download the Iceberg runtime JAR +RUN wget $MAVEN_REPO/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/$ICEBERG_VERSION/iceberg-spark-runtime-3.5_2.12-$ICEBERG_VERSION.jar \ + -O /opt/bitnami/spark/jars/iceberg-spark-runtime-3.5.jar + +# Optional: Add configuration files +COPY spark-defaults.conf /opt/bitnami/spark/conf/ + +# Set up environment variables for Spark +ENV SPARK_MODE=master +ENV SPARK_RPC_AUTHENTICATION_ENABLED=no +ENV SPARK_RPC_ENCRYPTION_ENABLED=no +ENV SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no +ENV SPARK_SSL_ENABLED=no + +# Switch back to non-root user for security +USER 1001 \ No newline at end of file diff --git a/docker/apache-spark-sample/README.md b/docker/apache-spark-sample/README.md new file mode 100644 index 000000000..2e352ac63 --- /dev/null +++ b/docker/apache-spark-sample/README.md @@ -0,0 +1,39 @@ +# Sanity Test OpenSearch Spark PPL +This document shows how to locally test OpenSearch PPL commands on top of Spark using docker-compose. + +See instructions for running docker-compose [here](../../docs/spark-docker.md) + +Once the docker services are running,[connect to the spark-sql](../../docs/local-spark-ppl-test-instruction.md#running-spark-shell) + +In the spark-sql shell - [run the next create table statements](../../docs/local-spark-ppl-test-instruction.md#testing-ppl-commands) + +Now PPL commands can [run](../../docs/local-spark-ppl-test-instruction.md#test-grok--top-commands-combination) on top of the table just created + +### Using Iceberg Tables +The following example utilize https://iceberg.apache.org/ table as an example +```sql +CREATE TABLE iceberg_table ( + id INT, + name STRING, + age INT, + city STRING +) +USING iceberg +PARTITIONED BY (city) +LOCATION 'file:/tmp/iceberg-tables/default/iceberg_table'; + +INSERT INTO iceberg_table VALUES + (1, 'Alice', 30, 'New York'), + (2, 'Bob', 25, 'San Francisco'), + (3, 'Charlie', 35, 'New York'), + (4, 'David', 40, 'Chicago'), + (5, 'Eve', 28, 'San Francisco'); +``` + +### PPL queries +```sql + source=`default`.`iceberg_table`; + source=`default`.`iceberg_table` | where age > 30 | fields id, name, age, city | sort - age; + source=`default`.`iceberg_table` | where age > 30 | stats count() by city; + source=`default`.`iceberg_table` | stats avg(age) by city; +``` \ No newline at end of file diff --git a/docker/apache-spark-sample/docker-compose.yml b/docker/apache-spark-sample/docker-compose.yml index df2da6d52..34979f469 100644 --- a/docker/apache-spark-sample/docker-compose.yml +++ b/docker/apache-spark-sample/docker-compose.yml @@ -1,6 +1,8 @@ services: spark: - image: bitnami/spark:3.5.3 + build: + context: . + dockerfile: Dockerfile ports: - "${MASTER_UI_PORT:-8080}:8080" - "${MASTER_PORT:-7077}:7077" @@ -21,7 +23,9 @@ services: target: /opt/bitnami/spark/jars/ppl-spark-integration.jar spark-worker: - image: bitnami/spark:3.5.3 + build: + context: . + dockerfile: Dockerfile environment: - SPARK_MODE=worker - SPARK_MASTER_URL=spark://spark:7077 diff --git a/docker/apache-spark-sample/spark-defaults.conf b/docker/apache-spark-sample/spark-defaults.conf index 47fdaae03..e5383c000 100644 --- a/docker/apache-spark-sample/spark-defaults.conf +++ b/docker/apache-spark-sample/spark-defaults.conf @@ -25,5 +25,10 @@ # spark.serializer org.apache.spark.serializer.KryoSerializer # spark.driver.memory 5g # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three" -spark.sql.extensions org.opensearch.flint.spark.FlintPPLSparkExtensions -spark.sql.catalog.dev org.apache.spark.opensearch.catalog.OpenSearchCatalog +spark.sql.extensions org.opensearch.flint.spark.FlintPPLSparkExtensions, org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions +spark.sql.catalog.dev org.apache.spark.opensearch.catalog.OpenSearchCatalog, org.apache.iceberg.spark.SparkCatalog + +# Enable Iceberg catalog +spark.sql.catalog.spark_catalog org.apache.iceberg.spark.SparkCatalog +spark.sql.catalog.spark_catalog.type hadoop +spark.sql.catalog.spark_catalog.warehouse file:/tmp/iceberg-tables diff --git a/docs/ppl-lang/local-spark-ppl-test-instruction.md b/docs/local-spark-ppl-test-instruction.md similarity index 83% rename from docs/ppl-lang/local-spark-ppl-test-instruction.md rename to docs/local-spark-ppl-test-instruction.md index 537ac043b..36ba8f3d7 100644 --- a/docs/ppl-lang/local-spark-ppl-test-instruction.md +++ b/docs/local-spark-ppl-test-instruction.md @@ -1,55 +1,38 @@ # Testing PPL using local Spark ## Produce the PPL artifact -The first step would be to produce the spark-ppl artifact: `sbt clean sparkPPLCosmetic/assembly` - -The resulting artifact would be located in the project's build directory: -```sql -[info] Built: ./opensearch-spark/sparkPPLCosmetic/target/scala-2.12/opensearch-spark-ppl-assembly-x.y.z-SNAPSHOT.jar +Build the Flint and PPL extensions for Spark: +``` +sbt clean +sbt sparkSqlApplicationCosmetic/assembly sparkPPLCosmetic/assembly ``` -## Downloading spark 3.5.3 version -Download spark from the [official website](https://spark.apache.org/downloads.html) and install locally. -## Start Spark with the plugin -Once installed, run spark with the generated PPL artifact: -```shell -bin/spark-sql --jars "/PATH_TO_ARTIFACT/opensearch-spark-ppl-assembly-x.y.z-SNAPSHOT.jar" \ ---conf "spark.sql.extensions=org.opensearch.flint.spark.FlintPPLSparkExtensions" \ ---conf "spark.sql.catalog.dev=org.apache.spark.opensearch.catalog.OpenSearchCatalog" \ ---conf "spark.hadoop.hive.cli.print.header=true" - -WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable -Setting default log level to "WARN". -To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). -WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist -WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist -WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0 -WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore -Spark Web UI available at http://*.*.*.*:4040 -Spark master: local[*], Application Id: local-1731523264660 - -spark-sql (default)> +### Using docker compose to run spark local cluster + +Next update the [`.env`](../../docker/apache-spark-sample/.env) file to match the `PPL_JAR` location: +``` +PPL_JAR=../../sparkPPLCosmetic/target/scala-2.12/opensearch-spark-ppl-assembly-0.7.0-SNAPSHOT.jar ``` -The resulting would be a spark-sql prompt: `spark-sql (default)> ...` -### Spark UI Html -One can also explore spark's UI portal to examine the execution jobs and how they are performing: +Next start the Docker containers that will be used for the tests. In the directory `docker/integ-test` -![Spark-UX](../img/spark-ui.png) +```shell +docker compose up -d +``` +## Running Spark Shell -### Configuring hive partition mode -For simpler configuration of partitioned tables, use the following non-strict mode: +Can run `spark-sql` on the master node. -```shell -spark-sql (default)> SET hive.exec.dynamic.partition.mode = nonstrict; +``` +docker exec -it apache-spark-sample-spark-1 /opt/bitnami/spark/bin/spark-sql ``` ---- +## Testing PPL Commands -# Testing PPL Commands +Within the Spark Shell, you can submit queries, including PPL queries. -In order to test ppl commands using the spark-sql command line - create and populate the following set of tables: +We will create and populate the following set of tables: ## emails table ```sql diff --git a/docs/spark-docker.md b/docs/spark-docker.md index d1200e2b3..b3244dfee 100644 --- a/docs/spark-docker.md +++ b/docs/spark-docker.md @@ -1,7 +1,7 @@ # Running Queries with Apache Spark in Docker -There are [Bitnami Apache Spark docker images](https://hub.docker.com/r/bitnami/spark). These -can be modified to be able to include the OpenSearch Spark PPL extension. With the OpenSearch +There are [Bitnami Apache Spark docker images](https://hub.docker.com/r/bitnami/spark). +These can be modified to be able to include the OpenSearch Spark PPL extension. With the OpenSearch Spark PPL extension, the docker image can be used to test PPL commands. The Bitnami Apache Spark image can be used to run a Spark cluster and also to run