Skip to content

Commit

Permalink
add local spark ppl testing documentation and details
Browse files Browse the repository at this point in the history
Signed-off-by: YANGDB <[email protected]>
  • Loading branch information
YANG-DB committed Nov 13, 2024
1 parent dd9c0cf commit d832fdf
Show file tree
Hide file tree
Showing 2 changed files with 162 additions and 0 deletions.
Binary file added docs/img/spark-ui.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
162 changes: 162 additions & 0 deletions docs/ppl-lang/local-spark-ppl-test-instruction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
# Testing PPL using local Spark

## Produce the PPL artifact
The first step would be to produce the spark-ppl artifact: `sbt clean sparkPPLCosmetic/publishM2`

The resulting artifact would be located in the local `.m2/` folder, for example:
`/Users/USER_NAME/.m2/repository/org/opensearch/opensearch-spark-ppl_XX/0.X.0-SNAPSHOT/opensearch-spark-ppl_XX-0.X.0-SNAPSHOT.jar"`

## Downloading spark 3.5.3 version
Download spark from the [official website](https://spark.apache.org/downloads.html) and install locally.

## Start Spark with the plugin
Once installed, run spark with the generated PPL artifact:
```shell
bin/spark-sql --jars "/PATH_TO_ARTIFACT/opensearch-spark-ppl_XX-0.X.0-SNAPSHOT.jar" \
--conf "spark.sql.extensions=org.opensearch.flint.spark.FlintPPLSparkExtensions" \
--conf "spark.sql.catalog.dev=org.apache.spark.opensearch.catalog.OpenSearchCatalog" \
--conf "spark.hadoop.hive.cli.print.header=true"

WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore
Spark Web UI available at http://*.*.*.*:4040
Spark master: local[*], Application Id: local-1731523264660

spark-sql (default)>
```
The resulting would be a spark-sql prompt: `spark-sql (default)> ...`

### Spark UI Html
One can also explore spark's UI portal to examine the execution jobs and how they are performing:

![Spark-UX](../img/spark-ui.png)


### Configuring hive partition mode
For simpler configuration of partitioned tables, use the following non-strict mode:

```shell
spark-sql (default)> SET hive.exec.dynamic.partition.mode = nonstrict;
```

---

# Testing PPL Commands

In order to test ppl commands using the spark-sql command line - create and populate the following set of tables:

## emails table
```sql
CREATE TABLE emails (name STRING, age INT, email STRING, street_address STRING, year INT, month INT) PARTITIONED BY (year, month);
INSERT INTO testTable (name, age, email, street_address, year, month) VALUES ('Alice', 30, '[email protected]', '123 Main St, Seattle', 2023, 4), ('Bob', 55, '[email protected]', '456 Elm St, Portland', 2023, 5), ('Charlie', 65, '[email protected]', '789 Pine St, San Francisco', 2023, 4), ('David', 19, '[email protected]', '101 Maple St, New York', 2023, 5), ('Eve', 21, '[email protected]', '202 Oak St, Boston', 2023, 4), ('Frank', 76, '[email protected]', '303 Cedar St, Austin', 2023, 5), ('Grace', 41, '[email protected]', '404 Birch St, Chicago', 2023, 4), ('Hank', 32, '[email protected]', '505 Spruce St, Miami', 2023, 5), ('Ivy', 9, '[email protected]', '606 Fir St, Denver', 2023, 4), ('Jack', 12, '[email protected]', '707 Ash St, Seattle', 2023, 5);
```

Now one can run the following ppl commands to test functionality:

### Test `describe` command

```sql
describe emails;

col_name data_type comment
name string
age int
email string
street_address string
year int
month int
# Partition Information
# col_name data_type comment
year int
month int

# Detailed Table Information
Catalog spark_catalog
Database default
Table testtable
Owner lioperry
Created Time Wed Nov 13 14:45:12 MST 2024
Last Access UNKNOWN
Created By Spark 3.5.3
Type MANAGED
Provider hive
Table Properties [transient_lastDdlTime=1731534312]
Location file:/Users/USER/tools/spark-3.5.3-bin-hadoop3/bin/spark-warehouse/testtable
Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.TextInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Storage Properties [serialization.format=1]
Partition Provider Catalog

Time taken: 0.128 seconds, Fetched 28 row(s)
```

### Test `grok` command
```sql
source=emails| grok email '.+@%{HOSTNAME:host}' | fields email, host;

email host
hank@demonstration.com demonstration.com
bob@test.org test.org
jack@sample.net sample.net
frank@sample.org sample.org
david@anotherdomain.com anotherdomain.com
grace@demo.net demo.net
alice@example.com example.com
ivy@examples.com examples.com
eve@examples.com examples.com
charlie@domain.net domain.net

Time taken: 0.626 seconds, Fetched 10 row(s)
```

```sql
source=emails| parse email '.+@(?<host>.+)' | where age > 45 | sort - age | fields age, email, host;

age email host
76 frank@sample.org sample.org
65 charlie@domain.net domain.net
55 bob@test.org test.org

Time taken: 1.555 seconds, Fetched 3 row(s)
```

### Test `fieldsummary` command

```sql
source=emails| fieldsummary includefields=age, email;

Field COUNT DISTINCT MIN MAX AVG MEAN STDDEV Nulls TYPEOF
age 10 10 9 76 36.0 36.0 22.847319317591726 0 int
email 10 10 alice@example.com jack@sample.net NULL NULL NULL 0 string

Time taken: 1.535 seconds, Fetched 2 row(s)
```

### Test `trendline` command

```sql
source=email | sort - age | trendline sma(2, age);

name age email street_address year month age_trendline
Frank 76 frank@sample.org 303 Cedar St, Austin 2023 5 NULL
Charlie 65 charlie@domain.net 789 Pine St, San Francisco 2023 4 70.5
Bob 55 bob@test.org 456 Elm St, Portland 2023 5 60.0
Grace 41 grace@demo.net 404 Birch St, Chicago 2023 4 48.0
Hank 32 hank@demonstration.com 505 Spruce St, Miami 2023 5 36.5
Alice 30 alice@example.com 123 Main St, Seattle 2023 4 31.0
Eve 21 eve@examples.com 202 Oak St, Boston 2023 4 25.5
David 19 david@anotherdomain.com 101 Maple St, New York 2023 5 20.0
Jack 12 jack@sample.net 707 Ash St, Seattle 2023 5 15.5
Ivy 9 ivy@examples.com 606 Fir St, Denver 2023 4 10.5

Time taken: 1.048 seconds, Fetched 10 row(s)
```


---

0 comments on commit d832fdf

Please sign in to comment.