Skip to content

Commit

Permalink
Merge branch 'main' into geoip
Browse files Browse the repository at this point in the history
  • Loading branch information
kenricky-bitquill committed Nov 19, 2024
2 parents 4902534 + 7b6e485 commit 5c13d8e
Show file tree
Hide file tree
Showing 129 changed files with 7,550 additions and 1,027 deletions.
11 changes: 11 additions & 0 deletions .github/workflows/test-and-build-workflow.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,16 @@ jobs:
- name: Style check
run: sbt scalafmtCheckAll

- name: Set SBT_OPTS
# Needed to extend the JVM memory size to avoid OutOfMemoryError for HTML test report
run: echo "SBT_OPTS=-Xmx2G" >> $GITHUB_ENV

- name: Integ Test
run: sbt integtest/integration

- name: Upload test report
if: always() # Ensures the artifact is saved even if tests fail
uses: actions/upload-artifact@v3
with:
name: test-reports
path: target/test-reports # Adjust this path if necessary
12 changes: 12 additions & 0 deletions DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,11 @@ To execute the unit tests, run the following command:
```
sbt test
```
To run a specific unit test in SBT, use the testOnly command with the full path of the test class:
```
sbt "; project pplSparkIntegration; test:testOnly org.opensearch.flint.spark.ppl.PPLLogicalPlanTrendlineCommandTranslatorTestSuite"
```


## Integration Test
The integration test is defined in the `integration` directory of the project. The integration tests will automatically trigger unit tests and will only run if all unit tests pass. If you want to run the integration test for the project, you can do so by running the following command:
Expand All @@ -23,6 +28,13 @@ If you get integration test failures with error message "Previous attempts to fi
3. Run `sudo ln -s $HOME/.docker/desktop/docker.sock /var/run/docker.sock` or `sudo ln -s $HOME/.docker/run/docker.sock /var/run/docker.sock`
4. If you use Docker Desktop, as an alternative of `3`, check mark the "Allow the default Docker socket to be used (requires password)" in advanced settings of Docker Desktop.

Running only a selected set of integration test suites is possible with the following command:
```
sbt "; project integtest; it:testOnly org.opensearch.flint.spark.ppl.FlintSparkPPLTrendlineITSuite"
```
This command runs only the specified test suite within the integtest submodule.


### AWS Integration Test
The `aws-integration` folder contains tests for cloud server providers. For instance, test against AWS OpenSearch domain, configure the following settings. The client will use the default credential provider to access the AWS OpenSearch domain.
```
Expand Down
15 changes: 12 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ Please refer to the [Flint Index Reference Manual](./docs/index.md) for more inf

* For additional details on Spark PPL commands project, see [PPL Project](https://github.com/orgs/opensearch-project/projects/214/views/2)

* Experiment ppl queries on local spark cluster [PPL on local spark ](docs/ppl-lang/local-spark-ppl-test-instruction.md)

## Prerequisites

Version compatibility:
Expand All @@ -31,6 +33,7 @@ Version compatibility:
| 0.4.0 | 11+ | 3.3.2 | 2.12.14 | 2.13+ |
| 0.5.0 | 11+ | 3.5.1 | 2.12.14 | 2.17+ |
| 0.6.0 | 11+ | 3.5.1 | 2.12.14 | 2.17+ |
| 0.7.0 | 11+ | 3.5.1 | 2.12.14 | 2.17+ |

## Flint Extension Usage

Expand Down Expand Up @@ -62,7 +65,7 @@ sbt clean standaloneCosmetic/publishM2
```
then add org.opensearch:opensearch-spark-standalone_2.12 when run spark application, for example,
```
bin/spark-shell --packages "org.opensearch:opensearch-spark-standalone_2.12:0.6.0-SNAPSHOT" \
bin/spark-shell --packages "org.opensearch:opensearch-spark-standalone_2.12:0.7.0-SNAPSHOT" \
--conf "spark.sql.extensions=org.opensearch.flint.spark.FlintSparkExtensions" \
--conf "spark.sql.catalog.dev=org.apache.spark.opensearch.catalog.OpenSearchCatalog"
```
Expand All @@ -74,14 +77,20 @@ To build and run this PPL in Spark, you can run (requires Java 11):
```
sbt clean sparkPPLCosmetic/publishM2
```
then add org.opensearch:opensearch-spark-ppl_2.12 when run spark application, for example,

Then add org.opensearch:opensearch-spark-ppl_2.12 when run spark application, for example,

```
bin/spark-shell --packages "org.opensearch:opensearch-spark-ppl_2.12:0.6.0-SNAPSHOT" \
bin/spark-shell --packages "org.opensearch:opensearch-spark-ppl_2.12:0.7.0-SNAPSHOT" \
--conf "spark.sql.extensions=org.opensearch.flint.spark.FlintPPLSparkExtensions" \
--conf "spark.sql.catalog.dev=org.apache.spark.opensearch.catalog.OpenSearchCatalog"
```

### PPL Run queries on a local spark cluster
See ppl usage sample on local spark cluster [PPL on local spark ](docs/ppl-lang/local-spark-ppl-test-instruction.md)


## Code of Conduct

This project has adopted an [Open Source Code of Conduct](./CODE_OF_CONDUCT.md).
Expand Down
40 changes: 38 additions & 2 deletions build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
* SPDX-License-Identifier: Apache-2.0
*/
import Dependencies._
import sbtassembly.AssemblyPlugin.autoImport.ShadeRule

lazy val scala212 = "2.12.14"
lazy val sparkVersion = "3.5.1"
Expand All @@ -21,7 +22,7 @@ val sparkMinorVersion = sparkVersion.split("\\.").take(2).mkString(".")

ThisBuild / organization := "org.opensearch"

ThisBuild / version := "0.6.0-SNAPSHOT"
ThisBuild / version := "0.7.0-SNAPSHOT"

ThisBuild / scalaVersion := scala212

Expand All @@ -43,7 +44,35 @@ lazy val compileScalastyle = taskKey[Unit]("compileScalastyle")
// Run as part of test task.
lazy val testScalastyle = taskKey[Unit]("testScalastyle")

// Explanation:
// - ThisBuild / assemblyShadeRules sets the shading rules for the entire build
// - ShadeRule.rename(...) creates a rule to rename multiple package patterns
// - "shaded.@0" means prepend "shaded." to the original package name
// - .inAll applies the rule to all dependencies, not just direct dependencies
val packagesToShade = Seq(
"com.amazonaws.cloudwatch.**",
"com.fasterxml.jackson.core.**",
"com.fasterxml.jackson.dataformat.**",
"com.fasterxml.jackson.databind.**",
"com.sun.jna.**",
"com.thoughtworks.paranamer.**",
"javax.annotation.**",
"org.apache.commons.codec.**",
"org.apache.commons.logging.**",
"org.apache.hc.**",
"org.apache.http.**",
"org.glassfish.json.**",
"org.joda.time.**",
"org.reactivestreams.**",
"org.yaml.**",
"software.amazon.**"
)

ThisBuild / assemblyShadeRules := Seq(
ShadeRule.rename(
packagesToShade.map(_ -> "shaded.flint.@0"): _*
).inAll
)

lazy val commonSettings = Seq(
javacOptions ++= Seq("-source", "11"),
Expand All @@ -53,7 +82,11 @@ lazy val commonSettings = Seq(
compileScalastyle := (Compile / scalastyle).toTask("").value,
Compile / compile := ((Compile / compile) dependsOn compileScalastyle).value,
testScalastyle := (Test / scalastyle).toTask("").value,
// Enable HTML report and output to separate folder per package
Test / testOptions += Tests.Argument(TestFrameworks.ScalaTest, "-h", s"target/test-reports/${name.value}"),
Test / test := ((Test / test) dependsOn testScalastyle).value,
// Needed for HTML report
libraryDependencies += "com.vladsch.flexmark" % "flexmark-all" % "0.64.8" % "test",
dependencyOverrides ++= Seq(
"com.fasterxml.jackson.core" % "jackson-core" % jacksonVersion,
"com.fasterxml.jackson.core" % "jackson-databind" % jacksonVersion
Expand Down Expand Up @@ -89,6 +122,8 @@ lazy val flintCore = (project in file("flint-core"))
"com.amazonaws" % "aws-java-sdk-cloudwatch" % "1.12.593"
exclude("com.fasterxml.jackson.core", "jackson-databind"),
"software.amazon.awssdk" % "auth-crt" % "2.28.10",
"com.fasterxml.jackson.core" % "jackson-core" % jacksonVersion,
"com.fasterxml.jackson.core" % "jackson-databind" % jacksonVersion,
"org.projectlombok" % "lombok" % "1.18.30" % "provided",
"org.scalactic" %% "scalactic" % "3.2.15" % "test",
"org.scalatest" %% "scalatest" % "3.2.15" % "test",
Expand Down Expand Up @@ -241,7 +276,8 @@ lazy val integtest = (project in file("integ-test"))
inConfig(IntegrationTest)(Defaults.testSettings ++ Seq(
IntegrationTest / javaSource := baseDirectory.value / "src/integration/java",
IntegrationTest / scalaSource := baseDirectory.value / "src/integration/scala",
IntegrationTest / parallelExecution := false,
IntegrationTest / resourceDirectory := baseDirectory.value / "src/integration/resources",
IntegrationTest / parallelExecution := false,
IntegrationTest / fork := true,
)),
inConfig(AwsIntegrationTest)(Defaults.testSettings ++ Seq(
Expand Down
Binary file added docs/img/spark-ui.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ Currently, Flint metadata is only static configuration without version control a

```json
{
"version": "0.6.0",
"version": "0.7.0",
"name": "...",
"kind": "skipping",
"source": "...",
Expand Down Expand Up @@ -698,7 +698,7 @@ For now, only single or conjunct conditions (conditions connected by AND) in WHE
### AWS EMR Spark Integration - Using execution role
Flint use [DefaultAWSCredentialsProviderChain](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html). When running in EMR Spark, Flint use executionRole credentials
```
--conf spark.jars.packages=org.opensearch:opensearch-spark-standalone_2.12:0.6.0-SNAPSHOT \
--conf spark.jars.packages=org.opensearch:opensearch-spark-standalone_2.12:0.7.0-SNAPSHOT \
--conf spark.jars.repositories=https://aws.oss.sonatype.org/content/repositories/snapshots \
--conf spark.emr-serverless.driverEnv.JAVA_HOME=/usr/lib/jvm/java-17-amazon-corretto.x86_64 \
--conf spark.executorEnv.JAVA_HOME=/usr/lib/jvm/java-17-amazon-corretto.x86_64 \
Expand Down Expand Up @@ -740,7 +740,7 @@ Flint use [DefaultAWSCredentialsProviderChain](https://docs.aws.amazon.com/AWSJa
```
3. Set the spark.datasource.flint.customAWSCredentialsProvider property with value as com.amazonaws.emr.AssumeRoleAWSCredentialsProvider. Set the environment variable ASSUME_ROLE_CREDENTIALS_ROLE_ARN with the ARN value of CrossAccountRoleB.
```
--conf spark.jars.packages=org.opensearch:opensearch-spark-standalone_2.12:0.6.0-SNAPSHOT \
--conf spark.jars.packages=org.opensearch:opensearch-spark-standalone_2.12:0.7.0-SNAPSHOT \
--conf spark.jars.repositories=https://aws.oss.sonatype.org/content/repositories/snapshots \
--conf spark.emr-serverless.driverEnv.JAVA_HOME=/usr/lib/jvm/java-17-amazon-corretto.x86_64 \
--conf spark.executorEnv.JAVA_HOME=/usr/lib/jvm/java-17-amazon-corretto.x86_64 \
Expand Down
53 changes: 39 additions & 14 deletions docs/ppl-lang/PPL-Example-Commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,10 @@ _- **Limitation: new field added by eval command with a function cannot be dropp
- `source = table | where a < 1 | fields a,b,c`
- `source = table | where b != 'test' | fields a,b,c`
- `source = table | where c = 'test' | fields a,b,c | head 3`
- `source = table | where c = 'test' AND a = 1 | fields a,b,c`
- `source = table | where c != 'test' OR a > 1 | fields a,b,c`
- `source = table | where (b > 1 OR a > 1) AND c != 'test' | fields a,b,c`
- `source = table | where c = 'test' NOT a > 1 | fields a,b,c` - Note: "AND" is optional
- `source = table | where ispresent(b)`
- `source = table | where isnull(coalesce(a, b)) | fields a,b,c | head 3`
- `source = table | where isempty(a)`
Expand All @@ -61,6 +65,7 @@ _- **Limitation: new field added by eval command with a function cannot be dropp
- `source = table | where cidrmatch(ip, '192.169.1.0/24')`
- `source = table | where cidrmatch(ipv6, '2003:db8::/32')`
- `source = table | trendline sma(2, temperature) as temp_trend`
- `source = table | trendline sort timestamp wma(2, temperature) as temp_trend`

#### **IP related queries**
[See additional command details](functions/ppl-ip.md)
Expand Down Expand Up @@ -177,6 +182,7 @@ source = table | where ispresent(a) |
- `source = table | stats max(c) by b`
- `source = table | stats count(c) by b | head 5`
- `source = table | stats distinct_count(c)`
- `source = table | stats distinct_count_approx(c)`
- `source = table | stats stddev_samp(c)`
- `source = table | stats stddev_pop(c)`
- `source = table | stats percentile(c, 90)`
Expand All @@ -202,6 +208,7 @@ source = table | where ispresent(a) |
- `source = table | where a < 50 | eventstats avg(c) `
- `source = table | eventstats max(c) by b`
- `source = table | eventstats count(c) by b | head 5`
- `source = table | eventstats count(c) by b | head 5`
- `source = table | eventstats stddev_samp(c)`
- `source = table | eventstats stddev_pop(c)`
- `source = table | eventstats percentile(c, 90)`
Expand Down Expand Up @@ -246,12 +253,15 @@ source = table | where ispresent(a) |

- `source=accounts | rare gender`
- `source=accounts | rare age by gender`
- `source=accounts | rare 5 age by gender`
- `source=accounts | rare_approx age by gender`

#### **Top**
[See additional command details](ppl-top-command.md)

- `source=accounts | top gender`
- `source=accounts | top 1 gender`
- `source=accounts | top_approx 5 gender`
- `source=accounts | top 1 age by gender`

#### **Parse**
Expand Down Expand Up @@ -306,7 +316,11 @@ source = table | where ispresent(a) |
- `source = table1 | left semi join left = l right = r on l.a = r.a table2`
- `source = table1 | left anti join left = l right = r on l.a = r.a table2`
- `source = table1 | join left = l right = r [ source = table2 | where d > 10 | head 5 ]`

- `source = table1 | inner join on table1.a = table2.a table2 | fields table1.a, table2.a, table1.b, table1.c` (directly refer table name)
- `source = table1 | inner join on a = c table2 | fields a, b, c, d` (ignore side aliases as long as no ambiguous)
- `source = table1 as t1 | join left = l right = r on l.a = r.a table2 as t2 | fields l.a, r.a` (side alias overrides table alias)
- `source = table1 as t1 | join left = l right = r on l.a = r.a table2 as t2 | fields t1.a, t2.a` (error, side alias overrides table alias)
- `source = table1 | join left = l right = r on l.a = r.a [ source = table2 ] as s | fields l.a, s.a` (error, side alias overrides subquery alias)

#### **Lookup**
[See additional command details](ppl-lookup-command.md)
Expand Down Expand Up @@ -437,8 +451,30 @@ Assumptions: `a`, `b` are fields of table outer, `c`, `d` are fields of table in

_- **Limitation: another command usage of (relation) subquery is in `appendcols` commands which is unsupported**_

---
#### Experimental Commands:

#### **fillnull**
[See additional command details](ppl-fillnull-command.md)
```sql
- `source=accounts | fillnull fields status_code=101`
- `source=accounts | fillnull fields request_path='/not_found', timestamp='*'`
- `source=accounts | fillnull using field1=101`
- `source=accounts | fillnull using field1=concat(field2, field3), field4=2*pi()*field5`
- `source=accounts | fillnull using field1=concat(field2, field3), field4=2*pi()*field5, field6 = 'N/A'`
```

#### **expand**
[See additional command details](ppl-expand-command.md)
```sql
- `source = table | expand field_with_array as array_list`
- `source = table | expand employee | stats max(salary) as max by state, company`
- `source = table | expand employee as worker | stats max(salary) as max by state, company`
- `source = table | expand employee as worker | eval bonus = salary * 3 | fields worker, bonus`
- `source = table | expand employee | parse description '(?<email>.+@.+)' | fields employee, email`
- `source = table | eval array=json_array(1, 2, 3) | expand array as uid | fields name, occupation, uid`
- `source = table | expand multi_valueA as multiA | expand multi_valueB as multiB`
```

#### Correlation Commands:
[See additional command details](ppl-correlation-command.md)

```sql
Expand All @@ -450,14 +486,3 @@ _- **Limitation: another command usage of (relation) subquery is in `appendcols`
> ppl-correlation-command is an experimental command - it may be removed in future versions
---
### Planned Commands:

#### **fillnull**
[See additional command details](ppl-fillnull-command.md)
```sql
- `source=accounts | fillnull fields status_code=101`
- `source=accounts | fillnull fields request_path='/not_found', timestamp='*'`
- `source=accounts | fillnull using field1=101`
- `source=accounts | fillnull using field1=concat(field2, field3), field4=2*pi()*field5`
- `source=accounts | fillnull using field1=concat(field2, field3), field4=2*pi()*field5, field6 = 'N/A'`
```
4 changes: 2 additions & 2 deletions docs/ppl-lang/PPL-on-Spark.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ sbt clean sparkPPLCosmetic/publishM2
```
then add org.opensearch:opensearch-spark_2.12 when run spark application, for example,
```
bin/spark-shell --packages "org.opensearch:opensearch-spark-ppl_2.12:0.6.0-SNAPSHOT"
bin/spark-shell --packages "org.opensearch:opensearch-spark-ppl_2.12:0.7.0-SNAPSHOT"
```

### PPL Extension Usage
Expand All @@ -46,7 +46,7 @@ spark-sql --conf "spark.sql.extensions=org.opensearch.flint.spark.FlintPPLSparkE
```

### Running With both Flint & PPL Extensions
In order to make use of both flint and ppl extension, one can simply add both jars (`org.opensearch:opensearch-spark-ppl_2.12:0.6.0-SNAPSHOT`,`org.opensearch:opensearch-spark_2.12:0.6.0-SNAPSHOT`) to the cluster's
In order to make use of both flint and ppl extension, one can simply add both jars (`org.opensearch:opensearch-spark-ppl_2.12:0.7.0-SNAPSHOT`,`org.opensearch:opensearch-spark_2.12:0.7.0-SNAPSHOT`) to the cluster's
classpath.

Next need to configure both extensions :
Expand Down
13 changes: 12 additions & 1 deletion docs/ppl-lang/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ For additional examples see the next [documentation](PPL-Example-Commands.md).
- [`correlation commands`](ppl-correlation-command.md)

- [`trendline commands`](ppl-trendline-command.md)

- [`expand commands`](ppl-expand-command.md)

* **Functions**

Expand All @@ -92,7 +94,7 @@ For additional examples see the next [documentation](PPL-Example-Commands.md).

- [`IP Address Functions`](functions/ppl-ip.md)

- [`Lambda Functions`](functions/ppl-lambda.md)
- [`Collection Functions`](functions/ppl-collection)

---
### PPL On Spark
Expand All @@ -104,6 +106,15 @@ For additional examples see the next [documentation](PPL-Example-Commands.md).
### Example PPL Queries
See samples of [PPL queries](PPL-Example-Commands.md)

---

### Experiment PPL locally using Spark-Cluster
See ppl usage sample on local spark cluster[PPL on local spark ](local-spark-ppl-test-instruction.md)

---
### TPC-H PPL Query Rewriting
See samples of [TPC-H PPL query rewriting](ppl-tpch.md)

---
### Planned PPL Commands

Expand Down
Loading

0 comments on commit 5c13d8e

Please sign in to comment.