Merge branch 'main' into geoip

# Conflicts: # ppl-spark-integration/src/main/antlr4/OpenSearchPPLParser.g4 # ppl-spark-integration/src/main/java/org/opensearch/sql/ast/AbstractNodeVisitor.java # ppl-spark-integration/src/main/java/org/opensearch/sql/ppl/CatalystQueryPlanVisitor.java # ppl-spark-integration/src/main/java/org/opensearch/sql/ppl/parser/AstExpressionBuilder.java
opensearch-project · Nov 5, 2024 · f4e1918 · f4e1918
2 parents e65e664 + aaba489
commit f4e1918
Show file tree

Hide file tree

Showing 171 changed files with 13,013 additions and 2,120 deletions.
diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -1,2 +1,2 @@
 # This should match the owning team set up in https://github.com/orgs/opensearch-project/teams
-*   @joshuali925 @dai-chen @rupal-bq @mengweieric @vamsi-amazon @penghuo @seankao-az @anirudha @kaituo @YANG-DB @LantaoJin
+*   @joshuali925 @dai-chen @rupal-bq @mengweieric @vamsi-amazon @penghuo @seankao-az @anirudha @kaituo @YANG-DB @noCharger @LantaoJin @ykmr1224
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
@@ -0,0 +1,15 @@
+### Description
+_Describe what this change achieves._
+
+### Related Issues
+_List any issues this PR will resolve, e.g. Resolves [...]._
+
+### Check List
+- [ ] Updated documentation (docs/ppl-lang/README.md)
+- [ ] Implemented unit tests
+- [ ] Implemented tests for combination with other commands
+- [ ] New added source code should include a copyright header
+- [ ] Commits are signed per the DCO using `--signoff`
+
+By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
+For more information on following Developer Certificate of Origin and signing off your commits, please check [here](https://github.com/opensearch-project/sql/blob/main/CONTRIBUTING.md#developer-certificate-of-origin).
diff --git a/.github/workflows/test-and-build-workflow.yml b/.github/workflows/test-and-build-workflow.yml
@@ -22,8 +22,8 @@ jobs:
           distribution: 'temurin'
           java-version: 11
 
-      - name: Integ Test
-        run: sbt integtest/integration
-
       - name: Style check
         run: sbt scalafmtCheckAll
+
+      - name: Integ Test
+        run: sbt integtest/integration
diff --git a/MAINTAINERS.md b/MAINTAINERS.md
@@ -12,9 +12,10 @@ This document contains a list of maintainers in this repo. See [opensearch-proje
 | Chen Dai        | [dai-chen](https://github.com/dai-chen)         | Amazon      |
 | Vamsi Manohar   | [vamsi-amazon](https://github.com/vamsi-amazon) | Amazon      |
 | Peng Huo        | [penghuo](https://github.com/penghuo)           | Amazon      |
-| Lior Perry      | [YANG-DB](https://github.com/YANG-DB)            | Amazon      |
+| Lior Perry      | [YANG-DB](https://github.com/YANG-DB)           | Amazon      |
 | Sean Kao        | [seankao-az](https://github.com/seankao-az)     | Amazon      |
 | Anirudha Jadhav | [anirudha](https://github.com/anirudha)         | Amazon      |
 | Kaituo Li       | [kaituo](https://github.com/kaituo)             | Amazon      |
 | Louis Chu       | [noCharger](https://github.com/noCharger)       | Amazon      |
 | Lantao Jin      | [LantaoJin](https://github.com/LantaoJin)       | Amazon      |
+| Tomoyuki Morita | [ykmr1224](https://github.com/ykmr1224)         | Amazon      |
diff --git a/build.sbt b/build.sbt
@@ -89,6 +89,7 @@ lazy val flintCore = (project in file("flint-core"))
       "com.amazonaws" % "aws-java-sdk-cloudwatch" % "1.12.593"
         exclude("com.fasterxml.jackson.core", "jackson-databind"),
       "software.amazon.awssdk" % "auth-crt" % "2.28.10",
+      "org.projectlombok" % "lombok" % "1.18.30" % "provided",
       "org.scalactic" %% "scalactic" % "3.2.15" % "test",
       "org.scalatest" %% "scalatest" % "3.2.15" % "test",
       "org.scalatest" %% "scalatest-flatspec" % "3.2.15" % "test",
@@ -153,6 +154,7 @@ lazy val pplSparkIntegration = (project in file("ppl-spark-integration"))
       "com.stephenn" %% "scalatest-json-jsonassert" % "0.2.5" % "test",
       "com.github.sbt" % "junit-interface" % "0.13.3" % "test",
       "org.projectlombok" % "lombok" % "1.18.30",
+      "com.github.seancfoley" % "ipaddress" % "5.5.1",
     ),
     libraryDependencies ++= deps(sparkVersion),
     // ANTLR settings

diff --git a/docs/index.md b/docs/index.md
@@ -549,6 +549,7 @@ In the index mapping, the `_meta` and `properties`field stores meta and schema i
 - `spark.flint.monitor.initialDelaySeconds`: Initial delay in seconds before starting the monitoring task. Default value is 15.
 - `spark.flint.monitor.intervalSeconds`: Interval in seconds for scheduling the monitoring task. Default value is 60.
 - `spark.flint.monitor.maxErrorCount`: Maximum number of consecutive errors allowed before stopping the monitoring task. Default value is 5.
+- `spark.flint.metadataCacheWrite.enabled`: default is false. enable writing metadata to index mappings _meta as read cache for frontend user to access. Do not use in production, this setting will be removed in later version.
 
 #### Data Type Mapping
 

diff --git a/docs/ppl-lang/PPL-Example-Commands.md b/docs/ppl-lang/PPL-Example-Commands.md
@@ -1,5 +1,10 @@
 ## Example PPL Queries
 
+#### **Comment**
+[See additional command details](ppl-comment.md)
+- `source=accounts | top gender // finds most common gender of all the accounts` (line comment)
+- `source=accounts | dedup 2 gender /* dedup the document with gender field keep 2 duplication */ | fields account_number, gender` (block comment)
+
 #### **Describe**
 - `describe table`  This command is equal to the `DESCRIBE EXTENDED table` SQL command
 - `describe schema.table`
@@ -28,6 +33,12 @@ _- **Limitation: new field added by eval command with a function cannot be dropp
 - `source = table | eval b1 = b + 1 | fields - b1,c` (Field `b1` cannot be dropped caused by SPARK-49782)
 - `source = table | eval b1 = lower(b) | fields - b1,c` (Field `b1` cannot be dropped caused by SPARK-49782)
 
+**Field-Summary**
+[See additional command details](ppl-fieldsummary-command.md)
+- `source = t | fieldsummary includefields=status_code nulls=false`
+- `source = t | fieldsummary includefields= id, status_code, request_path nulls=true`
+- `source = t | where status_code != 200 | fieldsummary includefields= status_code nulls=true`
+
 **Nested-Fields**
 - `source = catalog.schema.table1, catalog.schema.table2 | fields A.nested1, B.nested1`
 - `source = catalog.table | where struct_col2.field1.subfield > 'valueA' | sort int_col | fields  int_col, struct_col.field1.subfield, struct_col2.field1.subfield`
@@ -44,6 +55,19 @@ _- **Limitation: new field added by eval command with a function cannot be dropp
 - `source = table | where isempty(a)`
 - `source = table | where isblank(a)`
 - `source = table | where case(length(a) > 6, 'True' else 'False') = 'True'`
+- `source = table | where a not in (1, 2, 3) | fields a,b,c`
+- `source = table | where a between 1 and 4` - Note: This returns a >= 1 and a <= 4, i.e. [1, 4]
+- `source = table | where b not between '2024-09-10' and '2025-09-10'` - Note: This returns b >= '2024-09-10' and b <= '2025-09-10'
+- `source = table | where cidrmatch(ip, '192.169.1.0/24')` 
+- `source = table | where cidrmatch(ipv6, '2003:db8::/32')`
+- `source = table | trendline sma(2, temperature) as temp_trend`
+
+#### **IP related queries**
+[See additional command details](functions/ppl-ip.md)
+
+- `source = table | where cidrmatch(ip, '192.169.1.0/24')`
+- `source = table | where isV6 = false and isValid = true and cidrmatch(ipAddress, '192.168.1.0/24')`
+- `source = table | where isV6 = true | eval inRange = case(cidrmatch(ipAddress, '2003:db8::/32'), 'in' else 'out') | fields ip, inRange`
 
 ```sql
  source = table | eval status_category =
@@ -92,6 +116,10 @@ Assumptions: `a`, `b`, `c` are existing fields in `table`
 - `source = table | eval f = case(a = 0, 'zero', a = 1, 'one', a = 2, 'two', a = 3, 'three', a = 4, 'four', a = 5, 'five', a = 6, 'six', a = 7, 'se7en', a = 8, 'eight', a = 9, 'nine')`
 - `source = table | eval f = case(a = 0, 'zero', a = 1, 'one' else 'unknown')`
 - `source = table | eval f = case(a = 0, 'zero', a = 1, 'one' else concat(a, ' is an incorrect binary digit'))`
+- `source = table | eval digest = md5(fieldName) | fields digest`
+- `source = table | eval digest = sha1(fieldName) | fields digest`
+- `source = table | eval digest = sha2(fieldName,256) | fields digest`
+- `source = table | eval digest = sha2(fieldName,512) | fields digest`
 
 #### Fillnull
 Assumptions: `a`, `b`, `c`, `d`, `e` are existing fields in `table`
@@ -102,6 +130,15 @@ Assumptions: `a`, `b`, `c`, `d`, `e` are existing fields in `table`
 - `source = table | fillnull using a = 101, b = 102`
 - `source = table | fillnull using a = concat(b, c), d = 2 * pi() * e`
 
+### Flatten
+[See additional command details](ppl-flatten-command.md)
+Assumptions: `bridges`, `coor` are existing fields in `table`, and the field's types are `struct<?,?>` or `array<struct<?,?>>`  
+- `source = table | flatten bridges`
+- `source = table | flatten coor`
+- `source = table | flatten bridges | flatten coor`
+- `source = table | fields bridges | flatten bridges`
+- `source = table | fields country, bridges | flatten bridges | fields country, length | stats avg(length) as avg by country`
+
 ```sql
 source = table | eval e = eval status_category =
         case(a >= 200 AND a < 300, 'Success',
@@ -158,6 +195,34 @@ source = table |  where ispresent(a) |
 - `source = table | stats avg(age) as avg_state_age by country, state | stats avg(avg_state_age) as avg_country_age by country`
 - `source = table | stats avg(age) as avg_city_age by country, state, city | eval new_avg_city_age = avg_city_age - 1 | stats avg(new_avg_city_age) as avg_state_age by country, state | where avg_state_age > 18 | stats avg(avg_state_age) as avg_adult_country_age by country`
 
+#### **Event Aggregations**
+[See additional command details](ppl-eventstats-command.md)
+
+- `source = table | eventstats avg(a) `
+- `source = table | where a < 50 | eventstats avg(c) `
+- `source = table | eventstats max(c) by b`
+- `source = table | eventstats count(c) by b | head 5`
+- `source = table | eventstats stddev_samp(c)`
+- `source = table | eventstats stddev_pop(c)`
+- `source = table | eventstats percentile(c, 90)`
+- `source = table | eventstats percentile_approx(c, 99)`
+
+**Limitation: distinct aggregation could not used in `eventstats`:**_
+- `source = table | eventstats distinct_count(c)` (throw exception)
+
+**Aggregations With Span**
+- `source = table  | eventstats count(a) by span(a, 10) as a_span`
+- `source = table  | eventstats sum(age) by span(age, 5) as age_span | head 2`
+- `source = table  | eventstats avg(age) by span(age, 20) as age_span, country  | sort - age_span |  head 2`
+
+**Aggregations With TimeWindow Span (tumble windowing function)**
+- `source = table | eventstats sum(productsAmount) by span(transactionDate, 1d) as age_date | sort age_date`
+- `source = table | eventstats sum(productsAmount) by span(transactionDate, 1w) as age_date, productId`
+
+**Aggregations Group by Multiple Times**
+- `source = table | eventstats avg(age) as avg_state_age by country, state | eventstats avg(avg_state_age) as avg_country_age by country`
+- `source = table | eventstats avg(age) as avg_city_age by country, state, city | eval new_avg_city_age = avg_city_age - 1 | eventstats avg(new_avg_city_age) as avg_state_age by country, state | where avg_state_age > 18 | eventstats avg(avg_state_age) as avg_adult_country_age by country`
+
 #### **Dedup**
 
 [See additional command details](ppl-dedup-command.md)
@@ -240,8 +305,7 @@ source = table |  where ispresent(a) |
 - `source = table1 | cross join left = l right = r table2`
 - `source = table1 | left semi join left = l right = r on l.a = r.a table2`
 - `source = table1 | left anti join left = l right = r on l.a = r.a table2`
-
-_- **Limitation: sub-searches is unsupported in join right side now**_
+- `source = table1 | join left = l right = r [ source = table2 | where d > 10 | head 5 ]`
 
 
 #### **Lookup**
@@ -268,6 +332,8 @@ _- **Limitation: "REPLACE" or "APPEND" clause must contain "AS"**_
 - `source = outer | where a not in [ source = inner | fields b ]`
 - `source = outer | where (a) not in [ source = inner | fields b ]`
 - `source = outer | where (a,b,c) not in [ source = inner | fields d,e,f ]`
+- `source = outer a in [ source = inner | fields b ]` (search filtering with subquery)
+- `source = outer a not in [ source = inner | fields b ]` (search filtering with subquery)
 - `source = outer | where a in [ source = inner1 | where b not in [ source = inner2 | fields c ] | fields b ]` (nested)
 - `source = table1 | inner join left = l right = r on l.a = r.a AND r.a in [ source = inner | fields d ] | fields l.a, r.a, b, c` (as join filter)
 
@@ -317,6 +383,9 @@ Assumptions: `a`, `b` are fields of table outer, `c`, `d` are fields of table in
 - `source = outer | where not exists [ source = inner | where a = c ]`
 - `source = outer | where exists [ source = inner | where a = c and b = d ]`
 - `source = outer | where not exists [ source = inner | where a = c and b = d ]`
+- `source = outer exists [ source = inner | where a = c ]` (search filtering with subquery)
+- `source = outer not exists [ source = inner | where a = c ]` (search filtering with subquery)
+- `source = table as t1 exists [ source = table as t2 | where t1.a = t2.a ]` (table alias is useful in exists subquery)
 - `source = outer | where exists [ source = inner1 | where a = c and exists [ source = inner2 | where c = e ] ]` (nested)
 - `source = outer | where exists [ source = inner1 | where a = c | where exists [ source = inner2 | where c = e ] ]` (nested)
 - `source = outer | where exists [ source = inner | where c > 10 ]` (uncorrelated exists)
@@ -332,8 +401,13 @@ Assumptions: `a`, `b` are fields of table outer, `c`, `d` are fields of table in
 - `source = outer | eval m = [ source = inner | stats max(c) ] | fields m, a`
 - `source = outer | eval m = [ source = inner | stats max(c) ] + b | fields m, a`
 
-**Uncorrelated scalar subquery in Select and Where**
-- `source = outer | where a > [ source = inner | stats min(c) ] | eval m = [ source = inner | stats max(c) ] | fields m, a`
+**Uncorrelated scalar subquery in Where**
+- `source = outer | where a > [ source = inner | stats min(c) ] | fields a`
+- `source = outer | where [ source = inner | stats min(c) ] > 0 | fields a`
+
+**Uncorrelated scalar subquery in Search filter**
+- `source = outer a > [ source = inner | stats min(c) ] | fields a`
+- `source = outer [ source = inner | stats min(c) ] > 0 | fields a`
 
 **Correlated scalar subquery in Select**
 - `source = outer | eval m = [ source = inner | where outer.b = inner.d | stats max(c) ] | fields m, a`
@@ -345,10 +419,23 @@ Assumptions: `a`, `b` are fields of table outer, `c`, `d` are fields of table in
 - `source = outer | where a = [ source = inner | where b = d | stats max(c) ]`
 - `source = outer | where [ source = inner | where outer.b = inner.d OR inner.d = 1 | stats count() ] > 0 | fields a`
 
+**Correlated scalar subquery in Search filter**
+- `source = outer a = [ source = inner | where b = d | stats max(c) ]`
+- `source = outer [ source = inner | where outer.b = inner.d OR inner.d = 1 | stats count() ] > 0 | fields a`
+
 **Nested scalar subquery**
 - `source = outer | where a = [ source = inner | stats max(c) | sort c ] OR b = [ source = inner | where c = 1 | stats min(d) | sort d ]`
 - `source = outer | where a = [ source = inner | where c =  [ source = nested | stats max(e) by f | sort f ] | stats max(d) by c | sort c | head 1 ]`
 
+#### **(Relation) Subquery**
+[See additional command details](ppl-subquery-command.md)
+
+`InSubquery`, `ExistsSubquery` and `ScalarSubquery` are all subquery expressions. But `RelationSubquery` is not a subquery expression, it is a subquery plan which is common used in Join or Search clause.
+
+- `source = table1 | join left = l right = r [ source = table2 | where d > 10 | head 5 ]` (subquery in join right side)
+- `source = [ source = table1 | join left = l right = r [ source = table2 | where d > 10 | head 5 ] | stats count(a) by b ] as outer | head 1`
+
+_- **Limitation: another command usage of (relation) subquery is in `appendcols` commands which is unsupported**_
 
 ---
 #### Experimental Commands:
@@ -366,12 +453,11 @@ Assumptions: `a`, `b` are fields of table outer, `c`, `d` are fields of table in
 ### Planned Commands:
 
 #### **fillnull**
-
+[See additional command details](ppl-fillnull-command.md)
 ```sql
    -  `source=accounts | fillnull fields status_code=101`
    -  `source=accounts | fillnull fields request_path='/not_found', timestamp='*'`
     - `source=accounts | fillnull using field1=101`
     - `source=accounts | fillnull using field1=concat(field2, field3), field4=2*pi()*field5`
     - `source=accounts | fillnull using field1=concat(field2, field3), field4=2*pi()*field5, field6 = 'N/A'`
 ```
-[See additional command details](planning/ppl-fillnull-command.md)
diff --git a/docs/ppl-lang/README.md b/docs/ppl-lang/README.md
@@ -22,13 +22,17 @@ For additional examples see the next [documentation](PPL-Example-Commands.md).
 
 * **Commands**
 
+    - [`comment`](ppl-comment.md)
+
     - [`explain command `](PPL-Example-Commands.md/#explain)
 
     - [`dedup command `](ppl-dedup-command.md)
 
     - [`describe command`](PPL-Example-Commands.md/#describe)
 
     - [`fillnull command`](ppl-fillnull-command.md)
+
+    - [`flatten command`](ppl-flatten-command.md)
 
     - [`eval command`](ppl-eval-command.md)
 
@@ -48,6 +52,8 @@ For additional examples see the next [documentation](PPL-Example-Commands.md).
 
     - [`stats command`](ppl-stats-command.md)
 
+    - [`eventstats command`](ppl-eventstats-command.md)
+
     - [`where command`](ppl-where-command.md)
 
     - [`head command`](ppl-head-command.md)
@@ -63,7 +69,8 @@ For additional examples see the next [documentation](PPL-Example-Commands.md).
     - [`subquery commands`](ppl-subquery-command.md)
 
     - [`correlation commands`](ppl-correlation-command.md)
-
+
+    - [`trendline commands`](ppl-trendline-command.md)
 
 * **Functions**
 
@@ -75,10 +82,17 @@ For additional examples see the next [documentation](PPL-Example-Commands.md).
 
     - [`String Functions`](functions/ppl-string.md)
 
+    - [`JSON Functions`](functions/ppl-json.md)
+
     - [`Condition Functions`](functions/ppl-condition.md)
 
     - [`Type Conversion Functions`](functions/ppl-conversion.md)
 
+    - [`Cryptographic Functions`](functions/ppl-cryptographic.md)
+
+    - [`IP Address Functions`](functions/ppl-ip.md)
+
+    - [`Lambda Functions`](functions/ppl-lambda.md)
 
 ---
 ### PPL On Spark
@@ -97,4 +111,4 @@ See samples of [PPL queries](PPL-Example-Commands.md)
 
 ---
 ### PPL Project Roadmap
-[PPL Github Project Roadmap](https://github.com/orgs/opensearch-project/projects/214)
+[PPL Github Project Roadmap](https://github.com/orgs/opensearch-project/projects/214)