Skip to content

Commit

Permalink
update documentation with specifications markdown pages including ppl…
Browse files Browse the repository at this point in the history
… expressions

Signed-off-by: YANGDB <[email protected]>
  • Loading branch information
YANG-DB committed Oct 4, 2024
1 parent b1791ff commit d7ee664
Show file tree
Hide file tree
Showing 4 changed files with 41 additions and 3 deletions.
31 changes: 31 additions & 0 deletions docs/ppl-lang/ppl-dedup-command.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,3 +124,34 @@ PPL query:
- `source = table | dedup 2 a,b keepempty=true | fields a,b,c`
- `source = table | dedup 1 a consecutive=true| fields a,b,c` (Consecutive deduplication is unsupported)

### Limitation:

**Spark Support** (3.4)

To translate `dedup` command with `allowedDuplication > 1`, such as `| dedup 2 a,b` to Spark plan, the solution is translating to a plan with Window function (e.g row_number) and a new column `row_number_col` as Filter.

- For `| dedup 2 a, b keepempty=false`

```
DataFrameDropColumns('_row_number_)
+- Filter ('_row_number_ <= 2) // allowed duplication = 2
+- Window [row_number() windowspecdefinition('a, 'b, 'a ASC NULLS FIRST, 'b ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS _row_number_], ['a, 'b], ['a ASC NULLS FIRST, 'b ASC NULLS FIRST]
+- Filter (isnotnull('a) AND isnotnull('b)) // keepempty=false
+- Project
+- UnresolvedRelation
```
- For `| dedup 2 a, b keepempty=true`
```
Union
:- DataFrameDropColumns('_row_number_)
: +- Filter ('_row_number_ <= 2)
: +- Window [row_number() windowspecdefinition('a, 'b, 'a ASC NULLS FIRST, 'b ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS _row_number_], ['a, 'b], ['a ASC NULLS FIRST, 'b ASC NULLS FIRST]
: +- Filter (isnotnull('a) AND isnotnull('b))
: +- Project
: +- UnresolvedRelation
+- Filter (isnull('a) OR isnull('b))
+- Project
+- UnresolvedRelation
```

- this `dedup` command with `allowedDuplication > 1` feature needs spark version >= 3.4
4 changes: 3 additions & 1 deletion docs/ppl-lang/ppl-eval-command.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,9 @@ eval status_category =
```

### Limitation:
Overriding existing field is unsupported, following queries throw exceptions with "Reference 'a' is ambiguous"
- `eval` with comma separated expression needs spark version >= 3.4

- Overriding existing field is unsupported, following queries throw exceptions with "Reference 'a' is ambiguous"

```sql
- `source = table | eval a = 10 | fields a,b,c`
Expand Down
5 changes: 4 additions & 1 deletion docs/ppl-lang/ppl-fields-command.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,13 +56,16 @@ PPL query:
- `source = table | eval b1 = b | fields - b1,c`

### Limitation:
new field added by eval command with a function cannot be dropped in current version:**_
- `fields - list` shows incorrect results for spark version 3.3 - see [issue](https://github.com/opensearch-project/opensearch-spark/pull/732)
- new field added by eval command with a function cannot be dropped in current version:**_

```sql
`source = table | eval b1 = b + 1 | fields - b1,c` (Field `b1` cannot be dropped caused by SPARK-49782)
`source = table | eval b1 = lower(b) | fields - b1,c` (Field `b1` cannot be dropped caused by SPARK-49782)
```

**Nested-Fields**
- nested field shows incorrect results for spark version 3.3 - see [issue](https://github.com/opensearch-project/opensearch-spark/issues/739)
```sql
`source = catalog.schema.table1, catalog.schema.table2 | fields A.nested1, B.nested1`
`source = catalog.table | where struct_col2.field1.subfield > 'valueA' | sort int_col | fields int_col, struct_col.field1.subfield, struct_col2.field1.subfield`
Expand Down
4 changes: 3 additions & 1 deletion docs/ppl-lang/ppl-rename-command.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,8 @@ PPL query:
+------+---------+

### Limitation:
Overriding existing field is unsupported:
- `rename` command needs spark version >= 3.4

- Overriding existing field is unsupported:

`source=accounts | grok address '%{NUMBER} %{GREEDYDATA:address}' | fields address`

0 comments on commit d7ee664

Please sign in to comment.