diff --git a/docs/ppl-lang/ppl-dedup-command.md b/docs/ppl-lang/ppl-dedup-command.md index f2f6dd086..de8814c60 100644 --- a/docs/ppl-lang/ppl-dedup-command.md +++ b/docs/ppl-lang/ppl-dedup-command.md @@ -124,3 +124,34 @@ PPL query: - `source = table | dedup 2 a,b keepempty=true | fields a,b,c` - `source = table | dedup 1 a consecutive=true| fields a,b,c` (Consecutive deduplication is unsupported) +### Limitation: + +**Spark Support** (3.4) + +To translate `dedup` command with `allowedDuplication > 1`, such as `| dedup 2 a,b` to Spark plan, the solution is translating to a plan with Window function (e.g row_number) and a new column `row_number_col` as Filter. + +- For `| dedup 2 a, b keepempty=false` + +``` +DataFrameDropColumns('_row_number_) ++- Filter ('_row_number_ <= 2) // allowed duplication = 2 + +- Window [row_number() windowspecdefinition('a, 'b, 'a ASC NULLS FIRST, 'b ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS _row_number_], ['a, 'b], ['a ASC NULLS FIRST, 'b ASC NULLS FIRST] + +- Filter (isnotnull('a) AND isnotnull('b)) // keepempty=false + +- Project + +- UnresolvedRelation +``` +- For `| dedup 2 a, b keepempty=true` +``` +Union +:- DataFrameDropColumns('_row_number_) +: +- Filter ('_row_number_ <= 2) +: +- Window [row_number() windowspecdefinition('a, 'b, 'a ASC NULLS FIRST, 'b ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS _row_number_], ['a, 'b], ['a ASC NULLS FIRST, 'b ASC NULLS FIRST] +: +- Filter (isnotnull('a) AND isnotnull('b)) +: +- Project +: +- UnresolvedRelation ++- Filter (isnull('a) OR isnull('b)) + +- Project + +- UnresolvedRelation +``` + + - this `dedup` command with `allowedDuplication > 1` feature needs spark version >= 3.4 \ No newline at end of file diff --git a/docs/ppl-lang/ppl-eval-command.md b/docs/ppl-lang/ppl-eval-command.md index 42cba1e2f..aa86220db 100644 --- a/docs/ppl-lang/ppl-eval-command.md +++ b/docs/ppl-lang/ppl-eval-command.md @@ -105,7 +105,9 @@ eval status_category = ``` ### Limitation: -Overriding existing field is unsupported, following queries throw exceptions with "Reference 'a' is ambiguous" + - `eval` with comma separated expression needs spark version >= 3.4 + + - Overriding existing field is unsupported, following queries throw exceptions with "Reference 'a' is ambiguous" ```sql - `source = table | eval a = 10 | fields a,b,c` diff --git a/docs/ppl-lang/ppl-fields-command.md b/docs/ppl-lang/ppl-fields-command.md index 87c32b64d..cb67865dc 100644 --- a/docs/ppl-lang/ppl-fields-command.md +++ b/docs/ppl-lang/ppl-fields-command.md @@ -56,13 +56,16 @@ PPL query: - `source = table | eval b1 = b | fields - b1,c` ### Limitation: -new field added by eval command with a function cannot be dropped in current version:**_ + - `fields - list` shows incorrect results for spark version 3.3 - see [issue](https://github.com/opensearch-project/opensearch-spark/pull/732) + - new field added by eval command with a function cannot be dropped in current version:**_ + ```sql `source = table | eval b1 = b + 1 | fields - b1,c` (Field `b1` cannot be dropped caused by SPARK-49782) `source = table | eval b1 = lower(b) | fields - b1,c` (Field `b1` cannot be dropped caused by SPARK-49782) ``` **Nested-Fields** + - nested field shows incorrect results for spark version 3.3 - see [issue](https://github.com/opensearch-project/opensearch-spark/issues/739) ```sql `source = catalog.schema.table1, catalog.schema.table2 | fields A.nested1, B.nested1` `source = catalog.table | where struct_col2.field1.subfield > 'valueA' | sort int_col | fields int_col, struct_col.field1.subfield, struct_col2.field1.subfield` diff --git a/docs/ppl-lang/ppl-rename-command.md b/docs/ppl-lang/ppl-rename-command.md index d7fd6921c..8a3e4e3b5 100644 --- a/docs/ppl-lang/ppl-rename-command.md +++ b/docs/ppl-lang/ppl-rename-command.md @@ -47,6 +47,8 @@ PPL query: +------+---------+ ### Limitation: -Overriding existing field is unsupported: +- `rename` command needs spark version >= 3.4 + +- Overriding existing field is unsupported: `source=accounts | grok address '%{NUMBER} %{GREEDYDATA:address}' | fields address`