Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translate PPL join Command #630

Merged
merged 11 commits into from
Sep 19, 2024
Merged

Conversation

LantaoJin
Copy link
Member

@LantaoJin LantaoJin commented Sep 9, 2024

Description

Syntax

SEARCH source=<left-table>
| <other piped command>
| [joinType] JOIN left = <leftAlias> [,] right = <rightAlias> ON joinCriteria <right-table>
| <other piped command>

joinType: [INNER] | LEFT [OUTER] | RIGHT [OUTER] | FULL [OUTER] | CROSS | [LEFT] SEMI | [LEFT] ANTI 

Example of migration from SQL query (TPC-H Q13):

SELECT c_count, COUNT(*) AS custdist
FROM
  ( SELECT c_custkey, COUNT(o_orderkey) c_count
    FROM customer LEFT OUTER JOIN orders ON c_custkey = o_custkey
        AND o_comment NOT LIKE '%unusual%packages%'
    GROUP BY c_custkey
  ) AS c_orders
GROUP BY c_count
ORDER BY custdist DESC, c_count DESC;

Rewritten by PPL Join query:

SEARCH source=customer
| FIELDS c_custkey
| LEFT OUTER JOIN left = c right = o
    ON c.c_custkey = o.o_custkey AND o_comment NOT LIKE '%unusual%packages%'
    orders
| STATS count(o_orderkey) AS c_count BY c.c_custkey
| STATS count(1) AS custdist BY c_count
| SORT - custdist, - c_count

- Limitation: sub-searches is unsupported in join right side because sub-searches is unsupported in PPL

If sub-searches is supported in future, above ppl query could be rewritten as:

SEARCH source=customer
| FIELDS c_custkey
| LEFT OUTER JOIN left = c right = o ON c.c_custkey = o.o_custkey
   [
      SEARCH source=orders
      | WHERE o_comment NOT LIKE '%unusual%packages%'
      | FIELDS o_orderkey, o_custkey
   ]
| STATS count(o_orderkey) AS c_count BY c.c_custkey
| STATS count(*) AS custdist BY c_count
| SORT - custdist, - c_count

Issues Resolved

Resolves #619

Check List

  • Updated documentation (ppl-spark-integration/README.md)
  • Implemented unit tests
  • Implemented tests for combination with other commands
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Lantao Jin <[email protected]>
@YANG-DB YANG-DB added the Lang:PPL Pipe Processing Language support label Sep 9, 2024
@LantaoJin LantaoJin marked this pull request as ready for review September 13, 2024 12:05
@LantaoJin LantaoJin requested a review from YANG-DB September 14, 2024 04:38
docs/PPL-Join-command.md Outdated Show resolved Hide resolved
- Optional
- Description: The type of join to perform. The default is `INNER` if not specified.

**leftAlias**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, if it is required field, why not using left and right, hint seems optional field?

Copy link
Member Author

@LantaoJin LantaoJin Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. hint. seems optional item. Using left and right sounds better. But it bring me that actually the alias also could be optional if users prefer to use pattern $table.$field in join condition and no ambiguous naming related. For example:

SEARCH source=otel-v1-apm-span-000001
| <other PPL commands>
| LEFT JOIN
    ON name = serviceName AND traceGroupName = 'client_cancel_order'
    otel-v1-apm-service-map
| <other PPL commands>

Or ON otel-v1-apm-span-000001.name = otel-v1-apm-service-map.serviceName AND otel-v1-apm-service-map.traceGroupName = 'client_cancel_order'

Which solution do you prefer to? @penghuo

  1. Keep the alias name hint.left and hint.right and set it to optional.
  2. Remove the hint. prefix and keep it required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I would keep the existing first option

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YANG-DB I found that the side aliases must be added as required because we add a SubqueryAlias node for each side of the join.

Copy link
Member Author

@LantaoJin LantaoJin Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc updated. The hint. is removed in latest commit.

@LantaoJin
Copy link
Member Author

@YANG-DB @penghuo can this PR be merged today? I am going to submit the lookup PR which is based on this.

@penghuo penghuo merged commit 56c4c34 into opensearch-project:main Sep 19, 2024
4 checks passed
@LantaoJin
Copy link
Member Author

@YANG-DB This PR requires a 0.6 label since the Lookup has had.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lang:PPL Pipe Processing Language support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Support Join command in PPL
3 participants