Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]support Subquery in ppl #661

Closed
YANG-DB opened this issue Sep 14, 2024 · 3 comments
Closed

[FEATURE]support Subquery in ppl #661

YANG-DB opened this issue Sep 14, 2024 · 3 comments
Assignees
Labels
0.6 enhancement New feature or request Lang:PPL Pipe Processing Language support

Comments

@YANG-DB
Copy link
Member

YANG-DB commented Sep 14, 2024

Problem we are solving for

  • An SRE or Security Analyst needs to filter data based on the result of another query or filter.
  • An SRE or Security Analyst wants to extract subsets of data conditionally, based on calculations or conditions derived from another part of the query.
  • An SRE or Security Analyst wants to correlate data from two or more sources, where the second source depends on the results of the first query.

Proposed Solution

Syntax

The subquery command should be implemented using a clean, logical syntax that integrates with existing PPL structure.

source=logs | where field in (subquery source=events | where condition | return field)

In this example, the primary search (source=logs) is filtered by results from the subquery (source=events).

The subquery command should allow nested queries to be as complex as necessary, supporting multiple levels of nesting.
Example:

source=logs | where field in (subquery source=users | where user in (subquery source=actions | where action="login"))

1.1 Scoping and Context Awareness
Subqueries must respect the context of the parent query, such as access to fields, filters, and other constraints applied at higher levels.
Field references in the subquery should be able to use aliasing if there are naming conflicts between the parent and subquery.

1.3 Operators and Piping
The subquery command should support the standard operators (|, where, stats, etc.) and functions available in PPL.
Subquery results should be able to be further piped and processed after returning data to the primary query.

1.4 Performance Considerations
Queries using subquery should be optimized to minimize performance overhead. The system should use lazy evaluation or other optimization techniques to prevent unnecessary computation.
Limiting the number of nested subqueries may be required to avoid performance degradation in large datasets.

Subquery and Joins

The subquery command must be able to integrate with joins, allowing users to filter and correlate data across multiple data sources efficiently. This section outlines how subquery should interact with joins in the Piped Processing Language (PPL).

2.1 Use Case 1: Subquery as a Filter within Joins

The subquery command should be usable in the ON or WHERE clause of a join, allowing for dynamic filtering based on the results of a subquery. For example:

source=users
| join source=orders on users.user_id = orders.user_id
| where orders.order_id in (subquery source=returns | where return_reason="damaged" | return order_id)

In this case:
The join command combines the users and orders sources based on user_id.
The where clause uses a subquery to filter the results, only returning orders that were returned due to a "damaged" reason.

2.2 Use Case 2: Subquery within a Join Clause

Subquery can be used directly in the join condition, allowing a user to join based on the result of a dynamic subquery. For example:

source=users
| join source=orders on users.user_id in (subquery source=high_value_customers | return user_id)

In this case:
The join condition compares the users.user_id to the result of a subquery that retrieves high-value customers from another dataset.
This allows users to filter which records are joined dynamically based on another dataset or query.

2.3 Use Case 3: Subquery with Different Join Types

The subquery should work with all types of joins, including INNER, LEFT OUTER, and CROSS joins. The behavior should remain consistent regardless of the join type.
INNER JOIN: The subquery will filter results within both data sources. Only rows that match the join condition and the subquery filter will be returned.

Example:

source=employees
| join source=sales on employees.employee_id = sales.employee_id
| where sales.sale_amount > (subquery source=targets | where target_met="true" | return target_value)

In this case, only employees who exceeded their sales targets are returned.
LEFT OUTER JOIN: In a left join, the subquery will still apply only to the joined source, but unmatched rows from the left side should still appear.

Example:

source=customers
| left join source=orders on customers.customer_id = orders.customer_id
| where orders.product_id in (subquery source=products | where category="electronics" | return product_id)

Example:

source=projects
| right join source=employees on projects.project_id = employees.project_id
| where projects.budget in (subquery source=budgets | where approved="true" | return budget_id)

Here, all employees are returned, but only those working on approved-budget projects will have project data.

2.4 Handling Performance with Joins and Subquery

Performance considerations are critical when combining joins and subqueries, as these operations could significantly increase query complexity.

Optimization: Queries should be optimized by evaluating subqueries first and passing their results efficiently to the join operation, reducing the overall number of records processed in the join.

Limiting Subquery in Joins: Users may be restricted from creating deeply nested subqueries within join clauses to prevent performance degradation. For example, a maximum depth of 2-3 levels may be enforced for subqueries used in join conditions.

Non-Functional Requirements

3.1 Performance
The subquery feature should not significantly degrade performance compared to running two separate queries independently. Where possible, the engine should optimize the execution plan to run the subquery efficiently.

3.2 Usability
The syntax should be clear and easily understood by users who are familiar with PPL.
The documentation should include detailed examples and edge cases to assist users in building complex subqueries.

3.3 Error Handling
If a subquery query returns no results or fails to execute, the system should return an appropriate error message, allowing the user to debug the query.
Cyclic subqueries (where a subquery references itself) should be detected, and an error should be thrown to prevent infinite loops.

Test Cases

4.1 Basic Query with Subquery
Validate that basic queries with Subquery execute as expected.
Input:
source=logs | where id in (subquery source=users | where active="true" | return id)
Expected Output: Logs only with user IDs returned by the subquery.

4.2 Performance Testing
Test the performance impact of subquery compared to equivalent independent queries.

4.3 Nested Subquery
Test with deeply nested subqueries and validate results.
Input:
source=actions | where user in (subquery source=users | where region in (subquery source=regions | where country="US"))
Expected Output: Actions filtered by users in US regions.

4.4 Edge Case: Empty Subquery
Test behavior when the subquery returns no results.
Input:
source=logs | where id in (subquery source=users | where id=999999 | return id)
Expected Output: No logs returned.

4.5 Error Case: Cyclic Subquery
Test cyclic subquery detection.
Input:
source=logs | where id in (subquery source=logs | where id in (subquery source=logs | ...))
Expected Output: Error for cyclic reference.

4.6 Test Case: Subquery as a Join Filter
Input:

source=customers
| join source=orders on customers.customer_id = orders.customer_id
| where orders.order_id in (subquery source=returns | where return_reason="damaged" | return order_id)

Expected Output: Customers who have returned items with a "damaged" return reason.

4.7. Test Case: Subquery in Join Clause
Input:

source=users
| join source=purchases on users.user_id in (subquery source=loyal_customers | where status="gold" | return user_id)

Expected Output: Only users who are gold-status loyal customers are joined with purchase data.

4.8 Test Case: Subquery with LEFT JOIN
Input:

source=customers
| left join source=orders on customers.customer_id = orders.customer_id
| where orders.product_id in (subquery source=products | where category="furniture" | return product_id)

Expected Output: All customers are returned, but only those who purchased furniture will have non-null order data.

4.9 Test Case: Subquery with Multiple Joins
Input:

source=employees
| join source=departments on employees.department_id = departments.department_id
| join source=projects on employees.project_id = projects.project_id
| where projects.budget in (subquery source=budgets | where budget_approved="true" | return budget_id)

Expected Output: Employees working in departments and on projects with approved budgets are returned.

4.10 Error Handling in Joins with Subquery
Unmatched Subquery Results: If a subquery returns no results, the joined query should behave consistently with the join type. For an INNER JOIN, no rows are returned. For a LEFT JOIN, unmatched rows from the left source should still appear.

Cyclic Subquery: As with non-join queries, cyclic subqueries should be detected, and an appropriate error message should be returned.

@YANG-DB YANG-DB added enhancement New feature or request untriaged Lang:PPL Pipe Processing Language support labels Sep 14, 2024
@YANG-DB YANG-DB moved this to Todo in PPL Commands Sep 14, 2024
@YANG-DB YANG-DB removed the untriaged label Sep 14, 2024
@YANG-DB YANG-DB moved this from Todo to Design in PPL Commands Sep 19, 2024
@LantaoJin
Copy link
Member

LantaoJin commented Sep 27, 2024

@YANG-DB Should we rename the title to Support In Subquery in PPL? From the description, it is not the entire subsearch or subquery functionality.
The most cases in the description is to request a InSubquery expression. As you know the where command syntax is:

| where <boolean expression>

So the subquery in description is part of boolean expression, such as

| where orders.order_id in (subquery source=returns | where return_reason="damaged" | return order_id)

The orders.order_id in (subquery source=...) is a <boolean expression>. In general, we name this kind of subquery clause the InSubquery expression, it is a <boolean expression>, one kind of subquery expressions.

PS: there are many kinds of subquery expressions, another commonly used one is ScalarSubquery expression:
The first example in section 2.3 Use Case 3: Subquery with Different Join Types in issue description is a ScalarSubquery:

source=employees
| join source=sales on employees.employee_id = sales.employee_id
| where sales.sale_amount > (subquery source=targets | where target_met="true" | return target_value)

Recall the join command doc: https://github.com/opensearch-project/opensearch-spark/blob/main/docs/PPL-Join-command.md#more-examples, the example I wrote is a subquery/subsearch plan, rather than a expression.

SEARCH source=customer
| FIELDS c_custkey
| LEFT OUTER JOIN left = c, right = o ON c.c_custkey = o.o_custkey
   [
      SEARCH source=orders
      | WHERE o_comment NOT LIKE '%unusual%packages%'
      | FIELDS o_orderkey, o_custkey
   ]
| STATS ...

simply to

SEARCH <leftPlan>
| LEFT OUTER JOIN ON <condition>
   [
      <rightPlan>
   ]
| STATS ...

Apply the syntax here and simply to

search <leftPlan> | left join on <condition> (subquery search ...)

The (subquery search ...) is not a expression, it's plan, similar to the relation plan

@LantaoJin
Copy link
Member

LantaoJin commented Sep 29, 2024

Base on #661 (comment), I created sub-tasks: #710 #711 #712 and #713, check the details in each issue.
Will implement them step by step.

@YANG-DB YANG-DB changed the title [FEATURE]support subquery in ppl [FEATURE]support InSubquery in ppl Sep 30, 2024
@LantaoJin LantaoJin moved this from Design to Done in PPL Commands Oct 8, 2024
@LantaoJin LantaoJin moved this from Done to In Progress in PPL Commands Oct 8, 2024
@YANG-DB YANG-DB added the 0.6 label Oct 9, 2024
@LantaoJin LantaoJin changed the title [FEATURE]support InSubquery in ppl [FEATURE]support Subquery in ppl Oct 11, 2024
@LantaoJin LantaoJin self-assigned this Oct 11, 2024
@YANG-DB YANG-DB moved this from In Progress to InReview in PPL Commands Oct 15, 2024
@LantaoJin
Copy link
Member

Issue was addressed by sub-tasks: #710 #711 #712 and #713. Close this.

@github-project-automation github-project-automation bot moved this from InReview to Done in PPL Commands Oct 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.6 enhancement New feature or request Lang:PPL Pipe Processing Language support
Projects
Status: Done
Development

No branches or pull requests

2 participants