-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add partial indexing support for skipping index #124
Add partial indexing support for skipping index #124
Conversation
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
Signed-off-by: Chen Dai <[email protected]>
partitions | ||
.flatMap(_.files.map(f => f.getPath.toUri.toString)) | ||
.toDF(FILE_PATH_COLUMN) | ||
.join(indexScan, Seq(FILE_PATH_COLUMN), "left") | ||
.filter(isnull(indexScan(FILE_PATH_COLUMN)) || new Column(indexFilter)) | ||
.join(indexScan.filter(not(new Column(indexFilter))), Seq(FILE_PATH_COLUMN), "anti") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- left join still works if flint-core can support array types, right?
- the reason we use left anti join is performance consideration, right? if yes, could we add test to guardian it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on current understanding, Anti Semi join seems required. Even if we support push down optimization for array, the OR
condition in previous Left Outer join cannot be pushed down to skipping index.
* SELECT left.file_path * FROM partitions AS left * LEFT JOIN indexScan AS right * ON left.file_path = right.file_path * WHERE right.file_path IS NULL * OR [indexFilter]
Signed-off-by: Chen Dai <[email protected]>
Will resolve the conflicts and reopen for review later. |
Description
a. Remove old UT, add more logging
b. Restrict column must be partitioned
a. Enforces hybrid scan per query if skipping index is partial
b. Fix hybrid scan join bug: change from LEFT OUTER to ANTI SEMI join between source file list and index data
Documentation: https://github.com/dai-chen/opensearch-spark/blob/add-where-clause-for-skipping-index/docs/index.md#skipping-index
Create Partial Skipping Index Test
Create skipping index with filtering condition on non-partitioned column should fail:
Create skipping index with disjunction filtering condition (OR) should fail:
Create skipping index with filtering condition on partitioned column should succeed:
Skipping Index Query Rewrite Test
Query rewrite should enforce hybrid scan and thus include unknown files in skipping index:
Verified that partial skipping index can hybrid scan source file not included in index data (day=9)
Issues Resolved
#89
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.