Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] PPL lookup Functionality #2651

Open
brijos opened this issue May 3, 2024 · 6 comments
Open

[FEATURE] PPL lookup Functionality #2651

brijos opened this issue May 3, 2024 · 6 comments
Assignees
Labels
enhancement New feature or request PPL Piped processing language

Comments

@brijos
Copy link

brijos commented May 3, 2024

Is your feature request related to a problem?
OpenSearch users want an easy way to enrich the data they have stored in OpenSearch and external data sources using content from an OpenSearch index. This is common in security analytics scenarios where one wants to enrich their IP reputation lists, vulnerability databases, or threat feeds.

What solution would you like?
Do a lookup of a field/value, from another log group and use that to convert to user friendly name/error code.

  • The lookup feature should support the OpenSearch indexes as data sources for lookup tables
  • Users should be able to perform a static lookup using the OpenSearch index and external data sources based on the Spark integration
  • Users should be able to perform a static lookup using a user generated CSV such as an org unit mapping or GeoIP from MaxMind and OpenSearch index
  • Users should be able to perform a static lookup using a user generated CSV such as an org unit mapping or GeoIP from MaxMind and an external data source
  • Admins can use Index State Management to control how long the reference lookup index is available
  • Include helpful error messages

*** Out of Scope ***

  • Defining new lookup data sources beyond listed types
  • Automatic field mapping between events and lookups

What alternatives have you considered?
Performing joins using SQL

Do you have any additional context?
None.

@brijos brijos added enhancement New feature or request untriaged labels May 3, 2024
@dblock
Copy link
Member

dblock commented Jun 24, 2024

Catch All Triage - 1 2 3 4 5 6

@salyh
Copy link

salyh commented Jul 16, 2024

PPL Lookup Design Proposal

As implemented in PR 2698 the proposed design (and so far implemented) syntax is:

Design

The lookup command can be implemented as a simple search for documents in the lookup index. For every row in a search result, a search with the given match fields are performed. If a single document is found, the fields and values of the lookup document are copied to the current row of the search result. If no document is found a no-op is performed. If multiple documents are found, an error is thrown. The implementation is mainly done in core/src/main/java/org/opensearch/sql/analysis/Analyzer.java and opensearch/src/main/java/org/opensearch/sql/opensearch/storage/OpenSearchIndex.java. The "lookup search" is performed as a term and a match query. So both cases are catched: When a field is not analyzed or analyzed with respect to it mapping.

The Spark PPL Lookup command implementation is done in separate PR in the opensearch-spark repo: PR 407. Here we can (and need) to implement it as a join.

Syntax

lookup <lookup index> <lookup field> [AS <local lookup field>] [<lookup field> [AS <local lookup field>]]… [appendonly=true|false] [<source field> [AS <local field>]]...

lookup is the name of the lookup operation and it is supposed to be changed to something else. Normally we would use lookup but this seems to already used otherwise in the AST.

<lookup index> is the name of the lookup index (mandatory).

Then we need at least one <lookup field> which is a field in the lookup index used to match to a local field (in the current search) to get the lookup document. When there is no lookup document we just do nothing, if there is more than one we fail with an error.

If more than one <lookup field> is provided, all of them must match (we do a term and a match query for the field value as of now)

If the field has a different name in the current search result use <local lookup field> to map it.

appendonly is false by default abnd inidicates if the values we copy over to the search result from the lookup documemnt should overwrite existing values. If appendonly is true we do not overwrite existing values.

<source field> are the fields that should be copied. If no such fields are given all fields are copied. If the field should have a different name than in the lookup document use [AS <local field>

Examples:

{"query":"source=logins | lookup users uid AS id appendonly=true"}
{"query":"source=logins | lookup users uid,name phone,department AS thedepartment"}

@Gokul-Radhakrishnan
Copy link

+1 for this feature

@LantaoJin
Copy link
Member

PPL Lookup Command Design

As implemented in opensearch-project/opensearch-spark#686 the proposed design doc:

Overview

Lookup command enriches your search data by adding or replacing data from a lookup index (dimension table).
You can extend fields of an index with values from a dimension table, append or replace values when lookup condition is matched.
As an alternative of Join command, lookup command is more suitable for enriching the source data with a static dataset.

Syntax of Lookup Command

SEARCH source=<sourceIndex>
| <other piped command>
| LOOKUP <lookupIndex> (<lookupMappingField> [AS <sourceMappingField>])...
    [(REPLACE | APPEND) (<inputField> [AS <outputField>])...]
| <other piped command>

lookupIndex

  • Required
  • Description: the name of lookup index (dimension table)

lookupMappingField

  • Required
  • Description: A mapping key in <lookupIndex>, analogy to a join key from right table. You can specify multiple <lookupMappingField> with comma-delimited.

sourceMappingField

  • Optional
  • Default: <lookupMappingField>
  • Description: A mapping key from source query, analogy to a join key from left side. If you don't specify any <sourceMappingField>, its default value is <lookupMappingField>.

inputField

  • Optional
  • Default: All fields of <lookupIndex> where matched values are applied to result output if no field is specified.
  • Description: A field in <lookupIndex> where matched values are applied to result output. You can specify multiple <inputField> with comma-delimited. If you don't specify any <inputField>, all fields of <lookupIndex> where matched values are applied to result output.

outputField

  • Optional
  • Default: <inputField>
  • Description: A field of output. You can specify multiple <outputField>. If you specify <outputField> with an existing field name in source query, its values will be replaced or appended by matched values from <inputField>. If the field specified in <outputField> is a new field, an extended new field will be applied to the results.

REPLACE | APPEND

  • Optional
  • Default: REPLACE
  • Description: If you specify REPLACE, matched values in <lookupIndex> field overwrite the values in result. If you specify APPEND, matched values in <lookupIndex> field only append to the missing values in result.

Usage

  • source = table1 | lookup table2 id
  • source = table1 | lookup table2 id, name
  • source = table1 | lookup table2 id as cid, name
  • source = table1 | lookup table2 id as cid, name replace dept as department
  • source = table1 | lookup table2 id as cid, name replace dept as department, city as location
  • source = table1 | lookup table2 id as cid, name append dept as department
  • source = table1 | lookup table2 id as cid, name append dept as department, city as location

Examples

SEARCH source=<sourceIndex>
| WHERE orderType = 'Cancelled'
| LOOKUP account_list, mkt_id AS mkt_code REPLACE amount, account_name AS name
| STATS count(mkt_code), avg(amount) BY name
SEARCH source=<sourceIndex>
| DEDUP market_id
| EVAL category=replace(category, "-", ".")
| EVAL category=ltrim(category, "dvp.")
| LOOKUP bounce_category category AS category APPEND classification
SEARCH source=<sourceIndex>
| LOOKUP bounce_category category

@YANG-DB YANG-DB moved this to Design in PPL Commands Oct 7, 2024
@YANG-DB YANG-DB added the PPL Piped processing language label Oct 7, 2024
@salyh
Copy link

salyh commented Oct 8, 2024

@YANG-DB guess we can close this one because it seems done in opensearch-project/opensearch-spark#686 ?

@YANG-DB
Copy link
Member

YANG-DB commented Oct 8, 2024

@YANG-DB guess we can close this one because it seems done in opensearch-project/opensearch-spark#686 ?

We are keeping this open since this issue refers to the PPL OpenSearch engine

@YANG-DB YANG-DB moved this from Design to In Progress in PPL Commands Oct 9, 2024
@YANG-DB YANG-DB moved this from In Progress to Design in PPL Commands Oct 9, 2024
@YANG-DB YANG-DB changed the title [FEATURE] PPL LOOKUP Functionality [FEATURE] PPL lookup Functionality Nov 28, 2024
@YANG-DB YANG-DB self-assigned this Nov 29, 2024
@YANG-DB YANG-DB moved this from Design to In Progress in PPL Commands Nov 30, 2024
This was referenced Dec 2, 2024
@YANG-DB YANG-DB removed the status in PPL Commands Dec 12, 2024
@YANG-DB YANG-DB moved this to InReview in PPL Commands Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request PPL Piped processing language
Projects
Status: InReview
Development

No branches or pull requests

6 participants