Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPL geoip function #871

Merged
merged 10 commits into from
Dec 19, 2024
Merged

PPL geoip function #871

merged 10 commits into from
Dec 19, 2024

Conversation

kenrickyap
Copy link
Contributor

@kenrickyap kenrickyap commented Nov 5, 2024

Description
PPL geoip function

Issues Resolved
#672

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

This PR is a continuation of #781 due to lacking permissions to push to forked branch in said PR

@kenrickyap
Copy link
Contributor Author

kenrickyap commented Nov 5, 2024

Hi @YANG-DB, I heard from Anas that you had a method of implementing ip2geo functionality for Spark. I wanted to check with you that our current approach aligns with your method.

Current Plan:
Leveraging SerializableUdf create a UDF that does the follow:

  1. Check if in-memory cache object for datasource exists.
  2. If cache object does not exists create new in-memory cache object from csv retrieved from datasource manifest. (manifest to CsvParser logic can be stripped from geospatial ip2geo).
  3. Search cached object for GeoIP data.
  4. Return GeoIP data.

This PR has a stub udf implementation for better idea of how this would be implemented.

All of this would have to be implemented within the Spark library, as currently I am not aware of how to access any geospatial artifacts. If you know of a better way to implement ip2geo please let me know! Thanks!

@YANG-DB
Copy link
Member

YANG-DB commented Nov 6, 2024

@kenrickyap please update the DCO (contributor sign-off)

Copy link
Member

@YANG-DB YANG-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kenrickyap I'm missing a more detailed design of the actual functionality that should include:

  • flow diagram of loading the geoIp data
  • storage of that data
  • utilizing of different datasource
  • pro/cons for the different approaches

please add the former discussions made into this issue for better traceability

import org.opensearch.sql.ast.expression.When;
import org.opensearch.sql.ast.expression.WindowFunction;
import org.opensearch.sql.ast.expression.Xor;
import org.opensearch.sql.ast.expression.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please keep the existing explicit import list ...

@kenrickyap
Copy link
Contributor Author

kenrickyap commented Nov 6, 2024

Flow diagram for ip2geo UDF

mermaid-diagram-2024-11-06-140042

Design Details

  • Ip2GeoCache will be a Gauva Cache that has datasource string as key and CidrTree as value
  • CidrTree will be Trie (will use apache.common PatriciaTrie as I see that apache.common is already included in project)
    • Am using trie instead of map as it is more well suited for longest prefix matching task, and this task is similar to ip to cidr matching
    • each CidrTreeNode will store:
      • nth bit value of cidr (n is the depth of tree)
      • geo_data if there is matching cidr row in datasource csv
      • child CidrTreeNodes
  • Will retrieve CsvParser from manifest using similar methodology as in geospatial ip2geo

Pros

  • ip2geo functionality is achieved.
  • implementation is simple and does not depend on any additional libraries that do not already exist in the project.

Cons

  • calculations are done in-memory as a UDF, this means that multiple instances of Ip2GeoCache will be created in distributed Spark systems and they will not sync.
  • not leveraging job-scheduler to run ip2geo task and not leveraging OpenSearch to store ip2geo data as geospatial does.

@YANG-DB
Copy link
Member

YANG-DB commented Nov 7, 2024

Flow diagram for ip2geo UDF

mermaid-diagram-2024-11-06-140042

Design Details

  • Ip2GeoCache will be a Gauva Cache that has datasource string as key and CidrTree as value

  • CidrTree will be Trie (will use apache.common PatriciaTrie as I see that apache.common is already included in project)

    • Am using trie instead of map as it is more well suited for longest prefix matching task, and this task is similar to ip to cidr matching

    • each CidrTreeNode will store:

      • nth bit value of cidr (n is the depth of tree)
      • geo_data if there is matching cidr row in datasource csv
      • child CidrTreeNodes
  • Will retrieve CsvParser from manifest using similar methodology as in geospatial ip2geo

Pros

  • ip2geo functionality is achieved.
  • implementation is simple and does not depend on any additional libraries that do not already exist in the project.

Cons

  • calculations are done in-memory as a UDF, this means that multiple instances of Ip2GeoCache will be created in distributed Spark systems and they will not sync.
  • not leveraging job-scheduler to run ip2geo task and not leveraging OpenSearch to store ip2geo data as geospatial does.

@kenrickyap a few questions here:

  • how / is the Ip2GeoCache memory shared between sessions ? is it a signleton ?
  • plz also add the use case where the geo_data_source will be a table / index (in OpenSearch) / service (API) - lets create a general purpose facade for this to hide different datasource drivers
  • I would like to see a more detailed description of the CidrTreeNode including an example and simple high level explanations (the geo tree diagram ?)
  • explain why is it worth while to add this trie-tree instead of using a hash-map / search ?
  • can u also add a pseudo-code here (in the issue) to clarity the composition ?

@kenrickyap
Copy link
Contributor Author

@kenrickyap a few questions here:

  • how / is the Ip2GeoCache memory shared between sessions ? is it a signleton ?

Yes we will create Ip2GeoCache to be a singleton. From my understanding this should allow the cache to be accessible between sessions

  • plz also add the use case where the geo_data_source will be a table / index (in OpenSearch) / service (API) - lets create a general purpose facade for this to hide different datasource drivers

Will update the flow diagram to reflect facade for different datasource.

However, would it be possible to provide a service (API) example usage? I am not to sure what such a datasource would be expected to return.

Also would there be a fixed schema for an index in opensource?

  • I would like to see a more detailed description of the CidrTreeNode including an example and simple high level explanations (the geo tree diagram ?)

In hindsight I will just use a hashmap to store cidr geo_data where the cidr bitstring will be the key and geo_data will be a value. To preform the ip cidr matching will implement lookup function to convert ip to bitstring and reduce bit length till key is found in map.

Initially I wanted to leverage the prefixMap function as I thought this would find the best fitting cidr mask for a given ip in O(1), which would mean I wouldn't have to implement my own lookup function. However as I was trying implement this noticed that prefixMap takes a prefix and finds all keys that have the prefix, which is the opposite with what I want.

  • explain why is it worth while to add this trie-tree instead of using a hash-map / search ?

As mentioned above will use a hash-map instead of trie.

  • can u also add a pseudo-code here (in the issue) to clarity the composition ?

Will have pseudo-code and code added by EOD


geoip function to add information about the geographical location of an IPv4 or IPv6 address

1. **Proposed syntax**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps also list out the values that can be used for properties.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a larger set of examples here plz

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added additional examples in ppl_ip.md doc

@@ -894,6 +895,10 @@ coalesceFunctionName
: COALESCE
;

geoIpFunctionCall
: GEOIP LT_PRTHS (datasource = functionArg COMMA)? ipAddress = functionArg (COMMA properties = stringLiteral)? RT_PRTHS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do further parsing here of properties. Something like:

... (COMMA properties =  geoipPropertiesList)? ...

geoipPropertiesList
  : geoipProperty (COMMA geoipProperty)*

geoipProperty
  : city
  | lat
  | lon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea !!

import org.opensearch.sql.ast.expression.Field;
import org.opensearch.sql.ast.expression.FieldList;
import org.opensearch.sql.ast.expression.LambdaFunction;
import org.opensearch.sql.ast.expression.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using * imports.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. please config your IDE to disable the auto-merging.

Copy link
Member

@YANG-DB YANG-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kenrickyap can u plz update the PR status and progress ?
thanks

@kenrickyap
Copy link
Contributor Author

@kenrickyap can u plz update the PR status and progress ? thanks

@YANG-DB we are nearly done implementation, am currently testing implementation of geoip functionality on TestDatasourceDao that provides a mock stream of geo data.

@jduo is implementing the manifest dao, in parallel

Also would it be ok to move API and OpenSearch index DAO implementation to another ticket? We are not sure what the specifications for this would be as there are no examples, and am not sure this would be done by Nov 15th.

@YANG-DB
Copy link
Member

YANG-DB commented Nov 13, 2024

@kenrickyap plz fix the DCO error ...


geoip function to add information about the geographical location of an IPv4 or IPv6 address

1. **Proposed syntax**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a larger set of examples here plz

@kenrickyap kenrickyap changed the title [WIP] PPL geoip function pt. 2 PPL geoip function pt. 2 Dec 11, 2024
@kenrickyap
Copy link
Contributor Author

@LantaoJin would you be able to provide a review?

kenrickyap and others added 2 commits December 18, 2024 00:17
Signed-off-by: Kenrick Yap <[email protected]>
@andy-k-improving
Copy link
Contributor

@kenrickyap I only conducted the review for half of the changes, will continue tomorrow, also checkstyle is failing on CI, would you mind to fix?
Thanks,

@YANG-DB
Copy link
Member

YANG-DB commented Dec 18, 2024

@kenrickyap can u see why the build is failing ?

Signed-off-by: Kenrick Yap <[email protected]>
@kenrickyap
Copy link
Contributor Author

@kenrickyap can u see why the build is failing ?

fixed the styling issues

Copy link
Member

@YANG-DB YANG-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kenrickyap I've added a few comments
plz review - thanks

Signed-off-by: Kenrick Yap <[email protected]>
@kenrickyap
Copy link
Contributor Author

@kenrickyap I've added a few comments plz review - thanks

@YANG-DB have addressed new PR comments, not sure if I fully understand some of the comments, have added my understanding of unaddressed comments

Copy link
Member

@YANG-DB YANG-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK LGTM !

@andy-k-improving
Copy link
Contributor

andy-k-improving commented Dec 19, 2024

Thx for taking the time to address comments. :)

@kenrickyap kenrickyap changed the title PPL geoip function pt. 2 PPL geoip function Dec 19, 2024
@YANG-DB YANG-DB merged commit 20ef890 into opensearch-project:main Dec 19, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.6 Lang:PPL Pipe Processing Language support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants