Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPL geoip function - old reference PR #781

Closed
wants to merge 1 commit into from

Conversation

salyh
Copy link
Contributor

@salyh salyh commented Oct 16, 2024

Description

PPL geoip function

Issues Resolved

#672

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@salyh
Copy link
Contributor Author

salyh commented Oct 16, 2024

@YANG-DB @penghuo @lukasz-soszynski-eliatra @kt-eliatra @dr-lilienthal PPL geoip function syntax proposal

Please review and comment if needed

@YANG-DB
Copy link
Member

YANG-DB commented Oct 16, 2024

@salyh thanks
on my end looking good - only question is the datasource definition - how does the user creates a new datasource, can/should the datasource` be one of the following:

  • external API
  • table
  • file
    Lets add these capabilities ...

@YANG-DB
Copy link
Member

YANG-DB commented Oct 16, 2024

@LantaoJin @penghuo @dai-chen please add u'r comments here

@LantaoJin LantaoJin added Lang:PPL Pipe Processing Language support 0.6 labels Oct 17, 2024
@salyh
Copy link
Contributor Author

salyh commented Oct 17, 2024

@salyh thanks on my end looking good - only question is the datasource definition - how does the user creates a new datasource, can/should the datasource` be one of the following:

* external API

* table

* file
  Lets add these capabilities ...

@YANG-DB @LantaoJin @penghuo @dai-chen

datasource refers to the datasources provided by the ip2geo processor https://opensearch.org/docs/latest/ingest-pipelines/processors/ip2geo/

My idea was to leveage the already present capabilities of the ip2geo processor and extend it as necessary. Because it would avoid code duplication. When I read the docs of the processor correct, GeoLite2 is supported by now. I propose to add the above requested functionality to the processor. Then we try to use the code from the processor (as a library) to solve this issue. Otherwise we would reinvent the wheel.

And there is also https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/geoip/ in data prepper which could be facilitated

@YANG-DB YANG-DB self-requested a review October 17, 2024 21:58
@YANG-DB
Copy link
Member

YANG-DB commented Oct 17, 2024

@salyh thanks on my end looking good - only question is the datasource definition - how does the user creates a new datasource, can/should the datasource` be one of the following:

* external API

* table

* file
  Lets add these capabilities ...

@YANG-DB @LantaoJin @penghuo @dai-chen

datasource refers to the datasources provided by the ip2geo processor https://opensearch.org/docs/latest/ingest-pipelines/processors/ip2geo/

My idea was to leveage the already present capabilities of the ip2geo processor and extend it as necessary. Because it would avoid code duplication. When I read the docs of the processor correct, GeoLite2 is supported by now. I propose to add the above requested functionality to the processor. Then we try to use the code from the processor (as a library) to solve this issue. Otherwise we would reinvent the wheel.

And there is also https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/geoip/ in data prepper which could be facilitated

@salyh this sounds like a good idea !
@LantaoJin @penghuo - any comments ?

- See https://opensearch.org/docs/latest/ingest-pipelines/processors/ip2geo/


### New syntax definition in ANTLR
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user doc, could we remove the ANTLR related information, what here needs is just a user readable syntax definition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the file is located ppl-lang/planning folder. I think we need an user doc instead of design doc. We can copy this design information to the PR description or issue description. And you can easily update your design without any code changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YANG-DB I did unterstand that the ppl-lang/planning folder is for design docs. When the design is approved it will be converted to a userdoc and moved out of the planning sub-folder. Am I wrong with that assumption?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was for roadmap. I prefer to using GitHub PR description and comments. It's fine to add design docs in code if you prefer to. But these docs are not for users, we need to write a new one later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@salyh yes - once design doc is approved we will move it with necessary updates to the regular commands folder

geoip function to add information about the geographical location of an IPv4 or IPv6 address

1. **Proposed syntax**
- `... | eval geoinfo = geoip([datasource,] ipAddress [,properties])`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does datasource mean? Could you use other meaningful usage rather that abc? abc is not straightforward user case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datasource refers to the datasources provided by the ip2geo processor, see #781 (comment)

@@ -781,6 +782,10 @@ coalesceFunctionName
: COALESCE
;

geoipFunction
: GEOIP LT_PRTHS (datasource = functionArg COMMA)? ipAddress = functionArg (COMMA properties = stringLiteral)? RT_PRTHS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datasource = functionArg

Any case for supporting functionArg instead of stringLiteral. Again, what does datasource mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no particular case for functionArg. We can change it to stringLiteral for sure

datasource refers to the datasources provided by the ip2geo processor, see #781 (comment)

@kenrickyap
Copy link
Contributor

kenrickyap commented Nov 4, 2024

HI @salyh,

Currently looking into implementation if this feature. I saw above conversation on planning to leverage ip2geo functionality within geospatial plugin. Did you have a clear idea on how this can be done?

From my understanding to leverage ip2geo, one must first create a ip2geo processor then attach it to a opensearch pipeline. I am not too sure how the Spark plugin could access this functionality. Some pointers would be much appreciated :)

Thanks!

@salyh
Copy link
Contributor Author

salyh commented Nov 4, 2024

HI @salyh,

Currently looking into implementation if this feature. I saw above conversation on planning to leverage ip2geo functionality within geospatial plugin. Did you have a clear idea on how this can be done?

From my understanding to leverage ip2geo, one must first create a ip2geo processor then attach it to a opensearch pipeline. I am not too sure how the Spark plugin could access this functionality. Some pointers would be much appreciated :)

Thanks!

The idea is not to leverage the processor as a processor but to reuse its code (either by copying it - not ideal - , or by separating them into a kind of shared commons library or by adding them as a dependency on code level).

@kenrickyap
Copy link
Contributor

Sounds good my current idea is to create an spi for the plugin, as was done for job-scheduler (allowing the geospatial plugin to implement ExtensiblePlugin) am still working on a design. This is to allow the sql plugin to access the ip2geo functionality (am also working on the iplocation functionality there).

Would you know if the Spark project would be able to leverage the SPI since it is not a plugin? If we are to expose the ip2geo functionality I would rather create something that works both for the sql-plugin and Spark.

@kenrickyap
Copy link
Contributor

kenrickyap commented Nov 4, 2024

Hi @salyh, spent a bit more time investigating this ticket. currently plan is to rewrite the geo2ip functionality within the spark project. The reason for this is that currently the geo2ip processor with the geospatial plugin is intwined with using the job scheduler extension. From my understanding there is no way to leverage usage of this extension in spark so a lot of the functionality would not be able to be migrated.

The rewritten geo2ip within spark will:

  • create spark temp table storing data from geoip datasource if table does not already exist
  • retrieve information from said table based on provided ips.

The current design question is if it would be better to use spark UDF or reflection to call java_method to call the rewritten geo2ip functionality?

@kenrickyap
Copy link
Contributor

Hi @salyh, spent a bit more time investigating this ticket. currently plan is to rewrite the geo2ip functionality within the spark project. The reason for this is that currently the geo2ip processor with the geospatial plugin is intwined with using the job scheduler extension. From my understanding there is no way to leverage usage of this extension in spark so a lot of the functionality would not be able to be migrated.

The rewritten geo2ip within spark will:

  • create spark temp table storing data from geoip datasource if table does not already exist
  • retrieve information from said table based on provided ips.

The current design question is if it would be better to use spark UDF or reflection to call java_method to call the rewritten geo2ip functionality?

Will follow CIDR https://github.com/opensearch-project/opensearch-spark/pull/706/files approach and use UDF

@kenrickyap kenrickyap mentioned this pull request Nov 5, 2024
@YANG-DB YANG-DB changed the title PPL geoip function PPL geoip function - old reference PR Nov 13, 2024
Copy link
Member

@YANG-DB YANG-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a reference PR - DO NO MERGE !

@YANG-DB YANG-DB closed this Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.6 Lang:PPL Pipe Processing Language support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants