Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a script for importing GeoIP data #954

Merged
merged 3 commits into from
Dec 3, 2024

Conversation

normanj-bitquill
Copy link
Contributor

Description

  • Splits subnets as necessary to remove any overlap
  • Is a Scala script to run in Spark CLI
  • Can optionally import the data into a Spark table
  • Adds 3 extra columns that will be used to find the subnet for an IP address

Related Issues

N/A

Check List

  • Updated documentation (docs/ppl-lang/README.md)
  • Implemented unit tests
  • Implemented tests for combination with other commands
  • New added source code should include a copyright header
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

* Splits subnets as necessary to remove any overlap
* Is a Scala script to run in Spark CLI
* Can optionally import the data into a Spark table
* Adds 3 extra columns that will be used to find the subnet for an IP address

Signed-off-by: Norman Jordan <[email protected]>
Copy link
Member

@YANG-DB YANG-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanj-bitquill
Nice - can you please add a csv file as an example usage?
And document the end-to-end process to allow queries based on this table

to use these functions, a table needs to be created containing the geographic location
information.

## How to Create Geographic Location Index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to provide example for downloading csv from https://geoip.maps.opensearch.org/v1/geolite2-city/manifest.json.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanj-bitquill @kenrickyap yes lets please add a complete End2End section in the doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an End-to-End section.

| Column Name | Description |
|-------------|--------------------------------------------------------------------|
| start | An integer value used to determine if an IP address is in a subnet |
| end | An integer value used to determine if an IP address is in a subnet |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be better to name columns ip_range_start and ip_range_end for clarity on what column represents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed.

@@ -0,0 +1,402 @@
import java.io.BufferedReader
Copy link
Contributor

@kenrickyap kenrickyap Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import java.io.BufferedReader
/*
* Copyright OpenSearch Contributors
* SPDX-License-Identifier: Apache-2.0
*/
import java.io.BufferedReader

nit: probably need to add Apache header:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the Apache header.

println("Done")
}

var INPUT_FILE: String = "/Users/normanj/Documents/geoip/geolite2-City.csv"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to have replace local path to <file_path_to_input_csv> and <file_path_to_output_csv>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the variable names.

Copy link
Contributor

@kenrickyap kenrickyap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments. Also were you able to test the ipv6 and address overlap logic?

ipv4Root.fixTree()
ipv6Root.fixTree()

println("Writing data to file")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better user experience to add loading bar

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kenrickyap love this idea!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When writing out the file, it will print the percentage every 10%. This is done for IPv4 and then again for IPv6.


1. Create a copy of the scala file `load_geoip_data.scala`
2. Edit the file
3. There are three variables that need to be updated.
Copy link
Member

@YANG-DB YANG-DB Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanj-bitquill
is this the Edit the file part from line 2 ?
if so it should be 2.1 / 2.2
And can we also give these input as parameters ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The :load command in Spark only accepts one argument (the filename of the script to load).

I have fixed the formatting here.

@YANG-DB YANG-DB added Lang:PPL Pipe Processing Language support 0.7 labels Nov 30, 2024
Signed-off-by: Norman Jordan <[email protected]>
Signed-off-by: Norman Jordan <[email protected]>
@normanj-bitquill
Copy link
Contributor Author

some comments. Also were you able to test the ipv6 and address overlap logic?

@kenrickyap Yes, I did test with overlaps. I added entries for both IPv4 and IPv6 to verify that it would split up the subnet properly so that there is no overlap.

@YANG-DB
Copy link
Member

YANG-DB commented Dec 2, 2024

@normanj-bitquill LGTM 👍
I'll merge it later today

@YANG-DB YANG-DB merged commit d35d66b into opensearch-project:main Dec 3, 2024
4 checks passed
kenrickyap pushed a commit to Bit-Quill/opensearch-spark that referenced this pull request Dec 11, 2024
* Add a script for importing GeoIP data

* Splits subnets as necessary to remove any overlap
* Is a Scala script to run in Spark CLI
* Can optionally import the data into a Spark table
* Adds 3 extra columns that will be used to find the subnet for an IP address

Signed-off-by: Norman Jordan <[email protected]>

* Updated based on PR feedback

Signed-off-by: Norman Jordan <[email protected]>

* Fixed formatting

Signed-off-by: Norman Jordan <[email protected]>

---------

Signed-off-by: Norman Jordan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.7 Lang:PPL Pipe Processing Language support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants