-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a script for importing GeoIP data #954
Conversation
* Splits subnets as necessary to remove any overlap * Is a Scala script to run in Spark CLI * Can optionally import the data into a Spark table * Adds 3 extra columns that will be used to find the subnet for an IP address Signed-off-by: Norman Jordan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@normanj-bitquill
Nice - can you please add a csv file as an example usage?
And document the end-to-end
process to allow queries based on this table
to use these functions, a table needs to be created containing the geographic location | ||
information. | ||
|
||
## How to Create Geographic Location Index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to provide example for downloading csv from https://geoip.maps.opensearch.org/v1/geolite2-city/manifest.json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@normanj-bitquill @kenrickyap yes lets please add a complete End2End section in the doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an End-to-End section.
docs/opensearch-geoip.md
Outdated
| Column Name | Description | | ||
|-------------|--------------------------------------------------------------------| | ||
| start | An integer value used to determine if an IP address is in a subnet | | ||
| end | An integer value used to determine if an IP address is in a subnet | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be better to name columns ip_range_start
and ip_range_end
for clarity on what column represents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed.
@@ -0,0 +1,402 @@ | |||
import java.io.BufferedReader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import java.io.BufferedReader | |
/* | |
* Copyright OpenSearch Contributors | |
* SPDX-License-Identifier: Apache-2.0 | |
*/ | |
import java.io.BufferedReader |
nit: probably need to add Apache header:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the Apache header.
docs/load_geoip_data.scala
Outdated
println("Done") | ||
} | ||
|
||
var INPUT_FILE: String = "/Users/normanj/Documents/geoip/geolite2-City.csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be better to have replace local path to <file_path_to_input_csv> and <file_path_to_output_csv>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the variable names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments. Also were you able to test the ipv6 and address overlap logic?
ipv4Root.fixTree() | ||
ipv6Root.fixTree() | ||
|
||
println("Writing data to file") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be better user experience to add loading bar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kenrickyap love this idea!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When writing out the file, it will print the percentage every 10%. This is done for IPv4 and then again for IPv6.
docs/opensearch-geoip.md
Outdated
|
||
1. Create a copy of the scala file `load_geoip_data.scala` | ||
2. Edit the file | ||
3. There are three variables that need to be updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@normanj-bitquill
is this the Edit the file
part from line 2 ?
if so it should be 2.1 / 2.2
And can we also give these input as parameters ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The :load
command in Spark only accepts one argument (the filename of the script to load).
I have fixed the formatting here.
Signed-off-by: Norman Jordan <[email protected]>
Signed-off-by: Norman Jordan <[email protected]>
@kenrickyap Yes, I did test with overlaps. I added entries for both IPv4 and IPv6 to verify that it would split up the subnet properly so that there is no overlap. |
@normanj-bitquill LGTM 👍 |
* Add a script for importing GeoIP data * Splits subnets as necessary to remove any overlap * Is a Scala script to run in Spark CLI * Can optionally import the data into a Spark table * Adds 3 extra columns that will be used to find the subnet for an IP address Signed-off-by: Norman Jordan <[email protected]> * Updated based on PR feedback Signed-off-by: Norman Jordan <[email protected]> * Fixed formatting Signed-off-by: Norman Jordan <[email protected]> --------- Signed-off-by: Norman Jordan <[email protected]>
Description
Related Issues
N/A
Check List
--signoff
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.