The purpose of this project was to demonstrate that connections on github (repos and users) could be analyzed using a graph database and visualized as a graph network.
A series of json.gz files representing github connections were downloaded from github archive. Each file was processed and converted to Apache Gremlin format. Gremlin files were then loaded into Amazon Neptune for further analysis.
View final visualization here: http://austinlasseter.com/amazon_neptune/
└──network_graph_app
└──main.py
├──helpers
├── __init__.py
├──setup_analysis.py
├──process_json_files.py
├──previz_prep.py
└──make_viz.py
└──outputs
└──various html files
├──internal
└──various jupyter notebooks
└──testing
└──various shell scripts for running `main.py`
└──gremlin_data_format
└──main.py
├──helpers
├── __init__.py
├──setup_analysis.py
├──process_json_files.py
├──load-csv-files.groovy
└──query-gremlin.groovy
└──outputs
└──two csv files
└──internal
└──various jupyter notebooks
This project corresponds to the network_graph_app
directory. Folder structure is outlined above.
This python module is activated by running main.py
.
- It receives as input a set of
json.gz
files downloaded from from Github Arhive. - It also receives the ID of a "prime" repo identified in the Github API that has been flagged for investigation.
- It produces an html file displaying a network graph of the relationships between repos and actors, starting with the prime repo, up to the third circle of connectivity. This html file is suitable for inclusion in a static website or other promotional material. Example html files are provided in the
outputs
file.
In the testing
folder there is a script covid19-calculator.sh
that demonstrates the process for one repository.
Inputs for the analysis are not stored in the primary directory. The expectation is that they are stored in a parent directory outside of the current directory, according to a file structure laid out in setup_analysis.py
. The necessary json.gz
files which are the initial inputs should be downloaded from Github Achive using a shell command from the appropriate data
folder, such as:
wget https://data.gharchive.org/2015-01-{01..31}-{0..23}.json.gz
This project corresponds to the gremlin_data_format
directory. Folder structure is outlined above.
This python module is activated by running main.py
.
- It receives as input a set of
json.gz
files downloaded from from Github Arhive. - It produces two CSV files, ready for uploading to AWS Neptune.
- It also produces example query results in Gremlin, designed for a use-case similar to our own.
As with the previous project, inputs for the analysis are not stored in the primary directory, but rather in the parent directory laid out in setup_analysis.py
.
According the AWS Neptune documentation, to load Apache TinkerPop Gremlin data using the CSV format, you must specify the vertices and the edges in separate files, with names similar to vertex.csv
and edge.csv
. The required and allowed system column headers are different for vertex files and edge files, as follows:
- Vertex headers: id, label.
- Edge headers: id, from, to, label.
Example CSV files for one analysis are available in the
outputs
folder of the project directory.
There are several different components to the graph database with Apache, and not all of them are necessary when you're getting set up with Amazon Neptune. Here's the full list of Apache tools:
- Apache Tinkerpop: an open source, vendor-agnostic, graph computing framework. It's one of the two frameworks supported by Neptune (the other is RDF).
- TinkerGraph. A lightweight, in-memory graph engine that serves as a reference implementation of the TinkerPop3 API. People often compare JanusGraph and TinkerGraph with Neo4j, Amazon Neptune and ArangoDB.
- Gremlin is the query language of Tinkerpop. Neptune supports this too.
- The Gremlin Console is one way (but not the only way) to interact with the Gremlin query language. You can skip the console and just write Python scripts with a connector. But doing so requires a server such as...
- The Gremlin Server. This is provided by Apache but there are multiple competitor options instead. It definitely is not used by AWS, because Neptune is the direct competitor to Gremlin Server.
For the purposes of this project, I installed Tinkerpop, Gremlin and the Gremlin Console. I experimented with Gremlin Server, which is necessary for running the GremlinPython language, but ultimately did not use it.
Tinkerpop and the Gremlin console can be downloaded from the Apache website They can also be installed using Docker:
docker pull tinkerpop/gremlin-console
docker run -it tinkerpop/gremlin-console
Note that there is not a brew formula per-se but for each of the client languages you can use their respective package managers to install the clients such as pip install gremlinpython
. Read more about installation at the Gremlin discussion board.
Add the gremlin console to PATH:
export PATH=/usr/local/apache-tinkerpop-gremlin-console-3.4.10/bin/gremlin.sh:$PATH
For the purpose of this project, I installed the Gremlin console in the parent directory of the primary directory. When you install the Gremlin console, it create a directory called apache-tinkerpop-gremlin-console-3.4.10
which has a subdirectory called data
. This is where I stored all data and outputs for my analysis.
Once installed, the Groovy console is initiated from the apache-tinkerpop-gremlin-console-3.4.10
directory as follows:
bin/gremlin.sh
The Gremlin console is an adaptation of the Groovy console and expects inputs written in the Groovy programming language. It is possible to load a groovy script in the console as follows:
:load /path/to/file/load-csv-files.groovy
I have provided two Groovy scripts in the helpers
directory. These can be loaded in the Gremlin console, one after the other, using the syntax provided above.
load-csv-files.groovy
: initiates a graph traversal objectg
by combining the two CSV filesvertex.csv
andedge.csv
.query-gremlin.groovy
: using the graph traversal objectg
, conducts a series of queries into the github events data, exploring relationships between actors and repos.
Examples of additional Gremlin queries can be found in this manual.
How does one load data into a property-graph database? Basically 3 options for this:
- Write a Gremlin Script to execute in the Gremlin Console to load your data.
- If you have an especially large graph, then consider BulkLoaderVertexProgram and Hadoop/Spark
- Consider the bulk loading tools available to the graph database you have chosen.
- The image was created using the python-igraph visualization library
- Several manuals and notes of internal meetings are saved in the CW shared drive on Google Docs.
- AWS Neptune: Using Gremlin to Access the Graph
- AWS Neptune: Gremlin Load Data Format
- AWS Neptune: Gremlin Load Data Format
- Documentation for Tinkerpop 2.0
- StackOverflow: All Gremlin Posts
- StackOverflow: All posts by Stephen Mallette
- StackOverflow: All posts by Kelvin Lawrence
- Kelvin Lawrence's Gremlin Guide
- Apache Tinkerpop Installation Guide
- Jason Plurad's Guide to Loading Gremlin Data
- Gremlin Users Google Group, which is full of many great questions and answers
- ACloudGuru Tutorial: Go Serverless with a Graph Database - requires ACG subscription to access
- ACloudGuru Tutorial: Loading and Retrieving Data in Neptune