Skip to content

austinlasseter/graph-db

Repository files navigation

Graph Database Project

The purpose of this project was to demonstrate that connections on github (repos and users) could be analyzed using a graph database and visualized as a graph network.

A series of json.gz files representing github connections were downloaded from github archive. Each file was processed and converted to Apache Gremlin format. Gremlin files were then loaded into Amazon Neptune for further analysis.

View final visualization here: http://austinlasseter.com/amazon_neptune/

Folder structure

└──network_graph_app
      └──main.py
      ├──helpers
              ├──    __init__.py
              ├──setup_analysis.py
              ├──process_json_files.py
              ├──previz_prep.py
              └──make_viz.py
      └──outputs
              └──various html files
      ├──internal
              └──various jupyter notebooks
      └──testing
              └──various shell scripts for running `main.py`
└──gremlin_data_format
      └──main.py
      ├──helpers
              ├──    __init__.py
              ├──setup_analysis.py
              ├──process_json_files.py
              ├──load-csv-files.groovy
              └──query-gremlin.groovy
      └──outputs
              └──two csv files
      └──internal
              └──various jupyter notebooks

Project 1. Demonstrate network relationships in event data

This project corresponds to the network_graph_app directory. Folder structure is outlined above.

This python module is activated by running main.py.

  • It receives as input a set of json.gz files downloaded from from Github Arhive.
  • It also receives the ID of a "prime" repo identified in the Github API that has been flagged for investigation.
  • It produces an html file displaying a network graph of the relationships between repos and actors, starting with the prime repo, up to the third circle of connectivity. This html file is suitable for inclusion in a static website or other promotional material. Example html files are provided in the outputs file.

In the testing folder there is a script covid19-calculator.sh that demonstrates the process for one repository.

Inputs for the analysis are not stored in the primary directory. The expectation is that they are stored in a parent directory outside of the current directory, according to a file structure laid out in setup_analysis.py. The necessary json.gz files which are the initial inputs should be downloaded from Github Achive using a shell command from the appropriate data folder, such as:

wget https://data.gharchive.org/2015-01-{01..31}-{0..23}.json.gz

Project 2. Demonstrate that event data can be transformed to Gremlin format

This project corresponds to the gremlin_data_format directory. Folder structure is outlined above.

This python module is activated by running main.py.

  • It receives as input a set of json.gz files downloaded from from Github Arhive.
  • It produces two CSV files, ready for uploading to AWS Neptune.
  • It also produces example query results in Gremlin, designed for a use-case similar to our own.

As with the previous project, inputs for the analysis are not stored in the primary directory, but rather in the parent directory laid out in setup_analysis.py.

According the AWS Neptune documentation, to load Apache TinkerPop Gremlin data using the CSV format, you must specify the vertices and the edges in separate files, with names similar to vertex.csv and edge.csv. The required and allowed system column headers are different for vertex files and edge files, as follows:

  • Vertex headers: id, label.
  • Edge headers: id, from, to, label. Example CSV files for one analysis are available in the outputs folder of the project directory.

Apache Ecosystem

There are several different components to the graph database with Apache, and not all of them are necessary when you're getting set up with Amazon Neptune. Here's the full list of Apache tools:

  • Apache Tinkerpop: an open source, vendor-agnostic, graph computing framework. It's one of the two frameworks supported by Neptune (the other is RDF).
  • TinkerGraph. A lightweight, in-memory graph engine that serves as a reference implementation of the TinkerPop3 API. People often compare JanusGraph and TinkerGraph with Neo4j, Amazon Neptune and ArangoDB.
  • Gremlin is the query language of Tinkerpop. Neptune supports this too.
  • The Gremlin Console is one way (but not the only way) to interact with the Gremlin query language. You can skip the console and just write Python scripts with a connector. But doing so requires a server such as...
  • The Gremlin Server. This is provided by Apache but there are multiple competitor options instead. It definitely is not used by AWS, because Neptune is the direct competitor to Gremlin Server.

For the purposes of this project, I installed Tinkerpop, Gremlin and the Gremlin Console. I experimented with Gremlin Server, which is necessary for running the GremlinPython language, but ultimately did not use it.

Setup for Project 2

Tinkerpop and the Gremlin console can be downloaded from the Apache website They can also be installed using Docker:

docker pull tinkerpop/gremlin-console
docker run -it tinkerpop/gremlin-console

Note that there is not a brew formula per-se but for each of the client languages you can use their respective package managers to install the clients such as pip install gremlinpython. Read more about installation at the Gremlin discussion board.

Add the gremlin console to PATH:

export PATH=/usr/local/apache-tinkerpop-gremlin-console-3.4.10/bin/gremlin.sh:$PATH

For the purpose of this project, I installed the Gremlin console in the parent directory of the primary directory. When you install the Gremlin console, it create a directory called apache-tinkerpop-gremlin-console-3.4.10 which has a subdirectory called data. This is where I stored all data and outputs for my analysis.

Once installed, the Groovy console is initiated from the apache-tinkerpop-gremlin-console-3.4.10 directory as follows:

bin/gremlin.sh

The Gremlin console is an adaptation of the Groovy console and expects inputs written in the Groovy programming language. It is possible to load a groovy script in the console as follows:

:load /path/to/file/load-csv-files.groovy

I have provided two Groovy scripts in the helpers directory. These can be loaded in the Gremlin console, one after the other, using the syntax provided above.

  • load-csv-files.groovy: initiates a graph traversal object g by combining the two CSV files vertex.csv and edge.csv.
  • query-gremlin.groovy: using the graph traversal object g, conducts a series of queries into the github events data, exploring relationships between actors and repos.

Examples of additional Gremlin queries can be found in this manual.

How does one load data into a property-graph database? Basically 3 options for this:

  • Write a Gremlin Script to execute in the Gremlin Console to load your data.
  • If you have an especially large graph, then consider BulkLoaderVertexProgram and Hadoop/Spark
  • Consider the bulk loading tools available to the graph database you have chosen.

Visualization

  • The image was created using the python-igraph visualization library

Additional Resources

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages