Project profile

This page will present the main concept, used data structures and core methods for processing of large graphs and networks in an hadoop environment.

Representation of data sets

real world objects (RWO) are expressed as Nodes with an unique NodeID
measured data (properties of RWO) is stored in a hashed data container with time stamps as a key
correlations at a certain time or as an average for a certain time frame between RWO are expressed as NodePairs represented through a hashed triple of values: <timeslice,[NodeID,NodeID,(key,value)]>, the key identifies the type of stored data value which can be measured or calculated
a time slice represents a descrete point in time or a time frame
a group of Nodes is handled as a NodeGroup
based on a the NodeGroup one can calculate all possible correlations for pairs of nodes in this group

a Graph is just a set of NodeIDs with one value, which represents a link strength
a Network can contain a kind of payload (different node properties, different link propoerties) which can be mapped to a special "view", such a view is just a graph
multiple views can be created from on Network and classification or characterization algorithms are used to calculate global properties of network views, which are stored as time series connected with the Network.

data is stored in flat files (Pajek Format, or just NodeID lists, and NodePair-Lists)
data is distributed by mapping it to HBase or HDFS
dependent in the amount of data the distributed cache can be used too
collection of measured date is done by Flume, Casandra, HBase, OpenTSDB
primary import export formats are:

Pajek Format
NodeID lists, one column is a graph more than one column is for a network
NodePair-Lists, two columns is a graph, three columns is a weighted graph more than three columns is for a network

data preprocessing (secondary index creation, parsing, filtering) is done by SMILA
for NLP tasks we use OpenNLP
MR-Jobs for data management are orchestrated by using Oozie

Networks and graphs are plotted by Jung2, Google-Chart API or Google-Maps API
several export adapters can transfer the data to Matlab, Cytoscape, Origin or other scientific platforms