Skip to content

Project profile

kamir edited this page Jul 3, 2012 · 4 revisions

This page will present the main concept, used data structures and core methods for processing of large graphs and networks in an hadoop environment.

Representation of data sets

  1. real world objects (RWO) are expressed as Nodes with an unique NodeID
  2. measured data (properties of RWO) is stored in a hashed data container with time stamps as a key
  3. correlations at a certain time or as an average for a certain time frame between RWO are expressed as NodePairs represented through a hashed triple of values: <timeslice,[NodeID,NodeID,(key,value)]>, the key identifies the type of stored data value which can be measured or calculated
  4. a time slice represents a descrete point in time or a time frame
  5. a group of Nodes is handled as a NodeGroup
  6. based on a the NodeGroup one can calculate all possible correlations for pairs of nodes in this group

Graphs and Networks

  • a Graph is just a set of NodeIDs with one value, which represents a link strength
  • a Network can contain a kind of payload (different node properties, different link propoerties) which can be mapped to a special "view", such a view is just a graph
  • multiple views can be created from on Network and classification or characterization algorithms are used to calculate global properties of network views, which are stored as time series connected with the Network.

IO & Storage

  • data is stored in flat files (Pajek Format, or just NodeID lists, and NodePair-Lists)
  • data is distributed by mapping it to HBase or HDFS
  • dependent in the amount of data the distributed cache can be used too
  • collection of measured date is done by Flume, Casandra, HBase, OpenTSDB
  • primary import export formats are:
  1. Pajek Format
  2. NodeID lists, one column is a graph more than one column is for a network
  3. NodePair-Lists, two columns is a graph, three columns is a weighted graph more than three columns is for a network

Algorithms

  • MapReduce Jobs are combined iteratively for several graph processing tasks
  • UpdateSynch Jobs are used for more sophisticated task performed on a network

Workflow

  • data preprocessing (secondary index creation, parsing, filtering) is done by SMILA
  • for NLP tasks we use OpenNLP
  • MR-Jobs for data management are orchestrated by using Oozie

Visualisation

  • Networks and graphs are plotted by Jung2, Google-Chart API or Google-Maps API
  • several export adapters can transfer the data to Matlab, Cytoscape, Origin or other scientific platforms