-
Notifications
You must be signed in to change notification settings - Fork 1
Project profile
kamir edited this page Jul 3, 2012
·
4 revisions
This page will present the main concept, used data structures and core methods for processing of large graphs and networks in an hadoop environment.
- real world objects (RWO) are expressed as Nodes with an unique NodeID
- measured data (properties of RWO) is stored in a hashed data container with time stamps as a key
- correlations at a certain time or as an average for a certain time frame between RWO are expressed as NodePairs represented through a hashed triple of values: <timeslice,[NodeID,NodeID,(key,value)]>, the key identifies the type of stored data value which can be measured or calculated
- a time slice represents a descrete point in time or a time frame
- a group of Nodes is handled as a NodeGroup
- based on a the NodeGroup one can calculate all possible correlations for pairs of nodes in this group
- a Graph is just a set of NodeIDs with one value, which represents a link strength
- a Network can contain a kind of payload (different node properties, different link propoerties) which can be mapped to a special "view", such a view is just a graph
- multiple views can be created from on Network and classification or characterization algorithms are used to calculate global properties of network views, which are stored as time series connected with the Network.
- data is stored in flat files (Pajek Format, or just NodeID lists, and NodePair-Lists)
- data is distributed by mapping it to HBase or HDFS
- dependent in the amount of data the distributed cache can be used too
- collection of measured date is done by Flume, Casandra, HBase, OpenTSDB
- primary import export formats are:
- Pajek Format
- NodeID lists, one column is a graph more than one column is for a network
- NodePair-Lists, two columns is a graph, three columns is a weighted graph more than three columns is for a network
- MapReduce Jobs are combined iteratively for several graph processing tasks
- UpdateSynch Jobs are used for more sophisticated task performed on a network
- data preprocessing (secondary index creation, parsing, filtering) is done by SMILA
- for NLP tasks we use OpenNLP
- MR-Jobs for data management are orchestrated by using Oozie
- Networks and graphs are plotted by Jung2, Google-Chart API or Google-Maps API
- several export adapters can transfer the data to Matlab, Cytoscape, Origin or other scientific platforms