-
Notifications
You must be signed in to change notification settings - Fork 274
Batch Implementation
<dependency>
<groupId>com.tinkerpop.blueprints</groupId>
<artifactId>blueprints-core</artifactId>
<version>??</version>
</dependency>
BatchGraph
wraps any TransactionalGraph
to enable batch loading of a large number of edges and vertices by chunking the entire load into smaller batches and maintaining a memory-efficient vertex cache so that intermediate transactional states can be flushed after each chunk is loaded to release memory.
BatchGraph
is ONLY meant for loading data and does not support any retrieval or removal operations. That is, BatchGraph only supports the following methods:
-
addVertex()
for adding vertices -
addEdge()
for adding edges -
getVertex()
to be used when adding edges - Property getter, setter and removal methods for vertices and edges as well as
getId()
An important limitation of BatchGraph
is that edge properties can only be set immediately after the edge has been added. If other vertices or edges have been created in the meantime, setting, getting or removing properties will throw exceptions. This is done to avoid caching of edges which would require a great amount of memory.
BatchGraph
wraps TransactionalGraph
. To wrap arbitrary graphs, use BatchGraph.wrap()
which will additionally wrap non-transactional graphs.
BatchGraph
can also automatically set the provided element ids as properties on the respective element. Use setVertexIdKey()
and setEdgeIdKey()
to set the keys for the vertex and edge properties respectively. This is useful when the graph implementation ignores supplied ids and allows to make the loaded graph compatible for later wrapping with IdGraph
(see Id Implementation) when setting the vertex and edge Id keys to IdGraph.ID
.
As an example, suppose we are loading a large number of edges defined by a String array with four entries called quads:
- The out vertex id
- The in vertex id
- The label of the edge
- A string annotation for the edge, i.e. an edge property
Assuming this array is very large, loading all these edges in a single transaction is likely to exhaust main memory. Furthermore, one would have to rely on the database indexes to retrieve previously created vertices for a given id. BatchGraph
addresses both of these issues.
BatchGraph bgraph = new BatchGraph(graph, BatchGraph.IdType.STRING, 1000);
for (String[] quad : quads) {
Vertex[] vertices = new Vertex[2];
for (int i=0;i<2;i++) {
vertices[i] = bgraph.getVertex(quad[i]);
if (vertices[i]==null) vertices[i]=bgraph.addVertex(quad[i]);
}
Edge edge = bgraph.addEdge(null,vertices[0],vertices[1],quad[2]);
edge.setProperty("annotation",quad[3]);
}
First, a BatchGraph
bgraph is created wrapping an existing graph and setting the id type to IdType.STRING
and the batch size to 1000. BatchGraph
maintains a mapping from the external vertex ids, in our example the first two entries in the String array describing th edge, to the internal vertex ids assigned by the wrapped grahp database. Since this mapping is maintained in memory, it is potentially much faster than the database index. By specifying the IdType
, BatchGraph
chooses the most memory-efficient mapping data structure and applies compression algorithms if possible. There are four different IdTypes
:
- OBJECT : For arbitrary object vertex ids. This is the most generic and least space efficient type.
- STRING : For string vertex ids. Attempts to apply string compression and prefixing strategies to reduce the memory footprint.
- URL : For string vertex ids that parse as URLs. Applies URL specific compression schemes that are more efficient than generic string compression.
- NUMBER : For numeric vertex ids. Uses primitive data structures that requires significantly less memory.
The last argument in the constructor is the batch size, that is, the number of vertices and edges to load before committing a transaction and starting a new one.
The for-loop then iterates over all the quad String arrays and creates an edge for each by first retrieving or creating the vertex end points and then creating the edge. Note, that we set the edge property immediately after creating the edge. This is required because edges are only kept in memory until the next edge is created for efficiency reasons.
In the previous example, there is a big speed advantage if the next edge loaded has the same out vertex as the previous edge. Loading all of the out going edges for a particular vertex at once before moving on to the next out vertex makes optimal use of the cache, whereas loading edges in a random order causes many more writes to and flushes of the cache.
To take advantage of this, the data can be presorted quickly and efficiently using the linux built-in sort command. Let’s say that edges are read from a text file edges.txt
with one edge per line:
4 created 5 weight=1.0
1 knows 4 weight=1.0
1 knows 2 weight=0.5
4 created 3 weight=0.4
6 created 3 weight=0.2
1 created 3 weight=0.4
This file can be sorted before loading with
vadasg$ sort -S4G -o edges_sorted.txt edges.txt
The -S4G
flag gives sort 4Gb of memory to work with. If the file fits into memory the sort will be very fast; otherwise sort
will use scratch space on disk to perform the operation. Although this is not as fast, the linux sort
command is highly optimized and is not limited in the size of files it can process. If the input data contain unwanted duplicate lines, using the -u
flag will cause sort
to remove these duplicate lines during processing.
The sorted file edges_sorted.txt
now has the edges ordered by out vertex:
1 created 3 weight=0.4
1 knows 2 weight=0.5
1 knows 4 weight=1.0
4 created 3 weight=0.4
4 created 5 weight=1.0
6 created 3 weight=0.2
This way, any given out vertex is kept in the cache for all of its out going edges. The time needed to sort the data is nearly always much less than the loading time saved by maximizing use of the cache, especially for large input data.
The above describes how BatchGraph
can be used to load data into a graph under the assumption that the wrapped graph is initially empty. BatchGraph
can also be used to incrementally batch load edges and vertices into a graph with existing data. In this case, vertices may already exist for given ids.
If the wrapped graph does not ignore ids, then enabling incremental batch loading is as simple as calling setLoadingFromScratch(false)
, i.e. to disable the assumption that data is loaded into an empty graph. If the wrapped graph does ignore ids, then one has to tell BatchGraph
how to find existing vertices for a given id by specifying the vertex id key using setVertexIdKey(uid)
where uid is some string for the property key. Also, uid must be key indexed for this to work.
graph.createKeyIndex("uid",Vertex.class);
BatchGraph bgraph = new BatchGraph(graph, BatchGraph.IdType.STRING, 1000);
bgraph.setVertexIdKey("uid");
bgraph.setLoadingFromScratch(false);
//Load data as shown above
Note, that incremental batch loading is more expensive than loading from scratch because BatchGraph
has to call on the wrapped graph to determine whether a vertex exists for a given id.