Skip to content

Commit

Permalink
Fixes #4138: Add support for loading Gephi GEXF file format (#4177)
Browse files Browse the repository at this point in the history
* Fixes #4138: Add support for loading Gephi GEXF file format

* removed unused imports

* improved assertMapEquals to include nulls

* added malformed gexf tests and examples, small fix typo
  • Loading branch information
vga91 authored Sep 2, 2024
1 parent de44ba7 commit 01f4d95
Show file tree
Hide file tree
Showing 13 changed files with 1,567 additions and 35 deletions.
2 changes: 2 additions & 0 deletions docs/asciidoc/modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ include::partial$generated-documentation/nav.adoc[]
** xref::import/load-csv.adoc[]
** xref::import/xls.adoc[]
** xref::import/html.adoc[]
** xref::import/parquet.adoc[]
** xref::import/gexf.adoc[]
* xref:export/index.adoc[]
** xref::export/xls.adoc[]
Expand Down
294 changes: 294 additions & 0 deletions docs/asciidoc/modules/ROOT/pages/import/gexf.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,294 @@
[[gexf]]
= Load GEXF (Graph Exchange XML Format)
:description: This section describes procedures that can be used to import data from GEXF files.



Many existing applications and data integrations use GEXF to describes a graph with nodes and relationships.
For further information, you should visit the https://gexf.net/[official documentation].

It is possible to load or import nodes and relationship from a GEXF file with the procedures
`apoc.load.gexf` and `apoc.import.gexf`. You need to:

* provide a path to a GEXF file
* provide configuration (optional)

The `apoc.import.gexf` read as the `apoc.load.gexf` but also create nodes and relationships in Neo4j.

For reading from files you'll have to enable the config option:

----
apoc.import.file.enabled=true
----

By default file paths are global, for paths relative to the `import` directory set:

----
apoc.import.file.use_neo4j_config=true
----

== Examples for apoc.load.gexf

.load.gexf
----
<?xml version="1.0" encoding="UTF-8"?>
<gexf version="1.2">
<graph defaultedgetype="directed">
<nodes>
<node foo="bar">
<attvalues>
<attvalue for="0" value="http://gephi.org"/>
</attvalues>
</node>
</nodes>
</graph>
</gexf>
----

[source, cypher]
----
CALL apoc.load.gexf('load.gexf')
----

.Results
[opts="header"]
|===
| value
| {_type: gexf, _children: [{_type: graph, defaultedgetype: directed, _children: [{_type: nodes, _children: [{_type: node, _children: [{_type: attvalues, _children: [{_type: attvalue, for: 0, value: http://gephi.org}]}], foo: bar}]}]}], version: 1.2}
|===


With a malformed GEXF file, like the following one:

----
<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://gexf.net/1.3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://gexf.net/1.3 http://gexf.net/1.3/gexf.xsd" version="1.2">
<meta lastmodifieddate="2009-03-20">
<creator>Gephi.org</creator>
<description>A Web network</description>
</meta>
<graph defaultedgetype="directed">
<attributes class="node">
</attribute>
<nodes>
<nodea id="0" label="Gephi">
<attvalues>
</node>
</edges>
</graph>
</gexf>
----

we get the following error:
```
[Fatal Error] :9:9: The element type "attributes" must be terminated by the matching end-tag "</attributes>".
```

== Examples for apoc.import.gexf

Besides the file you can pass in a config map:

.Config parameters
[opts=header]
|===
| name | type | default | description
| readLabels | Boolean | false | Creates node labels based on the value in the `labels` property of `node` elements
| defaultRelationshipType | String | RELATED | The default relationship type to use if none is specified in the GraphML file
| storeNodeIds | Boolean | false | store the `id` property of `node` elements
| batchSize | Integer | 20000 | The number of elements to process per transaction
| compression | `Enum[NONE, BYTES, GZIP, BZIP2, DEFLATE, BLOCK_LZ4, FRAMED_SNAPPY]` | `null` | Allow taking binary data, either not compressed (value: `NONE`) or compressed (other values)
| source | Map<String,String> | Empty map | See `source / target config` parameter below
| target | Map<String,String> | Empty map | See `source / target config` parameter below
See the xref::overview/apoc.load/apoc.load.csv.adoc#_binary_file[Binary file example]
|===


With the following file will be created:

* 1 node with label Gephi
* 2 nodes with label Webatlas
* 1 node with label RTGI
* 1 node with label BarabasiLab
* 6 relationships of kind KNOWS
* 1 relationship of kind HAS_TICKET
* 1 relationship of kind BAZ

.data.gexf
----
<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://gexf.net/1.3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://gexf.net/1.3 http://gexf.net/1.3/gexf.xsd" version="1.2">
<meta lastmodifieddate="2009-03-20">
<creator>Gephi.org</creator>
<description>A Web network</description>
</meta>
<graph defaultedgetype="directed">
<attributes class="node">
<attribute id="0" title="url" type="string"/>
<attribute id="room" title="room" type="integer"/>
<attribute id="projects" title="projects" type="long"/>
<attribute id="price" title="price" type="double"/>
<attribute id="1" title="indegree" type="float"/>
<attribute id="members" title="members" type="liststring"/>
<attribute id="pins" title="pins" type="listboolean"/>
<attribute id="2" title="frog" type="boolean">
<default>true</default>
</attribute>
</attributes>
<attributes class="edge">
<attribute id="score" title="score" type="float"/>
</attributes>
<nodes>
<node id="0" label="Gephi">
<attvalues>
<attvalue for="0" value="http://gephi.org"/>
<attvalue for="1" value="1"/>
<attvalue for="room" value="10"/>
<attvalue for="price" value="10.02"/>
<attvalue for="projects" value="300"/>
<attvalue for="members" value="[Altomare, Sterpeto, Lino]"/>
<attvalue for="pins" value="[true, false, true, false]"/>
</attvalues>
</node>
<node id="5" label="Gephi">
<attvalues>
<attvalue for="0" value="http://test.gephi.org"/>
<attvalue for="1" value="2"/>
</attvalues>
</node>
<node id="1" label="Webatlas">
<attvalues>
<attvalue for="0" value="http://webatlas.fr"/>
<attvalue for="1" value="2"/>
</attvalues>
</node>
<node id="2" label="RTGI">
<attvalues>
<attvalue for="0" value="http://rtgi.fr"/>
<attvalue for="1" value="1"/>
</attvalues>
</node>
<node id="3" label=":BarabasiLab:Webatlas">
<attvalues>
<attvalue for="0" value="http://barabasilab.com"/>
<attvalue for="1" value="1"/>
<attvalue for="2" value="false"/>
</attvalues>
</node>
</nodes>
<edges>
<edge source="0" target="1" kind="KNOWS">
<attvalues>
<attvalue for="score" value="1.5"/>
</attvalues>
</edge>
<edge source="0" target="0" kind="BAZ">
<attvalues>
<attvalue for="foo" value="bar"/>
<attvalue for="score" value="2"/>
</attvalues>
</edge>
<edge source="0" target="2" kind="HAS_TICKET">
<attvalues>
<attvalue for="ajeje" value="brazorf"/>
<attvalue for="score" value="3"/>
</attvalues>
</edge>
<edge source="0" target="2" kind="KNOWS" />
<edge source="1" target="0" kind="KNOWS" />
<edge source="2" target="1" kind="KNOWS" />
<edge source="0" target="3" kind="KNOWS" />
<edge source="5" target="3" kind="KNOWS" />
</edges>
</graph>
</gexf>
----

[source, cypher]
----
CALL apoc.import.gexf('data.gexf', {readLabels:true})
----

.Results
[opts="header"]
|===
| value
| {
"relationships" : 8,
"batches" : 0,
"file" : "file:/../data.gexf",
"nodes" : 5,
"format" : "gexf",
"source" : "file",
"time" : 9736,
"rows" : 0,
"batchSize" : -1,
"done" : true,
"properties" : 21
}
|===

We can also store the node IDs by executing:
[source, cypher]
----
CALL apoc.import.gexf('data.gexf', {readLabels:true, storeNodeIds: true})
----


With a malformed GEXF file, like the following one:

----
<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://gexf.net/1.3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://gexf.net/1.3 http://gexf.net/1.3/gexf.xsd" version="1.2">
<meta lastmodifieddate="2009-03-20">
<creator>Gephi.org</creator>
<description>A Web network</description>
</meta>
<graph defaultedgetype="directed">
<attributes class="node">
</attribute>
<nodes>
<nodea id="0" label="Gephi">
<attvalues>
</node>
</edges>
</graph>
</gexf>
----

we get the following result, without nodes, relationships and properties imported:

.Results
[opts="header"]
|===
| value
| {
"relationships" : 8,
"batches" : 0,
"file" : "file:/../malformed.gexf",
"nodes" : 0,
"format" : "gexf",
"source" : "file",
"time" : 9736,
"rows" : 0,
"batchSize" : -1,
"done" : true,
"properties" : 0
}
|===



=== source / target config

Allows the import of relations in case the source and / or target nodes are not present in the file, searching for nodes via a custom label and property.
To do this, we can insert into the config map `source: {label: '<MY_SOURCE_LABEL>', id: `'<MY_SOURCE_ID>'`}` and/or `source: {label: '<MY_TARGET_LABEL>', id: `'<MY_TARGET_ID>'`}`
In this way, we can search start and end nodes via the source and end attribute of `edge` tag.

For example, with a config map `{source: {id: 'myId', label: 'Foo'}, target: {id: 'other', label: 'Bar'}}`
with a edge row like `<edge id="e0" source="n0" target="n1" label="KNOWS"><data key="label">KNOWS</data></edge>`
we search a source node `(:Foo {myId: 'n0'})` and an end node `(:Bar {other: 'n1'})`.
The id key is optional (the default is `'id'`).




1 change: 1 addition & 0 deletions docs/asciidoc/modules/ROOT/pages/import/index.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ For more information on these procedures, see:
* xref::import/xls.adoc[]
* xref::import/html.adoc[]
* xref::import/parquet.adoc[]
* xref::import/gexf.adoc[]
83 changes: 83 additions & 0 deletions extended/src/main/java/apoc/load/Gexf.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
package apoc.load;

import apoc.Extended;
import apoc.Pools;
import apoc.export.util.CountingReader;
import apoc.export.util.ExportConfig;
import apoc.export.util.ProgressReporter;
import apoc.load.util.XmlReadUtil.Import;
import apoc.result.MapResult;
import apoc.result.ProgressInfo;
import apoc.util.FileUtils;
import apoc.util.Util;
import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.security.URLAccessChecker;
import org.neo4j.procedure.Context;
import org.neo4j.procedure.Description;
import org.neo4j.procedure.Mode;
import org.neo4j.procedure.Name;
import org.neo4j.procedure.Procedure;
import org.neo4j.procedure.TerminationGuard;

import java.util.Map;
import java.util.stream.Stream;

import static apoc.load.util.XmlReadUtil.Load.xmlXpathToMapResult;

@Extended
public class Gexf {

@Context
public GraphDatabaseService db;

@Context
public URLAccessChecker urlAccessChecker;

@Context
public TerminationGuard terminationGuard;

@Context
public Pools pools;

@Procedure("apoc.load.gexf")
@Description("apoc.load.gexf(urlOrBinary, path, $config) - load Gexf file from URL or binary source")
public Stream<MapResult> gexf(
@Name("urlOrBinary") Object urlOrBinary,
@Name(value = "config", defaultValue = "{}") Map<String, Object> config
) throws Exception {
return xmlXpathToMapResult(urlOrBinary, urlAccessChecker, terminationGuard, config);
}

@Procedure(name = "apoc.import.gexf", mode = Mode.WRITE)
@Description("Imports a graph from the provided GraphML file.")
public Stream<ProgressInfo> importGexf(
@Name("urlOrBinaryFile") Object urlOrBinaryFile, @Name(value = "config", defaultValue = "{}") Map<String, Object> config) {
ProgressInfo result = Util.inThread(pools, () -> {
ExportConfig exportConfig = new ExportConfig(config);
String file = null;
String source = "binary";
if (urlOrBinaryFile instanceof String) {
file = (String) urlOrBinaryFile;
source = "file";
}
ProgressReporter reporter = new ProgressReporter(null, null, new ProgressInfo(file, source, "gexf"));
Import graphReader = new Import(db)
.reporter(reporter)
.batchSize(exportConfig.getBatchSize())
.relType(exportConfig.defaultRelationshipType())
.source(exportConfig.getSource())
.target(exportConfig.getTarget())
.nodeLabels(exportConfig.readLabels());

if (exportConfig.storeNodeIds()) graphReader.storeNodeIds();

try (CountingReader reader =
FileUtils.readerFor(urlOrBinaryFile, exportConfig.getCompressionAlgo(), urlAccessChecker)) {
graphReader.parseXML(reader, terminationGuard);
}

return reporter.getTotal();
});
return Stream.of(result);
}
}
Loading

0 comments on commit 01f4d95

Please sign in to comment.