Skip to content

Commit

Permalink
fix conflicts
Browse files Browse the repository at this point in the history
  • Loading branch information
sierra-moxon committed Aug 11, 2022
2 parents a46016a + 0424ee8 commit 2692e0c
Show file tree
Hide file tree
Showing 32 changed files with 555,383 additions and 97 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/run_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ jobs:
- uses: actions/setup-python@v2
name: setup python environment
with:
python-version: 3.7
python-version: 3.9


- name: Install dependencies
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/run_tox.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python: [3.7, 3.8, 3.9]
python: [3.8, 3.9]

steps:
- uses: actions/checkout@v2
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM python:3.7
FROM python:3.9
MAINTAINER Sierra Moxon "[email protected]"

# Clone repository
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
export PYTHONPATH=.

tests: unit-tests integration-tests
test: unit-tests integration-tests

unit-tests:
pytest tests/unit/test_source/*.py
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Knowledge Graph Exchange

[![Python](https://img.shields.io/badge/python-3.7+-blue.svg)]()
[![Python](https://img.shields.io/badge/python-3.9+-blue.svg)]()
![Run tests](https://github.com/biolink/kgx/workflows/Run%20tests/badge.svg)[![Documentation Status](https://readthedocs.org/projects/kgx/badge/?version=latest)](https://kgx.readthedocs.io/en/latest/?badge=latest)
[![Quality Gate Status](https://sonarcloud.io/api/project_badges/measure?project=biolink_kgx&metric=alert_status)](https://sonarcloud.io/dashboard?id=biolink_kgx)
[![Maintainability Rating](https://sonarcloud.io/api/project_badges/measure?project=biolink_kgx&metric=sqale_rating)](https://sonarcloud.io/dashboard?id=biolink_kgx)
Expand Down Expand Up @@ -53,7 +53,7 @@ transformer.transform(
"graph_edges.tsv",
],
"format": "tsv",
}
},
output_args={
"format": "null"
},
Expand Down Expand Up @@ -113,7 +113,7 @@ It is likely that additional error conditions within KGX can be efficiently capt

## Installation

The installation for KGX requires Python 3.7 or greater.
The installation for KGX requires Python 3.9 or greater.


### Installation for users
Expand Down
2 changes: 1 addition & 1 deletion docs/installation.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Installation

The installation for KGX requires Python 3.7 or greater.
The installation for KGX requires Python 3.9 or greater.


## Installation for users
Expand Down
34 changes: 25 additions & 9 deletions docs/reference/transformer.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,23 +83,39 @@ t.store.graph.edges()

## InfoRes Identifier Rewriting

The `provided_by` and/or `knowledge_source` _et al._ field values of KGX node and edge records generally contain a name of a knowledge source for the node or edge. In some cases, (e.g. Monarch) such values in source knowledge sources could be quite verbose. To normalize such names to a concise standard, the latest Biolink Model (2.*) is moving towards the use of **Information Resource** ("InfoRes") CURIE identifiers.
The `provided_by` and/or `knowledge_source` _et al._ field values of KGX node and edge records generally contain a name
of a knowledge source for the node or edge. In some cases, (e.g. Monarch) such values in source knowledge sources
could be quite verbose. To normalize such names to a concise standard, Biolink Model uses
**Information Resource** ("InfoRes") CURIE identifiers.

To help generate and document such InfoRes identifiers, the provenance property values may optionally trigger a rewrite of their knowledge source names to a candidate InfoRes, as follows:
To help generate and document such InfoRes identifiers, the provenance property values may optionally trigger a rewrite
of their knowledge source names to a candidate InfoRes, as follows:

1. Setting the provenance property to a boolean **True* or (case insensitive) string **"True"** triggers a simple reformatting of knowledge source names into lower case alphanumeric strings removing non-alphanumeric characters and replacing space delimiting words, with hyphens.
1. Setting the provenance property to a boolean **True* or (case-insensitive) string **"True"** triggers a simple
reformatting of knowledge source names into lower case alphanumeric strings removing non-alphanumeric characters
and replacing space delimiting words, with hyphens.

1. Setting the provenance property to a boolean **False* or (case insensitive) string **"False"** suppresses the given provenance annotation on the output graph.
2. Setting the provenance property to a boolean **False* or (case-insensitive) string **"False"** suppresses the
given provenance annotation on the output graph.

1. Providing a tuple with a single string argument not equal to **True**, then the string assumed to be a standard (Pythonic) regular expression to match against knowledge source names. If you do not provide any other string argument (see below), then a matching substring in the name triggers deletion of the matched patter. The simple reformatting (as in 1 above) is then applied to the resulting string.
3. Providing a tuple with a single string argument not equal to **True**, then the string is assumed to be a standard
regular expression to match against knowledge source names. If you do not provide any other string
argument (see below), then a matching substring in the name triggers deletion of the matched pattern. The simple
reformatting (as in 1 above) is then applied to the resulting string.

1. Similar to 2 above, except providing a second string in the tuple which is substituted for the regular expression matched string, followed by simple reformatting.
4. Similar to 2 above, except providing a second string in the tuple which is substituted for the regular expression
matched string, followed by simple reformatting.

1. Providing a third string in the tuple to add a prefix string to the name (as a separate word) of all the generated InfoRes identifiers. Note that if one sets the first and second elements of the tuple to empty strings, the result is the simple addition of a prefix to the provenance property value. Again, the algorithm then applies the simple reformatting rules, but no other internal changes.
5. Providing a third string in the tuple to add a prefix string to the name (as a separate word) of all the generated
InfoRes identifiers. Note that if one sets the first and second elements of the tuple to empty strings, the result
is the simple addition of a prefix to the provenance property value. Again, the algorithm then applies the simple
reformatting rules, but no other internal changes.

The unit tests provide examples of these various rewrites, in the KGX project [tests/integration/test_transform.py](https://github.com/biolink/kgx/blob/master/tests/integration/test_transform.py).
The unit tests provide examples of these various rewrites, in the KGX project
[tests/integration/test_transform.py](https://github.com/biolink/kgx/blob/master/tests/integration/test_transform.py).

The catalog of inferred InfoRes mappings onto knowledge source names is available programmatically, after completion of transform call by using the `get_infores_catalog()` method of the **Transformer** class. The `transform` call of the CLI now also takes a multi-valued `--knowledge-sources` argument, which either facilitates the aforementioned infores processing. Note that quoted comma-delimited strings demarcate the tuple rewrite specifications noted above.
The catalog of inferred InfoRes mappings onto knowledge source names is available programmatically, after completion
of transform call by using the `get_infores_catalog()` method of the **Transformer** class.

## kgx.transformer

Expand Down
4 changes: 2 additions & 2 deletions kgx/cli/cli_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -853,7 +853,7 @@ def transform_source(
Returns an instance of Sink
"""
log.info(f"Processing source '{key}'")
log.debug(f"Processing source '{key}'")
input_args = prepare_input_args(
key,
source,
Expand Down Expand Up @@ -881,7 +881,7 @@ def transform_source(
catalog: Dict[str, str] = transformer.get_infores_catalog()
for source in catalog.keys():
infores = catalog.setdefault(source, "unknown")
print(f"{source}\t{infores}", file=irc)
log.debug(f"{source}\t{infores}", file=irc)

return transformer.store

Expand Down
1 change: 1 addition & 0 deletions kgx/error_detection.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ class ErrorType(Enum):
"""

MISSING_NODE_PROPERTY = 1
MISSING_PROPERTY = 1.5
MISSING_EDGE_PROPERTY = 2
INVALID_NODE_PROPERTY = 3
INVALID_EDGE_PROPERTY = 4
Expand Down
6 changes: 3 additions & 3 deletions kgx/sink/neo_sink.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
from kgx.error_detection import ErrorType
from kgx.sink.sink import Sink
from kgx.source.source import DEFAULT_NODE_CATEGORY

log = get_logger()


Expand Down Expand Up @@ -84,7 +83,7 @@ def _write_node_cache(self) -> None:
filtered_categories = [x for x in categories if x not in self._seen_categories]
self.create_constraints(filtered_categories)
for category in self.node_cache.keys():
log.debug("Generating UNWIND for category: {}".format(category))
log.info("Generating UNWIND for category: {}".format(category))
cypher_category = category.replace(
self.CATEGORY_DELIMITER, self.CYPHER_CATEGORY_DELIMITER
)
Expand Down Expand Up @@ -140,12 +139,13 @@ def _write_edge_cache(self) -> None:
batch_size = 10000
for predicate in self.edge_cache.keys():
query = self.generate_unwind_edge_query(predicate)
log.debug(query)
log.info(query)
edges = self.edge_cache[predicate]
for x in range(0, len(edges), batch_size):
y = min(x + batch_size, len(edges))
batch = edges[x:y]
log.debug(f"Batch {x} - {y}")
log.info(edges[x:y])
try:
self.session.run(
query, parameters={"relationship": predicate, "edges": batch}
Expand Down
15 changes: 13 additions & 2 deletions kgx/source/json_source.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,20 @@
import gzip
from typing import Optional, Generator, Any

import ijson
from itertools import chain

from typing import Dict, Tuple, Any, Generator, Optional, List
from kgx.config import get_logger
from kgx.error_detection import ErrorType, MessageLevel
from kgx.source.source import Source
from kgx.utils.kgx_utils import (
generate_uuid,
generate_edge_key,
extension_types,
archive_read_mode,
sanitize_import
)
from kgx.source.tsv_source import TsvSource
log = get_logger()


class JsonSource(TsvSource):
Expand Down Expand Up @@ -96,3 +106,4 @@ def read_edges(self, filename: str) -> Generator:
FH = open(filename, "rb")
for e in ijson.items(FH, "edges.item", use_float=True):
yield self.read_edge(e)

7 changes: 0 additions & 7 deletions kgx/source/obograph_source.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@
import ijson
import stringcase
from bmt import Toolkit

from kgx.error_detection import ErrorType, MessageLevel
from kgx.prefix_manager import PrefixManager
from kgx.config import get_logger
Expand Down Expand Up @@ -56,7 +55,6 @@ def parse(
"""
self.set_provenance_map(kwargs)

n = self.read_nodes(filename, compression)
e = self.read_edges(filename, compression)
yield from chain(n, e)
Expand Down Expand Up @@ -129,11 +127,6 @@ def read_node(self, node: Dict) -> Optional[Tuple[str, Dict]]:
if "equivalent_nodes" in node_properties:
equivalent_nodes = node_properties["equivalent_nodes"]
fixed_node["same_as"] = equivalent_nodes
# for n in node_properties['equivalent_nodes']:
# data = {'subject': fixed_node['id'], 'predicate': 'biolink:same_as',
# 'object': n, 'relation': 'owl:sameAs'}
# super().load_node({'id': n, 'category': ['biolink:OntologyClass']})
# self.graph.add_edge(fixed_node['id'], n, **data)
return super().read_node(fixed_node)

def read_edges(self, filename: str, compression: Optional[str] = None) -> Generator:
Expand Down
3 changes: 1 addition & 2 deletions kgx/source/source.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ def __init__(self, owner):
self.node_properties = set()
self.edge_properties = set()
self.prefix_manager = PrefixManager()
self.infores_context: Optional[InfoResContext] = None
self.infores_context: Optional[InfoResContext] = InfoResContext()

def set_prefix_map(self, m: Dict) -> None:
"""
Expand Down Expand Up @@ -256,7 +256,6 @@ def set_provenance_map(self, kwargs):
"""
Set up a provenance (Knowledge Source to InfoRes) map
"""
self.infores_context = InfoResContext()
self.infores_context.set_provenance_map(kwargs)

def get_infores_catalog(self) -> Dict[str, str]:
Expand Down
7 changes: 1 addition & 6 deletions kgx/source/tsv_source.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@
archive_read_mode,
sanitize_import
)

log = get_logger()

DEFAULT_LIST_DELIMITER = "|"
Expand Down Expand Up @@ -226,11 +225,9 @@ def read_node(self, node: Dict) -> Optional[Tuple[str, Dict]]:
if node:
# if not None, assumed to have an "id" here...
node_data = sanitize_import(node.copy(), self.list_delimiter)

n = node_data["id"]

self.set_node_provenance(node_data)

self.set_node_provenance(node_data) # this method adds provided_by to the node properties/node data
self.node_properties.update(list(node_data.keys()))
if self.check_node_filter(node_data):
self.node_properties.update(node_data.keys())
Expand Down Expand Up @@ -272,9 +269,7 @@ def read_edge(self, edge: Dict) -> Optional[Tuple]:
edge = self.validate_edge(edge)
if not edge:
return None

edge_data = sanitize_import(edge.copy(), self.list_delimiter)

if "id" not in edge_data:
edge_data["id"] = generate_uuid()
s = edge_data["subject"]
Expand Down
21 changes: 7 additions & 14 deletions kgx/transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,6 @@ def __init__(
if len(entry):
entry = entry.strip()
if entry:
print("entry: " + entry, file=stderr)
source, infores = entry.split("\t")
self._infores_catalog[source] = infores

Expand Down Expand Up @@ -208,22 +207,13 @@ def transform(
if output_args:
if self.stream:
if output_args["format"] in {"tsv", "csv"}:
if "node_properties" not in output_args:
error_type = ErrorType.MISSING_NODE_PROPERTY
self.log_error(
entity=f"{output_args['format']} stream",
error_type=error_type,
message=f"'node_properties' not defined for output while streaming. " +
f"The exported format will be limited to a subset of the columns.",
message_level=MessageLevel.WARNING
)
if "edge_properties" not in output_args:
error_type = ErrorType.MISSING_EDGE_PROPERTY
if "node_properties" not in output_args or "edge_properties" not in output_args:
error_type = ErrorType.MISSING_PROPERTY
self.log_error(
entity=f"{output_args['format']} stream",
error_type=error_type,
message=f"'edge_properties' not defined for output while streaming. " +
f"The exported format will be limited to a subset of the columns.",
message=f"'node_properties' and 'edge_properties' must be defined for output while"
f"streaming. The exported format will be limited to a subset of the columns.",
message_level=MessageLevel.WARNING
)
sink = self.get_sink(**output_args)
Expand Down Expand Up @@ -274,6 +264,7 @@ def transform(
output_args[
"node_properties"
] = intermediate_source.node_properties
log.debug("output_args['node_properties']: " + str(output_args["node_properties"]), file=stderr)
if "edge_properties" not in output_args:
output_args[
"edge_properties"
Expand Down Expand Up @@ -342,6 +333,7 @@ def process(self, source: Generator, sink: Sink) -> None:
"""
for rec in source:
if rec:
log.debug("length of rec", len(rec), "rec", rec)
if len(rec) == 4: # infer an edge record
write_edge = True
if "subject_category" in self.edge_filters:
Expand All @@ -367,6 +359,7 @@ def process(self, source: Generator, sink: Sink) -> None:
self._seen_nodes.add(rec[0])
if self.inspector:
self.inspector(GraphEntityType.NODE, rec)
# last element of rec is the node properties
sink.write_node(rec[-1])

# TODO: review whether or not the 'save()' method need to be 'knowledge_source' aware?
Expand Down
Loading

0 comments on commit 2692e0c

Please sign in to comment.