fix conflicts

biolink · Aug 11, 2022 · 2692e0c · 2692e0c
2 parents a46016a + 0424ee8
commit 2692e0c
Show file tree

Hide file tree

Showing 32 changed files with 555,383 additions and 97 deletions.
diff --git a/.github/workflows/run_tests.yml b/.github/workflows/run_tests.yml
@@ -26,7 +26,7 @@ jobs:
       - uses: actions/setup-python@v2
         name: setup python environment
         with:
-          python-version: 3.7
+          python-version: 3.9
 
 
       - name: Install dependencies

diff --git a/.github/workflows/run_tox.yml b/.github/workflows/run_tox.yml
@@ -15,7 +15,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python: [3.7, 3.8, 3.9]
+        python: [3.8, 3.9]
 
     steps:
       - uses: actions/checkout@v2

diff --git a/Dockerfile b/Dockerfile
@@ -1,4 +1,4 @@
-FROM python:3.7
+FROM python:3.9
 MAINTAINER  Sierra Moxon "[email protected]"
 
 # Clone repository

diff --git a/Makefile b/Makefile
@@ -1,6 +1,6 @@
 export PYTHONPATH=.
 
-tests: unit-tests integration-tests
+test: unit-tests integration-tests
 
 unit-tests:
 	pytest tests/unit/test_source/*.py

diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # Knowledge Graph Exchange
 
-[![Python](https://img.shields.io/badge/python-3.7+-blue.svg)]()
+[![Python](https://img.shields.io/badge/python-3.9+-blue.svg)]()
 ![Run tests](https://github.com/biolink/kgx/workflows/Run%20tests/badge.svg)[![Documentation Status](https://readthedocs.org/projects/kgx/badge/?version=latest)](https://kgx.readthedocs.io/en/latest/?badge=latest)
 [![Quality Gate Status](https://sonarcloud.io/api/project_badges/measure?project=biolink_kgx&metric=alert_status)](https://sonarcloud.io/dashboard?id=biolink_kgx)
 [![Maintainability Rating](https://sonarcloud.io/api/project_badges/measure?project=biolink_kgx&metric=sqale_rating)](https://sonarcloud.io/dashboard?id=biolink_kgx)
@@ -53,7 +53,7 @@ transformer.transform(
             "graph_edges.tsv",
         ],
         "format": "tsv",
-    }
+    },
     output_args={
         "format": "null"
     },
@@ -113,7 +113,7 @@ It is likely that additional error conditions within KGX can be efficiently capt
 
 ## Installation
 
-The installation for KGX requires Python 3.7 or greater.
+The installation for KGX requires Python 3.9 or greater.
 
 
 ### Installation for users

diff --git a/docs/installation.md b/docs/installation.md
@@ -1,6 +1,6 @@
 # Installation
 
-The installation for KGX requires Python 3.7 or greater.
+The installation for KGX requires Python 3.9 or greater.
 
 
 ## Installation for users

diff --git a/docs/reference/transformer.md b/docs/reference/transformer.md
@@ -83,23 +83,39 @@ t.store.graph.edges()
 
 ## InfoRes Identifier Rewriting
 
-The `provided_by` and/or `knowledge_source` _et al._ field values of KGX node and edge records generally contain a name of a knowledge source for the node or edge.  In some cases, (e.g. Monarch)  such values in source knowledge sources could be quite verbose. To normalize such names to a concise standard, the latest Biolink Model (2.*) is moving towards the use of **Information Resource** ("InfoRes") CURIE identifiers.  
+The `provided_by` and/or `knowledge_source` _et al._ field values of KGX node and edge records generally contain a name 
+of a knowledge source for the node or edge.  In some cases, (e.g. Monarch)  such values in source knowledge sources 
+could be quite verbose. To normalize such names to a concise standard, Biolink Model uses
+**Information Resource** ("InfoRes") CURIE identifiers.  
 
-To help generate and document such InfoRes identifiers, the provenance property values may optionally trigger a rewrite of their knowledge source names to a candidate InfoRes, as follows:
+To help generate and document such InfoRes identifiers, the provenance property values may optionally trigger a rewrite 
+of their knowledge source names to a candidate InfoRes, as follows:
 
-1. Setting the provenance property to a boolean **True* or  (case insensitive) string **"True"** triggers a simple reformatting of knowledge source names into lower case alphanumeric strings removing non-alphanumeric characters and replacing space delimiting words, with hyphens.
+1. Setting the provenance property to a boolean **True* or (case-insensitive) string **"True"** triggers a simple 
+reformatting of knowledge source names into lower case alphanumeric strings removing non-alphanumeric characters 
+and replacing space delimiting words, with hyphens.
 
-1. Setting the provenance property  to a boolean **False* or (case insensitive) string **"False"** suppresses the given provenance annotation on the output graph.
+2. Setting the provenance property  to a boolean **False* or (case-insensitive) string **"False"** suppresses the 
+given provenance annotation on the output graph.
 
-1. Providing a tuple with a single string argument not equal to **True**, then the string assumed to be a standard (Pythonic) regular expression to match against knowledge source names. If you do not provide any other string argument (see below), then a matching substring in the name triggers deletion of the matched patter.  The simple reformatting (as in 1 above) is then applied to the resulting string.
+3. Providing a tuple with a single string argument not equal to **True**, then the string is assumed to be a standard 
+regular expression to match against knowledge source names. If you do not provide any other string
+argument (see below), then a matching substring in the name triggers deletion of the matched pattern.  The simple 
+reformatting (as in 1 above) is then applied to the resulting string.
 
-1. Similar to 2 above, except providing a second string in the tuple which is substituted for the regular expression matched string, followed by simple reformatting.
+4. Similar to 2 above, except providing a second string in the tuple which is substituted for the regular expression 
+matched string, followed by simple reformatting.
 
-1. Providing a third string in the tuple to add a prefix string to the name (as a separate word) of all the generated InfoRes identifiers.  Note that if one sets the first and second elements of the tuple to empty strings, the result is the simple addition of a prefix to the provenance property value. Again, the algorithm then applies the simple reformatting rules, but no other internal changes.
+5. Providing a third string in the tuple to add a prefix string to the name (as a separate word) of all the generated 
+InfoRes identifiers.  Note that if one sets the first and second elements of the tuple to empty strings, the result
+is the simple addition of a prefix to the provenance property value. Again, the algorithm then applies the simple 
+reformatting rules, but no other internal changes.
 
-The unit tests provide examples of these various rewrites, in the KGX project [tests/integration/test_transform.py](https://github.com/biolink/kgx/blob/master/tests/integration/test_transform.py).
+The unit tests provide examples of these various rewrites, in the KGX project
+[tests/integration/test_transform.py](https://github.com/biolink/kgx/blob/master/tests/integration/test_transform.py).
 
-The catalog of inferred InfoRes mappings onto knowledge source names is available programmatically, after completion of transform call by using the `get_infores_catalog()` method of the **Transformer** class.  The `transform` call of the CLI now also takes a multi-valued `--knowledge-sources` argument, which either facilitates the aforementioned infores processing. Note that quoted comma-delimited strings demarcate the tuple rewrite specifications noted above.
+The catalog of inferred InfoRes mappings onto knowledge source names is available programmatically, after completion 
+of transform call by using the `get_infores_catalog()` method of the **Transformer** class.
 
 ## kgx.transformer
 

diff --git a/kgx/cli/cli_utils.py b/kgx/cli/cli_utils.py
@@ -853,7 +853,7 @@ def transform_source(
         Returns an instance of Sink
 
     """
-    log.info(f"Processing source '{key}'")
+    log.debug(f"Processing source '{key}'")
     input_args = prepare_input_args(
         key,
         source,
@@ -881,7 +881,7 @@ def transform_source(
             catalog: Dict[str, str] = transformer.get_infores_catalog()
             for source in catalog.keys():
                 infores = catalog.setdefault(source, "unknown")
-                print(f"{source}\t{infores}", file=irc)
+                log.debug(f"{source}\t{infores}", file=irc)
 
     return transformer.store
 

diff --git a/kgx/error_detection.py b/kgx/error_detection.py
@@ -14,6 +14,7 @@ class ErrorType(Enum):
     """
 
     MISSING_NODE_PROPERTY = 1
+    MISSING_PROPERTY = 1.5
     MISSING_EDGE_PROPERTY = 2
     INVALID_NODE_PROPERTY = 3
     INVALID_EDGE_PROPERTY = 4

diff --git a/kgx/sink/neo_sink.py b/kgx/sink/neo_sink.py
@@ -5,7 +5,6 @@
 from kgx.error_detection import ErrorType
 from kgx.sink.sink import Sink
 from kgx.source.source import DEFAULT_NODE_CATEGORY
-
 log = get_logger()
 
 
@@ -84,7 +83,7 @@ def _write_node_cache(self) -> None:
         filtered_categories = [x for x in categories if x not in self._seen_categories]
         self.create_constraints(filtered_categories)
         for category in self.node_cache.keys():
-            log.debug("Generating UNWIND for category: {}".format(category))
+            log.info("Generating UNWIND for category: {}".format(category))
             cypher_category = category.replace(
                 self.CATEGORY_DELIMITER, self.CYPHER_CATEGORY_DELIMITER
             )
@@ -140,12 +139,13 @@ def _write_edge_cache(self) -> None:
         batch_size = 10000
         for predicate in self.edge_cache.keys():
             query = self.generate_unwind_edge_query(predicate)
-            log.debug(query)
+            log.info(query)
             edges = self.edge_cache[predicate]
             for x in range(0, len(edges), batch_size):
                 y = min(x + batch_size, len(edges))
                 batch = edges[x:y]
                 log.debug(f"Batch {x} - {y}")
+                log.info(edges[x:y])
                 try:
                     self.session.run(
                         query, parameters={"relationship": predicate, "edges": batch}

diff --git a/kgx/source/json_source.py b/kgx/source/json_source.py
@@ -1,10 +1,20 @@
 import gzip
 from typing import Optional, Generator, Any
-
 import ijson
 from itertools import chain
-
+from typing import Dict, Tuple, Any, Generator, Optional, List
+from kgx.config import get_logger
+from kgx.error_detection import ErrorType, MessageLevel
+from kgx.source.source import Source
+from kgx.utils.kgx_utils import (
+    generate_uuid,
+    generate_edge_key,
+    extension_types,
+    archive_read_mode,
+    sanitize_import
+)
 from kgx.source.tsv_source import TsvSource
+log = get_logger()
 
 
 class JsonSource(TsvSource):
@@ -96,3 +106,4 @@ def read_edges(self, filename: str) -> Generator:
             FH = open(filename, "rb")
         for e in ijson.items(FH, "edges.item", use_float=True):
             yield self.read_edge(e)
+
diff --git a/kgx/source/obograph_source.py b/kgx/source/obograph_source.py
@@ -4,7 +4,6 @@
 import ijson
 import stringcase
 from bmt import Toolkit
-
 from kgx.error_detection import ErrorType, MessageLevel
 from kgx.prefix_manager import PrefixManager
 from kgx.config import get_logger
@@ -56,7 +55,6 @@ def parse(
 
         """
         self.set_provenance_map(kwargs)
-
         n = self.read_nodes(filename, compression)
         e = self.read_edges(filename, compression)
         yield from chain(n, e)
@@ -129,11 +127,6 @@ def read_node(self, node: Dict) -> Optional[Tuple[str, Dict]]:
         if "equivalent_nodes" in node_properties:
             equivalent_nodes = node_properties["equivalent_nodes"]
             fixed_node["same_as"] = equivalent_nodes
-            # for n in node_properties['equivalent_nodes']:
-            #     data = {'subject': fixed_node['id'], 'predicate': 'biolink:same_as',
-            #     'object': n, 'relation': 'owl:sameAs'}
-            #     super().load_node({'id': n, 'category': ['biolink:OntologyClass']})
-            #     self.graph.add_edge(fixed_node['id'], n, **data)
         return super().read_node(fixed_node)
 
     def read_edges(self, filename: str, compression: Optional[str] = None) -> Generator:

diff --git a/kgx/source/source.py b/kgx/source/source.py
@@ -25,7 +25,7 @@ def __init__(self, owner):
         self.node_properties = set()
         self.edge_properties = set()
         self.prefix_manager = PrefixManager()
-        self.infores_context: Optional[InfoResContext] = None
+        self.infores_context: Optional[InfoResContext] = InfoResContext()
 
     def set_prefix_map(self, m: Dict) -> None:
         """
@@ -256,7 +256,6 @@ def set_provenance_map(self, kwargs):
         """
         Set up a provenance (Knowledge Source to InfoRes) map
         """
-        self.infores_context = InfoResContext()
         self.infores_context.set_provenance_map(kwargs)
 
     def get_infores_catalog(self) -> Dict[str, str]:

diff --git a/kgx/source/tsv_source.py b/kgx/source/tsv_source.py
@@ -13,7 +13,6 @@
     archive_read_mode,
     sanitize_import
 )
-
 log = get_logger()
 
 DEFAULT_LIST_DELIMITER = "|"
@@ -226,11 +225,9 @@ def read_node(self, node: Dict) -> Optional[Tuple[str, Dict]]:
         if node:
             # if not None, assumed to have an "id" here...
             node_data = sanitize_import(node.copy(), self.list_delimiter)
-
             n = node_data["id"]
 
-            self.set_node_provenance(node_data)
-
+            self.set_node_provenance(node_data)  # this method adds provided_by to the node properties/node data
             self.node_properties.update(list(node_data.keys()))
             if self.check_node_filter(node_data):
                 self.node_properties.update(node_data.keys())
@@ -272,9 +269,7 @@ def read_edge(self, edge: Dict) -> Optional[Tuple]:
         edge = self.validate_edge(edge)
         if not edge:
             return None
-
         edge_data = sanitize_import(edge.copy(), self.list_delimiter)
-
         if "id" not in edge_data:
             edge_data["id"] = generate_uuid()
         s = edge_data["subject"]

diff --git a/kgx/transformer.py b/kgx/transformer.py
@@ -116,7 +116,6 @@ def __init__(
                     if len(entry):
                         entry = entry.strip()
                         if entry:
-                            print("entry: " + entry, file=stderr)
                             source, infores = entry.split("\t")
                             self._infores_catalog[source] = infores
 
@@ -208,22 +207,13 @@ def transform(
         if output_args:
             if self.stream:
                 if output_args["format"] in {"tsv", "csv"}:
-                    if "node_properties" not in output_args:
-                        error_type = ErrorType.MISSING_NODE_PROPERTY
-                        self.log_error(
-                            entity=f"{output_args['format']} stream",
-                            error_type=error_type,
-                            message=f"'node_properties' not defined for output while streaming. " +
-                                    f"The exported format will be limited to a subset of the columns.",
-                            message_level=MessageLevel.WARNING
-                        )
-                    if "edge_properties" not in output_args:
-                        error_type = ErrorType.MISSING_EDGE_PROPERTY
+                    if "node_properties" not in output_args or "edge_properties" not in output_args:
+                        error_type = ErrorType.MISSING_PROPERTY
                         self.log_error(
                             entity=f"{output_args['format']} stream",
                             error_type=error_type,
-                            message=f"'edge_properties' not defined for output while streaming. " +
-                                    f"The exported format will be limited to a subset of the columns.",
+                            message=f"'node_properties' and 'edge_properties' must be defined for output while"
+                                    f"streaming. The exported format will be limited to a subset of the columns.",
                             message_level=MessageLevel.WARNING
                         )
                 sink = self.get_sink(**output_args)
@@ -274,6 +264,7 @@ def transform(
                         output_args[
                             "node_properties"
                         ] = intermediate_source.node_properties
+                        log.debug("output_args['node_properties']: " + str(output_args["node_properties"]), file=stderr)
                     if "edge_properties" not in output_args:
                         output_args[
                             "edge_properties"
@@ -342,6 +333,7 @@ def process(self, source: Generator, sink: Sink) -> None:
         """
         for rec in source:
             if rec:
+                log.debug("length of rec", len(rec), "rec", rec)
                 if len(rec) == 4:  # infer an edge record
                     write_edge = True
                     if "subject_category" in self.edge_filters:
@@ -367,6 +359,7 @@ def process(self, source: Generator, sink: Sink) -> None:
                         self._seen_nodes.add(rec[0])
                     if self.inspector:
                         self.inspector(GraphEntityType.NODE, rec)
+                    # last element of rec is the node properties
                     sink.write_node(rec[-1])
 
     # TODO: review whether or not the 'save()' method need to be 'knowledge_source' aware?