Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dereferencer #11

Merged
merged 90 commits into from
Jan 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
3ce6908
Merge pull request #5 from lifewatch/fix/docker-names
laurianvm Sep 8, 2023
08ab97c
Merge pull request #7 from lifewatch/fix/jupyter-token
laurianvm Sep 8, 2023
af750b2
Merge branch 'main' of github.com:lifewatch/user-analysis-2023 into t…
marc-portier Sep 9, 2023
3afba5d
realized we need the data for the ingest
marc-portier Sep 9, 2023
342d63e
docker image builds python using poetry
marc-portier Sep 10, 2023
3a3c169
apply image names
marc-portier Sep 10, 2023
8d6bc1a
getting the graphdb to work together with the sparqlwrapper
marc-portier Sep 10, 2023
d39c314
minor cleanup
marc-portier Sep 10, 2023
1215308
Merge branch 'fix/docker-names' of github.com:lifewatch/user-analysis…
marc-portier Sep 11, 2023
c2d53c5
create a local graphd-db image that initializes the database
marc-portier Sep 11, 2023
97586d3
introduce the notebooks so they become available in the jupyter
marc-portier Sep 11, 2023
e0eefa2
use the new feaures of the jupyter and graphdb images
marc-portier Sep 11, 2023
30bd04f
cleanup not needed test script
marc-portier Sep 11, 2023
a9b85af
ingest of file succeeded
marc-portier Sep 11, 2023
417766c
prefer https for schema.org
marc-portier Oct 26, 2023
c71976b
rename docker/info script, introducing jq and some enhancements
marc-portier Oct 26, 2023
d9f8ce1
fix error in copy statement (2nd arg required)
marc-portier Oct 26, 2023
edeb104
introduce external shared logging volume
marc-portier Oct 26, 2023
b897e68
updated deps
marc-portier Oct 26, 2023
2f90b6b
updated deps
marc-portier Oct 26, 2023
0983c1d
ensure the log folder exists
marc-portier Oct 26, 2023
846ec54
fix path to data - as it is distinct to the location inside the grpah…
marc-portier Oct 26, 2023
c31846c
use the new external logging/ folder
marc-portier Oct 26, 2023
892abd6
extended readme
marc-portier Oct 26, 2023
907d87d
room for more dependencies in ipynb context
marc-portier Oct 26, 2023
523df6e
as is current dump of progress towards autodetection
marc-portier Nov 14, 2023
7eb3d96
normalise dos2unix for /docker/**/*.sh files
cedricdcc Nov 14, 2023
8b93eb1
added watcher to injest
cedricdcc Nov 14, 2023
4378715
deleted non essential code fr starting graphdb-database
cedricdcc Nov 15, 2023
e41d322
watcher works, iri injest error on graph modifications though
cedricdcc Nov 16, 2023
e66d915
working injest , no auto
cedricdcc Nov 16, 2023
6a83d3e
auto injest complete
cedricdcc Nov 16, 2023
4731fe7
small refactoring
cedricdcc Nov 17, 2023
866c1b8
added rdf2j and refactoring of the graph functions
cedricdcc Nov 17, 2023
36d73e4
Update graph_functions.py
cedricdcc Nov 17, 2023
8fa637b
deleted / commented out non used imports
cedricdcc Nov 17, 2023
ee74345
performed autopep8 and black on all python files
cedricdcc Nov 18, 2023
fd4879d
refactoring of watcher.py , editied templates and graphdb.py function…
cedricdcc Nov 21, 2023
9f8cfdc
beginning of tests
cedricdcc Nov 21, 2023
c916ec5
changed const variables and reverted changes on update context lastmod
cedricdcc Nov 22, 2023
5f67142
done refactoring + tests made + workflows for autopep8 and black made
cedricdcc Nov 27, 2023
f486db3
changed version for workflows
cedricdcc Nov 27, 2023
103b51c
renaming workflow file + change in python test file to check if actio…
cedricdcc Nov 27, 2023
3eed9b7
changed python workflow versions to work with arch x64
cedricdcc Nov 27, 2023
a2c6cf8
attempt 4 at working linting
cedricdcc Nov 27, 2023
c1af6c7
Automated code formatting
github-actions[bot] Nov 27, 2023
d982426
last reforctoring mods
cedricdcc Nov 29, 2023
c6246c8
Automated code formatting
github-actions[bot] Nov 29, 2023
9375895
added beginning of dereferencer
cedricdcc Nov 29, 2023
3fa7ada
added dereferencing config and memory
cedricdcc Nov 30, 2023
ccc7e87
Automated python code formatting
github-actions[bot] Nov 30, 2023
04803d5
small updates lwua-ingest and added deref entity runs for orcid and mr
cedricdcc Dec 1, 2023
b4fcec9
Automated python code formatting
github-actions[bot] Dec 1, 2023
29417c6
deleted metadata management for now in search for more favorable system
cedricdcc Dec 1, 2023
f070d16
Automated python code formatting
github-actions[bot] Dec 1, 2023
1b5963d
working dereferencer
cedricdcc Dec 1, 2023
9188dcc
fixed linting workflow
cedricdcc Dec 1, 2023
0f5714a
Update derefEntity.py
cedricdcc Dec 1, 2023
dec05f6
wf-update
cedricdcc Dec 1, 2023
cc35c2b
Automated python code formatting
github-actions[bot] Dec 1, 2023
5478e3a
adds files directly to graph instead of via ingest
cedricdcc Dec 5, 2023
ebcffc0
Automated python code formatting
github-actions[bot] Dec 5, 2023
f5b682f
added cache life
cedricdcc Dec 5, 2023
7ed1987
Automated python code formatting
github-actions[bot] Dec 5, 2023
bfe2e18
added batches to the insert for huge batches
cedricdcc Dec 5, 2023
76d12df
Automated python code formatting
github-actions[bot] Dec 5, 2023
766d0dd
new bottom up property traversal
cedricdcc Dec 6, 2023
ca54add
Automated python code formatting
github-actions[bot] Dec 6, 2023
40286bf
added testing
cedricdcc Dec 7, 2023
4ac6e93
Automated python code formatting
github-actions[bot] Dec 7, 2023
d64b645
added ability to deref fair signposting links
cedricdcc Dec 8, 2023
89fdb19
Automated python code formatting
github-actions[bot] Dec 8, 2023
c02ca4b
fixed json download issue and added dataset in the config folder
cedricdcc Dec 8, 2023
f97e4cc
Automated python code formatting
github-actions[bot] Dec 8, 2023
1521c54
Update test_derefEntity.py
cedricdcc Dec 11, 2023
442bbc4
Automated python code formatting
github-actions[bot] Dec 11, 2023
72f8110
added new tempate that will get the subject and one for the objects
cedricdcc Dec 11, 2023
4b43371
Automated python code formatting
github-actions[bot] Dec 11, 2023
efc1ca2
revert to only object searches
cedricdcc Dec 12, 2023
0bc1001
Add dereference functionality and write store to graph database
cedricdcc Dec 12, 2023
eb83248
Automated python code formatting
github-actions[bot] Dec 12, 2023
f1bdc46
changed/added testing for re and script tag harvesting
cedricdcc Dec 12, 2023
8941c10
Automated python code formatting
github-actions[bot] Dec 12, 2023
8c40cc2
added more tests
cedricdcc Dec 12, 2023
663838e
Automated python code formatting
github-actions[bot] Dec 12, 2023
8defd84
Update test_derefEntity.py
cedricdcc Dec 12, 2023
92f1f9b
Merge branch 'dereferencer' of https://github.com/cedricdcc/user-anal…
cedricdcc Dec 12, 2023
8003a97
Automated python code formatting
github-actions[bot] Dec 12, 2023
3080711
Update cache lifetime to 1 second, added context while parsing triple…
cedricdcc Jan 10, 2024
077d79b
Automated python code formatting
github-actions[bot] Jan 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
docker/**/*.sh text eol=lf
46 changes: 46 additions & 0 deletions .github/workflows/linting-python-files.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
name: Python Linting

on:
push:
paths:
- 'docker/lwua-ingest/**/*.py'
- 'docker/lwua-dereferencer/**/*.py'
pull_request:
paths:
- 'docker/lwua-ingest/**/*.py'
- 'docker/lwua-dereferencer/**/*.py'

jobs:
lint:
runs-on: ubuntu-latest
steps:
- name: Check out source repository
uses: actions/checkout@v2

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.10.6

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install black autopep8

- name: Run Black
run: |
black docker/lwua-ingest/
black docker/lwua-dereferencer/

- name: Run autopep8
run: |
autopep8 --in-place --aggressive --aggressive --max-line-length 79 --recursive docker/lwua-ingest/
autopep8 --in-place --aggressive --aggressive --max-line-length 79 --recursive docker/lwua-dereferencer/

- name: Commit and push changes
run: |
git config --global user.name 'cedricdcc'
git config --global user.email 'github-actions[bot]@users.noreply.github.com'
git add -A
git commit -m "Automated python code formatting" || exit 0
git push
33 changes: 33 additions & 0 deletions .github/workflows/lwua-ingest-testing.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
name: Python Tests

on:
push:
paths:
- 'docker/lwua-ingest/lwua-py/**/*.py'
pull_request:
paths:
- 'docker/lwua-ingest/lwua-py/**/*.py'

jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Check out source repository
uses: actions/checkout@v2

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.10.6

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install poetry
cd docker/lwua-ingest/lwua-py
poetry install

- name: Run pytest
run: |
cd docker/lwua-ingest/lwua-py
poetry run pytest ./tests/
5 changes: 3 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
.env
data/
__pycache__
*.log
__pycache__
.ipynb_checkpoints/
67 changes: 56 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,27 +3,57 @@
## using this project

Steps:

1. retrieve the source code from github
2. to start up the services simply run

2. to build the services simply run

```bash
.$ cp dotenv-example .env # make sure you have an .env file
.$ cd docker && docker-compose build # use docker to build the services
```

3. to start up the services simply run

```bash
.$ cd docker && docker-compose up # use docker to run the services
```

4. open the jupyter notebook

```bash
.$ xdg-open $(docker/jupyter_url.sh) # this gets the url for the service and opens a browser to it
```

5. open the graphdb browser ui

```bash
.$ touch .env # make sure you have an .env file
.$ cd docker
./docker$ docker-compose up
.$ xdg-open http://localhost:7200 # opens the web ui in a browser
```

6. run a test-ingest

This introduces forcefully at least the data/project.ttl into the triple store
This should not be needed when the ingest runs automatically

```bash
.$ docker exec -it lwua_ingest /bin/bash # interactively gets you into the ingest env
root@f226b253fbd4:/lwua-py# python -m lwua.ingest # run the ingest
```


## general plan
## general plan ahead -- details to be converted into github issues

big idea is to have a central triples store for the user analysis approach
this to decouple the ingest (retrieval and semantic mapping) from the different sources from the reporting (which should be based on the assembled knowledge graph)

### for the ingest we will need a mix if strategies
* actually getting data by using dumps our webservices
* additionally uplifting thos to triples (via pysubyt)
### for the ingest we will need a mix of strategies
* actually getting raw (non linkd) data by using dumps from webservices
* additionally uplifting those to triples (via pysubyt)
* possibly ingesting long-living reference sets through ldes client

* augmenting strategies --> starting by reading from what we already have in store, decide, then fetch more connected data, and produce more triples
* possibly add semantic reasoner
* attention to provenance triples for meta analysis ?

#### Ingest Tasks
- identify sources (dumps, werbservices or sparql endpoints)
Expand All @@ -45,7 +75,6 @@ this to decouple the ingest (retrieval and semantic mapping) from the different
- build ipynb reports



### model-design
* identify the shape of the graph we will use and how all items will be linked together
* source for uplifting and querying
Expand All @@ -72,9 +101,23 @@ this to decouple the ingest (retrieval and semantic mapping) from the different
- deploy at docker-dev
- setup ci/cd for autodeploy

### meta & wrap up

#### release management
- to be setup
- to consider split between reusable platform of components for generic semantic analysis & lwua23
- to organise multiple repos
- to publish images on docker-hub? elsewhere?

#### documentation
- todo / make lists
- probably organize into separate /docs/**md linked from this readme ?


## repo layout

## documentation

### repo layout

src / py / lwua_ingest --> module for ingest, has nested ./lwua_ingest/ and ./tests/

Expand All @@ -91,3 +134,5 @@ docker / tools --> useful bash scripts to do some standard docker commands (as a
docs / **.md --> with useful planning / motivation / usage / etc etc docs (e.g. list-of-sources.md)

data / {source} / **.* out of band retrieved actual files

logging / ** placeholder folder where dedicated logging from different docker-containers are grouped and put together.
9 changes: 9 additions & 0 deletions configs/dereference_fair_signposting_dataset.yml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally hate foldernames with suffix-s
./config/ will do nicely

motivation: there is only one folder with that name (so it is singular :) -- , and all folders are meant to possibly contain multiple files ... so what gives?

Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
subjects:
SPARQL: >
PREFIX schema: <https://schema.org/>
SELECT ?s ?p ?o
WHERE {
?s ?p schema:Dataset .
}
assert-paths:
cache_lifetime: 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remind me what this does?

10 changes: 10 additions & 0 deletions configs/dereference_fair_signposting_projects.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
subjects:
SPARQL: >
PREFIX schema: <https://schema.org/>
SELECT ?s ?p ?o
WHERE {
?s ?p schema:Project .
}
assert-paths:
- <https://schema.org/Dataset>
cache_lifetime: 1
15 changes: 15 additions & 0 deletions configs/dereference_mr_test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
subjects:
literal:
- http://marineregions.org/mrgid/63523
- http://marineregions.org/mrgid/2540
- http://marineregions.org/mrgid/12548
prefixes:
ex: <https://example.org/whatever/>
mr: <http://marineregions.org/ns/ontology#>
assert-paths:
- "mr:hasGeometry"
#- "mr:isPartOf / mr:hasGeometry"
- "mr:isPartOf / <https://schema.org/geo> / <https://schema.org/latitude>"
- "mr:isPartOf/ <https://schema.org/geo>/<https://schema.org/longitude>"
#- "mr:isPartOf/mr:hasGeometry / <https://schema.org/latitude> /<https://schema.org/longitude>"
cache_lifetime: 180
10 changes: 10 additions & 0 deletions configs/dereference_test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
subjects:
SPARQL: >
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX schema: <https://schema.org/>
SELECT ?s
WHERE {
?s rdf:type schema:Person .
}
assert-paths:
cache_lifetime: 180
48 changes: 48 additions & 0 deletions configs/example_dereference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
### Examples

#### Example 1

```yaml
subjects:
literal:
- http://marineregions.org/mrgid/63523
- http://marineregions.org/mrgid/2540
- http://marineregions.org/mrgid/12548
prefixes:
ex: <https://example.org/whatever/>
mr: <http://marineregions.org/ns/ontology#>
assert-paths:
- "mr:hasGeometry"
- "mr:isPartOf / mr:hasGeometry"
- "mr:isPartOf / <https://schema.org/geo> / <https://schema.org/latitude>"
- "mr:isPartOf/ <https://schema.org/geo>/<https://schema.org/longitude>"
- "mr:isPartOf/mr:hasGeometry / <https://schema.org/latitude> /<https://schema.org/longitude>"
cache_lifetime: 18000
```

In this example, the subjects are the literal values of the URIs. The `assert-paths` are the property paths to follow in the results. The `cache_lifetime` is the lifetime of the cache in minutes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still does not explain what is being cached, and thus does not help me to consider what should drive my decision as a user to set, increase, decrease this thingy?


#### Example 2

```yaml
subjects:
SPARQL: >
select DISTINCT ?s where {
?s ?p ?o .
FILTER regex(str(?s), "^http://marineregions.org/mrgid/[0-9]{1,5}$")
}
assert-paths:
- <https://schema.org/Dataset>
cache_lifetime: 5
```

In this example, the subjects are the results of the SPARQL query. The `assert-paths` are the property paths to follow in the results. The `cache_lifetime` is the lifetime of the cache in minutes.

### Explanation

Key | Value | Required
--- | ---
`subjects` | The subjects to dereference this can be a list of literal values or a SPARQL query | Yes
`prefixes` | The prefixes to use in the `assert-paths` | No
`assert-paths` | The property paths to test the results against. | Yes
`cache_lifetime` | The lifetime of the cache in minutes | No
89 changes: 89 additions & 0 deletions data/mr_regions_ldes_test.ttl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is a _test file should we not put it somewhere else?
at least grouped under ./data/tests for now but maybe even ./tests/data (although that might come natural if we repackage this as k-gap project without other data)

Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
@prefix tree: <https://w3id.org/tree#> .
@prefix ldes: <https://w3id.org/ldes#> .
@prefix dc: <http://purl.org/dc/terms/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix gsp: <http://www.opengis.net/ont/geosparql#> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix mr: <http://marineregions.org/ns/ontology#> .
@prefix schema: <https://schema.org/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<http://marineregions.org/feed?page=2023-11-28T10%3A00%3A00Z%2F2023-11-28T11%3A00%3A00Z>
a tree:Node ;
tree:relation [ tree:node <http://marineregions.org/feed?page=2023-11-23T14%3A00%3A00Z%2F2023-11-23T15%3A00%3A00Z> ] ;
ldes:retentionPolicy [
a ldes:LatestVersionSubset ;
ldes:amount 1 ;
ldes:versionKey ( dc:isVersionOf )
] .

<http://marineregions.org/feed>
a ldes:EventStream ;
tree:shape [
a sh:NodeShape ;
sh:nodeKind sh:IRI ;
sh:property [
sh:datatype xsd:dateTime ;
sh:minCount 1 ;
sh:path dc:modified
], [
sh:minCount 1 ;
sh:nodeKind sh:IRI ;
sh:path dc:isVersionOf
], [ sh:path skos:note ], [ sh:path skos:historyNote ], [
sh:datatype gsp:wktLiteral ;
sh:maxCount 1 ;
sh:minCount 1 ;
sh:path dcat:centroid
], [
sh:datatype gsp:wktLiteral ;
sh:maxCount 1 ;
sh:minCount 0 ;
sh:path dcat:bbox
], [
sh:minCount 0 ;
sh:nodekind sh:IRI ;
sh:path mr:hasGeometry
], [
sh:minCount 0 ;
sh:node [
a sh:NodeShape ;
sh:nodeKind sh:IRI ;
sh:property [
sh:class schema:PropertyValue ;
sh:maxCount 1 ;
sh:minCount 1 ;
sh:path schema:identifier
], [
sh:maxCount 1 ;
sh:minCount 1 ;
sh:nodeKind sh:IRI ;
sh:path schema:url
]
] ;
sh:path skos:exactMatch
], [
sh:datatype rdf:langString ;
sh:minCount 1 ;
sh:path skos:prefLabel
], [
sh:datatype rdf:langString ;
sh:minCount 0 ;
sh:path skos:altLabel
], [
sh:class mr:MRGeoObject ;
sh:minCount 0 ;
sh:nodeKind sh:IRI ;
sh:path mr:isRelatedTo
] ;
sh:targetClass mr:MRGeoObject
] ;
tree:view <http://marineregions.org/feed?page=2023-11-28T10%3A00%3A00Z%2F2023-11-28T11%3A00%3A00Z> ;
tree:member <http://marineregions.org/mrgid/63523?t=1701165742> .

<http://marineregions.org/mrgid/63523?t=1701165742>
dc:isVersionOf <http://marineregions.org/mrgid/63523> ;
dc:modified "2023-11-28T10:02:22Z"^^xsd:dateTime .

Loading
Loading