Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inject process metadata (provenance tracking) into produced turtle files #28

Open
laurianvm opened this issue Nov 23, 2021 · 1 comment
Milestone

Comments

@laurianvm
Copy link
Contributor

(draft) we should be able to track back to the origin of the record, track versions
e.g. a data point that is altered after QC
--> in order to do so a set of metadata triples should be produced (e.g. date, time, version of pysubyt, arguments, ...)

@marc-portier
Copy link
Member

gave this some (a little) thought...

call optionally

It should be optional, and kept separate from the real data-flow, so a command line switch should be added to point to the provenance report to be generated.

-p path-to-prov-report.ttl

under control of template writer

It should only add provenance statements concerning selected nodes controlled by the template-designer. So the template-designer should have a mechanism to "add" certain selected URI to the provenance set. Maybe be wrapping that uri in a pass-through function like this in the template:

<{{provit(uritexpand("https://example.org/id/{#id}",_))}}> a ex:something.

calling towards a new function that follows this general structure:

def provit(uri):
    # actual code to register the uri, associated to the current runtime record-event-and-context
    return uri  # to achieve the pass-through effect

follow the template

It should eat our own dogfood , so the prov.ttl should be produced by some pysubyt template itself - we should have a built-in prov-template.ttl file inside the py lib package that actually holds the template producing the output based on an internal python-dict holding the assembled prov info during the run.

@laurianvm - if you agree with this approach, you might want to use this issue to draft / suggest the outlines of such python-dict and an appropriate template (and thus useful vocabs) :)

first ideas:

prov = {
  'about': { 'code': '[email protected]', 'exects': '2021-11-23T21:15:52', ...} ,
  'context': {  ... stuff from the context , like flags ... } ,
  'inputs': { ... describing the files making up the sources of sets and _ ...} ,
  'events': [
    { 'source': ref to input-source, 
      'location': some ref to line and or item-number in the set, 
      'produced': [  ... list of  uri's that were registered through provit into this "event" ...] 
  ]
} 

direction of link

I don't like the idea that we would add this kind of provenance info as properties to nodes we add, i.e. let us not reuse those as ?subj nodes that get more structs added to their shapes.

Instead, I would prefer the prov-context to stand on its own feet, but rather link up to these registered nodes as ?obj members of an array listing all the outcomes of the described prov-action?

# rather not
:registeredNodeA :producedIn :someContext .
:registeredNodeB :producedIn :someContext .

# but rather
:someContext :producedItems [:registerednodeA, :registeredNodeB, ...] .

implementation thoughts

  • need to have some prov object that assembles all events and registrations during execution
  • will need to pass it into the j2 context, adding events to it in the processing-loop
  • need to check if functions can access the j2 context to register uris to it a we go allong
  • end up with extracting the .asdict() from that object, just to pass it to an extra run of pysubyt applying the built in prov-template
  • if ever needed we could of course allow to inject a custom template for this last step too

@marc-portier marc-portier changed the title inject process metadata into produced turtle files inject process metadata (provenance tracking) into produced turtle files Feb 16, 2022
@laurianvm laurianvm added this to the 0.2. milestone Feb 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants