-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write pointer files: PREMIS-rw and METS validation #27
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #28
metsrw/fsentry.py
Outdated
@@ -285,7 +309,9 @@ def serialize_structmap(self, recurse=True): | |||
""" | |||
if not self.label: | |||
return None | |||
el = etree.Element(utils.lxmlns('mets') + 'div', TYPE=self.type, LABEL=self.label) | |||
el = etree.Element(utils.lxmlns('mets') + 'div', | |||
TYPE=self.mets_div_type or self.type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems redundant, given how self.mets_div_type
is set during initialization.
metsrw/mets.py
Outdated
file_id_prefix = utils.FILE_ID_PREFIX | ||
# If the file is an AIP, then its prefix is not "file-" but the | ||
# name of the AIP. Therefore we need to get the extension-less | ||
# basename of the AIP's path and remove its UUID suffix to ge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
metsrw/mets.py
Outdated
# name of the AIP. Therefore we need to get the extension-less | ||
# basename of the AIP's path and remove its UUID suffix to ge | ||
# the prefix to remove from the FILEID attribute value. | ||
if entry_type.lower() == 'archival information package': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this string should be a constant
>>> premis_obj = PREMISObject( | ||
identifier_type='UUID', | ||
identifier_value='9bf6bcf8-4d77-4623-a9fb-b703365d0ffe', | ||
...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring should make clear that the tuple passed to data
kwarg is the source of truth for the PREMIS element, under both construction approaches.
@property | ||
def attrs_to_paths(self): | ||
"""Return a dict that maps valid getter attributes to the simplified | ||
XPaths needed to get the corresponding values from ``self.data``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve docstring to explain what this property returns/generates. It analyzes self.schema
and sets self._attrs_to_paths
to a dict that maps implicit getters like 'agent_identifier_value'
and 'identifier_value'
to the XPaths implicit in self.schema
, in this case the XPath 'agent/agent_identifier/agent_identifier_value'
. That same path also implies the getters 'agent_identifier'
and 'identifier'
, which both map to the XPath 'agent/agent_identifier'
and which should return a tuple (or list thereof) instead of a string.
metsrw/plugins/premisrw/premis.py
Outdated
return tuple(ret) | ||
|
||
|
||
def premis(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existence of this function is just confusing. _data_to_lxml_el
should probably just be used directly throughout (and made "public").
metsrw/plugins/premisrw/premis.py
Outdated
|
||
|
||
def data_find(data, path): | ||
"""Find and return the element-as-tuple in tuple ``data`` using simplified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Find and return the first element-as-tuple ...
return data_find_all(data, path) | ||
|
||
|
||
def get_event_type(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this ever used?
metsrw/plugins/premisrw/premis.py
Outdated
|
||
def get_attrs_to_paths(data, attrs_to_paths=None, path=None): | ||
"""Analyze PREMIS-element-as-tuple ``data`` and return a dict that maps | ||
attribute names to the simplified XPaths needed to retrieve them, e.g.,:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first positional argument here should be schema
not data
. data
is a relic from the previous implementation.
tests/test_mets.py
Outdated
@@ -379,4 +382,213 @@ def test_full_mets(self): | |||
mw.append_file(sip) | |||
mw.write('full_metsrw.xml', fully_qualified=True, pretty_print=True) | |||
|
|||
# TODO: this should be passing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deal with this.
cfbb162
to
9b071a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've learned quite a lot reading your code. It's very smart!
I would have liked to see more than one PR though, e.g. validation was its own thing and I think the dependency injection could also be a PR alone easily.
tests/test_mets.py
Outdated
import os | ||
import pprint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused.
tests/test_mets.py
Outdated
@@ -378,5 +381,201 @@ def test_full_mets(self): | |||
|
|||
mw.append_file(sip) | |||
mw.write('full_metsrw.xml', fully_qualified=True, pretty_print=True) | |||
self.assert_mets_valid(mw.serialize()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fafaf71 this fails but you seem to have deleted it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it fails because the METS file in this test is not really valid (according to the schematron file) because it contains fake metadata elements such as <premis>agent</premis>
. The validation validates not only the METS but also the embedded PREMIS elements.
metsrw/fsentry.py
Outdated
# Dependencies that must be injected. This means that an | ||
# ``FSEntry`` instance must be able to call ``self.premis_object_class`` and | ||
# get a class with methods ``fromtree`` and ``serialize``. | ||
premis_object_class = Dependency( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool!
tests/test_mets.py
Outdated
@@ -381,10 +381,6 @@ def test_full_mets(self): | |||
|
|||
mw.append_file(sip) | |||
mw.write('full_metsrw.xml', fully_qualified=True, pretty_print=True) | |||
self.assert_mets_valid(mw.serialize()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added in fafaf71 and deleted here. Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it fails because the METS file in this test is not really valid (according to the schematron file) because it contains fake metadata elements such as <premis>agent</premis>
. The validation validates not only the METS but also the embedded PREMIS elements.
metsrw/plugins/premisrw/premis.py
Outdated
'PREMIS:OBJECT': PREMISObject, | ||
'PREMIS:EVENT': PREMISEvent, | ||
'PREMIS:AGENT': PREMISAgent | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should be removed I think. It's a vestige from the initial way I was doing the plugin system.
- PREMISAgent | ||
|
||
""" | ||
__metaclass__ = type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've learned something new!
- Schematron files have been added to the distribution i.e., they are present in pip installs. - Validation reports synthesizing XMLSchema and Schematron reports can be generated. - Can perform XMLSchema (.xsd file) validation using the ``validate`` function. - Can perform Schematron validation using the XML files in metsrw/resources/. - Using schematron files created by copying and modifying https://github.com/artefactual/mets-validator/blob/master/src/main/resources/archivematica_mets_schematron.xml - Tests added, including two new METS files (AIP METS and pointer METS) as fixtures, which were created via a production AM qa/1.x deploy. - Encryption added to known premis event types in pointer file schematron. - Weakened assertions for Archivematica pointer METS files (which are different from Archivematica METS files). Notably, now only requiring that there be at least one PREMIS:AGENT for each PREMIS:EVENT, not two as previously. - Unrelated: ``metsrw::FSEntry`` can now create transformFile elements under mets:file.
The mets_div_type will override the type of the FSEntry in the structmap's <mets:div>'s TYPE attribute. This allows, for example, a METS pointer file to have a user-defined type for the AIP that is more specific than 'Archival Information Package'.
Adds a dependency injection system which allows metsrw classes (i.e., FSEntry) to explicitly declare their dependencies on classes (with a declared API) for reading and writing metadata elements, e.g., PREMIS. Also adds a PREMIS reader/writer plugin which contains classes that conform to the API required by FSEntry.
- F403 'from .premis import *' used; unable to detect undefined names - F405 'VAR' may be undefined, or defined from star imports
The ``PREMISRW_PLUGINS`` was unused.
9b071a8
to
5c2b832
Compare
Makes it easier to use mets-reader-writer to create Archivematica METS pointer files for compressed/encrypted Archival Information Packages (AIPs) by doing the following.
See this draft Architectural Decision Record for unifying METS interactions in Archivematica.
The PREMIS-RW functionality introduced here has significant overlap with the work in PR #20, which introduces a different approach to PREMIS-related functionality. These two pull requests should be reconciled. This one has the benefit of providing a plugin/dependency injection system. It should be easy to modify the PREMIS functionality of PR #20 (or that of uchicagoldr-premiswork) to work with this dependency injection system. See issue #28.
These changes were motivated by the desire to make it easier to read and write METS files, specifically Archivematica METS pointer files, in the context of the larger AIP replication project where interactions with METS files are increasingly being spread across several Archivematica-related tools/repos. Saliently, the Storage Service will now be responsible for creating pointer files. As a result, there is a greater need for establishing standard practices for interacting with METS files in Archivematica-related work. It is hoped that the changes introduced here can help to achieve that goal. See DRAFT Architectural Decision Record (ADR) 1: Unification of METS Creation in Archivematica
.