Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes to enable custom structmaps to be more expressive #1404

Merged
merged 1 commit into from
Apr 17, 2019

Conversation

ross-spencer
Copy link
Contributor

@ross-spencer ross-spencer commented Apr 11, 2019

Fixes to enable custom structmaps to be more expressive

Previously a user would need to submit a custom mets:structMap in
Archivematica using a FILEID attribute for a file path which would
be replaced during METS generation. The FILEID is only capable of
storing a limited set of values and so here we replace that
capability and ask users to use the CONTENTIDS attribute of a custom
structMap which can encode a larger range of values to ensure that
the final METS document remains valid.

Unit tests have been created to support maintenance of this feature
on an ongoing basis.

Validation of structMap input is completed during transfer and ingest
and this validation was previously been done by an external command
called xmllint. Given the opportunity to make the validation of METS
documents in Archivematica more consistent these calls in
Archivematica jobs to xmllint have been replaced with more idiomatic
calls to Python lxml validation methods.

Validation of mets:structMap files is now completed against
METS v1.12.

Other small fixes in this commit include the ability for the custom
structMap identifier to be updated dynamically where previously it
would always be structMap_2 which would be incompatible if other
automatically generated structMaps were included in the METS. Users
should note that they can also label a custom structMap with
something more meaningful to them using their own label and this will
not be replaced.

From the original commit history

This PR is separated into multiple commits and it is recommended to tackle them one by one for the code reviewer:

  1. We implement new unit tests to demonstrate the existing capability.
  2. We enhance the create_mets_v2.py code to support mode dynamic testing and add those tests.
  3. Make sure FILEIDs are all handled.
  4. We fix custom structmap numbering.
  5. We add the fixes to embed_custom_structmap to enable more complex structures.
  6. Add mets validation as structmap should be valid in its own right.
  7. We do some minor refactoring to create_mets_v2.py to separate responsibilities somewhat.
  8. We replace the verification shell script calling xmllint with Python.
  9. We didn't correctly anticipate all tests required to accurately reflect a transfer layout, this commit deals with that.
  10. Finally, we missed the sanitized names use-case where user submitted names might not anticipate AM's handling of that, so we deal with that here.

Feature: Draft documentation
Sample-data for testing can be found here: artefactual/archivematica-sampledata#52

Connected to archivematica/Issues#283
Connected to archivematica/Issues#633
Connected to archivematica/Issues#245
Connected to IISH#5
Connected to IISH#4

@ross-spencer ross-spencer force-pushed the dev/issue-283-structmap-fixes branch 9 times, most recently from d3bc39a to 5476b50 Compare April 12, 2019 21:55
@ross-spencer ross-spencer force-pushed the dev/issue-283-structmap-fixes branch 2 times, most recently from 247309f to 51b3793 Compare April 13, 2019 18:01
@ross-spencer ross-spencer changed the title WIP: Custom structmap fixes Fixes to enable custom structmaps to be more expressive Apr 15, 2019
@ross-spencer ross-spencer requested a review from a team April 15, 2019 14:39
Copy link
Contributor

@cole cole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks good! A few questions inline.

I have one overall concern. If I understand ADR-0001 correctly, we'd like to (over time) remove most of the XML generation code from create_mets_v2, and replace it with metsrw usage. It seems like this PR does not advance that goal. It doesn't make that task too much harder, but it seems like if we're touching this script in a significant way, it's a good opportunity to move in that direction. I do understand though that metsrw has some serious issues, that may prevent it from being used at all here, so maybe we need to just understand that while we have that goal, we can't advance it here (I don't know the urgency of this change, etc).

return
if not os.path.isfile(mets_xsd):
raise (VerifyMETSException("METS asset is unavailable"))
xmlschema = etree.XMLSchema(etree.parse(mets_xsd))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, does metsrw not yet work for this case?

@@ -1,7 +1,9 @@
# -*- coding: utf8
from __future__ import unicode_literals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return sum([len(files) for _, dir_, files in os.walk(path)])

@staticmethod
def validate_mets(mets_xsd, mets_structmap):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, just to confirm, metsrw can't be used here?

that they don't have to anticipate the Archivematica normalization process.
"""
return os.path.join(
"", *[sanitizeName(name.encode("utf8")) for name in path.split(os.path.sep)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that these files will already have been sanitized, so we're just matching up to the current state on the filesystem/database, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry i missed this @cole, yep! That's right. We'll have to keep this in mind when you complete your work on that microservice job.

tree.getroot()
) # TDOD - not root to return, but sub element structMap
# print etree.tostring(root)
root = tree.getroot()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, does metsrw not yet work for this case?

@ross-spencer
Copy link
Contributor Author

Hi @cole thanks for the quick input. For metsrw I investigated the possibility, and came up with this as the primary issue for not using it: archivematica/Issues#636 and getting to this stage in the PR used the majority of the budgeted time as much as I would love to work my way through it.

In prepping the way for metsrw:

  • We have removed a shell script calling xmllint so brought scripts into line using lxml.
  • We use the same validation call in the microservices as we do in the unit tests.
  • Created unit tests for the new feature to help with refactoring to use metsrw in future.

I approached it as taking one step forward.

@cole
Copy link
Contributor

cole commented Apr 16, 2019

@ross-spencer fair enough! Moving create_mets_v2 to metsrw is a big job, maybe it's wishful thinking that we can tackle it incrementally. Thanks for the detailed issues you've created (archivematica/Issues#636 and related).

@ross-spencer
Copy link
Contributor Author

@cole yeah, but I'm looking forward to making it happen though 🤓 . If there's an appetite to have lxml and metsrw mixing with each other for a few releases, then that's useful to know too.

Previously a user would need to submit a custom mets:structMap in
Archivematica using a FILEID attribute for a file path which would
be replaced during METS generation. The FILEID is only capable of
storing a limited set of values and so here we replace that
capability and ask users to use the CONTENTIDS attribute of a custom
structMap which can encode a larger range of values to ensure that
the final METS document remains valid.

Unit tests have been created to support maintenance of this feature
on an ongoing basis.

Validation of structMap input is completed during transfer and ingest
and this validation was previously been done by an external command
called xmllint. Given the opportunity to make the validation of METS
documents in Archivematica more consistent these calls in
Archivematica jobs to xmllint have been replaced with more idiomatic
calls to Python lxml validation methods.

Validation of mets:structMap files is now completed against
METS v1.12.

Other small fixes in this commit include the ability for the custom
structMap identifier to be updated dynamically where previously it
would always be structMap_2 which would be incompatible if other
automatically generated structMaps were included in the METS. Users
should note that they can also label a custom structMap with
something more meaningful to them using their own label and this will
not be replaced.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants