-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem: Archivematica's mechanism to import custom structmaps can only describe flat structures, e.g. no nested-dirs, and only for files with limited character-sets, e.g. not even spaces are allowed #283
Comments
TODO:
Example error for 1. /var/archivematica/sharedDirectory/currentlyProcessing/samp7/metadata/mets_structmap.xml:6: element fptr: Schemas validity error : Element '{http://www.loc.gov/METS/}fptr', attribute 'FILEID': '2017-01-27-with-good-reason-ferdinand-short.mp3' is not a valid value of the atomic type 'xs:IDREF'.
/var/archivematica/sharedDirectory/currentlyProcessing/samp7/metadata/mets_structmap.xml:7: element area: Schemas validity error : Element '{http://www.loc.gov/METS/}area', attribute 'FILEID': '2017-01-27-with-good-reason-ferdinand-short.mp3' is not a valid value of the atomic type 'xs:IDREF'. NB. there is a similar issue described here with Pointer Files: #380 (comment) |
Analysis Archivematica uses this script to validate a structmap supplied with a transfer. It calls the following: if [ -f "${metsFile}" ]; then
xmllint --noout --schema "${schemaFile}" "${metsFile}"
else
echo "No metadata/mets_structmap.xml file to verify."
fi Using this we can rig up a test file with test patterns in it, designed to match a XSD which will do something similar to a mets schema. FILEID errors For the two sample files described below we can generate the following errors: $ xmllint --schema structmap-simulation.xsd field-examples.xml --noout
field-examples.xml:3: element FILEID: Schemas validity error : Element 'FILEID': '2019-04-02-transfer.file' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml:4: element FILEID: Schemas validity error : Element 'FILEID': '-transfer file.zip' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml:5: element FILEID: Schemas validity error : Element 'FILEID': '.transfer-file.zip' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml:6: element FILEID: Schemas validity error : Element 'FILEID': '1-transfer' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml:9: element FILEID: Schemas validity error : Element 'FILEID': '1transfer' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml:10: element FILEID: Schemas validity error : Element 'FILEID': '2600.zip' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml:12: element FILEID: Schemas validity error : Element 'FILEID': 'a@' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml:13: element FILEID: Schemas validity error : Element 'FILEID': 'has#' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml:14: element FILEID: Schemas validity error : Element 'FILEID': 'handle/1234' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml:15: element FILEID: Schemas validity error : Element 'FILEID': '001-digitised.image.tif' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml:16: element FILEID: Schemas validity error : Element 'FILEID': 'this/is/a/nested/dir/file.txt' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml fails to validate Note how they are all for the Sample files:
<?xml version="1.0" encoding="utf-8"?>
<mets:mets xmlns:mets="http://www.loc.gov/METS/">
<mets:structMap TYPE="logical">
<mets:div LABEL="Short Listen: Ferdinand, the Misunderstood Bull by With Good Reason" TYPE="documentary">
<mets:div LABEL="Introduction" ORDER="1">
<!-- File: 2017-01-27-with-good-reason-ferdinand-short.mp3 -->
<mets:fptr FILEID="ferdinand_short_2017_01_27.mp3">
<mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:00:00" END="00:00:17" BETYPE="TIME"/>
</mets:fptr>
</mets:div>
<mets:div LABEL="The story of Ferdinand and its context in history" ORDER="2">
<mets:fptr FILEID="ferdinand_short_2017_01_27.mp3">
<mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:00:18" END="00:01:13" BETYPE="TIME"/>
</mets:fptr>
</mets:div>
<mets:div LABEL="Sharon McQueen - Socio Cultural Historian of Children's Literature" ORDER="3">
<mets:fptr FILEID="ferdinand_short_2017_01_27.mp3">
<mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:01:14" END="00:01:33" BETYPE="TIME"/>
</mets:fptr>
</mets:div>
<mets:div LABEL="Isolation and Pacifism" ORDER="4">
<mets:fptr FILEID="ferdinand_short_2017_01_27.mp3">
<mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:01:33" END="00:02:09" BETYPE="TIME"/>
</mets:fptr>
</mets:div>
<mets:div LABEL="Written on a rainy afternoon by Munro Leaf" ORDER="5">
<mets:fptr FILEID="ferdinand_short_2017_01_27.mp3">
<mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:02:10" END="00:02:52" BETYPE="TIME"/>
</mets:fptr>
</mets:div>
<mets:div LABEL="Individualism" ORDER="6">
<mets:fptr FILEID="ferdinand_short_2017_01_27.mp3">
<mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:02:53" END="00:03:33" BETYPE="TIME"/>
</mets:fptr>
</mets:div>
<mets:div LABEL="Banned in Germany, Spain, and attempted in the USA" ORDER="7">
<mets:fptr FILEID="ferdinand_short_2017_01_27.mp3">
<mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:03:34" END="00:04:05" BETYPE="TIME"/>
</mets:fptr>
</mets:div>
<mets:div LABEL="Ferdinand's fame, even as he's seen as a 'coward'" ORDER="8">
<mets:fptr FILEID="ferdinand_short_2017_01_27.mp3">
<mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:04:06" END="00:04:35" BETYPE="TIME"/>
</mets:fptr>
</mets:div>
<mets:div LABEL="Outro" ORDER="9">
<mets:fptr FILEID="ferdinand_short_2017_01_27.mp3">
<mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:00:00" END="00:05:06" BETYPE="TIME"/>
</mets:fptr>
</mets:div>
<!-- documentary -->
</mets:div>
</mets:structMap>
</mets:mets>
<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="structmapfields">
<xs:complexType>
<xs:sequence>
<xs:element name="FILEID" type="xs:IDREF" minOccurs ="0" maxOccurs="unbounded" />
<xs:element name="CONTENTIDS" type="URIs" minOccurs ="0" maxOccurs="unbounded" />
</xs:sequence>
</xs:complexType>
</xs:element>
<!-- from mets.xsd -->
<xs:simpleType name="URIs">
<xs:list itemType="xs:anyURI"/>
</xs:simpleType>
</xs:schema>
<!-- from mets.xsd:
<xsd:attribute name="FILEID" type="xsd:IDREF" use="optional">
<xsd:annotation>
<xsd:documentation xml:lang="en">FILEID (IDREF/O): An optional attribute that provides the XML ID identifying the <file> element that links to and/or contains the digital content represented by the <fptr>. A <fptr> element should only have a FILEID attribute value if it does not have a child <area>, <par> or <seq> element. If it has a child element, then the responsibility for pointing to the relevant content falls to this child element or its descendants.
</xsd:documentation>
</xsd:annotation>
</xsd:attribute>
<xsd:attribute name="CONTENTIDS" type="URIs" use="optional">
<xsd:annotation>
<xsd:documentation xml:lang="en">CONTENTIDS (URI/O): Content IDs for the content represented by the <fptr> (equivalent to DIDL DII or Digital Item Identifier, a unique external ID).
</xsd:documentation>
</xsd:annotation>
</xsd:attribute>
--> |
Additionally, using the example described in the METS documentation, we cannot populate all the FILEIDs as anticipated: <mets:structMap TYPE="logical" ID="structMap_2">
<mets:div TYPE="track" LABEL="Complete documentary">
<!-- File: ferdinand_short_2017_01_27.mp3
NB. The filename is designed to fit the original design of the
custom structMap functionality and so it is kept as simple as
possible and will validate as a xs:IDREF'.
-->
<mets:div LABEL="Introduction" ORDER="1">
<mets:fptr FILEID="file-78a8587a-1788-4a9f-91d9-6cf3e442ced6">
<mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:00:00" END="00:00:17" BETYPE="TIME"/>
</mets:fptr>
</mets:div>
<mets:div LABEL="Outro" ORDER="2">
<mets:fptr FILEID="file-78a8587a-1788-4a9f-91d9-6cf3e442ced6">
<mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:00:18" END="00:01:13" BETYPE="TIME"/>
</mets:fptr>
</mets:div>
</mets:div>
</mets:structMap>
</mets:mets> |
I ingested a transfer with the sample "Ferdinand" structMap supplied above. The resulting AIP METS file fails validation with the error "Cvc-id.1: There Is No ID/IDREF Binding For IDREF 'ferdinand_short_2017_01_27.mp3'., Line '298', Column '13'." Presumably that's because IDREF isn't referencing a corresponding entry in the fileSec. |
Hi there @evelynPM yeah, I think that's the only reason. I created some sample data here the Ferdinand example is under SingleObjectExample if you'd like to review these. A cursory review on my machine shows those samples should pass validation okay. |
Ah, I see what you mean now. I think we can make this more robust if that's happening. I'll see if I can recreate and see what's needed to do this... |
@evelynPM I've submitted a draft PR here We can output something like this in the microservice job:
Would that be okay? And my preference would be to not let the process fail at this point, but we haven't many ways to raise that as a warning. Do you have an opinion? The reason the structmap example above doesn't work is that we changed: <mets:fptr FILEID="ferdinand_short_2017_01_27.mp3">
<mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:00:00" END="00:00:17" BETYPE="TIME"/>
</mets:fptr> To require: <mets:fptr FILEID="some_value" CONTENTID="ferdinand_short_2017_01_27.mp3">
<mets:area FILEID="some_value" CONTENTID="ferdinand_short_2017_01_27.mp3" BEGIN="00:00:00" END="00:00:17" BETYPE="TIME"/>
</mets:fptr> Which you'll see in the sample data linked to. It looks like we missed this test case in between the original and new versions of this capability. |
I think using CONTENTIDS on user-supplied structMaps if the content is not in a flat file structure is ok. If the content is in a flat structure, the user should not be required to include CONTENTIDS in the structMap. It would be good to replace the CONTENTIDS with FILEID in the AIP METS file once processing is complete, unless the CONTENTIDS are in |
@ross-spencer just to clarify, do all custom structMaps now need to have the CONTENTIDS attribute in order to be parsed to the AIP METS file? |
Ross and I discussed this and agreed that it's ok to require CONTENTIDS in custom structMaps, so long as the requirement is documented. Also, in the AIP METS file the CONTENTIDS don't appear in the Archivematica default structMap, only in the custom structMaps, so they can be left in. |
@ross-spencer Thanks! I think we've low-key decided that docs shouldn't block releases so I've added a docs tag to this - I've been doing this for all the 1.10 things that need docs updates. If you or @evelynPM want to add a separate docs issue, that might be good too. So this can be moved to Done if it is done! |
Hi @evelynPM just one more thing, you asked about this here and I proposed an addition to the script that I think would be of benefit here. Should we still follow up on this while we have a chance with NB. we have a PR against this which is ready to go, if there's a thumbs-up to make this a little more robust. |
Needs to go in release notes! |
@ross-spencer and I had a conversation about this & agreed that the SIP should fail if lack of CONTENTIDS in the custom structMap result in the structMap not being parsed to the AIP METS file. |
@ross-spencer @evelynPM I'm code reviewing the PR connected to this issue. I've tried the Could we update the issue with a Steps to reproduce section? Thanks! |
Hi @replaceafill it's probably not that clear but the additional PR is to capture error conditions we were perhaps being a little too flexible with in the feature's first version which is only being introduced with I've included four new structmaps into the fixtures to test against, so if you look at the xml filter here you'll get an idea what you want to see Archivematica report, or fail on. In summary, if you want to take, say,
Crucially it means the feature will not fail silently, and its harder to put rogue data in, so we also want to make sure the sample data passes, I've tested each of the four samples I created in the sample data repo and they all work as expected but you might want to verify. |
Thank you @ross-spencer. I understand now what those sample transfers do. I know the issue has back and forth discussion about the feature and provides samples and explanations, but I still think a Steps to reproduce section in each issue is worth for people less experienced in some of these features like me 😊 |
@evelynPM I wonder if the docs need updating because of the late changes? It would be this section here: Importing structural metadata with mets_structmap.xml. The referenced scripts are here |
Hi @sallain we didn't modify the mechanism with the last changes, so any changes would be focused on nuances about working with this feature, otherwise 🤞 we should be g2g? |
@ross-spencer Just confirming, the structmap samples are still correct? |
@sallain yep, I would have submitted new ones with the changes otherwise. 👍 |
Expected behaviour
The ability to supply a custom
structmap
that describes logical structure of arbitrary complexity.Current behaviour
The current behavior is that the template for providing a custom
structmap
uses aFILEID
field in anfptr
to describe unique filenames likely to appear in a transfer. These values are replaced withfilesec
UUIDs through a lookup when the AIP METS is written.Ref: https://github.com/artefactual/archivematica-sampledata/blob/f35f041af690774653b63fcd94e6649e719fed66/SampleTransfers/structMapSample/metadata/mets_structmap.xml
Unfortunately, the
FILEID
field is restricted to a specific character set which means that the slashes normally used to describe nested directories are considered to be illegal: https://stackoverflow.com/questions/1631396/what-is-an-xsncname-type-and-when-should-it-be-used/6159003#6159003When a user attempts to supply a custom structmap during transfer. and if that structmap attempts to define some structure, e.g.
some/path/to/file.txt
then the validation of that structmap will fail and the transfer cannot continue.Proposed solution
A solution has been proposed to use the
CONTENTIDS
field of anfptr
as this can encode a URI, for example, aFILE:///
uri. Though we elected to use a far simpler path-like string here:The code-change for this is fairly minimal, nb this is just a work in progress commit: IISH/archivematica@c7cc6cf
The change reads the
CONTENTIDS
field to then does the lookup. TheFILEID
field continues to be replaced with the file UUIDs found in the SIP.Other considerations
Given this is a URI field, we might find that it won't validate given certain combinations of characters. Therefore we may consider not keeping the
CONTENTIDS
field in the XML once it is written (where the current implementation doesn't have these at all). Or we need to consider the impact of URI encoding on making this as easy as possible for users.Your environment (version of Archivematica, OS version, etc)
qa/1.x
.Additional context
Duplicate of: IISH/archivematica#4
Related to IISH/archivematica#5
Related to #245
For Artefactual use:
Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:
The text was updated successfully, but these errors were encountered: