Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: Archivematica's mechanism to import custom structmaps can only describe flat structures, e.g. no nested-dirs, and only for files with limited character-sets, e.g. not even spaces are allowed #283

Closed
ross-spencer opened this issue Oct 18, 2018 · 22 comments
Assignees
Labels
IISH International Institute of Social History Status: review The issue's code has been merged and is ready for testing/review.
Milestone

Comments

@ross-spencer
Copy link
Contributor

ross-spencer commented Oct 18, 2018

Expected behaviour

The ability to supply a custom structmap that describes logical structure of arbitrary complexity.

Current behaviour

The current behavior is that the template for providing a custom structmap uses a FILEID field in an fptr to describe unique filenames likely to appear in a transfer. These values are replaced with filesec UUIDs through a lookup when the AIP METS is written.

Ref: https://github.com/artefactual/archivematica-sampledata/blob/f35f041af690774653b63fcd94e6649e719fed66/SampleTransfers/structMapSample/metadata/mets_structmap.xml

Unfortunately, the FILEID field is restricted to a specific character set which means that the slashes normally used to describe nested directories are considered to be illegal: https://stackoverflow.com/questions/1631396/what-is-an-xsncname-type-and-when-should-it-be-used/6159003#6159003

When a user attempts to supply a custom structmap during transfer. and if that structmap attempts to define some structure, e.g. some/path/to/file.txt then the validation of that structmap will fail and the transfer cannot continue.

Proposed solution

A solution has been proposed to use the CONTENTIDS field of an fptr as this can encode a URI, for example, a FILE:/// uri. Though we elected to use a far simpler path-like string here:

<mets:mets xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:mets="http://www.loc.gov/METS/" xmlns:xlink="http://www.w3.org/1999/xlink" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/mets.xsd" OBJID="ARCH00152.355">
  <mets:structMap TYPE="logical" LABEL="IISH Sequence">
    <mets:div TYPE="brochure">
      <mets:div TYPE="page" ORDER="1" LABEL="Page 1">
        <mets:fptr FILEID="one_file_two.txt" CONTENTIDS="sample_dir/one_file_two.txt" />
      </mets:div>
    </mets:div>
  </mets:structMap>
</mets:mets>

The code-change for this is fairly minimal, nb this is just a work in progress commit: IISH/archivematica@c7cc6cf

The change reads the CONTENTIDS field to then does the lookup. The FILEID field continues to be replaced with the file UUIDs found in the SIP.

Other considerations

Given this is a URI field, we might find that it won't validate given certain combinations of characters. Therefore we may consider not keeping the CONTENTIDS field in the XML once it is written (where the current implementation doesn't have these at all). Or we need to consider the impact of URI encoding on making this as easy as possible for users.

Your environment (version of Archivematica, OS version, etc)

qa/1.x.

Additional context

Duplicate of: IISH/archivematica#4
Related to IISH/archivematica#5
Related to #245


For Artefactual use:
Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:

  • All PRs related to this issue are properly linked 👍
  • All PRs related to this issue have been merged 👍
  • Test plan for this issue has been implemented and passed 👍
  • Documentation regarding this issue has been written and it has been added to the release notes, if needed 👍
@ross-spencer ross-spencer added the IISH International Institute of Social History label Dec 17, 2018
@ross-spencer ross-spencer self-assigned this Mar 11, 2019
@sevein sevein added the Status: in progress Issue that is currently being worked on. label Mar 13, 2019
@ross-spencer ross-spencer added Type: bug A flaw in the code that causes the software to produce an incorrect or unexpected result. and removed Type: bug A flaw in the code that causes the software to produce an incorrect or unexpected result. labels Mar 14, 2019
@sromkey sromkey added this to the 1.10.0 milestone Mar 18, 2019
@ross-spencer ross-spencer changed the title Problem: mets_structmap.xml can only describe flat structures Problem: Archivematica's mechanism to import custom structmaps can only describe flat structures for files with limited character-sets, e.g. not even spaces are allowed Apr 3, 2019
@ross-spencer ross-spencer changed the title Problem: Archivematica's mechanism to import custom structmaps can only describe flat structures for files with limited character-sets, e.g. not even spaces are allowed Problem: Archivematica's mechanism to import custom structmaps can only describe flat structures, e.g. no nested-dirs, and only for files with limited character-sets, e.g. not even spaces are allowed Apr 3, 2019
@ross-spencer
Copy link
Contributor Author

ross-spencer commented Apr 3, 2019

TODO:

  1. Lookup expression to understand why <!-- File: 2017-01-27-with-good-reason-ferdinand-short.mp3 --> fails to comply with an IDREF.

Example error for 1.

/var/archivematica/sharedDirectory/currentlyProcessing/samp7/metadata/mets_structmap.xml:6: element fptr: Schemas validity error : Element '{http://www.loc.gov/METS/}fptr', attribute 'FILEID': '2017-01-27-with-good-reason-ferdinand-short.mp3' is not a valid value of the atomic type 'xs:IDREF'.
/var/archivematica/sharedDirectory/currentlyProcessing/samp7/metadata/mets_structmap.xml:7: element area: Schemas validity error : Element '{http://www.loc.gov/METS/}area', attribute 'FILEID': '2017-01-27-with-good-reason-ferdinand-short.mp3' is not a valid value of the atomic type 'xs:IDREF'.

NB. there is a similar issue described here with Pointer Files: #380 (comment)

@ross-spencer
Copy link
Contributor Author

ross-spencer commented Apr 8, 2019

Analysis

Archivematica uses this script to validate a structmap supplied with a transfer.

It calls the following:

if [ -f "${metsFile}" ]; then
    xmllint --noout --schema "${schemaFile}" "${metsFile}"
else
    echo "No metadata/mets_structmap.xml file to verify."
fi

Using this we can rig up a test file with test patterns in it, designed to match a XSD which will do something similar to a mets schema.

FILEID errors

For the two sample files described below we can generate the following errors:

$ xmllint --schema structmap-simulation.xsd field-examples.xml --noout
field-examples.xml:3: element FILEID: Schemas validity error : Element 'FILEID': '2019-04-02-transfer.file' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml:4: element FILEID: Schemas validity error : Element 'FILEID': '-transfer file.zip' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml:5: element FILEID: Schemas validity error : Element 'FILEID': '.transfer-file.zip' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml:6: element FILEID: Schemas validity error : Element 'FILEID': '1-transfer' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml:9: element FILEID: Schemas validity error : Element 'FILEID': '1transfer' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml:10: element FILEID: Schemas validity error : Element 'FILEID': '2600.zip' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml:12: element FILEID: Schemas validity error : Element 'FILEID': 'a@' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml:13: element FILEID: Schemas validity error : Element 'FILEID': 'has#' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml:14: element FILEID: Schemas validity error : Element 'FILEID': 'handle/1234' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml:15: element FILEID: Schemas validity error : Element 'FILEID': '001-digitised.image.tif' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml:16: element FILEID: Schemas validity error : Element 'FILEID': 'this/is/a/nested/dir/file.txt' is not a valid value of the atomic type 'xs:IDREF'.
field-examples.xml fails to validate

Note how they are all for the FILEID and not for the CONTENTIDS.

Sample files:

mets_structmap.xml

<?xml version="1.0" encoding="utf-8"?>
<mets:mets xmlns:mets="http://www.loc.gov/METS/">
  <mets:structMap TYPE="logical">
    <mets:div LABEL="Short Listen: Ferdinand, the Misunderstood Bull by With Good Reason" TYPE="documentary">
      <mets:div LABEL="Introduction" ORDER="1">
        <!-- File: 2017-01-27-with-good-reason-ferdinand-short.mp3 -->
        <mets:fptr FILEID="ferdinand_short_2017_01_27.mp3">
          <mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:00:00" END="00:00:17" BETYPE="TIME"/>
        </mets:fptr>
      </mets:div>
      <mets:div LABEL="The story of Ferdinand and its context in history" ORDER="2">
        <mets:fptr FILEID="ferdinand_short_2017_01_27.mp3">
          <mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:00:18" END="00:01:13" BETYPE="TIME"/>
        </mets:fptr>
      </mets:div>
      <mets:div LABEL="Sharon McQueen - Socio Cultural Historian of Children's Literature" ORDER="3">
        <mets:fptr FILEID="ferdinand_short_2017_01_27.mp3">
          <mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:01:14" END="00:01:33" BETYPE="TIME"/>
        </mets:fptr>
      </mets:div>
      <mets:div LABEL="Isolation and Pacifism" ORDER="4">
        <mets:fptr FILEID="ferdinand_short_2017_01_27.mp3">
          <mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:01:33" END="00:02:09" BETYPE="TIME"/>
        </mets:fptr>
      </mets:div>
      <mets:div LABEL="Written on a rainy afternoon by Munro Leaf" ORDER="5">
        <mets:fptr FILEID="ferdinand_short_2017_01_27.mp3">
          <mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:02:10" END="00:02:52" BETYPE="TIME"/>
        </mets:fptr>
      </mets:div>
      <mets:div LABEL="Individualism" ORDER="6">
        <mets:fptr FILEID="ferdinand_short_2017_01_27.mp3">
          <mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:02:53" END="00:03:33" BETYPE="TIME"/>
        </mets:fptr>
      </mets:div>
      <mets:div LABEL="Banned in Germany, Spain, and attempted in the USA" ORDER="7">
        <mets:fptr FILEID="ferdinand_short_2017_01_27.mp3">
          <mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:03:34" END="00:04:05" BETYPE="TIME"/>
        </mets:fptr>
      </mets:div>
      <mets:div LABEL="Ferdinand's fame, even as he's seen as a 'coward'" ORDER="8">
        <mets:fptr FILEID="ferdinand_short_2017_01_27.mp3">
          <mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:04:06" END="00:04:35" BETYPE="TIME"/>
        </mets:fptr>
      </mets:div>
      <mets:div LABEL="Outro" ORDER="9">
        <mets:fptr FILEID="ferdinand_short_2017_01_27.mp3">
          <mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:00:00" END="00:05:06" BETYPE="TIME"/>
        </mets:fptr>
      </mets:div>
      <!-- documentary -->
    </mets:div>
  </mets:structMap>
</mets:mets>

structmap-simulation.xsd

<?xml version="1.0" encoding="UTF-8" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
	<xs:element name="structmapfields">
		<xs:complexType>
			<xs:sequence>
				<xs:element name="FILEID" type="xs:IDREF" minOccurs ="0" maxOccurs="unbounded" />
				<xs:element name="CONTENTIDS" type="URIs" minOccurs ="0" maxOccurs="unbounded" />
			</xs:sequence>
		</xs:complexType>
	</xs:element>

	<!-- from mets.xsd -->
	<xs:simpleType name="URIs">
	    <xs:list itemType="xs:anyURI"/>
	</xs:simpleType>

</xs:schema>


<!-- from mets.xsd:

<xsd:attribute name="FILEID" type="xsd:IDREF" use="optional">
	<xsd:annotation>
		<xsd:documentation xml:lang="en">FILEID (IDREF/O): An optional attribute that provides the XML ID identifying the &lt;file&gt; element that links to and/or contains the digital content represented by the &lt;fptr&gt;. A &lt;fptr&gt; element should only have a FILEID attribute value if it does not have a child &lt;area&gt;, &lt;par&gt; or &lt;seq&gt; element. If it has a child element, then the responsibility for pointing to the relevant content falls to this child element or its descendants.
		</xsd:documentation>
	</xsd:annotation>
</xsd:attribute>
<xsd:attribute name="CONTENTIDS" type="URIs" use="optional">
    <xsd:annotation>
    	<xsd:documentation xml:lang="en">CONTENTIDS (URI/O): Content IDs for the content represented by the &lt;fptr&gt; (equivalent to DIDL DII or Digital Item Identifier, a unique external ID).
        </xsd:documentation>
    </xsd:annotation>
    </xsd:attribute>

-->

@ross-spencer
Copy link
Contributor Author

Additionally, using the example described in the METS documentation, we cannot populate all the FILEIDs as anticipated:

<mets:structMap TYPE="logical" ID="structMap_2">
    <mets:div TYPE="track" LABEL="Complete documentary">
      <!-- File: ferdinand_short_2017_01_27.mp3

          NB. The filename is designed to fit the original design of the
          custom structMap functionality and so it is kept as simple as
          possible and will validate as a xs:IDREF'.
        -->
      <mets:div LABEL="Introduction" ORDER="1">
        <mets:fptr FILEID="file-78a8587a-1788-4a9f-91d9-6cf3e442ced6">
          <mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:00:00" END="00:00:17" BETYPE="TIME"/>
        </mets:fptr>
      </mets:div>
      <mets:div LABEL="Outro" ORDER="2">
        <mets:fptr FILEID="file-78a8587a-1788-4a9f-91d9-6cf3e442ced6">
          <mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:00:18" END="00:01:13" BETYPE="TIME"/>
        </mets:fptr>
      </mets:div>
    </mets:div>
  </mets:structMap>
</mets:mets>

@ross-spencer ross-spencer added Status: review The issue's code has been merged and is ready for testing/review. and removed Status: in progress Issue that is currently being worked on. labels Apr 19, 2019
@sallain sallain assigned evelynPM and unassigned ross-spencer Jul 8, 2019
@evelynPM
Copy link

I ingested a transfer with the sample "Ferdinand" structMap supplied above. The resulting AIP METS file fails validation with the error "Cvc-id.1: There Is No ID/IDREF Binding For IDREF 'ferdinand_short_2017_01_27.mp3'., Line '298', Column '13'." Presumably that's because IDREF isn't referencing a corresponding entry in the fileSec.

@evelynPM evelynPM added Status: in progress Issue that is currently being worked on. and removed Status: review The issue's code has been merged and is ready for testing/review. labels Jul 15, 2019
@ross-spencer
Copy link
Contributor Author

Hi there @evelynPM yeah, I think that's the only reason.

I created some sample data here the Ferdinand example is under SingleObjectExample if you'd like to review these. A cursory review on my machine shows those samples should pass validation okay.

@ross-spencer
Copy link
Contributor Author

ross-spencer commented Jul 16, 2019

I ingested a transfer with the sample "Ferdinand" structMap supplied above. The resulting AIP METS file fails validation with the error "Cvc-id.1: There Is No ID/IDREF Binding For IDREF 'ferdinand_short_2017_01_27.mp3'., Line '298', Column '13'." Presumably that's because IDREF isn't referencing a corresponding entry in the fileSec.

Ah, I see what you mean now. I think we can make this more robust if that's happening. I'll see if I can recreate and see what's needed to do this...

@ross-spencer
Copy link
Contributor Author

ross-spencer commented Jul 16, 2019

@evelynPM I've submitted a draft PR here

We can output something like this in the microservice job:

Skipping creation of normative structmap
Deleted empty directory /var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/metadataReminder/122-c90f1b96-f6ce-4805-9f8c-41cd5f744b06/logs/transfers/122-5920296a-6e93-421b-9b76-46f451afb475/logs/fileMeta
Deleted empty directory /var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/metadataReminder/122-c90f1b96-f6ce-4805-9f8c-41cd5f744b06/logs/fileMeta
Deleted empty directory /var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/metadataReminder/122-c90f1b96-f6ce-4805-9f8c-41cd5f744b06/thumbnails
--->   No CONTENTIDS found in custom structmap. Structmap will not be included in AIP METS   <--- structmap warning
DmdSecs: 1
AmdSecs: 5
TechMDs: 5
RightsMDs: 0
DigiprovMDs: 35

Would that be okay? And my preference would be to not let the process fail at this point, but we haven't many ways to raise that as a warning. Do you have an opinion?

The reason the structmap example above doesn't work is that we changed:

  <mets:fptr FILEID="ferdinand_short_2017_01_27.mp3">
    <mets:area FILEID="ferdinand_short_2017_01_27.mp3" BEGIN="00:00:00" END="00:00:17" BETYPE="TIME"/>
  </mets:fptr>

To require:

  <mets:fptr FILEID="some_value" CONTENTID="ferdinand_short_2017_01_27.mp3">
    <mets:area FILEID="some_value" CONTENTID="ferdinand_short_2017_01_27.mp3" BEGIN="00:00:00" END="00:00:17" BETYPE="TIME"/>
  </mets:fptr>

Which you'll see in the sample data linked to. It looks like we missed this test case in between the original and new versions of this capability.

@evelynPM
Copy link

I think using CONTENTIDS on user-supplied structMaps if the content is not in a flat file structure is ok. If the content is in a flat structure, the user should not be required to include CONTENTIDS in the structMap. It would be good to replace the CONTENTIDS with FILEID in the AIP METS file once processing is complete, unless the CONTENTIDS are in <area> or <mptr> elements, (if possible).

@evelynPM
Copy link

@ross-spencer just to clarify, do all custom structMaps now need to have the CONTENTIDS attribute in order to be parsed to the AIP METS file?

@evelynPM
Copy link

Ross and I discussed this and agreed that it's ok to require CONTENTIDS in custom structMaps, so long as the requirement is documented. Also, in the AIP METS file the CONTENTIDS don't appear in the Archivematica default structMap, only in the custom structMaps, so they can be left in.

@ross-spencer ross-spencer added Status: review The issue's code has been merged and is ready for testing/review. and removed Status: in progress Issue that is currently being worked on. labels Jul 26, 2019
@ross-spencer
Copy link
Contributor Author

Thanks @evelynPM. I've moved this out of In progress to Review so we're a bit closer to Verified cc. @sallain or we can put to Verified but we'll need to update the docs with a little more information. The commit for the PID declaration method was added here.

@sallain sallain added the Type: documentation The issue requires documentation. label Jul 26, 2019
@sallain
Copy link
Member

sallain commented Jul 26, 2019

@ross-spencer Thanks! I think we've low-key decided that docs shouldn't block releases so I've added a docs tag to this - I've been doing this for all the 1.10 things that need docs updates. If you or @evelynPM want to add a separate docs issue, that might be good too. So this can be moved to Done if it is done!

@ross-spencer
Copy link
Contributor Author

ross-spencer commented Jul 29, 2019

Hi @evelynPM just one more thing, you asked about this here and I proposed an addition to the script that I think would be of benefit here. Should we still follow up on this while we have a chance with 1.10? cc. @sallain as a reminder to follow-up on for today's AM weekly.

NB. we have a PR against this which is ready to go, if there's a thumbs-up to make this a little more robust.

@sallain
Copy link
Member

sallain commented Jul 29, 2019

Needs to go in release notes!

@ross-spencer ross-spencer added Status: in progress Issue that is currently being worked on. and removed Status: review The issue's code has been merged and is ready for testing/review. labels Jul 30, 2019
@evelynPM
Copy link

@ross-spencer and I had a conversation about this & agreed that the SIP should fail if lack of CONTENTIDS in the custom structMap result in the structMap not being parsed to the AIP METS file.

@replaceafill
Copy link
Member

@ross-spencer @evelynPM I'm code reviewing the PR connected to this issue. I've tried the SingleObjectExample and MinimalStructMapExample sample transfers but cannot see any failures in the process. I do see the Custom structmap validated correctly output in the Verify mets_structmap.xml compliance job.

Could we update the issue with a Steps to reproduce section? Thanks!

@ross-spencer
Copy link
Contributor Author

Hi @replaceafill it's probably not that clear but the additional PR is to capture error conditions we were perhaps being a little too flexible with in the feature's first version which is only being introduced with 1.10.

I've included four new structmaps into the fixtures to test against, so if you look at the xml filter here you'll get an idea what you want to see Archivematica report, or fail on.

In summary, if you want to take, say, MinimalStructMapExample, you need to modify it to four different states and transfer if four times:

  1. No CONTENTID attributes (each of these will be found in mets:fptr elements or mets:area elements.
  2. One missing CONTENTID attribute.
  3. A add a path that does not exist to a CONTENTID attribute.
  4. CONTENTID attributes exist but are "" empty.

Crucially it means the feature will not fail silently, and its harder to put rogue data in, so we also want to make sure the sample data passes, I've tested each of the four samples I created in the sample data repo and they all work as expected but you might want to verify.

@replaceafill
Copy link
Member

Thank you @ross-spencer. I understand now what those sample transfers do.

I know the issue has back and forth discussion about the feature and provides samples and explanations, but I still think a Steps to reproduce section in each issue is worth for people less experienced in some of these features like me 😊

@ross-spencer ross-spencer added Status: review The issue's code has been merged and is ready for testing/review. and removed Status: in progress Issue that is currently being worked on. labels Jul 31, 2019
@sallain
Copy link
Member

sallain commented Aug 20, 2019

@evelynPM I wonder if the docs need updating because of the late changes? It would be this section here: Importing structural metadata with mets_structmap.xml. The referenced scripts are here

@ross-spencer
Copy link
Contributor Author

Hi @sallain we didn't modify the mechanism with the last changes, so any changes would be focused on nuances about working with this feature, otherwise 🤞 we should be g2g?

@sallain
Copy link
Member

sallain commented Aug 21, 2019

@ross-spencer Just confirming, the structmap samples are still correct?

@ross-spencer
Copy link
Contributor Author

@sallain yep, I would have submitted new ones with the changes otherwise. 👍

@sallain sallain removed the Type: documentation The issue requires documentation. label Aug 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IISH International Institute of Social History Status: review The issue's code has been merged and is ready for testing/review.
Projects
None yet
Development

No branches or pull requests

6 participants