Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: mets-reader-writer places its own restrictions on METS profiles it can read and then validate (metsrw) #636

Open
ross-spencer opened this issue Apr 8, 2019 · 1 comment
Labels
Ⓜ️ mets/premis METS/PREMIS issues

Comments

@ross-spencer
Copy link
Contributor

ross-spencer commented Apr 8, 2019

Expected behaviour

It may be desirable (convenient?) to load any METS document into mets-reader-writer to perform validation against the METS schema.

Current behaviour

Mets-reader-writer places limits on what can be imported during its load processes by seeking the existence of various properties within the METS when it is loaded. The reader-writer could potentially be more general purpose.

As an example:

As mets-reader-writer loads XML from a file, it then calls the following functions:

  • fromtree: here
    Which then calls:

  • _parse_tree: here.

In _parse_tree we seek the existence of a physical structMap and raise an error if one isn't found: raise exceptions.ParseError("No physical structMap found.")

A structmap however isn't a mandatory element of a METS file. And here (1.12) looking specifically for a physical structMap is also an additional stipulation affecting our ability to load any particular METS.

Steps to reproduce

A sample structmap that will fail validation is as follows:

<?xml version="1.0" encoding="utf-8"?>
<mets:mets xmlns:mets="http://www.loc.gov/METS/">
  <mets:structMap TYPE="logical">
    <mets:div TYPE="book" LABEL="How to create a hierarchical book">
      <mets:div TYPE="page" LABEL="Cover">
        <mets:fptr FILEID="cover.jpg"/>
      </mets:div>
      <mets:div TYPE="page" LABEL="Inside cover">
        <mets:fptr FILEID="inside_cover.jpg"/>
      </mets:div>
      <mets:div TYPE="chapter" LABEL="Chapter 1">
        <mets:div TYPE="page" LABEL="Page 1">
          <mets:fptr FILEID="page_01.jpg"/>
        </mets:div>
        <mets:div TYPE="subchapter" LABEL="Subchapter 1.1">
          <mets:div TYPE="page" LABEL="Page 2">
            <mets:fptr FILEID="page_02.jpg"/>
          </mets:div>
          <mets:div TYPE="page" LABEL="Page 3">
            <mets:fptr FILEID="page_03.jpg"/>
          </mets:div>
          <mets:div TYPE="page" LABEL="Page 4">
            <mets:fptr FILEID="page_04.jpg"/>
          </mets:div>
          <mets:div TYPE="subchapter" LABEL="Subchapter 1.2">
            <mets:div TYPE="page" LABEL="Page 5">
              <mets:fptr FILEID="page_05.jpg"/>
            </mets:div>
            <mets:div TYPE="page" LABEL="Page 6">
              <mets:fptr FILEID="page_06.jpg"/>
            </mets:div>
            <mets:div TYPE="page" LABEL="Page 7">
              <mets:fptr FILEID="page_07.jpg"/>
            </mets:div>
          </mets:div>
          <!-- Subchapter 1.2 -->
        </mets:div>
        <!-- Subchapter 1.1 -->
      </mets:div>
      <!-- Chapter 1 -->
      <!-- Chapters 2 and 3, each with their own subchapters as in Chapter 1, omitted from this example. -->
      <mets:div TYPE="afterword" LABEL="Afterword">
        <mets:div TYPE="page" LABEL="Page 20">
          <mets:fptr FILEID="page_20.jpg"/>
        </mets:div>
      </mets:div>
      <!-- afterword -->
      <mets:div TYPE="index" LABEL="Index">
        <mets:div TYPE="page" LABEL="Index, page 1">
          <mets:fptr FILEID="index_01.jpg"/>
        </mets:div>
        <mets:div TYPE="page" LABEL="Index, page 2">
          <mets:fptr FILEID="index_02.jpg"/>
        </mets:div>
      </mets:div>
      <!-- index -->
      <mets:div TYPE="page" LABEL="Back cover">
        <mets:fptr FILEID="back_cover.jpg"/>
      </mets:div>
      <!-- back cover -->
    </mets:div>
    <!-- book -->
  </mets:structMap>
</mets:mets>

An attempt to load this will result in the following stack trace:

Traceback (most recent call last):
  File "validate-mets.py", line 79, in <module>
    main()
  File "validate-mets.py", line 75, in main
    use_mets(args.mets)
  File "validate-mets.py", line 47, in use_mets
    mets = load_mets(filename[0])
  File "validate-mets.py", line 35, in load_mets
    mets = metsrw.METSDocument.fromfile(filename)  # Reads a file
  File "/usr/local/lib/python2.7/dist-packages/metsrw/mets.py", line 563, in fromfile
    return cls.fromtree(etree.parse(path, parser=parser))
  File "/usr/local/lib/python2.7/dist-packages/metsrw/mets.py", line 587, in fromtree
    mets._parse_tree(tree)
  File "/usr/local/lib/python2.7/dist-packages/metsrw/mets.py", line 506, in _parse_tree
    raise exceptions.ParseError("No physical structMap found.")
metsrw.exceptions.ParseError: No physical structMap found.

Your environment (version of Archivematica, OS version, etc)

metsrw-0.3.7.

Additional context

Validation could be done via mets-rw for custom structmaps rather than via xmllint in archivematicaVerifyMETS.sh.


For Artefactual use:
Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:

  • All PRs related to this issue are properly linked 👍
  • All PRs related to this issue have been merged 👍
  • Test plan for this issue has been implemented and passed 👍
  • Documentation regarding this issue has been written and it has been added to the release notes, if needed 👍
@ross-spencer
Copy link
Contributor Author

A similar issue from the mets-rw repo with additional sampled-data to consider: artefactual-labs/mets-reader-writer#36

@ross-spencer ross-spencer changed the title Problem: mets-reader-writer places its own restrictions on METS profiles it can validate Problem: mets-reader-writer places its own restrictions on METS profiles can read and then validate Apr 15, 2019
@ross-spencer ross-spencer changed the title Problem: mets-reader-writer places its own restrictions on METS profiles can read and then validate Problem: mets-reader-writer places its own restrictions on METS profiles can read and then validated Apr 16, 2019
@ross-spencer ross-spencer changed the title Problem: mets-reader-writer places its own restrictions on METS profiles can read and then validated Problem: mets-reader-writer places its own restrictions on METS profiles is can read and then validate Apr 16, 2019
@ross-spencer ross-spencer changed the title Problem: mets-reader-writer places its own restrictions on METS profiles is can read and then validate Problem: mets-reader-writer places its own restrictions on METS profiles it can read and then validate Apr 16, 2019
@ross-spencer ross-spencer changed the title Problem: mets-reader-writer places its own restrictions on METS profiles it can read and then validate Problem: mets-reader-writer places its own restrictions on METS profiles it can read and then validate (metsrw) Jan 16, 2020
@ross-spencer ross-spencer added the Ⓜ️ mets/premis METS/PREMIS issues label Jul 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ⓜ️ mets/premis METS/PREMIS issues
Projects
None yet
Development

No branches or pull requests

1 participant