New Catalog Design

A New Design

The new strategy is to create a database of authority information – MADS, with embedded RDF, and raw RDF. I’m gleaning that information from Alison’s spreadsheets and the 1st thousand years of Greek spreadsheet. These are the real authority sources at this point. Here’s a sample:

    <?xml version="1.0" encoding="utf-8"?>
    <mads:mads xmlns:mads="http://www.loc.gov/mads/v2" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms" xmlns:efrbroo="http://erlangen-crm.org/efrbroo/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:local="http://www.perseus.tufts.edu/">
	<mads:authority>
	   <mads:titleInfo>
	      <mads:title>Fragmenta</mads:title>
	   </mads:titleInfo>
	</mads:authority>
	<mads:identifier type="ctsurn">urn:cts:greekLit:fhg0138.fhg001</mads:identifier>
	<mads:extension>
	   <rdf:RDF>
	      <rdf:Description about="urn:cts:greekLit:fhg0138.fhg001">
		 <rdfs:label>Fragmenta</rdfs:label>
		 <dcterms:isPartOf>urn:cts:greekLit:fhg0138</dcterms:isPartOf>
		 <rdf:type rdf:resource="efrbroo:F15_Complex_Work"/>
		 <efrbroo:P48_has_preferred_identifier>urn:cts:greekLit:fhg0138.fhg001</efrbroo:P48_has_preferred_identifier>
		 <efrbroo:R10i_is_member_of rdf:resource="urn:cts:greekLit:fhg0138"/>
	      </rdf:Description>
	   </rdf:RDF>
	</mads:extension>
    </mads:mads>

A few things to notice:

the MADS record contains a CTS urn that acts as the key linking a MODS record to the work expression(s) it contains (more on that in a moment).
This is a standard MADS authority record, so it could contain lots of other information about the work, pulled from various sources, including alternative IDs.
We take advantage of the mads:extension element to embed some linked data in the record. This data serves two purposes:
- it augments the MADS record with information that can be used by applications to locate the record;
- it can be extracted and put into an RDF store for LOD applications.
There is no link to an author/creator here. There is an authority record for every creator in the system, too (the PrimaryAuthor records, or a derivative of them); in the scheme I’m building, we use independent RDF statements to link them.

But what about the MODS records for the catalog?

MODS is intended for manifestation-level data: specific publications, not abstract works. A manifestation is a carrier of one or more expressions of one or more works. So rather than have one MODS for each expression (as we do now), the new scheme turns the existing MODS inside out: the <relatedItem type=’host’> element is hoisted up to become its own MODS record. This is the proper use of MODS: to encode metadata about particular publications. The main element of the old MODS now becomes a constituent expression of the publication. And since we have authority records for all those works, the MODS constituent is little more than a pointer to the MADS record (though other information about the expression like its language, or links to specific pages in online editions, could be encoded there.

Here’s an example:

     <?xml version="1.0" encoding="utf-8"?>
     <mods xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:schemaLocation="http://www.loc.gov/mods/v3 https://www.loc.gov/standards/mods/mods.xsd">
     <recordInfo>
	<recordIdentifier>d42f6101-0810-4011-989f-641c95562f0d</recordIdentifier>
     </recordInfo>
     <titleInfo>
	<title>Grammaticae Romanae fragmenta collegit</title>
     </titleInfo>
     <name xmlns:ns2="http://www.w3.org/1999/xlink"
	    authority="naf"
	    type="personal"
	    ns2:href="http://errol.oclc.org/laf/nr97-15146.html">
	<namePart>Funaioli, Gino</namePart>
	<namePart type="date">1878-1958</namePart>
	<role>
	  <roleTerm type="text" authority="marcrelator">compiler</roleTerm>
	</role>
	<role>
	  <roleTerm type="text" authority="marcrelator">editor</roleTerm>
	</role>
     </name>
     <typeOfResource>text</typeOfResource>
     <originInfo>
	<place>
	  <placeTerm type="code" authority="marccountry">gw</placeTerm>
	</place>
	<place>
	  <placeTerm type="text">Lipsiae</placeTerm>
	</place>
	<publisher>in aedibus B. G. Teubneri</publisher>
	<dateIssued>1907</dateIssued>
	<dateIssued point="start" encoding="marc">1907</dateIssued>
	<issuance>monographic</issuance>
     </originInfo>
     <language>
	<languageTerm type="code" authority="iso639-2b">lat</languageTerm>
     </language>
     <physicalDescription>
	<form authority="marcform">print</form>
	<extent>xxx, 610 p.</extent>
     </physicalDescription>
     <note type="statement of responsibility">recensuit Hyginus Funaioli.</note>
     <classification authority="lcc">PA6103 .G7 1907</classification>
     <subject authority="lcsh">
	<topic>Latin language</topic>
	<topic>Grammar</topic>
	<topic>Early works to 1500</topic>
     </subject>
     <relatedItem type="series">
	<titleInfo>
	  <title>Bibliotheca scriptorum Graecorum et Romanorum Teubneriana. [Scriptores Romani]</title>
	</titleInfo>
     </relatedItem>
     <identifier type="lccn">08002169</identifier>
     <identifier type="oclc">46348511</identifier>
     <location>
	<url displayLabel="LC Permalink">http://www.worldcat.org/oclc/08002169</url>
     </location>
     <location>
	<url displayLabel="WorldCat">http://www.worldcat.org/oclc/46348511</url>
     </location>
     <location>
	<url displayLabel="HathiTrust">https://hdl.handle.net/2027/hvd.32044012755260</url>
     </location>
     <location>
	<url displayLabel="GoogleBooks">http://books.google.com/books?id=WZofAAAAMAAJ</url>
     </location>
     <relatedItem type="constituent" otherType="expression"
		   xlink:href="urn:cts:latinLit:phi0400.phi001"
		   xlink:role="ctsurn"/>

     <relatedItem type="constituent" otherType="expression"
		   xlink:href="urn:cts:greekLit:tlg1810.tlg002"
		   xlink:role="ctsurn"/>

     <!-- pointers to the other fragments expressed in this book go here  -->
     </mods>

Now, there are currently 86 separate MODS records with Funaioli’s Grammaticae Romanae fragmenta collegit as a host; all of those get collapsed into a single record with 86 constituent elements, pointing to the work authorities.

But where do these new MODS records go?

The old workflow entailed creating a new MODS record for every expression, and giving the record a name based on the expression via some sort of algorithm that calculated a suffix based on the language of the expression, the number of previous expressions already represented, etc.

Because the MODS are no longer about expressions, it no longer makes sense to name (and organize) the MODS files based on expressions. File-naming is always a pain; as an interim solution, I’m simply giving the records a UUID as a record identifier and saving the files by that; this is a pretty common work-around for solving the unique-name problem. They are long and ugly, but they are guranteed to be unique; at some point Perseus may want to set up something with (https://ezid.cdlib.org) and mint arks, or stand up its own NOID minter (the original code is in Perl, but there are also implementations in Python and Ruby, among others).

In short, file-naming and file-management concerns shouldn’t drive the way the metadata is expressed; ways can be found.

What’s Next?

This new scheme will require re-writing the eXist-based Catalog app and the API, but the good news is that these implementations will be much simpler and almost certainly faster than the current code.

And the existing metadata will have to be converted. I’m confident most of the conversion can be automated, but this is real data, so....

I think you’ll find the result will be worth the effort, though. Going forward, record creation and ingestion into the catalog will be vastly simplified: adding a new edition entails nothing more than augmenting a plain MODS record (pulled down from WorldCat, for example) with expression constituents and adding it to the database.

Looking Forward

MODS, like MADS, has an <extension> element, so there’s no reason why these MODS records couldn’t also be augmented with RDF. And as linked bibliographic data matures, it may eventually be possible to more or less do away with maintaining manifestion metadata at all; one could simply write RDF that expresses the appropriate relationships of our authors and works to these various manifestations. It would be very interesting to know what percentage of the versions/editions now in the Perseus Catalog already have durable URIs (via WorldCat, VIAF, LoC/BIBFRAME, etc.)

Provide feedback

Saved searches