Skip to content

CERN bibliography data (XML)

vogelsgesang edited this page Apr 24, 2014 · 5 revisions

We decided to integrate the bibliography data of CERN available under http://library.web.cern.ch/library_projects/bookdata under a CC0 license. This data is coded as XML/Marc21, i.e. a recoding of the old Marc21 format to XML. The original Marc21 field names are still contained in the transcoded XML. Documentation for the meaning of these field names can be found under http://www.loc.gov/marc/bibliographic/ecbdhome.html.

The data sources will be hosted on the web in XML format. For the task of downloading these sources via HTTP, we will implement in the platform an intermediate layer able to stand asynchronous interaction with NoSQL data stores. This feature will help us as a communication channel between sites and APIs hosted in the server, to manipulate documents and posterior stored in our database.

In reference to data repository, we have chosen eXist-db as our native XML database. It provides a robust and efficient indexing to manage amounts of unstructured data, documents or collection. Thus we can take advantage of its support to XML queries and reduce the number of data transformations.

As technique to perform querying, we will use XQuery and its powerful features to manipulate XML-based object databases and in combination with XPath expressions to do much easier the surf through syntax. To achieve manage of the best way our data collections, we propose to use a well-know schema to represent the bibliography data and so to accelerate the evaluation of path expressions.

Technology

We are using eXistDB for accessing the XML sources. There are two possibilities to integrate the eXistDb into our Node-based application. First, we can write XQuery-Documents and store them in the database. The results of evaluating these XQuery documents on the database are returned when sending a GET request to

<existDb-server>/exist/rest/<path of the document>?parameters

We can use placeholders in the XQuery document whose values can be specified as GET parameters. Hence, we can retrieve the data by querying the correct documents using the corresponding parameters. Alternatively, we can send a XQuery document to the relevant XML-document using a POST request. eXistDb will return the results of evaluating this Xquery against the corresponding XML-document.

In both cases, the access to the database is achieved by sending HTTP requests to the eXistDB server. We are using the second possibility, since constructing the correct XPath in JavaScript is easier/more straightforward than Xquery for this job The results are sent as XML by the eXistDb server.

MARC 21 format

The MARC 21 communication formats are standards for the representation and exchange of bibliographic, authority, holdings and classification in machine-readable form.

Bibliographic Data format.

MARC 21 for Bibliographic Data is designed to be a carrier for bibliographic information of materials in which commonly includes titles, names, subjects, notes and description of an item.

The following table show the descriptors that we will use in our data schema:

Leader 07 - Bibliographic level

  • a - Monographic component part
  • c - Collection
  • d - Subunit
  • i - Integrating resource

041 - Language Code Indicator

  • - No information provided

  • 0 - Item not a translation/does not include a translation
  • 1 - Item is or includes a translation Subfield Codes
  • $a - Language code of text/sound track or separate title
  • $h - Language code of original

044 - Country of Publishing/Producing Entity Code Indicator

  • - Undefined

Subfield Codes

  • $a - MARC country code

100 - Main Entry - Personal Name Indicator

  • 0 - Forename
  • 1 - Surname Subfield Codes
  • $a - Personal name
  • $c - Titles and words associated with a name
  • $d - Dates associated with a name

245 - Title Statement Indicator

  • 0 - No added entry
  • 1 - Added entry Subfield Codes
  • $a - Title
  • $b - Remainder of title
  • $g - Bulk dates
  • $s - Version

260 - Publication, Distribution Indicator

  • - Not applicable/No information provided/Earliest available publisher

  • 3 - Current/latest publisher Subfield Codes
  • $a - Place of publication, distribution, etc.
  • $b - Name of publisher, distributor, etc.
  • $c - Date of publication, distribution, etc.
  • $e - Place of manufacture
  • $f - Manufacturer

336 - Content Type Indicator

  • - Undefined

Subfield Codes

  • $a - Content type term
  • $b - Content type code
  • $2 - Source

520 - Summary Indicator

  • - Summary

  • 0 - Subject
  • 1 - Review
  • 3 - Abstract Subfield Codes
  • $a - Summary, etc.
  • $u - Uniform Resource Identifier
  • $2 - Source

773 - Host Item Entry Indicator

  • 0 - Display note
  • 1 - Do not display note Subfield Codes
  • $a - Main entry heading
  • $b - Edition
  • $d - Place, publisher, and date of publication
  • $g - Related parts
  • $t - Title

References