-
Notifications
You must be signed in to change notification settings - Fork 1
CERN bibliography data (XML)
We decided to integrate the bibliography data of CERN available under http://library.web.cern.ch/library_projects/bookdata under a CC0 license. This data is coded as XML/Marc21, i.e. a recoding of the old Marc21 format to XML. The original Marc21 field names are still contained in the transcoded XML. Documentation for the meaning of these field names can be found under http://www.loc.gov/marc/bibliographic/ecbdhome.html.
The data sources will be hosted on the web in XML format. For the task of downloading these sources via HTTP, we will implement in the platform an intermediate layer able to stand asynchronous interaction with NoSQL data stores. This feature will help us as a communication channel between sites and APIs hosted in the server, to manipulate documents and posterior stored in our database.
In reference to data repository, we have chosen eXist-db as our native XML database. It provides a robust and efficient indexing to manage amounts of unstructured data, documents or collection. Thus we can take advantage of its support to XML queries and reduce the number of data transformations.
As technique to perform querying, we will use XQuery and its powerful features to manipulate XML-based object databases and in combination with XPath expressions to do much easier the surf through syntax. To achieve manage of the best way our data collections, we propose to use a well-know schema to represent the bibliography data and so to accelerate the evaluation of path expressions.
We are using eXistDB for accessing the XML sources. There are two possibilities to integrate the eXistDb into our Node-based application. First, we can write XQuery-Documents and store them in the database. The results of evaluating these XQuery documents on the database are returned when sending a GET request to
<existDb-server>/exist/rest/<path of the document>?parameters
We can use placeholders in the XQuery document whose values can be specified as GET parameters. Hence, we can retrieve the data by querying the correct documents using the corresponding parameters. Alternatively, we can send a XQuery document to the relevant XML-document using a POST request. eXistDb will return the results of evaluating this Xquery against the corresponding XML-document.
In both cases, the access to the database is achieved by sending HTTP requests to the eXistDB server. We are using the second possibility, since constructing the correct XPath in JavaScript is easier/more straightforward than Xquery for this job The results are sent as XML by the eXistDb server.
The MARC 21 communication formats are standards for the representation and exchange of bibliographic, authority, holdings and classification in machine-readable form.
MARC 21 for Bibliographic Data is designed to be a carrier for bibliographic information of materials in which commonly includes titles, names, subjects, notes and description of an item.
The following table show the descriptors that we will use in our data schema:
Leader 07 - Bibliographic level
- a - Monographic component part
- c - Collection
- d - Subunit
- i - Integrating resource
041 - Language Code Indicator
- 0 - Item not a translation/does not include a translation
- 1 - Item is or includes a translation Subfield Codes
- $a - Language code of text/sound track or separate title
- $h - Language code of original
044 - Country of Publishing/Producing Entity Code Indicator
Subfield Codes
- $a - MARC country code
100 - Main Entry - Personal Name Indicator
- 0 - Forename
- 1 - Surname Subfield Codes
- $a - Personal name
- $c - Titles and words associated with a name
- $d - Dates associated with a name
245 - Title Statement Indicator
- 0 - No added entry
- 1 - Added entry Subfield Codes
- $a - Title
- $b - Remainder of title
- $g - Bulk dates
- $s - Version
260 - Publication, Distribution Indicator
- 3 - Current/latest publisher Subfield Codes
- $a - Place of publication, distribution, etc.
- $b - Name of publisher, distributor, etc.
- $c - Date of publication, distribution, etc.
- $e - Place of manufacture
- $f - Manufacturer
336 - Content Type Indicator
Subfield Codes
- $a - Content type term
- $b - Content type code
- $2 - Source
520 - Summary Indicator
- 0 - Subject
- 1 - Review
- 3 - Abstract Subfield Codes
- $a - Summary, etc.
- $u - Uniform Resource Identifier
- $2 - Source
773 - Host Item Entry Indicator
- 0 - Display note
- 1 - Do not display note Subfield Codes
- $a - Main entry heading
- $b - Edition
- $d - Place, publisher, and date of publication
- $g - Related parts
- $t - Title
- http://exist-db.org/exist/apps/doc/devguide_rest.xml (description of the REST Api of eXistDB)
- http://exist-db.org/exist/apps/doc/xquery.xml (contains the informations about JSON serialization)
- http://www.loc.gov/marc/bibliographic/
- http://www.loc.gov/standards/marcxml/