-
Notifications
You must be signed in to change notification settings - Fork 1
Home
This project is to create a data set consisting of a collection of valid EAD 2002 DTD and EAD 2002 Schema XML specimens that are made available under a CC0 1.0 Public Domain Dedication.
The collection seeks to capture specimens that follow a wide diversity of valid encoding practice. The primary use case of the collection is for the design and testing of systems that need to handle a variety of EAD mark up features.
As in typography, greeking involves inserting nonsense text or, commonly, Greek or Latin text in prototypes of visual media projects (such as in graphic and web design) to check the layout of the final version before the actual text is available, or to enhance layout assessment by eliminating the distraction of readable text. -- http://en.wikipedia.org/wiki/Greeking
Typical EAD files contain many names and paragraphs of text. EAD systems often convert EAD to HTML and make them available to search engines such as google. Specimens are systematically obscured by scrambling nouns in XML text nodes, so that the donated specimens don't show up in common google searches or get confused with the original collection description.
The greeking process is not cryptographically secure. A dedicated person probably could recover the original EAD file from the greeked version (buy why would they?).
So that the end product maintains some readability (it reads sort of like a mad lib), only nouns are obscured. Noun inflection and capitalization are preserved. The greeking algorithm will aways return the same result for a given input. It should be possible to build test interfaces with this collection that can browse or facet controlled access terms.
The python Natural Language Toolkit is used to identify nouns.
(In a latter phase, a "stop word" list of common archival terms exempted from noun obscurfication will be evaluated to see if this makes the end product more useful for testing EAD systems and interfaces.)
Upper and lower case English vowels and consonants from the basic ASCII character set are scrambled in a way that maintains letter case and vowel/consonant positions in words. This technique will not work with non-latin scripts, but it seems to work with other languages encoded predominantly in low-bit ASCII such as Dutch.
Digits are ignored by the systematic greeking and are left unaltered. Phone numbers, ZIP codes, and other identifiable numbers could also be greeked (all phone numbers to the 555 exchange?) but in the current data set these are left unaltered. Dates expressed in numbers are not changed, but spelled out month names are obscured if identified as a noun.
In the current data set; data in XML attributes and XML Comments are not obscured.
Why use the Creative Commons Public Domain Dedication, rather than retaining copyright but allowing anyone to use the collection? Retaining copyright is common in even some of the most permissive of open source software licenses. For some reason (that I can't remember/don't know) software licenses are not appropriate for content, and visa versa. Creative Commons Zero is the least restrictive content license, imposing no restrictions on the use of the systematically obscured content in the collection, and at the same time it makes no warranties about the collection of specimens, and disclaims liability for all uses of the data set. Contributors' copyright in the original files is fully retained.
The collection of noun-obscured EAD specimen files is maintained in a revision control system repository (specifically, a git repository on git hub https://github.com/tingletech/ead-test-col ).
An EAD file "in the wild" is submitted to a specimen processor. The submitter asserts they have the right to submit the original specimen for the purpose of it being processed and included in the collection. The specimen processor then conducts the noun-obscuring "greeking" transformation procedure on the file and commits the transformed file to the github repository or a fork. In the commit message; the specimen processor references the source of the original file.
original post to EAD listserv about project http://bit.ly/rPV1hJ → http://listserv.loc.gov/cgi-bin/wa?A2=ind1112&L=ead&T=0&P=1437
code4lib discussion of greeking technique http://bit.ly/tshBF9 → http://www.mail-archive.com/[email protected]/msg12410.html
Here is paper about data papers: http://www.escholarship.org/uc/item/9jw4964t
created with https://github.com/tingletech/ead_basic_xslt/