This dataset, the Corpus of Lake District Writing (CLDW), consists of 80 manually digitised and annotated texts (comprising over 1.5 million word tokens). These texts were originally composed between 1622 and 1900, and they represent a range of different genres and authors. Collectively, the texts in the CLDW constitute an indicative sample of writing about the English Lake District during the early seventeenth century and the early twentieth century.
The full set of XML tags and symbols used, and transcription guidelines are available in the transcription_guidelines folder. The set of 80 transcribed files are available in LD80_transcribed.
The full set of automatically geoparsed files are in LD80_geoparsed.
This gold standard subset includes a representative sample of 28 texts, which were selected from the different genres and historical periods included in the corpus. Quantitatively, the gold standard subset we compiled comprised approximately 242,000 word tokens: about one-sixth of the entire corpus. This subset was hand checked and coded using XML tags in order to mark every place-name entity it contained. These placename entities included the names of a variety of different regional, national and international locations, landmarks and geographical formations. All of these identified place-name entities were marked-up with a customised tag (<cdplace>). The gold standard subcorpus is available in gold_standard.
A spreadsheet and csv format version describing the full corpus metadata are available in LD80_metadata.
We would like to acknowledge the support of three projects, and their funders and members that have contributed to this research. Original explorations into mapping two textual accounts of journeys through the landscape of the Lake District began in the Mapping the Lakes: Towards a Literary GIS project funded by the British Academy, involving Dr David Cooper, now at Manchester Metropolitan University. The CLDW was created in the Spatial Humanities: Texts, GIS, places’ research project which was funded by the European Research Council (ERC) under the European Union’s Seventh Framework Programme (FP7/2007-2013) (agreement number 283850) from 2012-16. Finally, we extended the corpus research in the Geospatial Innovation in the Digital Humanities: A Deep Map of the Lake District, which was funded by a Leverhulme Trust Research Project Grant (RPG-2015-230) from 2015-18. We also acknowledge the contribution of a number of research assistants who contributed to the preparation of the Corpus of Lake District Writing: Karen Donnelly, Joel Evans, Sayeed Ferdous, Chris Fletcher, Rachael Holland, Ann-Kathrin Marchlewski, Annegret Nissen, Amanda Pullan, Eliza Skakel, Enrico Torre, Tereza Valny, Alex Wilkinson, and Lynsey Wood.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
For further details of the corpus and to reference this dataset, please refer to the paper:
Rayson, P., Reinhold, A., Butler, J., Donaldson, C. E., Gregory, I. N., & Taylor, J. E. (2017). A deeply annotated testbed for geographical text analysis: The Corpus of Lake District Writing. In GeoHumanities'17: 1st ACM SIGSPATIAL Workshop on Geospatial Humanities: 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. Association for Computing Machinery (ACM). DOI: 10.1145/3149858.3149865