- ETL
This repository contains scripts for accesing, extracting and transforming epigraphic datasets from the Epigraphic Database Clauss-Slaby. We have developed a series of scripts, merging the data together and streamlining them for quantitative analysis of epigraphic trends.
- Petra Heřmánková SDAM project, [email protected]
DATASET 2022: Heřmánková, Petra. (2022). EDCS_text_cleaned_2022_09_12 (v2.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.7072337
https://zenodo.org/record/7072337
SCRIPTS 2022: Petra Heřmánková. (2022). sdam-au/EDCS_ETL: Scripts (v2.0). Zenodo. https://doi.org/10.5281/zenodo.7072355
https://doi.org/10.5281/zenodo.7072355
The 2022 datasets contains 537,286 cleaned and streamlined Latin inscriptions from the Epigraphic Database Clauss Slaby (EDCS, http://www.manfredclauss.de/), aggregated on 2022/09/12, created for the purpose of a quantitative study of epigraphic trends by the Social Dynamics in the Ancient Mediterranean Project (SDAM, http://sdam.au.dk). The dataset contains 27 attributes with original and streamlined data. Compared to the 2021 dataset, there are 36,750 more inscriptions and 2 less attributes containing redundant legacy data, thus the entire dataset is approximately the same size but some of the attributes are streamlined (465.5 MB in 2022 compared to 451.5 MB MB from 2021.): some of the attribute names have changed for better consistency, e.g. Material
> material
, Latitude
> latitude
; some attributes are no longer available due to the improvements of the LatEpig tool, e.g. start_yr
, notes_dating
, inscription_stripped_final
; and some new attributes were added due to the improvements of the cleaning process, e.g. clean_text_conservative
. For full overview, see the Metadata
section.
Metadata
EDCS 2022 dataset metadata with descriptions for all attributes.
DATASET 2021: Heřmánková, Petra. (2021). EDCS_text_cleaned_2021_03_01 (Version 1.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4888817
https://zenodo.org/record/4888817
SCRIPTS 2021: Petra Heřmánková. (2022). sdam-au/EDCS_ETL: Scripts (v1.1). Zenodo. https://doi.org/10.5281/zenodo.6497148
https://doi.org/10.5281/zenodo.6497148
The 2021 dataset contains 500,536 cleaned and streamlined Latin inscriptions from the Epigraphic Database Clauss Slaby (EDCS, http://www.manfredclauss.de/), aggregated on 2021/03/01, created for the purpose of a quantitative study of epigraphic trends by the Social Dynamics in the Ancient Mediterranean Project (SDAM, http://sdam.au.dk). The dataset contains 29 attributes with original and streamlined data. For full overview, see the Metadata
section.
Metadata
EDCS 2021 dataset metadata with descriptions for all attributes.
is published at www.manfredclauss.de webinterface as HTML. The output of the webinterface is accessed and saved by a third party tool, Lat Epig 2.0, developed at Macquarie University in Sydney, in a series of CVS files by their respective province.
The scripts access the main dataset via a webinterface, transform the data into one dataframe object and save the outcome to SDAM project directory on sciencedata.dk and on Zenodo. Since the most important data files are in a public folder, you can use and re-run our analyses even without a sciencedata.dk account and access to our team folder. A separate Python package sddk
was created specifically for accessing sciencedata.dk from Python (see https://github.com/sdam-au/sddk). If you want to save the dataset in a different location, the scripts might be easily modified. You can access the file without having to login into sciencedata.dk. Here is a path to the file on sciencedata.dk:
SDAM_root/SDAM_data/EDCS/public/EDCS_text_cleaned[timestamp].json
or https://sciencedata.dk/public/1f5f56d09903fe259c0906add8b3a55e/EDCS_text_cleaned_[timestamp].json
To access the files created in previous steps of the ETL process, you can use the dataset from the public folder, or you have to rerun all scripts on your own.
is produced by the scripts in this repository is called EDCS_text_cleaned_[timestamp].json
and published on Zenodo in all its versions, for details and links see How to cite us
section above.
Additionally, the identical dataset can be accessed via Sciencedata.dk: SDAM_root/SDAM_data/EDCS/public
folder on sciencedata.dk or alternatively as https://sciencedata.dk/public/1f5f56d09903fe259c0906add8b3a55e/
.
The data is accessed via a third party tool, Lat Epig 2.0, and saved as a series of TSV files by their respective Roman Province and saved in the folder data
. We furter use R for accessing the data from a series of TSVs and combining them into one dataframe, exported as JSON file. Subsequently, we use series of R scripts for further cleaning and transformming the data. The scripts can be found in the folder scripts
and they are named according to the sequence they should run in.
If you are trying to access the ETL scripts creted in 2020-2021 that created the version 1.0 of the dataset (Heřmánková, Petra. (2021). EDCS_text_cleaned_2021_03_01 (Version 1.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4888817
https://zenodo.org/record/4888817), we refer you to the release 1.0 to 1.3 on Zenodo. Because of the external dependencies and changes in third party software and the databases between 2020 and 2022, the ETL scripts has changed since then (release v2.0).
- Clone https://github.com/mqAncientHistory/Lat-Epig repository to your local computer
- Change the branche to
scrapeprovinces
- Make sure you have Docker installed, if not follow the installation instructions for your OS https://docs.docker.com/engine/install/ and post-installation https://docs.docker.com/engine/install/linux-postinstall/ (Linux)
- Run in the terminal: bash dockerScraperAll.sh
- The scraper will run on its own (for several hours, depending on your internet connection and your computer, usually around 4-5 hours) and when it's done, the data will show in the main folder labelled
full_scrape_[today's-date]
. All inscriptions are saved as TSV file and JSON file, labelled with their metadata containing the date of accessing, source, name o fthe province and their number. - Copy the entire folder to the EDCS_ETL repository for further processing (don't forget to rename the folder to
YYYY_MM_allProvinces
or make necessary changes in the follwing scripts).
Alternatively, if you are using the old version of the tool (pre-2022 version), you would be using the script 1_0_LatEpig_2_0_search_by_provinces.bsh to access the data. However, in the 2022 version the file is deprecated. The bash script programmatically extracted all non-empty inscriptions from individual provinces into separate CSV files. Run time ca. 16-20 hrs. The script was to be used within the local instantiation of the Lat Epig 2.0 tool. The CSV files were saved within that repository to the folder output
.
Merging TSV files and cleaning attributes
The current script works with TSV files stored in the YYYY_MM_allProvinces
folder. If you wish to work with JSON files, amend the script.
File | Source commentary | |
---|---|---|
input | 2022_09_allProvinces in folder data |
containting TSVs with inscriptions in individual provinces, accessed via Epigraphy Scraper Jupyter Notebook |
output | EDCS_merged_cleaned_attrs_[timestamp].json |
Cleaning text of an inscription
File | Source commentary | |
---|---|---|
input | EDCS_merged_cleaned_attrs_[timestamp].json |
The current script works with JSON file containing all inscriptions will their streamlined attributes. |
output | EDCS_text_cleaned_[timestamp].json |
The following scripts are exploratory only (do not change the dataset, only explore the contents of the dataset)
Exploration of the entire dataset
File | Source commentary | |
---|---|---|
input | EDCS_text_cleaned_[timestamp].json |
The current script works with JSON file containing all inscriptions will their streamlined attributes and cleaned text. |
output | NA |
Exploration of the text of inscriptions
File | Source commentary | |
---|---|---|
input | EDCS_text_cleaned_[timestamp].json |
The current script works with JSON file containing all inscriptions will their streamlined attributes and cleaned text. |
output | NA |
Lemmatization of the text of inscriptions with UDpipe tool. However, upon closer inspection, the results of such lemmatization were unsatisfactory.
File | Source commentary | |
---|---|---|
input | EDCS_text_cleaned_[timestamp].json |
The current script works with JSON file containing all inscriptions will their streamlined attributes and cleaned text. |
output | EDCS_text_lemmatized_udpipe_[timestamp]].json |
Heřmánková, P., Kaše, V., & Sobotkova, A. (2021). Inscriptions as data: Digital epigraphy in macro-historical perspective. Journal of Digital History, 1(1), 99. https://doi.org/10.1515/jdh-2021-1004
- the article working with version 1, but version 2 follows the same principles. Some attribute names may vary in the version 2 as well as the contents of the dataset (that reflect the changes made by the EDCS).