- Vojtěch Kaše , SDAM project, [email protected]
- Petra Hermankova , SDAM project, [email protected]
- Adela Sobotkova , SDAM project, [email protected]
This repository serves for generation of two datasets: LIST (Latin Inscriptions in Space and Time, https://zenodo.org/record/7587556#.ZEor6i9BxhF) and LIRE (Latin Inscriptions of the Roman Empire, https://zenodo.org/record/5776109#.ZEosBC9BxpQ), where the second is a filtered, spatio-temporally more restricted, version of the first one. Both were created by aggregation of EDH and EDCS epigraphic datasets enriched by additional metadata. The repository does not contain the datasets as such, but the scripts used to generating them (see the scripts subdirectory).
For inscriptions which are covered by both EDCS and EDH source datasets, it contains attributes from both of them. In cases in which an inscription is available only in one dataset, it contains attributes only from that one dataset. Some crucial attributes shared by both datasets:
clean_text_interpretive_word
: text of the inscriptionnot_before
: start of the dating intervalnot_after
: end of the dating intervalgeography
: latitude/longitude defining geospatial position in form of a point In the case of other metadata attributes, the information cannot be easily transferred between the two sources. For instance, EDCS has the attributeinscr_type
which should bear approximately the same information astype_of_inscription_clean
in EDH. However, theinscr_type
attribute from EDCS uses a different classification system than EDH, relies on latin labels of inscription types etc. This project overcomes this issue by developing and applying a machine learning classification model (seescripts/CLASSIFIER_TRAINING&TESTING.ipynb
andscripts/CLASSIFIER-APPLICATION.ipynb
). This way the dataset is enriched by two additional attributes:type_of_inscription_auto
andtype_of_inscription_prob
.
For an overview of all metadata, see LIST_v0.4_metadata.csv
. For an overview of the data, see the jupyter notebook 5_DATA-OVERVIEW.ipynb
in the scripts subdirectory.
The final datasets are available via Zenodo:
- LIST dataset: https://zenodo.org/record/7870085#.ZEoyjy9BxhE (using geopandas library, you can load the data directly into your Python environment using the following command:
LIST = gpd.read_parquet("https://zenodo.org/record/7870085/files/LIST_v0-4.parquet?download=1")
) - LIRE dataset: https://zenodo.org/record/7577788#.ZEo3rS9BxhE (using geopandas library, you can load the data directly into your Python environment using the following command:
LIRE = gpd.read_parquet("https://zenodo.org/record/7577788/files/LIRE_v2-1.parquet?download=1")
)
EDCS dataset is accessed and transformed by the series of Python and R scripts in EDCS ETL repository, created by the SDAM project. The latest version of the dataset (as JSON file) can be accessed via Sciencedata.dk using the following url: https://sciencedata.dk/shared/1f5f56d09903fe259c0906add8b3a55e.
EDH dataset is accessed and transformed by the series of Python and R scripts in EDH ETL repository and in EDH exploration repository, created by the SDAM Project. The latest version of the dataset (as JSON file) can be accessed via Sciencedata.dk using the following url: https://sciencedata.dk/shared/b6b6afdb969d378b70929e86e58ad975.
- Python 3
- Jupyter notebooks app/JupyterLab/JupyterHub
- Python 3 additional libraries listed
requirements.txt
After you clone the repository, we recommend you to create a virtual environment lire_venv using the virtualenv
library and to run the notebooks with it as their kernel:
virtualenv li_venv
source li_venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt # install anything in requirements.txt
python -m ipykernel install --user --name=li_venv # add to kernels