Omniscribe training data
Project data for Omniscribe: https://github.com/collectionslab/Omniscribe
Omniscribe was developed to detect annotations (marginalia, interlinear markings, provenance marks, etc.) in digitized printed books hosted via the International Image Interoperability Framework (IIIF).
Files and Directories
-
rawData.csv
: This csv files stores all the labeled data created from zooniverse users. The data includes regions of interests that were labeled and provides some information of the user who marked them. Further processing of this data is needed before it can be trained. -
extractROIs.py
: This script takes therawData.csv
file (hard-coded) and generatesdata.json
, a JSON file that contains all the images listed on Zooniverse along with all the regions that they may have. The JSON itself is a relatively complex object that stores many images, and those images may themselves have lists of ROIs.To put it simply, every image has a list of ROIs, and every ROI is made up of an
all_points_x"
array and anall_points_y
array such thatall_points_x[i]
andall_points_y[i]
make up a coordinate point, where every region would have four of these coordinate points (to make a rectangle that captures the ROI). The ROIs are constructed this way to fit Mask R-CNN structure requirements. -
data.json
: The resulting file generated fromextractROIs.py
. It contains all the images with their labeled annotations fromrawData.csv
. It is to be used withdatasetGenerator.py
in order to generate datasets that are ready for training. -
datasetGenerator.py
: This scripts readsdata.json
and generates three JSON files for training, validation, and testing. Each of these files have to be renamed tovia_region_data.json
and are to be placed in the same directory where the images they represent are located. Note that changing theSEED
value will create different datasets. -
annotation-datasets/
: Contains a training set and a validation set for images that contain handwriting. Note that training the model assumes daughter directories "train" and "val" where those directories contain only images.