This project is designed to research possibilities for automated or semi-automated transcription of herbarium sheet text, both handwritten and typed.
- Create a Python 3.8 virtual environment. For example, in Anaconda Terminal:
conda create -n <envname> python=3.8 pip
conda activate <envname>
pip install -r requirements.txt
- Install the needed nltk corpora by running the
requirements-nltk.py
script. For example, in the same Anaconda window, execute:python requirements-nltk.py
- Set up Google Cloud Vision credentials. (Optional, but required to generate new GCV analyses.)
- Download your Google Cloud Service Account Key. Save the file in this main directory, e.g.
service_account_token.json
. (See Google Cloud help center for more guidance with creating a key.) - Copy the
Configuration-plain.cfg
file asConfiguration.cfg
.- Edit the file to add the name of your service account token (in the
GOOGLE_CLOUD_VISION_API
section, underserviceAccountTokenPath={name-of-your-service-account-token}
. - Update any other settings if desired.
- Edit the file to add the name of your service account token (in the
- Download your Google Cloud Service Account Key. Save the file in this main directory, e.g.
- Set up Amazon Web Services credentials. (Optional, but required to generate new AWS analyses.)
- Set up your Amazon Web Services account and store your credentials in the proper location (this location is OS-dependent, please consult the AWS documentation) for full instructions).
Currently available: Google Cloud Vision and Amazon Web Services Textract
- Download the ground truth information for your dataset, plus URLs of the images.
- In a web browser, log into the Fern Portal and download an occurrence file, which should contain human-created transcriptions of the herbarium sheet text:
- As an authenticated user, click the "Crowdsource", and click the pencil icon next to the desired dataset.
- Click the "Exporter" tab. Create a search query (e.g.
Collector/Observer CONTAINS Steyermark).
- For "Processing Status" select "Reviewed." (This ensures that you have ground truth data, i.e. transcribed by humans and reviewed by staff, to compare to the OCR.)
- For "Structure," select "Darwin Core."
- For "Data Extensions," deselect "include Determination History" and select "include Image Records."
- (Compression should already be checked, and "CSV" selected for file format.)
- For "Character Set" select "UTF-8 (unicode)."
- Click "Download Records" button.
- Place this downloaded ZIP file in your working directory.
- Download your image set.
- Run the script
utilities\join_occurrence_file_with_image_urls.py
, pointing to the ZIP file you just downloaded. A new CSV file (e.g. "occurrence_file_with_images.csv") is created in the same directory. - Run the script
utilities\download_images_from_csv.py
, pointing to (1) the "occurrence_file_with_images.csv" file, and (2) the desired directory for the downloaded image set.
- Run the script
- Retrieve and save OCR data for your image set.
- Run the script
gather_ocr_data_from_cloud_platforms.py
pointing to the folder of images downloaded in the previous step, and the "occurrence_with_image_urls" file. - (To cut down on cloud usage, the program attempts to find already existing
ocr response objects, and any OCR responses generated will be saved in the
ocr_responses
folder, with one subfolder for each cloud platform.) For a brand-new set of images, this script will take 23-30 seconds per image. - The file "occurrence_with_ocr-<yyyy_mm_dd-hh_mm_ss>.csv" will be saved in the folder
test_results
. - To generate annotated images for each image and cloud platform, e.g. if you want
to visualize and manually compare, add a flag 'True' to the script.
These images are saved within a new subfolder called
cloud_ocr-[yyyy_mm_dd-hh_mm_ss]
. (e.g.python gather_ocr_data_from_cloud_platforms.py images True
)
- Run the script
- Compare OCR data to ground truth data.
- Run the script
prep_comparison_data.py
with the "occurrences_with_ocr" file generated in the previous step. This script saves 2 new files to thetest_results
folder:- The occurrence file with 3 added columns,saved as
"occurrence_with_ocr_and_scores-<yyyy_mm_dd-hh-mm-ss>.csv":
labelText
- ground truth data (compiled from the human-created transcriptions in the occurrence file)awsMatchingScore
- The total "score" for the AWS Textract platform's OCR text found in this image. (Roughly, this score gives 1 pt for an exact match, 0.5 pt for a 60-99% match, and no points for any match <60%)gcvMatchingScore
- The total "score" for the GCV platform's OCR text found in this image.
- A file called "compare_word_by_word-<yyyy_mm_dd-hh_mm_ss>.csv" which shows the best OCR match (fuzzy match ratio) for each word in the ground truth text (taken from the occurrence file).
- The occurrence file with 3 added columns,saved as
"occurrence_with_ocr_and_scores-<yyyy_mm_dd-hh-mm-ss>.csv":
- (Note: because of the label-finding feature, this script takes roughly 23-30 seconds per image on an average personal computer.)
- Run the script
/
gather_ocr_data_from_cloud_platforms.py
- see "Comparing OCR platforms on analyzing herbarium sheet labels" in Example Workflowsprep_comparison_data.py
- see "Comparing OCR platforms on analyzing herbarium sheet labels" in Example Workflowscalculate_changes.py
- quick visualization (in terminal), comparing OCR platforms' performance.compare_ocrs.py
- deprecated?create_images_for_zooniverse.py
- Given a folder of images, create images to spec for the latest Zooniverse project.crop_images_of_words.py
- (in development) Use Zooniverse results and herbarium images to create a dataset of labeled word images.Configuration-plain.cfg
- see Environment Setuprequirements.txt
- see Environment Setup
/imageprocessor/
- Contains classes for handling, parsing, and visualizing OCR data.
image_annotator.py
image_processor.py
/labelcorpus/
- (not in use) Contains files for creating and applying text corpora.
analyze_corpus.py
make_corpus_from_occurence_file.py
/nameresolution/
- (not in use) Contains files for fuzzy text matching/error correction, specifically for scientific names and synonym resolution.
fuzzy_text_matching.py
taxon_binomial_name_matching.py
/utilities/
- Contains files for quickly loading and saving commonly used data types, as well as some scripts which have specific uses, such as for parsing files and batch downloading.
/image_preparation/
data_loader.py
- Quickly load various data types which are common in this repo.data_processor.py
- Quickly save various data types which are common in this repo.join_occurrence_file_with_image_urls.py
- Given a Fern Portal occurrence file, find and verify URLs for each image.download_images_from_csv.py
- After running the previous script, download the images for a given URL.quick_crop_labels.py
- Quickly and roughly crop the bottom right corner of a set of herbarium sheet images.timer.py
- Quick timer class for tracking program execution time.
/imageprocessor_objects/
- Stores all pickled ImageProcessor objects.
/test_results/
- Stores various processed CSV files and annotated images.
Use this program to send each image file to all available cloud-based platforms, for OCR processing.
Input:
- A single image file on the local computer
- OR
- A folder of image files on the local computer
N.B. about file naming and cloud server usage: To reduce cloud computing costs,
the program always searches the ocr_responses
folder for an existing response object before
sending a query to the cloud service. Queries are stored in the folder with the base of the
image file name as the name. e.g. The OCR response for cat-and-dog.jpg
is saved as
cat-and-dog.pickle
. If cat-and-dog.jpg
is run through this script again, it will import the
pickle (and print a message to the console, Using previously pickled response object for
cat-and-dog).
Outputs:
- The response for any new cloud queries are saved in the folder
ocr_responses
, with one sub-folder for each cloud service, e.g.aws
andgcv
. - The other outputs are all saved to the
test_results
folder, in a new subfolder calledcloud_ocr-[timestamp]
. - The complete text output is saved as
ocr_texts.csv
, with one row per image. For AWS and GCV respectively, a line break (\n
) character separates each "line" or "paragraph" of OCR data. (Both AWS and GCV will generate extended character sets and non-latin characters, such as latin letters with diacritics, Korean, Arabic, etc.) - Unless flagged false (see example usage), one copy of each image is generated per cloud
platform, with annotations indicating the "words" found by all platforms.
The program is configured to draw (1) a thin black box around each line/paragraph, (2)
a green line at the start of each detected word, and (3) a red line at the end of each
detected word. (This can be adjusted in the
draw_comparison_image
function.)
Example usage:
python gather_ocr_data_from_cloud_platforms.py oneimage.jpg
python gather_ocr_data_from_cloud_platforms.py image_folder
python gather_ocr_data_from_cloud_platforms.py image_folder True
(Same functionality as the
previous example)
python gather_ocr_data_from_cloud_platforms.py image_folder false
(Optional second argument
to skip the creation of the annotated images. Case-insensitive, will detect "false", "no",
or "n".)
Example output for python gather_ocr_data_from_cloud_platforms.py oneimage.jpg
:
Saved in the folder ./test_results/cloud_ocr-<yyyy-mm-dd_hh-mm-ss>/
:
- Image saved as
oneimage-annotated<datestamp>.jpg
in sub-folderaws
. - Image saved as
oneimage-annotated<datestamp>.jpg
in sub-foldergcv
. - CSV file
ocr_texts.csv
:
barcode | gcv | aws |
---|---|---|
C12345678F | 31160 PLANTS OF GUATEMALA ...(etc) | 31160 PLANTS OF GUATEMALA ...(etc) |
... | ... | ... |
Using the downloaded occurrence information (as a ZIP file), this program joins the full occurrence record with the URL of the high resolution image for each row.
Input:
- A ZIP file exported from the Fern Portal. See workflows above for detailed information.
Output:
- The results are saved as a new file in the same directory, with the file name
occurrence_file_with_images-[timestamp].csv
. This file is the same as theoccurrences.csv
file, with one additional column,image_url
, taken from theimages.csv
file.
Example usage:
python utilities/join_occurrence_file_with_image_urls resources/occur_download.zip
Example output:
Saved in resources/
as occurrence_file_with_images-2021_04_14-10_58_13.csv
:
This program uses fuzzy match ratio to find the closest name match based on World Flora Online.
Input:
- A text file of generated binomial names (genus and species), e.g. as generated by OCR, with one name per line.
Output:
- The results are saved as with the name
[original_filename]-name_match_results.csv
in the current working directory. - This file has 3 columns:
- The original text string from the input file
- A list showing the highest ratio match (or multiple options, if tied)
- The highest ratio achieved by those matches (an integer value 0-100, representing %)
Example usage:
python nameresolution/taxonomic_name.py file_of_OCR_names_to_match.txt
Example output:
Saved as file_of_OCR_names_to_match-name_match_results.csv :
search_query | best_matches | best_match_ratio |
---|---|---|
Adiantum pedatum | ['adiantum pedatum'] | 100 |
Polypodium virginiangan | ['polypodium virginianum'] | 89 |
This project is being developed for the Grainger Bioinformatics Center at the Field Museum by Beth McDonald (@emcdona1), under the guidance of Dr. Rick Ree and Dr. Matt von Konrat.
Original codebase for a GUI system with a local database developed by Keshab Panthi (@kpanthi), Northeastern Illinois University.