You might find this google drive link useful. It contains a terrible jupyter notebook and our presentation.
It also has the PDF that analysis the accuracy of our model
The location and description of our dataset is described below. It is also stored on the SCC:
/restricted/projectnb/cs501t2/ian_thomas/LiberatorProject/CS501-Liberator-Project/data_downloader/output_images
- 1. Table Of Contents
- 2. Newspaper Article Segmentation
- 3. Why is this tool needed
- 4. End goal of this tool
- 5. Big Overview of Tool
- 6. Getting Started
- 7. Known Bugs
- 8. Input File Requirements
- 9. Command Line Interface
- 10. Pipeline Architecture
- 11. Deep Dive Into The Pipeline
- 12. Dataset and Sample Results
- 13. Alternative Methods Tried
- 14. Extra Resources
This repository contains tools for segmenting archives of newspapers, in the form of digital images into individual articles containing various article level information, such as content, author, title etc. It utilizes a DNN and a set of post-processing scripts to segment images into articles, producing a JSON output that contains the detected information.
Currently, it supports segmenting a particular input image into several regions identified as articles, using this data it is then able to perform basic OCR.
In the future, features such as title, author and content extraction would be a great addition to improving the utility of data generated.
There has been a large effort put into digitizing old literary works, including historic newspapers. Although having digital copies of this data is step forward without a method for searching the data it significantly less accessible to the average person, academic researcher etc.
Due to both the time required to manually label data and the sheer volume of data that exists an automatic method for segmenting and labeling data, extracting authors, titles, and content is required. This problem is not a new one either, there exists research to help solve this problem that attempts to solve this problem using a variety of approaches.
<Add Sources>
The "ideal" version of this tool would be able to take as input a set of images that represent a number of historical newspaper issues. Segment those images into individual issues, extracting the location of each article on the image. Then, among each article extracting:
- Article Content
- Article Title
- Article Author
- Published Date(If applicable)
Further, providing a set of topics/keywords about each article to aid in search. As in, performing some sort of topic analysis on the extracted text to aid in classification of individual articles and issues.
With the end goal in mind, the tool as of right now is able to:
- Extract article location (with about 70% accuracy)
- Perform basic OCR on detected regions.
Clearly, there are several points left to complete and the article location extraction needs some polishing to improve the accuracy, especially on more "difficult" to read archives(partially damaged image, more worn page etc).
It cannot be stressed enough that this is an alpha version of a tool. It needs much refining before it would be able to produce data that would be useful to the archive systems. We have identfied the following as possible next steps for the next person(s) who want to work on this project.
-
Train model on data in the dataset, ie, manually label some set of images(20-50) and perform some transfer learning.
-
Train a model(this or another) to detect the titles of articles, in attempt to help extract this data.
-
Optimize the method that is used to take the labeled image and generated a list of polygons that represent article regions. It is currently quite slow and just generally under performs.
-
Perform content/topic analysis on extracted text
The tool operates in three principal steps, all wrapped by a single command line entry point.
- Input data is run through a DNN based on TensorFlow and images are labeled
- Labeled images pass through a post-processing script that attempts to segment images using the labels from the previous step.
- Using the segmentation data(could also be called bounding boxes) each identified article is run through OCR
Here we provide a quick overview on how to get started using the project, processing data through etc.
This project contains a git submodule so you will need to initialize that in addition to cloning the project. See the git book for more info how submodules work.
git clone repo.url.git
- In the cloned directory run:
git submodule update --init --recursive
- Python
3.7
(Python>= 3.8
is not supported) - Tesseract
4.x
- Pipenv
Python dependencies are handled by Pipenv thus easy to install and keep updated. The Pipfile that is located in the root of this repo contains all required dependencies needed to run the project end-to-end -- including the ML model.
Simply run in the project directory:
pipenv install
pipenv shell
python main.py
Congrats, you're not 100% setup to run the project!
There are three principal external dependencies, the pre-trained model, tesseract-ocr
and libgl(Linux specific).
To utilize the first part of the pipeline you will need the pre-trained model. As the ML model of this project is based on bbz-segment you can use their model which is provided here on dropbox.
For convience, we also provide an archive of the complete model above here.
In addition, here is a link to just the separator model, the only one currently utilized in this version of the model. This is the recommended model file to download
Once downloaded, the model should be extracted to a convenient location. You will need to provide the path to the model to the CLI tool.
tesseract
is the OCR engine used for the third step in the pipeline. It must be downloaded and or installed. Ultimately, you just need to know the path to the appropriate executable for your system.
See the tesseract-ocr documentation about installing the required files. Remember to keep the path of your tesseract install handy as you will need to provide it to CLI
application.
Most unix systems require libGL for opencv, for Ubuntu:
sudo apt-get install libgl1-mesa-glx
Happily this section is quite short! The main bug we have noticed seems to be connected to the old version of tensorflow that is required to run the ML model. At least on the Shared Computing Cluster(the only GPU we have access to) we could not get GPU accelerated labeling to work. So for right now the model can only be run on a CPU.
Although this pipeline has been designed with next steps in mind it has been designed with the Liberator Dataset specifically, thus there are certain expectations for the input files, their naming format etc.
Input images must be JPGs of at least 2400,1200
(height, width). They must be named with the following convention: issueid_imageid.jpg
A given image in the Liberator dataset contains to key pieces of information:
- issue ID
- image ID
Both are globally unique and an image has both. That is to say, a given image has an ID and also belongs to a given issue(which typically contains some number of images comprising an entire image).
As such, both IDs are used extensively throughout the pipeline for identify images etc. Input files must has the following naming convention.
issueid_imageid.jpg
Neither the issue ID or image ID can contain the special character that is an underscore because this is used to separate the two fields.
In theory, the pipeline supports other file extensions, but for right now we're limiting the input dataset to jpeg format. Under the hood Pillow
and OpenCV
are used to read images and thus they should support other formats but they're currently untested.
Lastly, the input size must be at least 2400x1200
height, width. This is a soft lower limit that we have set for this version of the pipeline. It could possibly be changed in the future.
As mentioned previously, the main.py
is the overall wrapper for this program!
Ensure you have started the Pipenv environment with Python 3.7 and installed all dependencies PLUS installed external dependencies such as Tesseract, and the Model.
Once you have installed all dependencies you are ready to run the below commands:
Understanding all of the flags available:
python main.py -h
Required Arguments:
image_directory
: the path to the directory containing all the images you would like to pass into this pipelineimage_extensions
: the file format for the images that you are working in! NOTE: we only accept .jpg format as of right nowoutput_directory
: the path to the directory that you will save the results to in each step of the pipeline, and also the input directory for the next stage in the pipeline! NOTE: this MUST be a full path! NOT a relative pathmodel_directory
: the path to the directory containing the segmentation model! NOTE: inside this directory must be a folder named "v3" then inside that a directory named "sep" and finally inside that directories named "1", "2", "3", "4" and "5". This is how the model will be downloaded on your computer, so just don't mess with the directories it lives in
Required Flags:
-t
: this is the Tesseract flag. You specify the path to the Tesseract executable on your computer here. NOTE: the directory where the tesseract executable file lives in should have all of it's dependencies as well, so DO NOT move the executable file away from it's dependencies.
Example Usage:
python main.py "./image_directory" "jpg" "<full_path>/output_directory" "./model" -t "D:/PyTesseract/tesseract.exe"
In the above command I had the directories image_directory
and model
in the same directory main.py
was contained in. Regardless of where the output_directory
directory is, you want to put in the full path to that directory. Finally, I wrote the required -t
flag, and used the full path of where my tesseract executable is.
As of right now we have not included in the command line interface the ability to run an individual stage of the pipeline. A work around to this is to go into main.py
file and comment out pipelines you don't want to run in the main()
method! Example Below:
The following are excerpts from main.py
Running All Pipelines:
### Step 1: Generate .npy file using bbz-segment and the model
bulk_generate_separators(args['image_directory'], args['image_extensions'], args['output_directory'], args['model_directory'], args['regenerate'], args['debug'], args['verbose'])
### Step 2: Get bounding boxes from .npy file
segment_all_images(args['output_directory'], args['image_directory'], args['output_directory'], args['debug']) # TODO: When this directory and file exist uncomment this
### Step 3: Run OCR on the generated bounding boxes
JSON_NAME = 'data.json' # NOTE: This is the name of the JSON file saved at Step 2 of the pipeline
json_path = os.path.join(args['output_directory'], JSON_NAME)
image_to_article_OCR(json_path, args['output_directory'], args['image_directory'], "tesseract")
Running Last 2 Pipelines:
### Step 1: Generate .npy file using bbz-segment and the model
# NOTE: I commented out Step 1 so the code will only run Steps 2 and 3 of the pipeline, but make sure the dependent files in Step 1 exist already
# bulk_generate_separators(args['image_directory'], args['image_extensions'], args['output_directory'], args['model_directory'], args['regenerate'], args['debug'], args['verbose'])
### Step 2: Get bounding boxes from .npy file
segment_all_images(args['output_directory'], args['image_directory'], args['output_directory'], args['debug']) # TODO: When this directory and file exist uncomment this
### Step 3: Run OCR on the generated bounding boxes
JSON_NAME = 'data.json' # NOTE: This is the name of the JSON file saved at Step 2 of the pipeline
json_path = os.path.join(args['output_directory'], JSON_NAME)
image_to_article_OCR(json_path, args['output_directory'], args['image_directory'], "tesseract")
The pipeline was designed in such a manner that enables independent development of each part and or addition of new steps. Given that the output data from each step is saved sub-sections of the pipeline can be re-run without necessitating the re-run of the entire pipeline.
For example, if one wanted to improve the article region detection algorithm(that leverages the output from the ML model) it would not be needed to re-run the labelling model since those files have already been generated.
Outside of the first step, the ML model that labels images, the common exchange format between each step is the JSON file containing the detected information about the articles/issues. If a need feature is desired to be added the author should try as much as possible to continue this design choice. A new "step" in the pipeline should read the input from the previous steps and then perform the new operations, outputting a new JSON
file for use by the next step of the pipeline. Always preserving both it's input and output.
Here we hope to take a deep dive into each component of the pipeline, talk about the code a bit more specifically and provide insights into choices that where made and possible room for improvement.
// TODO: Update the name
This part of the code handles the labeling of images using the DNN model. This model is based on other research work, thus this part of the code has been kept in an open-source location to comply with the license.
The original work can be found at this github repository and published in this paper.
The model is actually comprised of two distinct parts, which are called sep
and blkx
in the original repository and academic paper. The first one sep
attempts to segment a given image into the following components: background, horizontal separator, vertical separator, and table separators. The 2nd, blkx
attempts to classify regions of the image into one of the following categories: background, text, tabular data, illustrations.
In this pipeline we only use the first one, sep
of which we only utilize the identified vertical and horizontal separators to help segment the images.
In both types of models, the model performs a pixel level labeling of the image. That is, each pixel in the image is assigned a particular label, an integer value. In addition, images are resized to 2400x1200
before processing.
For more in depth information it would be useful to read the above linked paper.
This section describes the general overview of how an image is taken from it's original input and passed through the ML model, finally outputting the pixel level labels.
First, input images are resized to 2400x1200
, this sized is a property that has been set in the pre-trained model and is loaded from that model file at runtime. Images are then broken into smaller tiles, this allows the processing of images without as high of video memory requirements.
After passing through the model the output is a numpy array, where each (row, col)
pair represents a pixel in the given resized image and the label assigned by the model. Depending on the command line options used this data is saved into the following format:
Always saved(As a Pickle via numpy):
- Labeled numpy array
- Image original size
- Input filename
If a debug parameter was passed the following is also saved:
- A reproduction of the labeled image, where each label gets it's own color. This allows the inspection of how the model labeled the input image.
This information is then saved for the next part of the pipeline.
This step of the pipeline attempts to utilize the data from the previous step to create a set of polygons that represent the detected article regions of the input image. There is done via a simple rule based algorithm. It is exclusively utilizes the detected horizontal and vertical separators to make these detections.
The output from this step is a partially filed JSON
file in the form as defined in the output section of this document. It contains a list of articles
, each with a set of images and polygons that makeup that article. At this step, no attempts to join articles across multiple pages has been made. Thus, each article object only contains a single image, although it may contain multiple polygons within a single image.
This step also outputs a debug image, if specified at invocation. This image contains the original image annotated with colored boxes drawn around each identified article region. This allows inspection of how the algorithm chose individual articles.
Check the to-do list for current issues with this step
Despite the reasonably simplistic nature of this step it takes a considerable amount of time to run. Typically about 30-45s per image. No attempt to improve this speed has been attempted, the easiest would be to process multiple images in parallel, but there does exist opportunities for improving the speed of the pipeline.
This step of the pipeline takes as input the partially filed JSON file from the previous step and uses the detected article regions and performs OCR, filling the JSON file with the detected text.
Currently this module uses tesseract-ocr
with default settings to perform this detection of text. In addition, no attempt at detecting keywords, authors, article titles etc. is made.
We provide links to all data utilizes for this project. Specific resources are broken out below but here is the link to the entire Google Drive folder:
Example data is provided in the following two folders, it contains the full output from the entire pipeline for a single image.
We provide a tool to scrape the digital commonwealth website for all images in this dataset. It can be found in the data_downloader
folder. It contains it's own readme and install instructions.
The complete dataset can be downloaded from Google Drive here, or re-scraped using the data_downloader
tool if the latest data is required. One could also jump-start the scraping by downloading our archived dataset and re-running the data-downloader which will only download images not in the dataset.
CSV with image links and metadata
We also provide the sample dataset which contains 25 issues comprised of 100 images. This is our test data and contains annotations for the number of ground-truth articles per issue. It also contains the output from the current version of the model, including debug data(annotated images etc), article segmentation and OCR results.
Sample Images w/Output at each step
We briefly explored other methods for accomplishing this task, mainly AWS Textract. Although textract tends to do very well with structured data, we did not see a good performance on our dataset. About the only thing it could infer was rough column/row data. These divisions often did not indicate the actual location of columns/rows and did not do as good of a job as the model we utilized.
Some of our EDA work lives in EDA. May or may not be useful for future work.
Origami is a tool that is built upon the same model we used(built by the same researcher) that attempts to solve a very similar problem. Results look promising from this tool but we where unable to get it to run on our systems. In code documentation is a bit lacking as well.
Here we provide a list of extra resources that anyone else working on this project might find helpful.
An Auto-Encoder Strategy for Adaptive Image Segmentation:
This paper presents a method that utilizes an auto-encoder to segment MRI images. It's two main features are it's ability to learn with hardly any reference data(1 labeled image). In addition, it is much more computationally efficient than other approaches. Code is available on github.
Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers
Similar approach and model to the paper we based our work on. Code also available on github.
Clustering-Based Article Identification in Historical Newspapers
This paper explores the method of utilizing NLP approaches to segmenting articles, finding reading order etc. Their method takes the OCR data and utilizes various NLP approaches in attempt to identify where one article starts or end.
Logical segmentation for article extraction in digitized old newspapers
Fully Convolutional Neural Networks for Newspaper Article Segmentation