Grader Note

You might find this google drive link useful. It contains a terrible jupyter notebook and our presentation.

It also has the PDF that analysis the accuracy of our model

Folder

Dataset location

The location and description of our dataset is described below. It is also stored on the SCC:

/restricted/projectnb/cs501t2/ian_thomas/LiberatorProject/CS501-Liberator-Project/data_downloader/output_images

1. Table Of Contents

1. Table Of Contents
2. Newspaper Article Segmentation
3. Why is this tool needed
4. End goal of this tool
- 4.1. Current State
- 4.2. Next Steps
5. Big Overview of Tool
6. Getting Started
7. Known Bugs
8. Input File Requirements
- 8.1. Quick Overview
- 8.2. In depth Input Image Requirements
9. Command Line Interface
- 9.1. Individual Pipeline
10. Pipeline Architecture
11. Deep Dive Into The Pipeline
12. Dataset and Sample Results
13. Alternative Methods Tried
14. Extra Resources

2. Newspaper Article Segmentation

This repository contains tools for segmenting archives of newspapers, in the form of digital images into individual articles containing various article level information, such as content, author, title etc. It utilizes a DNN and a set of post-processing scripts to segment images into articles, producing a JSON output that contains the detected information.

Currently, it supports segmenting a particular input image into several regions identified as articles, using this data it is then able to perform basic OCR.

In the future, features such as title, author and content extraction would be a great addition to improving the utility of data generated.

3. Why is this tool needed

There has been a large effort put into digitizing old literary works, including historic newspapers. Although having digital copies of this data is step forward without a method for searching the data it significantly less accessible to the average person, academic researcher etc.

Due to both the time required to manually label data and the sheer volume of data that exists an automatic method for segmenting and labeling data, extracting authors, titles, and content is required. This problem is not a new one either, there exists research to help solve this problem that attempts to solve this problem using a variety of approaches.

<Add Sources>

4. End goal of this tool

The "ideal" version of this tool would be able to take as input a set of images that represent a number of historical newspaper issues. Segment those images into individual issues, extracting the location of each article on the image. Then, among each article extracting:

Article Content
Article Title
Article Author
Published Date(If applicable)

Further, providing a set of topics/keywords about each article to aid in search. As in, performing some sort of topic analysis on the extracted text to aid in classification of individual articles and issues.

4.1. Current State

With the end goal in mind, the tool as of right now is able to:

Extract article location (with about 70% accuracy)
Perform basic OCR on detected regions.

Clearly, there are several points left to complete and the article location extraction needs some polishing to improve the accuracy, especially on more "difficult" to read archives(partially damaged image, more worn page etc).

4.2. Next Steps

It cannot be stressed enough that this is an alpha version of a tool. It needs much refining before it would be able to produce data that would be useful to the archive systems. We have identfied the following as possible next steps for the next person(s) who want to work on this project.

Train model on data in the dataset, ie, manually label some set of images(20-50) and perform some transfer learning.
Train a model(this or another) to detect the titles of articles, in attempt to help extract this data.
Optimize the method that is used to take the labeled image and generated a list of polygons that represent article regions. It is currently quite slow and just generally under performs.
Perform content/topic analysis on extracted text

5. Big Overview of Tool

The tool operates in three principal steps, all wrapped by a single command line entry point.

Input data is run through a DNN based on TensorFlow and images are labeled
Labeled images pass through a post-processing script that attempts to segment images using the labels from the previous step.
Using the segmentation data(could also be called bounding boxes) each identified article is run through OCR

6. Getting Started

Here we provide a quick overview on how to get started using the project, processing data through etc.

6.1. Cloning

This project contains a git submodule so you will need to initialize that in addition to cloning the project. See the git book for more info how submodules work.

git clone repo.url.git
In the cloned directory run:
- git submodule update --init --recursive

6.2. Requirements

Python 3.7 (Python >= 3.8 is not supported)
Tesseract 4.x
Pipenv

6.3. Installing Python Requirements

Python dependencies are handled by Pipenv thus easy to install and keep updated. The Pipfile that is located in the root of this repo contains all required dependencies needed to run the project end-to-end -- including the ML model.

Simply run in the project directory:

pipenv install
pipenv shell
python main.py

Congrats, you're not 100% setup to run the project!

6.4. External Dependencies

There are three principal external dependencies, the pre-trained model, tesseract-ocr and libgl(Linux specific).

6.5. Pre-trained Model

To utilize the first part of the pipeline you will need the pre-trained model. As the ML model of this project is based on bbz-segment you can use their model which is provided here on dropbox.

For convience, we also provide an archive of the complete model above here.

In addition, here is a link to just the separator model, the only one currently utilized in this version of the model. This is the recommended model file to download

Once downloaded, the model should be extracted to a convenient location. You will need to provide the path to the model to the CLI tool.

6.6. Tesseract

tesseract is the OCR engine used for the third step in the pipeline. It must be downloaded and or installed. Ultimately, you just need to know the path to the appropriate executable for your system.

See the tesseract-ocr documentation about installing the required files. Remember to keep the path of your tesseract install handy as you will need to provide it to CLI application.

6.7. libGL

Most unix systems require libGL for opencv, for Ubuntu:

sudo apt-get install libgl1-mesa-glx

7. Known Bugs

Happily this section is quite short! The main bug we have noticed seems to be connected to the old version of tensorflow that is required to run the ML model. At least on the Shared Computing Cluster(the only GPU we have access to) we could not get GPU accelerated labeling to work. So for right now the model can only be run on a CPU.

8. Input File Requirements

Although this pipeline has been designed with next steps in mind it has been designed with the Liberator Dataset specifically, thus there are certain expectations for the input files, their naming format etc.

8.1. Quick Overview

Input images must be JPGs of at least 2400,1200(height, width). They must be named with the following convention: issueid_imageid.jpg

8.2. In depth Input Image Requirements

A given image in the Liberator dataset contains to key pieces of information:

issue ID
image ID

Both are globally unique and an image has both. That is to say, a given image has an ID and also belongs to a given issue(which typically contains some number of images comprising an entire image).

As such, both IDs are used extensively throughout the pipeline for identify images etc. Input files must has the following naming convention.

issueid_imageid.jpg

Neither the issue ID or image ID can contain the special character that is an underscore because this is used to separate the two fields.

In theory, the pipeline supports other file extensions, but for right now we're limiting the input dataset to jpeg format. Under the hood Pillow and OpenCV are used to read images and thus they should support other formats but they're currently untested.

Lastly, the input size must be at least 2400x1200 height, width. This is a soft lower limit that we have set for this version of the pipeline. It could possibly be changed in the future.

9. Command Line Interface

As mentioned previously, the main.py is the overall wrapper for this program! Ensure you have started the Pipenv environment with Python 3.7 and installed all dependencies PLUS installed external dependencies such as Tesseract, and the Model. Once you have installed all dependencies you are ready to run the below commands:

Understanding all of the flags available:

python main.py -h

Required Arguments:

image_directory: the path to the directory containing all the images you would like to pass into this pipeline
image_extensions: the file format for the images that you are working in! NOTE: we only accept .jpg format as of right now
output_directory: the path to the directory that you will save the results to in each step of the pipeline, and also the input directory for the next stage in the pipeline! NOTE: this MUST be a full path! NOT a relative path
model_directory: the path to the directory containing the segmentation model! NOTE: inside this directory must be a folder named "v3" then inside that a directory named "sep" and finally inside that directories named "1", "2", "3", "4" and "5". This is how the model will be downloaded on your computer, so just don't mess with the directories it lives in

Required Flags:

-t: this is the Tesseract flag. You specify the path to the Tesseract executable on your computer here. NOTE: the directory where the tesseract executable file lives in should have all of it's dependencies as well, so DO NOT move the executable file away from it's dependencies.

Example Usage:

python main.py "./image_directory" "jpg" "<full_path>/output_directory" "./model" -t "D:/PyTesseract/tesseract.exe"

In the above command I had the directories image_directory and model in the same directory main.py was contained in. Regardless of where the output_directory directory is, you want to put in the full path to that directory. Finally, I wrote the required -t flag, and used the full path of where my tesseract executable is.

9.1. Individual Pipeline

As of right now we have not included in the command line interface the ability to run an individual stage of the pipeline. A work around to this is to go into main.py file and comment out pipelines you don't want to run in the main() method! Example Below:

The following are excerpts from main.py

Running All Pipelines:

### Step 1: Generate .npy file using bbz-segment and the model
bulk_generate_separators(args['image_directory'], args['image_extensions'], args['output_directory'], args['model_directory'], args['regenerate'], args['debug'], args['verbose'])

### Step 2: Get bounding boxes from .npy file   
segment_all_images(args['output_directory'], args['image_directory'], args['output_directory'], args['debug']) # TODO: When this directory and file exist uncomment this 

### Step 3: Run OCR on the generated bounding boxes
JSON_NAME = 'data.json' # NOTE: This is the name of the JSON file saved at Step 2 of the pipeline
json_path = os.path.join(args['output_directory'], JSON_NAME)
image_to_article_OCR(json_path, args['output_directory'], args['image_directory'], "tesseract")

Running Last 2 Pipelines:

### Step 1: Generate .npy file using bbz-segment and the model
# NOTE: I commented out Step 1 so the code will only run Steps 2 and 3 of the pipeline, but make sure the dependent files in Step 1 exist already
# bulk_generate_separators(args['image_directory'], args['image_extensions'], args['output_directory'], args['model_directory'], args['regenerate'], args['debug'], args['verbose'])

### Step 2: Get bounding boxes from .npy file   
segment_all_images(args['output_directory'], args['image_directory'], args['output_directory'], args['debug']) # TODO: When this directory and file exist uncomment this 

### Step 3: Run OCR on the generated bounding boxes
JSON_NAME = 'data.json' # NOTE: This is the name of the JSON file saved at Step 2 of the pipeline
json_path = os.path.join(args['output_directory'], JSON_NAME)
image_to_article_OCR(json_path, args['output_directory'], args['image_directory'], "tesseract")

10. Pipeline Architecture

The pipeline was designed in such a manner that enables independent development of each part and or addition of new steps. Given that the output data from each step is saved sub-sections of the pipeline can be re-run without necessitating the re-run of the entire pipeline.

For example, if one wanted to improve the article region detection algorithm(that leverages the output from the ML model) it would not be needed to re-run the labelling model since those files have already been generated.

Outside of the first step, the ML model that labels images, the common exchange format between each step is the JSON file containing the detected information about the articles/issues. If a need feature is desired to be added the author should try as much as possible to continue this design choice. A new "step" in the pipeline should read the input from the previous steps and then perform the new operations, outputting a new JSON file for use by the next step of the pipeline. Always preserving both it's input and output.

11. Deep Dive Into The Pipeline

Here we hope to take a deep dive into each component of the pipeline, talk about the code a bit more specifically and provide insights into choices that where made and possible room for improvement.

11.1. Deep Dive: ML Model, Image Labeling

// TODO: Update the name

This part of the code handles the labeling of images using the DNN model. This model is based on other research work, thus this part of the code has been kept in an open-source location to comply with the license.

The original work can be found at this github repository and published in this paper.

11.1.1. Model Overview

The model is actually comprised of two distinct parts, which are called sep and blkx in the original repository and academic paper. The first one sep attempts to segment a given image into the following components: background, horizontal separator, vertical separator, and table separators. The 2nd, blkx attempts to classify regions of the image into one of the following categories: background, text, tabular data, illustrations.

In this pipeline we only use the first one, sep of which we only utilize the identified vertical and horizontal separators to help segment the images.

In both types of models, the model performs a pixel level labeling of the image. That is, each pixel in the image is assigned a particular label, an integer value. In addition, images are resized to 2400x1200 before processing.

For more in depth information it would be useful to read the above linked paper.

11.1.2. End-To-End Model Function

This section describes the general overview of how an image is taken from it's original input and passed through the ML model, finally outputting the pixel level labels.

First, input images are resized to 2400x1200, this sized is a property that has been set in the pre-trained model and is loaded from that model file at runtime. Images are then broken into smaller tiles, this allows the processing of images without as high of video memory requirements.

After passing through the model the output is a numpy array, where each (row, col) pair represents a pixel in the given resized image and the label assigned by the model. Depending on the command line options used this data is saved into the following format:

Always saved(As a Pickle via numpy):

Labeled numpy array
Image original size
Input filename

If a debug parameter was passed the following is also saved:

A reproduction of the labeled image, where each label gets it's own color. This allows the inspection of how the model labeled the input image.

This information is then saved for the next part of the pipeline.

11.2. Deep Dive: Article Segmentation

This step of the pipeline attempts to utilize the data from the previous step to create a set of polygons that represent the detected article regions of the input image. There is done via a simple rule based algorithm. It is exclusively utilizes the detected horizontal and vertical separators to make these detections.

The output from this step is a partially filed JSON file in the form as defined in the output section of this document. It contains a list of articles, each with a set of images and polygons that makeup that article. At this step, no attempts to join articles across multiple pages has been made. Thus, each article object only contains a single image, although it may contain multiple polygons within a single image.

This step also outputs a debug image, if specified at invocation. This image contains the original image annotated with colored boxes drawn around each identified article region. This allows inspection of how the algorithm chose individual articles.

Check the to-do list for current issues with this step

11.2.1. Speed Considerations

Despite the reasonably simplistic nature of this step it takes a considerable amount of time to run. Typically about 30-45s per image. No attempt to improve this speed has been attempted, the easiest would be to process multiple images in parallel, but there does exist opportunities for improving the speed of the pipeline.

11.3. Deep Dive: Content Extraction(OCR)

This step of the pipeline takes as input the partially filed JSON file from the previous step and uses the detected article regions and performs OCR, filling the JSON file with the detected text.

Currently this module uses tesseract-ocr with default settings to perform this detection of text. In addition, no attempt at detecting keywords, authors, article titles etc. is made.

12. Dataset and Sample Results

We provide links to all data utilizes for this project. Specific resources are broken out below but here is the link to the entire Google Drive folder:

Folder with Data

12.1. Example Data

Example data is provided in the following two folders, it contains the full output from the entire pipeline for a single image.

Example Input Data

Example Output Data

12.2. Scraping Tool

We provide a tool to scrape the digital commonwealth website for all images in this dataset. It can be found in the data_downloader folder. It contains it's own readme and install instructions.

12.3. Complete Dataset

The complete dataset can be downloaded from Google Drive here, or re-scraped using the data_downloader tool if the latest data is required. One could also jump-start the scraping by downloading our archived dataset and re-running the data-downloader which will only download images not in the dataset.

Complete Dataset

CSV with image links and metadata

12.4. Sample Dataset & Results

We also provide the sample dataset which contains 25 issues comprised of 100 images. This is our test data and contains annotations for the number of ground-truth articles per issue. It also contains the output from the current version of the model, including debug data(annotated images etc), article segmentation and OCR results.

Sample Images Only

Sample Images w/Output at each step

13. Alternative Methods Tried

We briefly explored other methods for accomplishing this task, mainly AWS Textract. Although textract tends to do very well with structured data, we did not see a good performance on our dataset. About the only thing it could infer was rough column/row data. These divisions often did not indicate the actual location of columns/rows and did not do as good of a job as the model we utilized.

14. Extra Resources

14.1. EDA

Some of our EDA work lives in EDA. May or may not be useful for future work.

14.2. Other Tools

Origami is a tool that is built upon the same model we used(built by the same researcher) that attempts to solve a very similar problem. Results look promising from this tool but we where unable to get it to run on our systems. In code documentation is a bit lacking as well.

14.3. Research Papers

Here we provide a list of extra resources that anyone else working on this project might find helpful.

An Auto-Encoder Strategy for Adaptive Image Segmentation:

This paper presents a method that utilizes an auto-encoder to segment MRI images. It's two main features are it's ability to learn with hardly any reference data(1 labeled image). In addition, it is much more computationally efficient than other approaches. Code is available on github.

Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

Similar approach and model to the paper we based our work on. Code also available on github.

Clustering-Based Article Identification in Historical Newspapers

This paper explores the method of utilizing NLP approaches to segmenting articles, finding reading order etc. Their method takes the OCR data and utilizes various NLP approaches in attempt to identify where one article starts or end.

Logical segmentation for article extraction in digitized old newspapers

Fully Convolutional Neural Networks for Newspaper Article Segmentation

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
EDA		EDA
ImageArticleOCR		ImageArticleOCR
bbz-segment @ ccd1e72		bbz-segment @ ccd1e72
data_downloader		data_downloader
example_data		example_data
example_data_out		example_data_out
extract_polygons		extract_polygons
.gitignore		.gitignore
.gitmodules		.gitmodules
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
main.py		main.py

Boston-University-Projects/CS501-Liberator-Project

Folders and files

Latest commit

History

Repository files navigation