Semantic structure identification model for general contract documents using MASK_RCNN. Generates JSON file containing semantic structures of each component in a document.
Semantic labels are below:
- Title
- Subtitle
- Paragraph
- Footnotes
- Header
- Footer
- Page
- Signature
git clone https://github.com/kasimebrahim/instance_segmentation cd instance_segmentation
# Open conda_environment.yml and change the last line to your own conda environment path. conda env create -f conda_environment.yml
pip install -r requirements.txt
You will need python>3.7 and optionally conda>4.7.10
To inspect and visualize your dataset use dataset_inspect.
To evaluate your model or visualize your output use model_eval.
To train your model
# If you want to train from a pretrained model. python Segmentation.py train --datasets=datasets --log=log --model=models/mask_rcnn_pub_lay_seg_0100.h5 # If you dont have. python Segmentation.py train --datasets=datasets --log=log # If you want to pickup from a stoped training. python Segmentation.py train --datasets=datasets --log=log --pickup=true
To segment your documents
python Segmentation.py segment --model=models/mask_rcnn_doc_seg_0100.h5 --image=infer # --image is the directory where your documents to be segmented are stored. # Your documents should be stored in a directory in two ways. # One: Each documenat is in its own directory and every # page of the document is in the directory. # i:e ds/doc/p01.jpg # Two: All pages of all the documents are under one # directory. And every image is named as document_page # concatinated with page name/number. # i:e ds/doc_p01.jpg
The out put of the segmented documents will be stored in a json file named "documents.json".
This repository heavily reuses code from the amazing tensorflow Mask RCNN implementation by @waleedka. Many thanks to all the contributors of that project. You are encouraged to checkout https://github.com/matterport/Mask_RCNN for documentation on many other aspects of this code.
Kasim Ebrahim [email protected]