Parsing Hindi tables better #284

Gautam-Rajeev · 2024-01-18T12:11:25Z

Description

We are currently struggling to parse Hindi tables to be able to replicate the table into structured useful form (like json/.md) and need help figuring out bonding boxes for each cell for detected table.

Sample pdfs for testing :

Pdf1
Pdf2

Test case : merged_tables.pdf

Implementation Details

Detection if the page has a table and separating the table image from the rest of the page :
This is done using Table Transformers - This works pretty well on most tested use cases. We need to set the detection confidence pretty high (more than 99% ) and its able to detect if the page has the table and the correct boundary for it with reasonable accuracy. Code to set this up is here
Detection of the text in the table : "

This we have seen the best results using Pytesseract Hindi.
In our limited testing, we have notices that Pytesseract works best we give the page in its entirety rather than after detection and separation of rows/column/cells. Something like this which gives the word as well as the bounding boxes :

from PIL import Image
import pytesseract
from pytesseract import Output

# Load the image from file
image_path = 'table.png'
image = Image.open(image_path)
tesseract_config = '-l hin --psm 6 --oem 3'

data = pytesseract.image_to_data(image, config=tesseract_config, output_type=Output.DICT)

# Iterate over each word detected and print the word along with its bounding box
n_boxes = len(data['text'])
for i in range(n_boxes):
    if int(data['conf'][i]) > 0: 
        (x, y, w, h) = (data['left'][i], data['top'][i], data['width'][i], data['height'][i])
        word = data['text'][i]
        print(f"Word: {word}, Coordinates: ({x}, {y}, {w}, {h})")

Detection of the cells/ bounding boxes of the cell:
This is what we require focus on.
We tried using Table Transformers for this but its giving poor results. Any logic for OCR heuristics (column boundaries can be identified by ability to draw a straight vertical line without hitting a word) would seem more plausible as a first solution here than fine-tuning image transformers. Ideal initial step would be a step in this direction.

Sample pdfs show different kinds of tables (some without columns marked by lines, others without lines demarcating the columns.

Guidelines for proposed solutions :

Anyone is free to work on proposed solutions to complete the 3rd aspect (detection of bounding boxes for each row/column/cell ) and integrate with the rest of the outputs.
People submitting working PRs will be considered favorably for internship applications in the future.
Please submit code which is quickly replicable on our end- github repos should have setup to install all required libraries and a folder with inputs and output. Even collabs one can easily run on my end to see results are good enough(again please install any necessary packages within it). Code quality and optimization are unimportant for now, just need something that works.

Product Name

AI Tools

Organization Name

SamagraX

Domain

NA

Tech Skills Needed

Pytorch/ Python, ML

Mentor(s)

@GautamR-Samagra

Complexity

High

The text was updated successfully, but these errors were encountered:

c4gt-community-support · 2024-01-19T04:54:25Z

Hi!
Important Details - These following details are helpful for contributors to effectively identify and contribute to tickets.

Sub-Category - Please mention the sub-category if any for the ticket

Please update the ticket

35C4n0r · 2024-01-19T05:01:51Z

@GautamR-Samagra I would like to work on this issue, kindly assign it to me.

Gautam-Rajeev · 2024-01-19T05:07:34Z

@35C4n0r Thanks for showing interest :)
Anyone can work on this, I am not assigning to anyone in particular now.
Please raise PRs - just link it to any working code.

Amit0617 · 2024-01-21T14:19:54Z

Is this the requirement? https://colab.research.google.com/drive/1pCYJ4H9vwV3GoriQ6UNoIj8N42_i3tMv?usp=sharing
Before (upload it as image.png in colab to test)

After (a bit rough marking of start(top) and end(bottom) of line.)

Amit0617 · 2024-01-21T14:40:55Z

After few more changes

Gautam-Rajeev · 2024-01-22T06:25:41Z

@Amit0617 this looks great. We also need to be able to detect the columns too.
Then we need to merge the output of bbox of these with the bbox extracted from the text using pytesseract.

Gautam-Rajeev · 2024-01-22T06:30:39Z

Hi, I'm planning to have a meeting today (22-Jan-2024) to discuss progress on this and clarify any doubts at 5:30 pm. Let me know if you can join @Amit0617 @35C4n0r @SarveshAtawane on this link

Amit0617 · 2024-01-23T13:05:51Z

Oops! I missed it.

naina35 · 2024-01-29T15:15:04Z

suggestions...

Gautam-Rajeev · 2024-01-30T02:52:36Z

suggestions...

Hi! The goal is to detect the column and rows and not the cells. i.e. I should be finally able to tell that these are the elements in the same column and these in the same row and convert to a csv. Do check if your approach is able to do that

35C4n0r · 2024-01-30T11:24:45Z

@GautamR-Samagra Here are the results and my collab notebook

{
    0: {
        0: "",
        1: "गी तिथि",
        2: "की तिथि",
        3: "अवधि (दिनों में)",
        4: "क्षमता कु. / हे.",
        5: "",
    },
    1: {
        0: "पूसा डबल जीरो-3॥",
        1: "206",
        2: "",
        3: "442",
        4: "23.3",
        5: "यो फोर्टिफाइड प्रजाति- इरयू  [सिड 2 प्रतिशत से कम  लूकोसाइनोलेटस 30 पीपीएम से कम)",
    },
    2: {
        0: "पूसा सरसों-29 (एल.ई.टी.-3(",
        1: "))2043",
        2: "",
        3: "43-55",
        4: "49.27-25.68",
        5: "रयूसिक एसिड की मात्रा बहुत न्यून",
    },
    3: {
        0: "पूसा सरसों-24 (एल.ई.टी.-8",
        1: ") 2008",
        2: "",
        3: "440",
        4: "20.25",
        5: "'रयूसिक एसिड की मात्रा बहुत न्यून।",
    },
    4: {
        0: "पूसा सरसों-2",
        1: "2007",
        2: "",
        3: "433-42",
        4: "48.6-2.0",
        5: "_रयूसिक एसिड की मात्रा बहुत न्यून |",
    },
    5: {0: "(एल.ई.एस--27)", 1: "", 2: "", 3: "", 4: "", 5: ""},
    6: {0: "त्त क्षेत्रों के लिए", 1: "", 2: "", 3: "", 4: "", 5: ""},
    7: {
        0: "वभव",
        1: "985",
        2: "48.4.85",
        3: "25-30",
        4: "45-20",
        5: "म्पूर्ण उ.प्र. हेतु",
    },
    8: {
        0: "वरूणा (टा. - 59)",
        1: "4975",
        2: "2.2.76",
        3: "420-25",
        4: "45-20",
        5: "म्पूर्ण मैदानी क्षेत्र हेतु",
    },
    9: {
        0: "आर.जी.एन.-298",
        1: "2045",
        2: "",
        3: "443",
        4: "शा",
        5: "[ल की मात्रा अधिक।",
    },
    10: {
        0: "गिरिराज (डीआरएमआरयू34)",
        1: "20]3",
        2: "",
        3: "437-53",
        4: "22.46-श27.57",
        5: "ल की मात्रा अधिक।",
    },
    11: {
        0: "पूसा सरसों-27",
        1: "20॥]",
        2: "",
        3: "408-35",
        4: "44.37-6.59",
        5: "रयूसिक एसिड की मात्रा बहुत न्यून |",
    },
    12: {0: "पन्\u200dत पीली सरसों 4", 1: "2040", 2: "", 3: "407-43", 4: "42-5", 5: ""},
    13: {0: "पीताम्बरी", 1: "2009", 2: "", 3: "440-5", 4: "44.7-7.65", 5: "ड़ा दाना।"},
    14: {0: "(आर.वाइ.एस.क. 05-02)", 1: "", 2: "", 3: "", 4: "", 5: ""},
    15: {
        0: "एन.आज.सी.एच.बी.-0॥",
        1: "2009",
        2: "",
        3: "405-35",
        4: "43.82-4.9]",
        5: "",
    },
    16: {0: "इसे बुआई के लिए", 1: "", 2: "", 3: "", 4: "", 5: ""},
    17: {
        0: "आशीर्वाद",
        1: "2005",
        2: "26.08.05",
        3: "430-35",
        4: "20-22",
        5: "म्पूर्ण उ.प्र. हेतु",
    },
    18: {
        0: "वरदान",
        1: "985",
        2: "48.4.85",
        3: "420-25",
        4: "48-20",
        5: "म्पूर्ण उ.प्र. हेतु",
    },
    19: {0: "/ लवणीय भूमि हेतु", 1: "", 2: "", 3: "", 4: "", 5: ""},
    20: {
        0: "नरेन्द्र राई",
        1: "990",
        2: "47.08.90",
        3: "",
        4: "",
        5: "म्पूर्ण उ.प्र. हेतु",
    },
    21: {
        0: "सी.एस.-52",
        1: "4987",
        2: "45.05.98",
        3: "435-45",
        4: "46-20",
        5: "म्पूर्ण उ.प्र. हेतु",
    },
    22: {
        0: "सी.एस.-54",
        1: "2003",
        2: "42.02.05",
        3: "435-45",
        4: "48-22",
        5: "म्पूर्ण उ.प्र. हेतु",
    },
    23: {0: "प्रजातियाँ :", 1: "", 2: "", 3: "", 4: "", 5: ""},
    24: {
        0: "जे.के. सम्रद्धि गोल्ड",
        1: "206",
        2: "",
        3: "425-30",
        4: "20.0-30.0",
        5: "फेद रस्ट एवं डाउनी मिल्डयू सहिष्णु",
    },
    25: {0: "(जे.के.एम.एस.-2)", 1: "", 2: "", 3: "", 4: "", 5: ""},
    26: {
        0: "जे.के. पुखराज",
        1: "206",
        2: "",
        3: "445-20",
        4: "45.0-20.0",
        5: "फेद रस्ट एवं डाउनी मिल्डयू सहिष्णु",
    },
    27: {0: "(जे.के.वाई.एस.-2)", 1: "", 2: "", 3: "", 4: "", 5: ""},
    28: {
        0: "बेयर सरसों - 5450",
        1: "206",
        2: "",
        3: "430-35",
        4: "28.0-30.0",
        5: "ल की मात्रा अधिक |",
    },
    29: {0: "अलबेली 4", 1: "2045", 2: "", 3: "440-445", 4: "2.3", 5: ""},
}

for the table

I've also added the logic for converting it to a Python dictionary

Gautam-Rajeev · 2024-01-30T17:40:50Z

merged_tables.pdf
Looks good @35C4n0r
Can you try on the above PDF? I have combined multiple table types into one pdf, lets make sure the solution works for all of them. If it does, this should be good enough.

cc: @Amit0617 @SarveshAtawane @naina35

35C4n0r · 2024-02-02T03:17:22Z

@GautamR-Samagra Sorry for the late response, I ran the script for all the tables, The results are not ok in some cases, for example the table on page 22

and this is one of the cells extracted (not complete)

There are many more cases like this, you can check the notebook.

35C4n0r · 2024-02-06T05:02:52Z

@GautamR-Samagra What should be my next step, have you checked the notebook?

Gautam-Rajeev · 2024-02-07T04:05:25Z

@35C4n0r can you manually estimate for how many its working out of 26 tables?

Can you verify if its working for all the tables which don't have row/column line boundaries ?

Gautam-Rajeev · 2024-02-09T03:57:49Z

@35C4n0r
Got time to go through your code now.
Table transformer is not doing a good enough job detecting rows and columns properly.
However I think the overall approach is very useful and we can extend it to plug and play with any DETR and pytesseract combination. Let's do that as a seaparate ticket.
Goal will be to improve detection of text objects and you can use any DETR model along with Pytesseract.

You provide the pdf, image as input along with the DETR model and the Pytesseract-language
You get as output, each word in each page labelled as the DETR object .. with words combined wherever they belong to the same object

For example here , you get as output :
the word of each cell mapped to row1/column1/cell1/table1
another word to row2/column1/cell4/table1 and so on.

Can you create another ticket on ai-tools with these details and assign it to yourself. I'll add to the details. I'll decide the community points etc.

Once the performance of table transformers or any other object detection model improves, we can use this is conjunction with that

Gautam-Rajeev · 2024-02-09T04:28:09Z

Since, this is not being done with the above approaches, Let's try and solve this from ground up heuristically to just try to pass the test cases on the pdf. Then we'll abstract the ideas to make it generic and useful.

Have a bunch of ideas here..
Key idea is that if the tables have lines for rows and column, just detect the lines and use that. If not, use other approaches.

For each page detect which part is the table using table transformer and separate it out (I'm happy with its object detection capabilities of the table). Let's use the code above.
As done in above code, keep detect all the words and the bbox of each word through PyTesseract and keep it
For the table image, let's detect the lines and all the words in each line and keep it. (not rows yet, line detection is simpler than row detection)
Let's detect the horizontal lines on the page. Remove any horizontal lines that are intersecting with the bbox of the words (words that have been detected as lines)
We need to decide if the table image had horizontal border-lines for rows now- we devise a rule based on number of lines and number of 'border-lines' - it could be as simple as if the table has number of border-lines > 2, then use those to detect the rows
If the table doesnt have border lines- then find the max horizontal gap between the words on adjacent rows (from pystesseract words) . We'll consider the average gap as the gap which determines row-borderlines
Similarly, detect the column vertical borderlines if any
// Place holder here-- how to detect the columns if borders are not there (vague idea around figuring out the gap between horizontally adjacent words and using the 2nd most-common gap (assuming first most common gap is just space between words))
Create the borders -- remove the lines wherever its intersecting with the words bbox from Pytesseract
Expand the outer boundaries to edge of the image (i.e. if the first row boundary upper boundary should be top edge, last column should be right edge etc)
Combine everything to get the word-row/cell/column mapping

We can keep adding ideas around models around row /column detection and figure out how to ensemble them all

cc: @Amit0617 @SarveshAtawane @naina35

35C4n0r · 2024-02-20T13:15:52Z

@GautamR-Samagra, sorry for the huge delay, got some time today and made a little bit of progress,
Here is the notebook https://colab.research.google.com/drive/1Z5H_Wp-UdfC75UXJGIqm3cDpTiCpgt91?usp=sharing

I've used the classic Canny Edge Detection + Hugh Transform, and the results are awesome.
What I got stuck at is the Pytesseract's OCR capabilities, my config: -l hin+eng

What I would like to have, is something like @naina35's solution, If possible, kindly share your approach with us @naina35.
Or we use some other OCR models or DETR, suggestions, and ideas are welcome (@GautamR-Samagra, @Amit0617, @naina35)

Once I have the bboxes for the text objects, I'll be able to clean up the Edge detection.

Gautam-Rajeev · 2024-02-20T13:19:11Z

Hey @35C4n0r can you pause work on this for now. I have made some progress myself and will clean and update that here in a couple of hours.

35C4n0r · 2024-02-20T13:20:09Z

Sure @GautamR-Samagra.

naina35 · 2024-02-20T14:59:32Z

hey @35C4n0r this was my first time trying my hand with python so I couldn't do much
usually my tech domain is different

this was what i managed to do
import cv2
import numpy as np
import matplotlib.pyplot as plt

file = 'C:/Users/hp5cd/Desktop/parse hindi tables/hindi5.jpg'
im = cv2.imread(file)
im_gray = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)

ret, thresh_value = cv2.threshold(im_gray, 180, 255, cv2.THRESH_BINARY_INV)

kernel = np.ones((2, 5), np.uint8)
dilated_value = cv2.dilate(thresh_value, kernel, iterations=3)

contours, hierarchy = cv2.findContours(dilated_value, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

for cnt in contours:
x, y, w, h = cv2.boundingRect(cnt)

if y < 1000:
    cv2.rectangle(im, (x, y), (x + w, y + h), (0, 0, 255), 1)

cv2.imshow('detecttable78', im)
cv2.imwrite('detecttable_with_rectangles4.jpg', im)
cv2.waitKey(0)
cv2.destroyAllWindows()

Gautam-Rajeev · 2024-02-21T03:23:01Z

@35C4n0r @naina35 sorry for the delay. Here is my effort at parsing the tables. It uses the same classic edge detection techniques you highlighted. For tables with the borderlines, it is able to detect the rows and columns with 100% accuracy in our test cases. It fails if the lines are blurry (we have one case like that). It also fails gracefully i.e when there are no columns/row lines or they are blurry, it doesn't detect either rows or columns and we can use alternative approaches in those cases.

@35C4n0r Would be very useful if you could continue with the logic you had written in line with the comment I had detailed above to extend the capabilty to tackle any table. Also, clean up code to get the table in a dataframe once we have detected the rows and columns accurately.

Also had kept all the images of the tables here for easy use and testing (after table detection)
detected_pdf.zip

Cleaned up here

Sanika-k-1317 · 2024-03-19T18:49:28Z

Hello @Shruti3004
I have experience on working with pytorch and have made a project in deep learning. I am really interested in contributing to this project. Could you please let me know how I can connect for further discussions and assign the ticket to me?

riyasachdeva04 · 2024-04-01T17:52:11Z

Hello @Shruti3004
Can I work on this issue or is it inactive now?

harshaharod21 · 2024-05-28T04:51:23Z

Is this task active? I was looking to contribute in projects in the domain of AI and ML.

Gautam-Rajeev added ai enhancement New feature or request good first issue Good for newcomers labels Jan 18, 2024

Shruti3004 added the C4GT Community label Jan 19, 2024

Shruti3004 removed the good first issue Good for newcomers label Jan 19, 2024

Gautam-Rajeev assigned 35C4n0r Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing Hindi tables better #284

Parsing Hindi tables better #284

Gautam-Rajeev commented Jan 18, 2024 •

edited

Loading

c4gt-community-support bot commented Jan 19, 2024

35C4n0r commented Jan 19, 2024

Gautam-Rajeev commented Jan 19, 2024

Amit0617 commented Jan 21, 2024 •

edited

Loading

Amit0617 commented Jan 21, 2024

Gautam-Rajeev commented Jan 22, 2024

Gautam-Rajeev commented Jan 22, 2024 •

edited

Loading

Amit0617 commented Jan 23, 2024

naina35 commented Jan 29, 2024

Gautam-Rajeev commented Jan 30, 2024

35C4n0r commented Jan 30, 2024 •

edited

Loading

Gautam-Rajeev commented Jan 30, 2024 •

edited

Loading

35C4n0r commented Feb 2, 2024

35C4n0r commented Feb 6, 2024

Gautam-Rajeev commented Feb 7, 2024

Gautam-Rajeev commented Feb 9, 2024 •

edited

Loading

Gautam-Rajeev commented Feb 9, 2024 •

edited

Loading

35C4n0r commented Feb 20, 2024 •

edited

Loading

Gautam-Rajeev commented Feb 20, 2024

35C4n0r commented Feb 20, 2024

naina35 commented Feb 20, 2024

Gautam-Rajeev commented Feb 21, 2024 •

edited

Loading

Sanika-k-1317 commented Mar 19, 2024

riyasachdeva04 commented Apr 1, 2024

harshaharod21 commented May 28, 2024

Parsing Hindi tables better #284

Parsing Hindi tables better #284

Comments

Gautam-Rajeev commented Jan 18, 2024 • edited Loading

Description

Implementation Details

Guidelines for proposed solutions :

Product Name

Organization Name

Domain

Tech Skills Needed

Category

Mentor(s)

Complexity

c4gt-community-support bot commented Jan 19, 2024

35C4n0r commented Jan 19, 2024

Gautam-Rajeev commented Jan 19, 2024

Amit0617 commented Jan 21, 2024 • edited Loading

Amit0617 commented Jan 21, 2024

Gautam-Rajeev commented Jan 22, 2024

Gautam-Rajeev commented Jan 22, 2024 • edited Loading

Amit0617 commented Jan 23, 2024

naina35 commented Jan 29, 2024

Gautam-Rajeev commented Jan 30, 2024

35C4n0r commented Jan 30, 2024 • edited Loading

Gautam-Rajeev commented Jan 30, 2024 • edited Loading

35C4n0r commented Feb 2, 2024

35C4n0r commented Feb 6, 2024

Gautam-Rajeev commented Feb 7, 2024

Gautam-Rajeev commented Feb 9, 2024 • edited Loading

Gautam-Rajeev commented Feb 9, 2024 • edited Loading

35C4n0r commented Feb 20, 2024 • edited Loading

Gautam-Rajeev commented Feb 20, 2024

35C4n0r commented Feb 20, 2024

naina35 commented Feb 20, 2024

Gautam-Rajeev commented Feb 21, 2024 • edited Loading

Sanika-k-1317 commented Mar 19, 2024

riyasachdeva04 commented Apr 1, 2024

harshaharod21 commented May 28, 2024

Gautam-Rajeev commented Jan 18, 2024 •

edited

Loading

Amit0617 commented Jan 21, 2024 •

edited

Loading

Gautam-Rajeev commented Jan 22, 2024 •

edited

Loading

35C4n0r commented Jan 30, 2024 •

edited

Loading

Gautam-Rajeev commented Jan 30, 2024 •

edited

Loading

Gautam-Rajeev commented Feb 9, 2024 •

edited

Loading

Gautam-Rajeev commented Feb 9, 2024 •

edited

Loading

35C4n0r commented Feb 20, 2024 •

edited

Loading

Gautam-Rajeev commented Feb 21, 2024 •

edited

Loading