-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing Hindi tables better #284
Comments
Hi!
Please update the ticket |
@GautamR-Samagra I would like to work on this issue, kindly assign it to me. |
@35C4n0r Thanks for showing interest :) |
Is this the requirement? https://colab.research.google.com/drive/1pCYJ4H9vwV3GoriQ6UNoIj8N42_i3tMv?usp=sharing After (a bit rough marking of start(top) and end(bottom) of line.) |
@Amit0617 this looks great. We also need to be able to detect the columns too. |
Hi, I'm planning to have a meeting today (22-Jan-2024) to discuss progress on this and clarify any doubts at 5:30 pm. Let me know if you can join @Amit0617 @35C4n0r @SarveshAtawane on this link |
Oops! I missed it. |
@GautamR-Samagra Here are the results and my collab notebook
I've also added the logic for converting it to a Python dictionary |
merged_tables.pdf |
@GautamR-Samagra What should be my next step, have you checked the notebook? |
@35C4n0r can you manually estimate for how many its working out of 26 tables? Can you verify if its working for all the tables which don't have row/column line boundaries ? |
@35C4n0r You provide the pdf, image as input along with the DETR model and the Pytesseract-language For example here , you get as output : Can you create another ticket on ai-tools with these details and assign it to yourself. I'll add to the details. I'll decide the community points etc. Once the performance of table transformers or any other object detection model improves, we can use this is conjunction with that |
Since, this is not being done with the above approaches, Let's try and solve this from ground up heuristically to just try to pass the test cases on the pdf. Then we'll abstract the ideas to make it generic and useful. Have a bunch of ideas here..
We can keep adding ideas around models around row /column detection and figure out how to ensemble them all |
@GautamR-Samagra, sorry for the huge delay, got some time today and made a little bit of progress, I've used the classic What I would like to have, is something like @naina35's solution, If possible, kindly share your approach with us @naina35. Once I have the bboxes for the text objects, I'll be able to clean up the Edge detection. |
Hey @35C4n0r can you pause work on this for now. I have made some progress myself and will clean and update that here in a couple of hours. |
Sure @GautamR-Samagra. |
hey @35C4n0r this was my first time trying my hand with python so I couldn't do much this was what i managed to do file = 'C:/Users/hp5cd/Desktop/parse hindi tables/hindi5.jpg' ret, thresh_value = cv2.threshold(im_gray, 180, 255, cv2.THRESH_BINARY_INV) kernel = np.ones((2, 5), np.uint8) contours, hierarchy = cv2.findContours(dilated_value, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) for cnt in contours:
cv2.imshow('detecttable78', im) |
@35C4n0r @naina35 sorry for the delay. Here is my effort at parsing the tables. It uses the same classic edge detection techniques you highlighted. For tables with the borderlines, it is able to detect the rows and columns with 100% accuracy in our test cases. It fails if the lines are blurry (we have one case like that). It also fails gracefully i.e when there are no columns/row lines or they are blurry, it doesn't detect either rows or columns and we can use alternative approaches in those cases. @35C4n0r Would be very useful if you could continue with the logic you had written in line with the comment I had detailed above to extend the capabilty to tackle any table. Also, clean up code to get the table in a dataframe once we have detected the rows and columns accurately. Also had kept all the images of the tables here for easy use and testing (after table detection) Cleaned up here |
Hello @Shruti3004 |
Hello @Shruti3004 |
Is this task active? I was looking to contribute in projects in the domain of AI and ML. |
Description
We are currently struggling to parse Hindi tables to be able to replicate the table into structured useful form (like json/.md) and need help figuring out bonding boxes for each cell for detected table.
Sample pdfs for testing :
Test case : merged_tables.pdf
Implementation Details
Detection if the page has a table and separating the table image from the rest of the page :
This is done using Table Transformers - This works pretty well on most tested use cases. We need to set the detection confidence pretty high (more than 99% ) and its able to detect if the page has the table and the correct boundary for it with reasonable accuracy. Code to set this up is here
Detection of the text in the table : "
This we have seen the best results using Pytesseract Hindi.
In our limited testing, we have notices that Pytesseract works best we give the page in its entirety rather than after detection and separation of rows/column/cells. Something like this which gives the word as well as the bounding boxes :
This is what we require focus on.
We tried using Table Transformers for this but its giving poor results. Any logic for OCR heuristics (column boundaries can be identified by ability to draw a straight vertical line without hitting a word) would seem more plausible as a first solution here than fine-tuning image transformers. Ideal initial step would be a step in this direction.
Sample pdfs show different kinds of tables (some without columns marked by lines, others without lines demarcating the columns.
Guidelines for proposed solutions :
Product Name
AI Tools
Organization Name
SamagraX
Domain
NA
Tech Skills Needed
Pytorch/ Python, ML
Category
Feature
Mentor(s)
@GautamR-Samagra
Complexity
High
The text was updated successfully, but these errors were encountered: