-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to text extraction needed #186
Comments
I think this sounds really hard, and I think before we build something with pdfplumber, we should look into doing "structural PDF extraction," which I think has come a long ways since I grabbed Can you survey what kinds of tools are already out there and see if there are ones that would already work for us before we go down what I think is going to be a very scary road? I'm pretty terrified that the corner cases on this issue could bog you down for months and still not be nailed down. |
I think some inspiration maybe can be taken from the following repo: https://github.com/VikParuchuri/surya Vik has been doing some awesome work on OCR and document structure recognition, reading order, etc. His twitter: |
@mlissner here are some of the improvements for you. |
Those are some very nice improvements! |
one more push coming momentarily with the last few changes |
@flooie, this is still open. Can you please figure out what's left to do here and make a comment or new issue if anything (and close otherwise)? |
The Needs OCR function needs to be improved. Currently we do this to determine if something that is OCR eligible should be OCRd.
The Situation
The content is generated from
pdftotext
using this code
later down stream - on CL we take the content - and say - are we sure we didn't need to OCR this and we do this
Where we look for any row that doesnt appear to be a bates stamp. And as long as we find any text - garbled or otherwise we say we are good to go.
This leads unfortunately to some seriously garbled plain text in our Recap - and potentially our opinion db.
Examples
I dont want to rag on
pdftotext
it has done an admirable job for the most part but I do not think it is the best way to approach what we dealing with now. For one - we are attempting to extract out content and place it into a plain text db field. This is challenging because a good amount of documents contain pdf objects, such as/widgets
,/annotations
,/freetext
,/Stamp
and/Popup
. Although this is not an exhaustive list we see links and signatures, and I'm sure more types.In addition to the complexity of handling documents that contain pdf stream objects, we also have to deal with images inserted into PDFs or even worse, the first or maybe just the last page being a rasterized PDF pages while the middle 30 odd pages being vector PDFs.
In this case - our checks fail and have no way to catch them because - after we iterate beyond the bates stamp on page 2 we get good text. See:
gov.uscourts.nysd.411264.100.0.pdf
This also fails when - for example, a free text widget is added on to the PDF page of an image that crosses out content or adds content to the page.
Here is an example - of a non image pdf page - containing Free Text widget (widget I think, it could be something different) meant to cross out the PROPOSED part.
This is not the perfect example, because the underlying content appears to contain text but is corrupted and looks like this
In fact, williams-v-t-mobile
Side by Side comparison of Williams v T-Mobile
Note the proposed - is incorrectly added here to the text frustrating the adjustment made by the court. Which is noted in the document itself.
Angled, Circular, and Sideways Text
Not to be out done - many judges - 👋
CAND
likes to use Stamps with circular text. These stamps are often at the end of the document but not exclusively. In doing that the courts introduce gibberish into our documents when we extract the text or OCR them.For example
gov.uscourts.cand.16711.1203.0.pdf
and another file have them adjacent to the text. One of these is stamped into an image pdf and the other is in a regular pdf and garbles it.In both cases - the content that is generated makes the test for OCR fail to identify a needed OCR.
Sideways Text
We also run into this problem where - pdftotext does an amazing job of figuring out the text on the side and writing it into the text. But here is the result - this is just a fancy thing some courts - and some firms like to do.
But look at the result. It unnaturally expands the plain text - and frustrates plain text searches for sure.
In this case and in others see below.
Margin Text
Occasionally the use of margin text in small font causes some weird creations in text. which again cause extra wide text that is hard to view and display and which I think make it hard to query or search for the content you may be looking for.
Final complaint (Bates Stamps)
Bates stamps on every page are ingested into the content and dont reflect the document that was generated. I would not expect to see bates stamps or sidebar content in a published book so I dont think we should display it in the plain text.
What should we do
If you've read this far @mlissner I know you must be dying to hear what I think the solution happens to be.
We should drop (I think)
pdftotext
for you guessed itpdfplumber
.Pdfplumber can both sample the pdfs better to determine if the entire page is likely an image - while correctly guessing that lines or signatures are in the document and leaving be. Additionally, we can easily extract out the pure text of the document while avoiding the pitfalls contained.
We should drop the check in CL and just make all the assessments done here in doctor as well.
Solutions coming in the next post.
The text was updated successfully, but these errors were encountered: