Replies: 2 comments 1 reply
-
Do these invoices have a consistent format? This seems like something
you'd be better served by using regular expressions. There is the
tokensregex package if you want to give that a try
…On Wed, Jul 27, 2022 at 12:02 AM Ralph Soika ***@***.***> wrote:
I have a question about how to best use CoreNLP for Names Entity
Recognition missing natural language text blocks.
My goal is to extract named entities from an invoice document. As far as I
understand, an invoice does not provide the unstructured plain text which
is usually expected from NLP.
An typical text example for an invoice text analyzed with NLP ML which I
found often in the Internet, looks like this:
“Partial invoice (€100,000, so roughly 40%) for the consignment C27655 we
shipped on 15th August to London from the Make Believe Town depot. INV2345
is for the balance.. Customer contact (Sigourney) says they will pay this
on the usual credit terms (30 days).”
I understand that NLP loves this kind of text. But this text example is
not typical for an invoice document. Text extracted form a Invoice PDF
(using Apache Tika) usually looks more like this:
Client no: Invoice no: Invoice date: Due date:
1000011128 DEAXXXD220012269 26-Jul-2022 02-Aug-2022
Invoice to: Booking Reference
LOGISTCS GMBH Client Reference :
DEMOSTRASSE 2-6 Comments:
28195 BREMEN
Germany
Vessel : Voy : Place of Receipt : POL: B/LNo:
XXX JUBILEE NUBBBW SAV33NAH, GA ME000243
ETA: Final Destination : POD:
15-Jul-2022 ANTWERP, BELGIUM
Charge Quantity(days) x Rate Currency Total ROE Total EUR VAT
STORAGE_IMP_FOREIGN 1 day(s) x 30,00 EUR EUR 30,00 1,000000 30,00 0,00
Is NLP in general the wrong approach to train the recognition of meta data
from an invoice document?
Is there a way to setup up a pipeline in CoreNLP which is more useful for
structured text in a invoice document?
Thanks for any tips
===
Ralph
—
Reply to this email directly, view it on GitHub
<#1287>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWKLEWCUSY4LLLDJU33VWDNIZANCNFSM54YOFLYA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
1 reply
-
In terms of the CRF, currently no. You could theoretically add it as a
feature to the model. This doesn't really seem like a situation where you
want to use machine learning techniques to find the last row of a table,
though
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have a question about how to best use CoreNLP for Names Entity Recognition missing natural language text blocks.
My goal is to extract named entities from an invoice document. As far as I understand, an invoice does not provide the unstructured plain text which is usually expected from NLP.
An typical text example for an invoice text analyzed with NLP ML which I found often in the Internet, looks like this:
I understand that NLP loves this kind of text. But this text example is not typical for an invoice document. Text extracted form a Invoice PDF (using Apache Tika) usually looks more like this:
Is NLP in general the wrong approach to train the recognition of meta data from an invoice document?
Is there a way to setup up a pipeline in CoreNLP which is more useful for structured text in a invoice document?
Thanks for any tips
===
Ralph
Beta Was this translation helpful? Give feedback.
All reactions