How to use Named Entity Recognition in structure text like invoices? #1287

rsoika · 2022-07-27T07:02:17Z

rsoika
Jul 27, 2022

I have a question about how to best use CoreNLP for Names Entity Recognition missing natural language text blocks.

My goal is to extract named entities from an invoice document. As far as I understand, an invoice does not provide the unstructured plain text which is usually expected from NLP.

An typical text example for an invoice text analyzed with NLP ML which I found often in the Internet, looks like this:

“Partial invoice (€100,000, so roughly 40%) for the consignment C27655 we shipped on 15th August to London from the Make Believe Town depot. INV2345 is for the balance.. Customer contact (Sigourney) says they will pay this on the usual credit terms (30 days).”

I understand that NLP loves this kind of text. But this text example is not typical for an invoice document. Text extracted form a Invoice PDF (using Apache Tika) usually looks more like this:

Client no: Invoice no: Invoice date: Due date:
1000011128 DEAXXXD220012269 26-Jul-2022 02-Aug-2022
Invoice to: Booking Reference
LOGISTCS GMBH Client Reference :
DEMOSTRASSE 2-6 Comments:
28195 BREMEN
Germany
Vessel : Voy : Place of Receipt : POL: B/LNo:
XXX JUBILEE NUBBBW SAV33NAH, GA ME000243
ETA: Final Destination : POD:
15-Jul-2022 ANTWERP, BELGIUM
Charge Quantity(days) x Rate Currency Total ROE Total EUR VAT
STORAGE_IMP_FOREIGN 1 day(s) x 30,00 EUR EUR 30,00 1,000000 30,00 0,00

Is NLP in general the wrong approach to train the recognition of meta data from an invoice document?
Is there a way to setup up a pipeline in CoreNLP which is more useful for structured text in a invoice document?

Thanks for any tips

===
Ralph

AngledLuffa · 2022-07-27T17:39:28Z

AngledLuffa
Jul 27, 2022
Maintainer

Do these invoices have a consistent format? This seems like something you'd be better served by using regular expressions. There is the tokensregex package if you want to give that a try

…

On Wed, Jul 27, 2022 at 12:02 AM Ralph Soika ***@***.***> wrote: I have a question about how to best use CoreNLP for Names Entity Recognition missing natural language text blocks. My goal is to extract named entities from an invoice document. As far as I understand, an invoice does not provide the unstructured plain text which is usually expected from NLP. An typical text example for an invoice text analyzed with NLP ML which I found often in the Internet, looks like this: “Partial invoice (€100,000, so roughly 40%) for the consignment C27655 we shipped on 15th August to London from the Make Believe Town depot. INV2345 is for the balance.. Customer contact (Sigourney) says they will pay this on the usual credit terms (30 days).” I understand that NLP loves this kind of text. But this text example is not typical for an invoice document. Text extracted form a Invoice PDF (using Apache Tika) usually looks more like this: Client no: Invoice no: Invoice date: Due date: 1000011128 DEAXXXD220012269 26-Jul-2022 02-Aug-2022 Invoice to: Booking Reference LOGISTCS GMBH Client Reference : DEMOSTRASSE 2-6 Comments: 28195 BREMEN Germany Vessel : Voy : Place of Receipt : POL: B/LNo: XXX JUBILEE NUBBBW SAV33NAH, GA ME000243 ETA: Final Destination : POD: 15-Jul-2022 ANTWERP, BELGIUM Charge Quantity(days) x Rate Currency Total ROE Total EUR VAT STORAGE_IMP_FOREIGN 1 day(s) x 30,00 EUR EUR 30,00 1,000000 30,00 0,00 Is NLP in general the wrong approach to train the recognition of meta data from an invoice document? Is there a way to setup up a pipeline in CoreNLP which is more useful for structured text in a invoice document? Thanks for any tips === Ralph — Reply to this email directly, view it on GitHub <#1287>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWKLEWCUSY4LLLDJU33VWDNIZANCNFSM54YOFLYA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

1 reply

rsoika Jul 27, 2022
Author

Hi @AngledLuffa, no the invoices do not have a consistent format. It is more a endless flow from incoming invoices from different suppliers. Of course some things are typical identical over all invoices like IBAN, BIC/SWIFT, or the VAT-ID.
So using regular expressions maybe makes sense for such entities. And some suppliers send several time a invoice. So there are a some how recurring formats.

What I do not understand in CoreNLP is that it looks like a training model provides only data line by line. But for an invoice - for example the total amount - is mostly somewhere in the last lines of the document. Is CoreNLP able to take respect of where in a document the entity is located?

What I mean is: an invoice can have a lot of separate amounts (many many numbers). Is CoreNLP able to learn that the Total is more at the end of a invoice document or is CoreNLP unable to see a difference:

Part Quantity  Rate Currency Total 
X1   5             1,50     EUR   7,50
X2  10             1,00     EUR   10,00
Total                       EUR   17,50

AngledLuffa · 2022-07-27T20:03:04Z

AngledLuffa
Jul 27, 2022
Maintainer

In terms of the CRF, currently no. You could theoretically add it as a feature to the model. This doesn't really seem like a situation where you want to use machine learning techniques to find the last row of a table, though

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use Named Entity Recognition in structure text like invoices? #1287

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

How to use Named Entity Recognition in structure text like invoices? #1287

rsoika Jul 27, 2022

Replies: 2 comments · 1 reply

AngledLuffa Jul 27, 2022 Maintainer

rsoika Jul 27, 2022 Author

AngledLuffa Jul 27, 2022 Maintainer

rsoika
Jul 27, 2022

Replies: 2 comments 1 reply

AngledLuffa
Jul 27, 2022
Maintainer

rsoika Jul 27, 2022
Author

AngledLuffa
Jul 27, 2022
Maintainer