Spacy hanging for badly formatted texts #13600
Unanswered
morbidCode
asked this question in
Help: Coding & Implementations
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello all!
I am using the latest version of spacy I installed from pip with the "en_core_web_sm" model. I am chunking lots of documents using sentence detection (sentor component). The documents have no structure and no specific formatting, and a few of them have extremely bad formatting.
Most of the time, my code is working well. However, there are rare cases where spacy encounters a badly formatted document, and hangs indefinitely: For example,
The problem is the sentence chunking is happening inside a loop, and if one hangs, the rest will not be processed.
Is there a way for spacy to throw an error if it encounters a text that it can't handle so I can skip it gracefully and proceed to the next documents? Or what would be the ideal approach?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions