-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add timeout limit to document parsing job. #270
Comments
Checking the attached PDF, it is not a surprise we see very long conversion time. It is fully scanned and has a lot of pages, which is very slow on CPU at least. Generally, there are multiple strategies to avoid such samples clogging a bulk conversion pipeline.
|
I am interested in this issue. Can you please assign this to me? Thanks :) |
Are you working on this @nikos-livathinos ? |
@ab-shrek great to see you are interested in helping out on this issue. Please submit a PR for our review.
|
Great; thanks @nikos-livathinos. Let me get on this asap :) |
Testing: (.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=100 INFO:docling.document_converter:Going to convert document batch... Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 87584.07it/s] INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 24.12 sec. INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md INFO:docling.cli.main:Processed 1 docs, of which 0 failed INFO:docling.cli.main:All documents were converted in 24.13 seconds. (.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=5 INFO:docling.document_converter:Going to convert document batch... Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 29037.49it/s] INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf WARNING:docling.pipeline.base_pipeline:Document processing time (6 s) exceeded the specified timeout of 5 s INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 10.82 sec. WARNING:docling.cli.main:Document /var/folders/d7/dsfkllxs0xs8x2t4fcjknj4c0000gn/T/tmpzedg349h/2206.01062v1.pdf failed to convert. INFO:docling.cli.main:Processed 1 docs, of which 1 failed INFO:docling.cli.main:All documents were converted in 10.82 seconds. (.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 INFO:docling.document_converter:Going to convert document batch... Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 88197.98it/s] INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 22.59 sec. INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md INFO:docling.cli.main:Processed 1 docs, of which 0 failed INFO:docling.cli.main:All documents were converted in 22.60 seconds. (.venv) mario@Abhisheks-MacBook-Air docling % docling Usage: docling [OPTIONS] source ╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ * input_sources source PDF files to convert. Can be local file / directory paths or URL. [default: None] [required] │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ --from [docx|pptx|html|image|pdf|asciidoc|md] Specify input formats to convert from. Defaults to all formats. [default: None] │ │ --to [md|json|text|doctags] Specify output formats. Defaults to Markdown. [default: None] │ │ --ocr --no-ocr If enabled, the bitmap content will be processed using OCR. [default: ocr] │ │ --force-ocr --no-force-ocr Replace any existing text with OCR generated text over the full content. [default: no-force-ocr] │ │ --ocr-engine [easyocr|tesseract_cli|tesseract] The OCR engine to use. [default: easyocr] │ │ --pdf-backend [pypdfium2|dlparse_v1|dlparse_v2] The PDF backend to use. [default: dlparse_v1] │ │ --table-mode [fast|accurate] The mode to use in the table structure model. [default: fast] │ │ --artifacts-path PATH If provided, the location of the model artifacts. [default: None] │ │ --abort-on-error --no-abort-on-error If enabled, the bitmap content will be processed using OCR. [default: no-abort-on-error] │ │ --output PATH Output directory where results are saved. [default: .] │ │ --version Show version information. │ │ --document-timeout INTEGER The timeout for processing each document, in seconds. [default: None] │ │ --help Show this message and exit. │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Requested feature
We need to have a way to add a timeout parameter when processing a document. Currently, it happens in very rare cases that certain documents will take very long to convert. In a batch processing job, this might become problematic.
example use case:
temp.pdf
The text was updated successfully, but these errors were encountered: