Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
enhancement: Add timeout limit to document parsing job. DS4SD#270
Testing: (.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=100 INFO:docling.document_converter:Going to convert document batch... Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 87584.07it/s] INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 24.12 sec. INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md INFO:docling.cli.main:Processed 1 docs, of which 0 failed INFO:docling.cli.main:All documents were converted in 24.13 seconds. (.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=5 INFO:docling.document_converter:Going to convert document batch... Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 29037.49it/s] INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf WARNING:docling.pipeline.base_pipeline:Document processing time (6 s) exceeded the specified timeout of 5 s INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 10.82 sec. WARNING:docling.cli.main:Document /var/folders/d7/dsfkllxs0xs8x2t4fcjknj4c0000gn/T/tmpzedg349h/2206.01062v1.pdf failed to convert. INFO:docling.cli.main:Processed 1 docs, of which 1 failed INFO:docling.cli.main:All documents were converted in 10.82 seconds. (.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 INFO:docling.document_converter:Going to convert document batch... Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 88197.98it/s] INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 22.59 sec. INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md INFO:docling.cli.main:Processed 1 docs, of which 0 failed INFO:docling.cli.main:All documents were converted in 22.60 seconds. (.venv) mario@Abhisheks-MacBook-Air docling % docling Usage: docling [OPTIONS] source ╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ * input_sources source PDF files to convert. Can be local file / directory paths or URL. [default: None] [required] │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ --from [docx|pptx|html|image|pdf|asciidoc|md] Specify input formats to convert from. Defaults to all formats. [default: None] │ │ --to [md|json|text|doctags] Specify output formats. Defaults to Markdown. [default: None] │ │ --ocr --no-ocr If enabled, the bitmap content will be processed using OCR. [default: ocr] │ │ --force-ocr --no-force-ocr Replace any existing text with OCR generated text over the full content. [default: no-force-ocr] │ │ --ocr-engine [easyocr|tesseract_cli|tesseract] The OCR engine to use. [default: easyocr] │ │ --pdf-backend [pypdfium2|dlparse_v1|dlparse_v2] The PDF backend to use. [default: dlparse_v1] │ │ --table-mode [fast|accurate] The mode to use in the table structure model. [default: fast] │ │ --artifacts-path PATH If provided, the location of the model artifacts. [default: None] │ │ --abort-on-error --no-abort-on-error If enabled, the bitmap content will be processed using OCR. [default: no-abort-on-error] │ │ --output PATH Output directory where results are saved. [default: .] │ │ --version Show version information. │ │ --document-timeout INTEGER The timeout for processing each document, in seconds. [default: None] │ │ --help Show this message and exit. │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
- Loading branch information