Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout limit to document parsing job. #270

Open
PeterStaar-IBM opened this issue Nov 7, 2024 · 5 comments · May be fixed by #320
Open

Add timeout limit to document parsing job. #270

PeterStaar-IBM opened this issue Nov 7, 2024 · 5 comments · May be fixed by #320
Assignees
Labels
enhancement New feature or request priority:high

Comments

@PeterStaar-IBM
Copy link
Contributor

Requested feature

We need to have a way to add a timeout parameter when processing a document. Currently, it happens in very rare cases that certain documents will take very long to convert. In a batch processing job, this might become problematic.

example use case:

temp.pdf

@PeterStaar-IBM PeterStaar-IBM added enhancement New feature or request priority:high labels Nov 7, 2024
@cau-git
Copy link
Contributor

cau-git commented Nov 7, 2024

Checking the attached PDF, it is not a surprise we see very long conversion time. It is fully scanned and has a lot of pages, which is very slow on CPU at least.

Generally, there are multiple strategies to avoid such samples clogging a bulk conversion pipeline.

  1. One can run over all docs with OCR off, and later rerun only those docs where the conversion result is empty (i.e. it may need OCR). Already possible with current version.
  2. We can extend docling to optionally stop converting a doc when a timeout is reached. This timeout can only be checked once after every next page batch (i.e. after multiples of 4 pages with the defaults). This would reflect as a status PARTIAL_SUCCESS. User code could either export the partial result or drop the document.

@ab-shrek
Copy link

I am interested in this issue. Can you please assign this to me? Thanks :)

@nikos-livathinos nikos-livathinos self-assigned this Nov 11, 2024
@ab-shrek
Copy link

Are you working on this @nikos-livathinos ?

@nikos-livathinos
Copy link
Contributor

@ab-shrek great to see you are interested in helping out on this issue. Please submit a PR for our review.
Here are some hints:

  1. Introduce a new parameter (e.g. pdf_document_timeout) in PdfPipelineOptions (
    class PdfPipelineOptions(PipelineOptions):
    )
  2. Implement the timeout logic in the PaginatedPipeline._build_document() (
    def _build_document(self, conv_res: ConversionResult) -> ConversionResult:
    )
    • The timeout should apply to the PDF pipeline for the time needed to convert the entire document.
    • We should check for a timeout after the conversion of each page chunk (but the check is for the document not only for the current page chunk).
    • When a timeout happens, the loop exits and the conv_res.status should set to ConversionStatus.PARTIAL_SUCCESS.
  3. Extend the docling CLI (https://github.com/DS4SD/docling/blob/main/docling/cli/main.py) to expose a cmd argument (e.g. --document-timeout ) that sets the pdf_document_timeout inside the PdfPipelineOptions.

@ab-shrek
Copy link

Great; thanks @nikos-livathinos. Let me get on this asap :)

ab-shrek pushed a commit to ab-shrek/docling that referenced this issue Nov 12, 2024
Testing:
(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=100
INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 87584.07it/s]
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 24.12 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 24.13 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062 --document-timeout=5
INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 29037.49it/s]
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
WARNING:docling.pipeline.base_pipeline:Document processing time (6 s) exceeded the specified timeout of 5 s
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 10.82 sec.
WARNING:docling.cli.main:Document /var/folders/d7/dsfkllxs0xs8x2t4fcjknj4c0000gn/T/tmpzedg349h/2206.01062v1.pdf failed to convert.
INFO:docling.cli.main:Processed 1 docs, of which 1 failed
INFO:docling.cli.main:All documents were converted in 10.82 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling https://arxiv.org/pdf/2206.01062
INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 88197.98it/s]
INFO:docling.pipeline.base_pipeline:Processing document 2206.01062v1.pdf
INFO:docling.document_converter:Finished converting document 2206.01062v1.pdf in 22.59 sec.
INFO:docling.cli.main:writing Markdown output to 2206.01062v1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 22.60 seconds.

(.venv) mario@Abhisheks-MacBook-Air docling % docling

 Usage: docling [OPTIONS] source

╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    input_sources      source  PDF files to convert. Can be local file / directory paths or URL. [default: None] [required]                                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --from                                       [docx|pptx|html|image|pdf|asciidoc|md]  Specify input formats to convert from. Defaults to all formats. [default: None]                                     │
│ --to                                         [md|json|text|doctags]                  Specify output formats. Defaults to Markdown. [default: None]                                                       │
│ --ocr                 --no-ocr                                                       If enabled, the bitmap content will be processed using OCR. [default: ocr]                                          │
│ --force-ocr           --no-force-ocr                                                 Replace any existing text with OCR generated text over the full content. [default: no-force-ocr]                    │
│ --ocr-engine                                 [easyocr|tesseract_cli|tesseract]       The OCR engine to use. [default: easyocr]                                                                           │
│ --pdf-backend                                [pypdfium2|dlparse_v1|dlparse_v2]       The PDF backend to use. [default: dlparse_v1]                                                                       │
│ --table-mode                                 [fast|accurate]                         The mode to use in the table structure model. [default: fast]                                                       │
│ --artifacts-path                             PATH                                    If provided, the location of the model artifacts. [default: None]                                                   │
│ --abort-on-error      --no-abort-on-error                                            If enabled, the bitmap content will be processed using OCR. [default: no-abort-on-error]                            │
│ --output                                     PATH                                    Output directory where results are saved. [default: .]                                                              │
│ --version                                                                            Show version information.                                                                                           │
│ --document-timeout                           INTEGER                                 The timeout for processing each document, in seconds. [default: None]                                               │
│ --help                                                                               Show this message and exit.                                                                                         │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
@ab-shrek ab-shrek linked a pull request Nov 12, 2024 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority:high
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants