Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to install ocrmypdf? #361

Closed
VooDisss opened this issue Nov 18, 2024 · 2 comments
Closed

How to install ocrmypdf? #361

VooDisss opened this issue Nov 18, 2024 · 2 comments

Comments

@VooDisss
Copy link

VooDisss commented Nov 18, 2024

Im running Windows 10. Currently given directions not elaborative enough. Please try to install ocrmypdf and use it with marker without getting errors. Where the hell do I find the root folder of marker if I installed it using command given below and not using some kind of visual studio environment (#316)?

  1. Installed marker:
    pip install marker-pdf
    https://github.com/VikParuchuri/marker/blob/master/docs/install_ocrmypdf.md
  2. Installed ocrmypdf:
    winget install -e --id Python.Python.3.11
    winget install -e --id UB-Mannheim.TesseractOCR
    installed ghostscript
    python3 -m pip install ocrmypdf
  3. set two variables for ocrmypdf to be used:
    set OCR_ALL_PAGES=true
    set OCR_ENGINE=ocrmypdf
  4. trying to launch marker_single:
    marker_single input.pdf C:/output/folder --langs Greek,Lithuanian
  5. Getting errors.
    5.1 Trying to resolve:
    Introduced a new variable:
    set TESSDATA_PREFIX="C:\Program Files\Tesseract-OCR\tessdata"
  6. Errors again, frustration starts and hopefully ends here (with your help).
    Loaded detection model vikp/surya_det3 on device cpu with dtype torch.float32 Loaded detection model vikp/surya_layout3 on device cpu with dtype torch.float32 Loaded reading order model vikp/surya_order on device cpu with dtype torch.float32 Loaded recognition model vikp/surya_rec2 on device cpu with dtype torch.float32 Loaded texify model to cpu with torch.float32 dtype Loaded recognition model vikp/surya_tablerec on device cpu with dtype torch.float32 Detecting bboxes: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:18<00:00, 18.75s/it] Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "C:\Users\wasup\miniconda3\Scripts\marker_single.exe\__main__.py", line 7, in <module> File "C:\Users\wasup\miniconda3\Lib\site-packages\convert_single.py", line 33, in main full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier, start_page=args.start_page) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\marker\convert.py", line 98, in convert_single_pdf pages, ocr_stats = run_ocr(doc, pages, langs, ocr_model, batch_multiplier=batch_multiplier, ocr_all_pages=ocr_all_pages) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\marker\ocr\recognition.py", line 55, in run_ocr new_pages = tesseract_recognition(doc, ocr_idxs, langs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\marker\ocr\recognition.py", line 136, in tesseract_recognition pages = list(executor.map(_tesseract_recognition, pdf_pages, repeat(langs, len(pdf_pages)))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\concurrent\futures\_base.py", line 619, in result_iterator yield _result_or_cancel(fs.pop()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\concurrent\futures\_base.py", line 317, in _result_or_cancel return fut.result(timeout) ^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\concurrent\futures\_base.py", line 456, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\concurrent\futures\_base.py", line 401, in __get_result raise self._exception File "C:\Users\wasup\miniconda3\Lib\concurrent\futures\thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\marker\ocr\recognition.py", line 177, in _tesseract_recognition new_doc = pdfium.PdfDocument(f.name) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\pypdfium2\_helpers\document.py", line 78, in __init__ self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\pypdfium2\_helpers\document.py", line 678, in _open_pdf raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).") pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: File access error).
@VooDisss
Copy link
Author

VooDisss commented Nov 18, 2024

In CMD I wrote (found it in #162):
pip show marker-pdf
Got this (bold is what I needed):

Name: marker-pdf
Version: 0.3.10
Summary: Convert PDF to markdown with high speed and accuracy.
Home-page: https://github.com/VikParuchuri/marker
Author: Vik Paruchuri
Author-email: [email protected]
License: GPL-3.0-or-later
Location: C:\Users\Wasup\miniconda3\Lib\site-packages
Requires: filetype, ftfy, pdftext, Pillow, pydantic, pydantic-settings, python-dotenv, rapidfuzz, regex, surya-ocr, tabled-pdf, tabulate, texify, torch, tqdm, transformers

Went to C:\Users\Wasup\miniconda3\Lib\site-packages and found undocumented settings.py file. Opened with notepad++ and found the needed lines and filled in the TESSDATA_PREFIX value:

OCR_PARALLEL_WORKERS: int = 2 # How many CPU workers to use for OCR
TESSERACT_TIMEOUT: int = 20 # When to give up on OCR
TESSDATA_PREFIX: str = "C:\Program Files\Tesseract-OCR\tessdata"

Now atleast marker recognizes that there is some kind of TESSDATA_PREFIX:

C:\Users\Wasup>marker_single input.pdf C:/output/folder --langs Greek,Lithuanian C:\Users\Wasup\miniconda3\Lib\site-packages\marker\settings.py:59: SyntaxWarning: invalid escape sequence '\P' **TESSDATA_PREFIX: str = "C\Program Files\Tesseract-OCR\tessdata"** Loaded detection model vikp/surya_det3 on device cpu with dtype torch.float32 Loaded detection model vikp/surya_layout3 on device cpu with dtype torch.float32 Loaded reading order model vikp/surya_order on device cpu with dtype torch.float32 Loaded recognition model vikp/surya_rec2 on device cpu with dtype torch.float32 Loaded texify model to cpu with torch.float32 dtype Loaded recognition model vikp/surya_tablerec on device cpu with dtype torch.float32 Detecting bboxes: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:16<00:00, 16.33s/it] Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "C:\Users\Wasup\miniconda3\Scripts\marker_single.exe\__main__.py", line 7, in <module> File "C:\Users\Wasup\miniconda3\Lib\site-packages\convert_single.py", line 33, in main full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier, start_page=args.start_page) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\site-packages\marker\convert.py", line 98, in convert_single_pdf pages, ocr_stats = run_ocr(doc, pages, langs, ocr_model, batch_multiplier=batch_multiplier, ocr_all_pages=ocr_all_pages) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\site-packages\marker\ocr\recognition.py", line 55, in run_ocr new_pages = tesseract_recognition(doc, ocr_idxs, langs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\site-packages\marker\ocr\recognition.py", line 136, in tesseract_recognition pages = list(executor.map(_tesseract_recognition, pdf_pages, repeat(langs, len(pdf_pages)))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\concurrent\futures\_base.py", line 619, in result_iterator yield _result_or_cancel(fs.pop()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\concurrent\futures\_base.py", line 317, in _result_or_cancel return fut.result(timeout) ^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\concurrent\futures\_base.py", line 456, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\concurrent\futures\_base.py", line 401, in __get_result raise self._exception File "C:\Users\Wasup\miniconda3\Lib\concurrent\futures\thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\site-packages\marker\ocr\recognition.py", line 177, in _tesseract_recognition new_doc = pdfium.PdfDocument(f.name) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\site-packages\pypdfium2\_helpers\document.py", line 78, in __init__ self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Wasup\miniconda3\Lib\site-packages\pypdfium2\_helpers\document.py", line 678, in _open_pdf raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).") pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: File access error).

@VooDisss
Copy link
Author

Launched cmd as administrator- problem (PdfiumError: Failed to load document (PDFium: File access error) fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant