-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to install ocrmypdf? #361
Comments
In CMD I wrote (found it in #162):
Went to
Now atleast marker recognizes that there is some kind of TESSDATA_PREFIX:
|
Launched cmd as administrator- problem |
Im running Windows 10. Currently given directions not elaborative enough. Please try to install ocrmypdf and use it with marker without getting errors. Where the hell do I find the root folder of marker if I installed it using command given below and not using some kind of visual studio environment (#316)?
pip install marker-pdf
https://github.com/VikParuchuri/marker/blob/master/docs/install_ocrmypdf.md
winget install -e --id Python.Python.3.11
winget install -e --id UB-Mannheim.TesseractOCR
installed ghostscript
python3 -m pip install ocrmypdf
set OCR_ALL_PAGES=true
set OCR_ENGINE=ocrmypdf
marker_single input.pdf C:/output/folder --langs Greek,Lithuanian
5.1 Trying to resolve:
Introduced a new variable:
set TESSDATA_PREFIX="C:\Program Files\Tesseract-OCR\tessdata"
Loaded detection model vikp/surya_det3 on device cpu with dtype torch.float32 Loaded detection model vikp/surya_layout3 on device cpu with dtype torch.float32 Loaded reading order model vikp/surya_order on device cpu with dtype torch.float32 Loaded recognition model vikp/surya_rec2 on device cpu with dtype torch.float32 Loaded texify model to cpu with torch.float32 dtype Loaded recognition model vikp/surya_tablerec on device cpu with dtype torch.float32 Detecting bboxes: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:18<00:00, 18.75s/it] Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "C:\Users\wasup\miniconda3\Scripts\marker_single.exe\__main__.py", line 7, in <module> File "C:\Users\wasup\miniconda3\Lib\site-packages\convert_single.py", line 33, in main full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier, start_page=args.start_page) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\marker\convert.py", line 98, in convert_single_pdf pages, ocr_stats = run_ocr(doc, pages, langs, ocr_model, batch_multiplier=batch_multiplier, ocr_all_pages=ocr_all_pages) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\marker\ocr\recognition.py", line 55, in run_ocr new_pages = tesseract_recognition(doc, ocr_idxs, langs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\marker\ocr\recognition.py", line 136, in tesseract_recognition pages = list(executor.map(_tesseract_recognition, pdf_pages, repeat(langs, len(pdf_pages)))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\concurrent\futures\_base.py", line 619, in result_iterator yield _result_or_cancel(fs.pop()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\concurrent\futures\_base.py", line 317, in _result_or_cancel return fut.result(timeout) ^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\concurrent\futures\_base.py", line 456, in result return self.__get_result() ^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\concurrent\futures\_base.py", line 401, in __get_result raise self._exception File "C:\Users\wasup\miniconda3\Lib\concurrent\futures\thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\marker\ocr\recognition.py", line 177, in _tesseract_recognition new_doc = pdfium.PdfDocument(f.name) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\pypdfium2\_helpers\document.py", line 78, in __init__ self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\wasup\miniconda3\Lib\site-packages\pypdfium2\_helpers\document.py", line 678, in _open_pdf raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).") pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: File access error).
The text was updated successfully, but these errors were encountered: