Skip to content

Commit

Permalink
Start rewriting the README
Browse files Browse the repository at this point in the history
  • Loading branch information
VikParuchuri committed Nov 22, 2024
1 parent fdb5564 commit d043dc6
Show file tree
Hide file tree
Showing 2 changed files with 74 additions and 87 deletions.
157 changes: 74 additions & 83 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,40 @@
# Marker

Marker converts PDF to markdown quickly and accurately.
Marker converts PDFs to markdown, JSON, and HTML quickly and accurately.

- Supports a wide range of documents (optimized for books and scientific papers)
- Supports a wide range of documents
- Supports all languages
- Removes headers/footers/other artifacts
- Formats tables and code blocks
- Extracts and saves images along with the markdown
- Converts most equations to latex
- Converts equations to latex
- Easily extensible with your own formatting and logic
- Works on GPU, CPU, or MPS

## How it works

Marker is a pipeline of deep learning models:

- Extract text, OCR if necessary (heuristics, [surya](https://github.com/VikParuchuri/surya), tesseract)
- Extract text, OCR if necessary (heuristics, [surya](https://github.com/VikParuchuri/surya))
- Detect page layout and find reading order ([surya](https://github.com/VikParuchuri/surya))
- Clean and format each block (heuristics, [texify](https://github.com/VikParuchuri/texify)
- Combine blocks and postprocess complete text (heuristics, [pdf_postprocessor](https://huggingface.co/vikp/pdf_postprocessor_t5))
- Clean and format each block (heuristics, [texify](https://github.com/VikParuchuri/texify. [tabled](https://github.com/VikParuchuri/tabled))
- Combine blocks and postprocess complete text

It only uses models where necessary, which improves speed and accuracy.

## Examples

| PDF | Type | Marker | Nougat |
|-----------------------------------------------------------------------|-------------|--------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
| [Think Python](https://greenteapress.com/thinkpython/thinkpython.pdf) | Textbook | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/thinkpython.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/thinkpython.md) |
| [Think OS](https://greenteapress.com/thinkos/thinkos.pdf) | Textbook | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/thinkos.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/thinkos.md) |
| [Switch Transformers](https://arxiv.org/pdf/2101.03961.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/switch_transformers.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/switch_transformers.md) |
| [Multi-column CNN](https://arxiv.org/pdf/1804.07821.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/multicolcnn.md) | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/multicolcnn.md) |
| PDF | Markdown | JSON |
| [Think Python](https://greenteapress.com/thinkpython/thinkpython.pdf) | Textbook | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/thinkpython.md) |
| [Think OS](https://greenteapress.com/thinkos/thinkos.pdf) | Textbook | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/marker/thinkos.md) |
| [Switch Transformers](https://arxiv.org/pdf/2101.03961.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/switch_transformers.md) |
| [Multi-column CNN](https://arxiv.org/pdf/1804.07821.pdf) | arXiv paper | [View](https://github.com/VikParuchuri/marker/blob/master/data/examples/nougat/multicolcnn.md) |

## Performance

![Benchmark overall](data/images/overall.png)

The above results are with marker and nougat setup so they each take ~4GB of VRAM on an A6000.
The above results are with marker setup so it takes ~4GB of VRAM on an A6000.

See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instructions on how to run your own benchmarks.

Expand All @@ -60,11 +60,10 @@ There's a hosted API for marker available [here](https://www.datalab.to/):

PDF is a tricky format, so marker will not always work perfectly. Here are some known limitations that are on the roadmap to address:

- Marker will not convert 100% of equations to LaTeX. This is because it has to detect then convert.
- Tables are not always formatted 100% correctly - text can be in the wrong column.
- Whitespace and indentations are not always respected.
- Not all lines/spans will be joined properly.
- This works best on digital PDFs that won't require a lot of OCR. It's optimized for speed, and limited OCR is used to fix errors.
- Marker will not convert inline equations
- Tables are not always formatted 100% correctly - multiline cells are sometimes split into multiple rows.
- Forms are not converted optimally
- Very complex layouts, with nested tables and forms, may not work

# Installation

Expand All @@ -76,20 +75,12 @@ Install with:
pip install marker-pdf
```

## Optional: OCRMyPDF

Only needed if you want to use the optional `ocrmypdf` as the ocr backend. Note that `ocrmypdf` includes Ghostscript, an AGPL dependency, but calls it via CLI, so it does not trigger the license provisions.

See the instructions [here](docs/install_ocrmypdf.md)

# Usage

First, some configuration:

- Inspect the settings in `marker/settings.py`. You can override any settings with environment variables.
- Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
- By default, marker will use `surya` for OCR. Surya is slower on CPU, but more accurate than tesseract. It also doesn't require you to specify the languages in the document. If you want faster OCR, set `OCR_ENGINE` to `ocrmypdf`. This also requires external dependencies (see above). If you don't want OCR at all, set `OCR_ENGINE` to `None`.
- Some PDFs, even digital ones, have bad text in them. Set `OCR_ALL_PAGES=true` to force OCR if you find bad output from marker.
- Some PDFs, even digital ones, have bad text in them. Set the `force_ocr` flag on the CLI or via configuration to ensure your PDF runs through OCR.

## Interactive App

Expand All @@ -103,62 +94,81 @@ marker_gui
## Convert a single file

```shell
marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --max_pages 10
marker_single /path/to/file.pdf
```

- `--batch_multiplier` is how much to multiply default batch sizes by if you have extra VRAM. Higher numbers will take more VRAM, but process faster. Set to 2 by default. The default batch sizes will take ~3GB of VRAM.
- `--max_pages` is the maximum number of pages to process. Omit this to convert the entire document.
- `--start_page` is the page to start from (default is None, will start from the first page).
- `--langs` is an optional comma separated list of the languages in the document, for OCR. Optional by default, required if you use tesseract.
Options:
- `--output_dir PATH`: Directory where output files will be saved. Defaults to the value specified in settings.OUTPUT_DIR.
- `--debug`: Enable debug mode for additional logging and diagnostic information.
- `--output_format [markdown|json|html]`: Specify the format for the output results.
- `--page_range TEXT`: Specify which pages to process. Accepts comma-separated page numbers and ranges. Example: `--page_range "0,5-10,20"` will process pages 0, 5 through 10, and page 20.
- `--force_ocr`: Force OCR processing on the entire document, even for pages that might contain extractable text.
- `--processors TEXT`: Override the default processors by providing their full module paths, separated by commas. Example: `--processors "module1.processor1,module2.processor2"`
- `--config_json PATH`: Path to a JSON configuration file containing additional settings.
- `--languages TEXT`: Optionally specify which languages to use for OCR processing. Accepts a comma-separated list. Example: `--languages "eng,fra,deu"` for English, French, and German.
- `-l`: List all available builders, processors, and converters, and their associated configuration. These values can be used to build a JSON configuration file for additional tweaking of marker defaults.

The list of supported languages for surya OCR is [here](https://github.com/VikParuchuri/surya/blob/master/surya/languages.py). If you need more languages, you can use any language supported by [Tesseract](https://tesseract-ocr.github.io/tessdoc/Data-Files#data-files-for-version-400-november-29-2016) if you set `OCR_ENGINE` to `ocrmypdf`. If you don't need OCR, marker can work with any language.
The list of supported languages for surya OCR is [here](https://github.com/VikParuchuri/surya/blob/master/surya/languages.py). If you don't need OCR, marker can work with any language.

## Convert multiple files

```shell
marker /path/to/input/folder /path/to/output/folder --workers 4 --max 10
marker /path/to/input/folder --workers 10
```

- `--workers` is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Marker will use 5GB of VRAM per worker at the peak, and 3.5GB average.
- `--max` is the maximum number of pdfs to convert. Omit this to convert all pdfs in the folder.
- `--min_length` is the minimum number of characters that need to be extracted from a pdf before it will be considered for processing. If you're processing a lot of pdfs, I recommend setting this to avoid OCRing pdfs that are mostly images. (slows everything down)
- `--metadata_file` is an optional path to a json file with metadata about the pdfs. If you provide it, it will be used to set the language for each pdf. Setting language is optional for surya (default), but required for tesseract. The format is:

```
{
"pdf1.pdf": {"languages": ["English"]},
"pdf2.pdf": {"languages": ["Spanish", "Russian"]},
...
}
```

You can use language names or codes. The exact codes depend on the OCR engine. See [here](https://github.com/VikParuchuri/surya/blob/master/surya/languages.py) for a full list for surya codes, and [here](https://tesseract-ocr.github.io/tessdoc/Data-Files#data-files-for-version-400-november-29-2016) for tesseract.
- `marker` supports all the same options from `marker_single` above.
- `--workers` is the number of conversion workers to run simultaneously. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Marker will use 5GB of VRAM per worker at the peak, and 3.5GB average.

## Convert multiple files on multiple GPUs

```shell
METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out
NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out
```

- `METADATA_FILE` is an optional path to a json file with metadata about the pdfs. See above for the format.
- `NUM_DEVICES` is the number of GPUs to use. Should be `2` or greater.
- `NUM_WORKERS` is the number of parallel processes to run on each GPU.
- `MIN_LENGTH` is the minimum number of characters that need to be extracted from a pdf before it will be considered for processing. If you're processing a lot of pdfs, I recommend setting this to avoid OCRing pdfs that are mostly images. (slows everything down)
-

Note that the env variables above are specific to this script, and cannot be set in `local.env`.
## Use from python

See the `PdfConverter` class at `marker/converters/pdf.py` function for additional arguments that can be passed.

## Use from python
```python
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered

converter = PdfConverter(
artifact_dict=create_model_dict(),
)
rendered = converter("FILEPATH")
text, _, images = text_from_rendered(rendered)
```

See the `convert_single_pdf` function for additional arguments that can be passed.
`rendered` will be a pydantic basemodel with different properties depending on the output type requested. With markdown output (default), you'll have the properties `markdown`, `metadata`, and `images`. For json output, you'll have `children`, `block_type`, and `metadata`.

### Custom configuration

You can also pass configuration using the `ConfigParser`:

```python
from marker.convert import convert_single_pdf
from marker.models import load_all_models
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.config.parser import ConfigParser

fpath = "FILEPATH"
model_lst = load_all_models()
full_text, images, out_meta = convert_single_pdf(fpath, model_lst)
config = {
"output_format": "json",
"ADDITIONAL_KEY": "VALUE"
}
config_parser = ConfigParser(config)

converter = PdfConverter(
config=config_parser.generate_config_dict(),
artifact_dict=create_model_dict(),
processor_list=config_parser.get_processors(),
renderer=config_parser.get_renderer()
)
rendered = converter("FILEPATH")
```

# Output format
Expand Down Expand Up @@ -202,14 +212,6 @@ marker_server --port 8001

This will start a fastapi server that you can access at `localhost:8001`. You can go to `localhost:8001/docs` to see the endpoint options.

Note that this is not a very robust API, and is only intended for small-scale use. If you want to use this server, but want a more robust conversion option, you can run against the hosted [Datalab API](https://www.datalab.to/plans). You'll need to register and get an API key, then run:

```shell
marker_server --port 8001 --api_key API_KEY
```

Note: This is not the recommended way to use the Datalab API - it's only provided as a convenience for people wrapping the marker repo. The recommended way is to make a post request to the endpoint directly from your code vs proxying through this server.

You can send requests like this:

```
Expand All @@ -224,30 +226,19 @@ post_data = {
requests.post("http://localhost:8001/marker", data=json.dumps(post_data)).json()
```

Note that this is not a very robust API, and is only intended for small-scale use. If you want to use this server, but want a more robust conversion option, you can use the hosted [Datalab API](https://www.datalab.to/plans).

# Troubleshooting

There are some settings that you may find useful if things aren't working the way you expect:

- `OCR_ALL_PAGES` - set this to true to force OCR all pages. This can be very useful if there is garbled text in the output of marker.
- Make sure to set `force_ocr` if you see garbled text - this will re-OCR the document.
- `TORCH_DEVICE` - set this to force marker to use a given torch device for inference.
- `OCR_ENGINE` - can set this to `surya` or `ocrmypdf`.
- Verify that you set the languages correctly, or passed in a metadata file.
- If you're getting out of memory errors, decrease worker count (increased the `VRAM_PER_TASK` setting). You can also try splitting up long PDFs into multiple files.

In general, if output is not what you expect, trying to OCR the PDF is a good first step. Not all PDFs have good text/bboxes embedded in them.
- If you're getting out of memory errors, decrease worker count. You can also try splitting up long PDFs into multiple files.

## Debugging

Set `DEBUG=true` to save data to the `debug` subfolder in the marker root directory. This will save images of each page with detected layout and text, as well as output a json file with additional bounding box information.

## Useful settings

These settings can improve/change output quality:

- `OCR_ALL_PAGES` will force OCR across the document. Many PDFs have bad text embedded due to older OCR engines being used.
- `PAGINATE_OUTPUT` will put a horizontal rule between pages. Default: False. The horizontal rule will be `\n\n`, then `{PAGE_NUMBER}`, then 48 single dashes `-`, then `\n\n`. The separator can be configured via the `PAGE_SEPARATOR` setting.
- `EXTRACT_IMAGES` will extract images and save separately. Default: True.
- `BAD_SPAN_TYPES` specifies layout blocks to remove from the markdown output.
Pass the `debug` option to activate debug mode. This will save images of each page with detected layout and text, as well as output a json file with additional bounding box information.

# Benchmarks

Expand Down
4 changes: 0 additions & 4 deletions marker/renderers/json.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,12 @@
from __future__ import annotations

import base64
import io
from typing import List, Dict

from bs4 import BeautifulSoup
from pydantic import BaseModel

from marker.schema.blocks import Block
from marker.renderers import BaseRenderer
from marker.schema import BlockTypes
from marker.schema.blocks import BlockId
from marker.schema.registry import get_block_class


Expand Down

0 comments on commit d043dc6

Please sign in to comment.