Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

文字识别错乱 Text recognition errors #339

Open
Batapha opened this issue Nov 10, 2024 · 2 comments
Open

文字识别错乱 Text recognition errors #339

Batapha opened this issue Nov 10, 2024 · 2 comments

Comments

@Batapha
Copy link

Batapha commented Nov 10, 2024

在附件文件中,可以查看PDF中的48、49页,和在markdown文件中没有1.10这个标题,识别这2页的内容有部分错乱,以及部分内容没有识别出来

In the attached file, you can view the PDF in the 48, 49 pages, and in the markdown file does not have the title of 1.10, to identify the content of these 2 pages are partially misplaced, as well as part of the content is not recognized!

Theories of Truth - Richard Kirkham (... (Z-Library).zip

Theories of Truth - Richard Kirkham (... (Z-Library).pdf

@Batapha
Copy link
Author

Batapha commented Nov 10, 2024

增补一个案例,英文文档识别过程应为逐行扫描,但是识别的结果发生了不同行之间错乱

在下面的截图中可以看到,识别出来的markdown文件,第4行和第1段的最后一行整合在一起,第2、3行又另起一段

英文测试 4.pdf

截屏2024-11-10 08 23 46
截屏2024-11-10 08 25 07
英文测试 4.zip

@Batapha
Copy link
Author

Batapha commented Nov 13, 2024

通过设置--max_pages 为9999强制全部页面识别,但是只能通过marker_single来运行,所以编写了个自动化处理脚本.sh,代码如下:

同时可以通过设置batch_multiplier的大小来实验GPU的占用率,防止爆缓存

`#!/bin/bash

设置固定的输入输出路径

INPUT_DIR="/Users/User/Documents/Github/maker/Input"
OUTPUT_DIR="/Users/user/Documents/Github/maker/Output"

设置默认的 batch_multiplier

BATCH_MULTIPLIER=${1:-1}

验证 batch_multiplier 的值

if ! [[ "$BATCH_MULTIPLIER" =~ ^[0-9]+$ ]]; then
echo "Error: batch_multiplier must be a number"
exit 1
fi

if [ "$BATCH_MULTIPLIER" -lt 1 ] || [ "$BATCH_MULTIPLIER" -gt 2 ]; then
echo "Error: batch_multiplier must be between 1 and 2"
exit 1
fi

检查输入目录是否存在

if [ ! -d "$INPUT_DIR" ]; then
echo "Error: Input directory '$INPUT_DIR' does not exist"
exit 1
fi

创建输出目录(如果不存在)

mkdir -p "$OUTPUT_DIR"

计数器

total_files=$(ls -1 "$INPUT_DIR"/*.pdf 2>/dev/null | wc -l)
current=0

echo "Starting processing with batch_multiplier = $BATCH_MULTIPLIER"

处理每个PDF文件

for file in "$INPUT_DIR"/*.pdf; do
if [ -f "$file" ]; then
current=$((current + 1))
filename=$(basename "$file")
echo "Processing ($current/$total_files): $filename"

    marker_single "$file" "$OUTPUT_DIR" --max_pages 9999 --batch_multiplier "$BATCH_MULTIPLIER"
fi

done

echo "All PDF files have been processed!" `

脚本运行方式

  1. 首次使用前,需要添加执行权限:
    chmod +x process_pdfs.sh

  2. 运行脚本的两种方式:
    方式一(需要执行权限):./process_pdfs.sh 2
    方式二(不需要执行权限):bash process_pdfs.sh 2

使用示例:
正确使用:./process_pdfs.sh 2 (GPU利用率较高)
正确使用:./process_pdfs.sh 1 (GPU利用率较低)
错误使用:./process_pdfs.sh 3 (会显示错误信息)
错误使用:./process_pdfs.sh 0 (会显示错误信息)

注意:

  1. 数字参数表示 batch_multiplier,取值范围为 1-2
  2. 较大的 batch_multiplier 值会提高 GPU 利用率,加快处理速度
  3. 如果遇到内存不足错误,请使用值 1
  4. 建议从值 1 开始尝试,如果运行正常再使用值 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant