-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
文字识别错乱 Text recognition errors #339
Comments
增补一个案例,英文文档识别过程应为逐行扫描,但是识别的结果发生了不同行之间错乱 在下面的截图中可以看到,识别出来的markdown文件,第4行和第1段的最后一行整合在一起,第2、3行又另起一段 |
通过设置--max_pages 为9999强制全部页面识别,但是只能通过marker_single来运行,所以编写了个自动化处理脚本.sh,代码如下: 同时可以通过设置batch_multiplier的大小来实验GPU的占用率,防止爆缓存 `#!/bin/bash 设置固定的输入输出路径INPUT_DIR="/Users/User/Documents/Github/maker/Input" 设置默认的 batch_multiplierBATCH_MULTIPLIER=${1:-1} 验证 batch_multiplier 的值if ! [[ "$BATCH_MULTIPLIER" =~ ^[0-9]+$ ]]; then if [ "$BATCH_MULTIPLIER" -lt 1 ] || [ "$BATCH_MULTIPLIER" -gt 2 ]; then 检查输入目录是否存在if [ ! -d "$INPUT_DIR" ]; then 创建输出目录(如果不存在)mkdir -p "$OUTPUT_DIR" 计数器total_files=$(ls -1 "$INPUT_DIR"/*.pdf 2>/dev/null | wc -l) echo "Starting processing with batch_multiplier = $BATCH_MULTIPLIER" 处理每个PDF文件for file in "$INPUT_DIR"/*.pdf; do
done echo "All PDF files have been processed!" ` 脚本运行方式
使用示例: 注意:
|
在附件文件中,可以查看PDF中的48、49页,和在markdown文件中没有1.10这个标题,识别这2页的内容有部分错乱,以及部分内容没有识别出来
In the attached file, you can view the PDF in the 48, 49 pages, and in the markdown file does not have the title of 1.10, to identify the content of these 2 pages are partially misplaced, as well as part of the content is not recognized!
Theories of Truth - Richard Kirkham (... (Z-Library).zip
Theories of Truth - Richard Kirkham (... (Z-Library).pdf
The text was updated successfully, but these errors were encountered: