Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong classification of title level #376

Open
kubni opened this issue Nov 20, 2024 · 1 comment
Open

Wrong classification of title level #376

kubni opened this issue Nov 20, 2024 · 1 comment

Comments

@kubni
Copy link

kubni commented Nov 20, 2024

Hello.
I m trying out marker-pdf and I noticed that it sometimes classifies title levels wrong.
Here is how page 63 looks:
image

That title is classified as level 1. Shouldn't it be level 0? It's a title of one of the main sections of the document, as seen on the contents page of this pdf:
image

Meanwhile, same title is mentioned on page 25 like this:
image

And that one is classified as level 0. That one is a link to the page 63 title, so I m not sure if that is making it confused.

Also, OCR somehow detects another same title on page 25 but marks it as level 1, which is weird since only that one exists there.

@kubni
Copy link
Author

kubni commented Nov 20, 2024

Here is another example of weird title stuff:
Here is a page:
image

OCR correctly detects this title as level 0 title: {'title': '7. Institutional stakeholders and their roles', 'level': 0, 'page': 62}

However, there is one more instance of that title being found on page 0, this time as a level 1, along with a LOT of empty text titles (also on page 0) that don't even exist there.

Is there a way to improve title detection?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant