Chunking Hierarchy Identification #287

Shubhamkumar782 · 2024-11-09T19:55:13Z

Question
I am working on a custom chunking method where I need to identify headings, subheadings, and child headings separately. Here's the detailed explanation:

Current Issue:

I am using Docling to tag headings in a PDF.
Currently, all nested headings (subheadings and child headings) are marked as regular headings with ##.
There is no differentiation between parent headings and sub-level headings.
Objective:

I want to store section headings and titles as metadata for the content under each subheading.
Example: For a PDF with 3 chapters, each having multiple subheadings, the chunk should have:
Chapter Name as the Title.
Subheading as the Section Heading.
Current Limitation:

While I can extract the lowest level of headings, I am unable to identify the parent headings since the tags do not differentiate between them.
Assumption for Hierarchy:

I assume that chapter names are typically larger in font size compared to subheadings and child headings.
A hierarchy based on text size or boldness could be useful to identify different levels of headings.
Question:

Is there a way to distinguish headings, subheadings, and child headings separately based on these characteristics (e.g., font size, boldness)?
Any solution or guidance to achieve this would be highly appreciated.

AlessandroSpallina · 2024-11-13T10:46:32Z

+1

PeterStaar-IBM · 2024-11-13T12:19:46Z

@Shubhamkumar782 Docling produces a DoclingDocument data-structure, which can be used in the HierarchicalChunker.

I think this solves your problem, just load the HierarchicalChunker from docling-core and leverage the chunk method.

Close the issue if this helps you!

qianyue76 · 2024-11-14T06:19:41Z

@PeterStaar-IBM I see the chunk method in the HierarchicalChunker from docling-core, but the level of section-header is from the dl_doc: DLDocument. And in this #77

At the moment, the system is detecting only one level of section headers.
Is it supported deeper level of section headers now? Or I missed something?

PeterStaar-IBM · 2024-11-14T06:40:09Z

@qianyue76 Yes, this is correct. We need to add a new feature to identify the level of the section-headers in PDF (for docx, html and pptx, this comes for free). Within the pdf, it is hard to identify the section-header level, but we have some ideas how to tackle them.

Of course, if you want to collaborate on that, please let us know!

qianyue76 · 2024-11-14T07:29:15Z

@PeterStaar-IBM Sounds great, I'm also trying to figure out how to solve this problem now, collaboration might be a good way

PeterStaar-IBM · 2024-11-14T07:49:08Z

ok, Please look at the email address on the MAINTAINERS.md. We can set up a quick sync and discuss next steps!

qianyue76 · 2024-11-14T08:09:07Z

@PeterStaar-IBM I get it! Should I do something first?

PeterStaar-IBM · 2024-11-14T09:20:25Z

Just write us an email, and we will follow up!

Shubhamkumar782 · 2024-11-14T09:29:14Z

@PeterStaar-IBM ,

I have tried HierarchicalChunker but the thing is that still section heading is not getting captured.

I am also trying to solve this problem, I tried to analyse font size and bold information of font but that is also not enough to identify if that will be section heading.
As sometimes for some document heading will be same size of usual text size and it is not bold as well.

For now I have written regex based chunking which is very efficient but it will work for my current use case documents only and it's patch work, it's tough to generalize.

There are some possible ways with vision based models but I have not tried yet. I saw few examples only.

If I find some ways I will too update.

PeterStaar-IBM · 2024-11-14T09:39:30Z

@Shubhamkumar782 we have ideas to solve this holistically, and are open to collaboration. Regex and other methods will only work for very specific use-cases, not bad but also not very satisfying.

If you have a working regex, you could always update the DoclingDocument with section-headers where the level gets updated.

Shubhamkumar782 · 2024-11-14T09:59:29Z

@PeterStaar-IBM,

Good to hear.

Actually regex that I have will work for specific document so even if I update it in DoclingDocument, I will be not able to achieve what I want.

Shubhamkumar782 added the question Further information is requested label Nov 9, 2024

PeterStaar-IBM self-assigned this Nov 11, 2024

vagenas added the PDF parsing label Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunking Hierarchy Identification #287

Chunking Hierarchy Identification #287

Shubhamkumar782 commented Nov 9, 2024

AlessandroSpallina commented Nov 13, 2024

PeterStaar-IBM commented Nov 13, 2024 •

edited

Loading

qianyue76 commented Nov 14, 2024

PeterStaar-IBM commented Nov 14, 2024

qianyue76 commented Nov 14, 2024

PeterStaar-IBM commented Nov 14, 2024

qianyue76 commented Nov 14, 2024

PeterStaar-IBM commented Nov 14, 2024

Shubhamkumar782 commented Nov 14, 2024

PeterStaar-IBM commented Nov 14, 2024

Shubhamkumar782 commented Nov 14, 2024

Chunking Hierarchy Identification #287

Chunking Hierarchy Identification #287

Comments

Shubhamkumar782 commented Nov 9, 2024

AlessandroSpallina commented Nov 13, 2024

PeterStaar-IBM commented Nov 13, 2024 • edited Loading

qianyue76 commented Nov 14, 2024

PeterStaar-IBM commented Nov 14, 2024

qianyue76 commented Nov 14, 2024

PeterStaar-IBM commented Nov 14, 2024

qianyue76 commented Nov 14, 2024

PeterStaar-IBM commented Nov 14, 2024

Shubhamkumar782 commented Nov 14, 2024

PeterStaar-IBM commented Nov 14, 2024

Shubhamkumar782 commented Nov 14, 2024

PeterStaar-IBM commented Nov 13, 2024 •

edited

Loading