AttributeError: 'int' object has no attribute 'strip'` - HierarchalChunker - V2 DoclingDocument #181

leviataniac · 2024-10-26T07:37:48Z

Dear Doclinbg Team, we have a problem in the pipeline with llama index pipelining in the hierarchal chunker. Please find attached the document. Latest used verion is 2.2.0 and inlined doclingparser from feature request of official llama index doclingparser.

Greenalia_.pdf
ACCIONA_compressed-1-1.pdf

Parsing seems to be fine -> so the converter in JSON ExportType is working fine..
then adding to miluvs-vector store with nodeparser/transformations, which is based on HierarchalChunker fails with the

2024-10-25 16:51:57,840 - ERROR - 58ebd550-a879-466f-9903-007bf468ebf5 - Exception details:
Traceback (most recent call last):
[ACCIONA_compressed-1-1.pdf](https://github.com/user-attachments/files/17529462/ACCIONA_compressed-1-1.pdf)

File "/Users/C/test/processor.py", line 336, in process_single_pdf
    index = self.upload_to_milvus(pdf_path, milvus_url, cleaned_milvus_coll, ingest)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/Users/C/test/processor.py", line 292, in upload_to_milvus
    index = VectorStoreIndex.from_documents(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/llama_index/core/indices/base.py", line 112, in from_documents
    nodes = run_transformations(
            ^^^^^^^^^^^^^^^^^^^^
 File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/llama_index/core/ingestion/pipeline.py", line 100, in run_transformations
    nodes = transform(nodes, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/llama_index/core/instrumentation/dispatcher.py", line 311, in wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
 File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py", line 193, in __call__
    return self.get_nodes_from_documents(nodes, **kwargs)  # type: ignore
[Greenalia_.pdf](https://github.com/user-attachments/files/17529473/Greenalia_.pdf)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py", line 165, in get_nodes_from_documents
    nodes = self._parse_nodes(documents, show_progress=show_progress, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/llama_index/core/instrumentation/dispatcher.py", line 311, in wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
 File "/Users/C/test/processor.py", line 211, in _parse_nodes
    for i, chunk in enumerate(chunk_iter):
                    ^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/docling_core/transforms/chunker/hierarchical_chunker.py", line 211, in chunk
    text = self._triplet_serialize(table_df=table_df)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/docling_core/transforms/chunker/hierarchical_chunker.py", line 132, in _triplet_serialize
    rows = [item.strip() for item in table_df.iloc[:, 0].to_list()]
            ^^^^^^^^^^
AttributeError: 'int' object has no attribute 'strip'`

Some converted PDF-Documents seems to be fine..but some are failing with that error. They could be converted with the 1.x inline. and JSON export .

Do you have any idea, where i can look into? Thank you.

The text was updated successfully, but these errors were encountered:

vagenas · 2024-10-26T08:36:19Z

Hi @leviataniac, the fix is probably going to be as simple as this. Will let you know once it's released.

leviataniac · 2024-10-26T08:50:23Z

Thx @vagenas. Much appreciated!!

vagenas · 2024-10-26T11:21:36Z

@leviataniac please update your docling-core and check again.

leviataniac · 2024-10-26T21:36:09Z

@vagenas thx. now its working.:) simple fix , helps a lot.

vagenas linked a pull request Oct 26, 2024 that will close this issue

fix: fix non-string table cell handling in chunker DS4SD/docling-core#58

Merged

vagenas closed this as completed in DS4SD/docling-core#58 Oct 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError: 'int' object has no attribute 'strip'` - HierarchalChunker - V2 DoclingDocument #181

AttributeError: 'int' object has no attribute 'strip'` - HierarchalChunker - V2 DoclingDocument #181

leviataniac commented Oct 26, 2024 •

edited

Loading

vagenas commented Oct 26, 2024

leviataniac commented Oct 26, 2024

vagenas commented Oct 26, 2024

leviataniac commented Oct 26, 2024

AttributeError: 'int' object has no attribute 'strip'` - HierarchalChunker - V2 DoclingDocument #181

AttributeError: 'int' object has no attribute 'strip'` - HierarchalChunker - V2 DoclingDocument #181

Comments

leviataniac commented Oct 26, 2024 • edited Loading

vagenas commented Oct 26, 2024

leviataniac commented Oct 26, 2024

vagenas commented Oct 26, 2024

leviataniac commented Oct 26, 2024

leviataniac commented Oct 26, 2024 •

edited

Loading