Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: 'int' object has no attribute 'strip'` - HierarchalChunker - V2 DoclingDocument #181

Closed
leviataniac opened this issue Oct 26, 2024 · 4 comments · Fixed by DS4SD/docling-core#58

Comments

@leviataniac
Copy link

leviataniac commented Oct 26, 2024

Dear Doclinbg Team, we have a problem in the pipeline with llama index pipelining in the hierarchal chunker. Please find attached the document. Latest used verion is 2.2.0 and inlined doclingparser from feature request of official llama index doclingparser.

Greenalia_.pdf
ACCIONA_compressed-1-1.pdf

  1. Parsing seems to be fine -> so the converter in JSON ExportType is working fine..
  2. then adding to miluvs-vector store with nodeparser/transformations, which is based on HierarchalChunker fails with the
2024-10-25 16:51:57,840 - ERROR - 58ebd550-a879-466f-9903-007bf468ebf5 - Exception details:
Traceback (most recent call last):
[ACCIONA_compressed-1-1.pdf](https://github.com/user-attachments/files/17529462/ACCIONA_compressed-1-1.pdf)

File "/Users/C/test/processor.py", line 336, in process_single_pdf
    index = self.upload_to_milvus(pdf_path, milvus_url, cleaned_milvus_coll, ingest)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/Users/C/test/processor.py", line 292, in upload_to_milvus
    index = VectorStoreIndex.from_documents(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/llama_index/core/indices/base.py", line 112, in from_documents
    nodes = run_transformations(
            ^^^^^^^^^^^^^^^^^^^^
 File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/llama_index/core/ingestion/pipeline.py", line 100, in run_transformations
    nodes = transform(nodes, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/llama_index/core/instrumentation/dispatcher.py", line 311, in wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
 File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py", line 193, in __call__
    return self.get_nodes_from_documents(nodes, **kwargs)  # type: ignore
[Greenalia_.pdf](https://github.com/user-attachments/files/17529473/Greenalia_.pdf)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/llama_index/core/node_parser/interface.py", line 165, in get_nodes_from_documents
    nodes = self._parse_nodes(documents, show_progress=show_progress, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/llama_index/core/instrumentation/dispatcher.py", line 311, in wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
 File "/Users/C/test/processor.py", line 211, in _parse_nodes
    for i, chunk in enumerate(chunk_iter):
                    ^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/docling_core/transforms/chunker/hierarchical_chunker.py", line 211, in chunk
    text = self._triplet_serialize(table_df=table_df)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/docling_core/transforms/chunker/hierarchical_chunker.py", line 132, in _triplet_serialize
    rows = [item.strip() for item in table_df.iloc[:, 0].to_list()]
            ^^^^^^^^^^
AttributeError: 'int' object has no attribute 'strip'`

Some converted PDF-Documents seems to be fine..but some are failing with that error. They could be converted with the 1.x inline. and JSON export .

Do you have any idea, where i can look into? Thank you.

@vagenas
Copy link
Contributor

vagenas commented Oct 26, 2024

Hi @leviataniac, the fix is probably going to be as simple as this. Will let you know once it's released.

@leviataniac
Copy link
Author

Thx @vagenas. Much appreciated!!

@vagenas
Copy link
Contributor

vagenas commented Oct 26, 2024

@leviataniac please update your docling-core and check again.

@leviataniac
Copy link
Author

@vagenas thx. now its working.:) simple fix , helps a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants