Is the purpose of this project to interpret and comprehensively analyze the content of PDF documents? #43

Bruce337f · 2024-05-06T08:46:35Z

Bruce337f
May 6, 2024

Description

When I installed and ran the code according to the example, I easily obtained the text content existing on the pdf. This is a very convenient project!

But what puzzles me is that the developer also provided sample code for openai. Does this mean that openai can be provided to generate summary conclusions for PDF content, or analyze the theme of the content?

Bruce337f · 2024-05-06T08:48:13Z

Bruce337f
May 6, 2024
Author

I would like to ask if the following projects can edit PDF in detail and specific document content, and what is their relationship with this project?
Dealing with PDF's:

pdfminer.six Fully open source.
Extracting Tables:

PyMuPDF has some table detection functionality. Please see their license.
Table Transformer is a deep learning approach.
unitable is another transformers based approach with state-of-the-art performance.

This is a good project!

0 replies

Filimoa · 2024-05-06T14:12:43Z

Filimoa
May 6, 2024
Maintainer

The first step of a RAG pipeline is splitting up the document into "semantic" chunks (parts of the doc that are talking about the same thing. This library is primarily aimed at that use case - embedding are a nice way of doing this.

1 reply

Bruce337f May 7, 2024
Author

Thanks for your reply, I will understand this code again!

To be honest, it feels very similar to AI customer service, extraction-summarization-delivery

Bruce337f · 2024-05-07T10:08:59Z

Bruce337f
May 7, 2024
Author

About Semantic Processing Example:

from openparse import processing, DocumentParser
semantic_pipeline = processing.SemanticIngestionPipeline(
    openai_api_key=OPEN_AI_KEY,
    model="text-embedding-3-large",
    min_tokens=64,
    max_tokens=1024,
)
parser = DocumentParser(
    processing_pipeline=semantic_pipeline,
)
parsed_content = parser.parse(basic_doc_path)

Could you please tell me if it can be combined with the code below，

from llama_index.core import VectorStoreIndex
index = VectorStoreIndex(nodes=nodes)
query_engine = index.as_query_engine()
response = query_engine.query("What do they do to make money?")
print(response)

Form a "answer every question" PDF analysis assistant?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the purpose of this project to interpret and comprehensively analyze the content of PDF documents? #43

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Is the purpose of this project to interpret and comprehensively analyze the content of PDF documents? #43

Bruce337f May 6, 2024

Description

Replies: 3 comments · 1 reply

Bruce337f May 6, 2024 Author

Filimoa May 6, 2024 Maintainer

Bruce337f May 7, 2024 Author

Bruce337f May 7, 2024 Author

Bruce337f
May 6, 2024

Replies: 3 comments 1 reply

Bruce337f
May 6, 2024
Author

Filimoa
May 6, 2024
Maintainer

Bruce337f May 7, 2024
Author

Bruce337f
May 7, 2024
Author