Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Method to convert ParsedDocument object to LlamaIndex Document object #63

Open
mjspeck opened this issue Aug 22, 2024 · 1 comment
Open

Comments

@mjspeck
Copy link

mjspeck commented Aug 22, 2024

Description

It would be great to have, in addition to the to_llama_index_nodes method to have a to_llama_index_document method on the openparse.schemas.ParsedDocument class that returns a valid llama_index.core.schema.Document object.

@Filimoa
Copy link
Owner

Filimoa commented Aug 22, 2024

Can you point me to documentation that explains how Nodes and Documents are related in llama_index? From what I understand a Document is just a parent Node.

This is the current implementation.

    def to_llama_index_nodes(self):
        try:
            from llama_index.core.schema import Document as LlamaIndexDocument
        except ImportError as err:
            raise ImportError(
                "llama_index is not installed. Please install it with `pip install llama-index`."
            ) from err

        li_doc = LlamaIndexDocument(
            id_=self.id_,
            metadata={
                "file_name": self.filename,
                "file_size": self.file_size,
                "creation_date": self.creation_date.isoformat(),
                "last_modified_date": self.last_modified_date.isoformat(),
            },
            excluded_embed_metadata_keys=[
                "file_size",
                "creation_date",
                "last_modified_date",
            ],
            excluded_llm_metadata_keys=[
                "file_name",
                "file_size",
                "creation_date",
                "last_modified_date",
            ],
        )
        li_nodes = self._nodes_to_llama_index(li_doc)

        return li_nodes

    def _nodes_to_llama_index(self, llama_index_doc):
        try:
            from llama_index.core.schema import NodeRelationship
        except ImportError as err:
            raise ImportError(
                "llama_index is not installed. Please install it with `pip install llama-index`."
            ) from err

        li_nodes = [node.to_llama_index() for node in sorted(self.nodes)]
        for i in range(len(li_nodes) - 1):
            li_nodes[i].relationships[NodeRelationship.NEXT] = li_nodes[
                i + 1
            ].as_related_node_info()

            li_nodes[i + 1].relationships[NodeRelationship.PREVIOUS] = li_nodes[
                i
            ].as_related_node_info()

        for li_node in li_nodes:
            li_node.relationships[NodeRelationship.PARENT] = (
                llama_index_doc.as_related_node_info()
            ) # NOTE: A DOC IS JUST A NODE?

        return li_nodes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants