-
How to check if a document exists before inserting it into the index? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
To check if a document exists before inserting it into the index, you can use the Here is an example of how you can use def check_document_exists(pipeline: IngestionPipeline, nodes: List[BaseNode]) -> List[BaseNode]:
"""Check if documents exist in the docstore before inserting.
Args:
pipeline: The ingestion pipeline instance.
nodes: The nodes (documents) to check.
Returns:
List of nodes that do not exist in the docstore.
"""
if pipeline.docstore is None:
raise ValueError("Docstore is not configured in the pipeline.")
existing_hashes = pipeline.docstore.get_all_document_hashes()
nodes_to_insert = []
for node in nodes:
if node.hash not in existing_hashes:
nodes_to_insert.append(node)
return nodes_to_insert
# Example usage
pipeline = IngestionPipeline(...) # Initialize your pipeline
nodes = [...] # List of nodes (documents) to check
nodes_to_insert = check_document_exists(pipeline, nodes)
if nodes_to_insert:
# Proceed with inserting nodes_to_insert into the index
pass
else:
print("All documents already exist in the index.") This function Additionally, the |
Beta Was this translation helpful? Give feedback.
To check if a document exists before inserting it into the index, you can use the
_handle_duplicates
or_handle_upserts
methods from theIngestionPipeline
class. These methods check for existing documents based on their hashes or IDs.Here is an example of how you can use
_handle_duplicates
to check for document existence: