[DERCBOT-1168] Indexing improvements #1766

assouktim · 2024-10-15T09:15:40Z

This pull request includes several changes to the gen-ai/orchestrator-server project, focusing on updating metadata keys, improving the indexing tool, and enhancing logging capabilities. The most important changes include renaming metadata keys, modifying the indexing tool to handle new arguments, and updating documentation accordingly.

Metadata key updates:

gen-ai/orchestrator-server/src/main/python/server/src/gen_ai_orchestrator/services/langchain/qa_chain.py: Changed the metadata key from url to source in the execute_qa_chain function.
gen-ai/orchestrator-server/src/main/python/server/tests/services/test_qa_chain.py: Updated tests to reflect the change from url to source in metadata. [1] [2]

Indexing tool improvements:

gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py: Added ignore_source argument to handle cases where sources aren't valid URLs, and improved handling of CSV input and logging. [1] [2] [3] [4] [5] [6] [7] [8]

Documentation updates:

gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/README.md: Updated documentation to reflect the change from url to source in CSV columns and other relevant sections. [1] [2] [3] [4]

Logging enhancements:

gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py: Enhanced logging setup with file and console handlers, and added detailed logging for the indexing process. [1] [2]

Benvii

Thanks for this PR

gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/README.md

gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py

Benvii · 2024-10-28T14:31:42Z

gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py

-    df['source'] = df['source'].replace('UNKNOWN', None)
-    loader = DataFrameLoader(df, page_content_column='text')
+    if bool(args['<ignore_source>']):
+        df_filtered['source'] = None


Having this for all PDF chunks we will completely miss the file name / location where the chunk came from, this will be a real nightmare do to any debugging / analysis of RAG traces.

It should at least be kepts in a metadata.
Do you have an explanation why we couldn't keep a file path as a source ?
Is it because of the AnyUrl here.

AnyUrl type is based on the URL rust crates it supports files URLs (but only absolute URLs) for instance the following exemple works :

from pydantic import AnyUrl file_url = AnyUrl('file:///tmp/ah.gif')

Why not keeping the URL using the file schema ? If needed we could fix Goulven's original pdf parsing tool script.

I don't know if you remember, but we've been discussing the fact that the pdf urls point to Golven's personal folder, and we can't consider that a valid link for end users, since they don't have access to that path. So we've decided to remove them, and consign the pdf documents as unsourced.

I see two things:

Yes, I'm in favor of keeping this information in the metadata.

We need to modify the code that processes the pdf to have the Google Drive url of the PDF, which can be given/exposed to the end user.

And yes, the source can be a URI (file:///tmp/file.pdf), this flag allows you to ignore or not the source during indexing.

Ok so we could add the source as metadata for instance "original_source" (every time so that when it's ignored we still have the pdf filename present in metadata).
Can you juste add this metadata ?

It will then be used when we debug RAG trace to understand were the chunk came from.

gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py

Benvii · 2024-10-28T15:11:00Z

gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py

+    log_dir = Path('logs')
+    log_dir.mkdir(exist_ok=True)
+
+    log_file_name = log_dir / f"index_documents_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"


It should be documented in the README.md that we now have logs output in this folder, thanks for adding it 👍️

Benvii

Make it more clearer that indexing_documents.py isn't async and see me comment about adding an optional argument to configure the bulk size it will be helpfull to adjust it depending on the supported embedding server TPM.

Some conflicts also needs to be solved.

Benvii · 2024-11-20T08:29:07Z

gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/pyproject.toml

@@ -14,14 +14,14 @@ pandas = "^2.2.1"
 openpyxl = "^3.1.2"
 beautifulsoup4 = "^4.12.2"
 langchain = "^0.3.3"
-langsmith = "^0.1.132"
+langsmith = "^0.1.134"


Good to update it, but do we still need langsmith ? as it's not officially a supported observability provider

yes, we're keeping langsmith for now

Benvii · 2024-11-20T09:42:31Z

gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py

-    df['source'] = df['source'].replace('UNKNOWN', None)
-    loader = DataFrameLoader(df, page_content_column='text')
+    if bool(args['<ignore_source>']):
+        df_filtered['source'] = None


Ok so we could add the source as metadata for instance "original_source" (every time so that when it's ignored we still have the pdf filename present in metadata).
Can you juste add this metadata ?

It will then be used when we debug RAG trace to understand were the chunk came from.

Benvii · 2024-11-20T09:53:08Z

gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py

    # Index all chunks in vector DB
    logging.debug('Index chunks in DB')
-    # Index respecting bulk_size (500 is from_documents current default: it is described for clarity only)
-    bulk_size = 500
+    bulk_size = 100  # Adjust bulk_size to suit your use case


As I understand this is this only way to limit to request token rate ratio ?

Also the number of concurrent task in the asyncIO loop is an other way to limit it .. but I see that this code despite and async main isn't async at all ... could you remove all async / await code ? So that people don't misunderstand that this script isn't async and maybe add a comment to explain why.

Can you make if configurable with an optional argument using doc opt :

""" Options: --embedding-bulk-size=<bs> Number of chunks sent in each embedding requests [default: 100]. """

Benvii requested changes Oct 28, 2024

View reviewed changes

Benvii assigned assouktim Oct 28, 2024

assouktim force-pushed the feature/dercbot-1168 branch from 2e769e4 to 29de731 Compare October 31, 2024 14:47

assouktim added the enhancement label Nov 8, 2024

assouktim changed the title ~~Indexing improvements~~ [DERCBOT-1168] Indexing improvements Nov 12, 2024

assouktim marked this pull request as ready for review November 12, 2024 13:50

Benvii requested changes Nov 20, 2024

View reviewed changes

assouktim force-pushed the feature/dercbot-1168 branch 3 times, most recently from 6011648 to 41df4eb Compare November 21, 2024 13:39

[DERCBOT-1168] Indexing improvements

95759ec

assouktim force-pushed the feature/dercbot-1168 branch from 41df4eb to 95759ec Compare November 21, 2024 13:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DERCBOT-1168] Indexing improvements #1766

[DERCBOT-1168] Indexing improvements #1766

assouktim commented Oct 15, 2024 •

edited by Benvii

Loading

Benvii left a comment

Benvii Oct 28, 2024

assouktim Oct 31, 2024

assouktim Nov 8, 2024

Benvii Nov 20, 2024

Benvii Oct 28, 2024

Benvii left a comment •

edited

Loading

Benvii Nov 20, 2024

assouktim Nov 21, 2024

Benvii Nov 20, 2024

Benvii Nov 20, 2024

[DERCBOT-1168] Indexing improvements #1766

Are you sure you want to change the base?

[DERCBOT-1168] Indexing improvements #1766

Conversation

assouktim commented Oct 15, 2024 • edited by Benvii Loading

Benvii left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Benvii left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

assouktim commented Oct 15, 2024 •

edited by Benvii

Loading

Benvii left a comment •

edited

Loading