-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DERCBOT-1168] Indexing improvements #1766
base: master
Are you sure you want to change the base?
[DERCBOT-1168] Indexing improvements #1766
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR
gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/README.md
Show resolved
Hide resolved
gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py
Outdated
Show resolved
Hide resolved
df['source'] = df['source'].replace('UNKNOWN', None) | ||
loader = DataFrameLoader(df, page_content_column='text') | ||
if bool(args['<ignore_source>']): | ||
df_filtered['source'] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having this for all PDF chunks we will completely miss the file name / location where the chunk came from, this will be a real nightmare do to any debugging / analysis of RAG traces.
It should at least be kepts in a metadata.
Do you have an explanation why we couldn't keep a file path as a source ?
Is it because of the AnyUrl here.
AnyUrl type is based on the URL rust crates it supports files URLs (but only absolute URLs) for instance the following exemple works :
from pydantic import AnyUrl
file_url = AnyUrl('file:///tmp/ah.gif')
Why not keeping the URL using the file schema ? If needed we could fix Goulven's original pdf parsing tool script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if you remember, but we've been discussing the fact that the pdf urls point to Golven's personal folder, and we can't consider that a valid link for end users, since they don't have access to that path. So we've decided to remove them, and consign the pdf documents as unsourced.
I see two things:
- Yes, I'm in favor of keeping this information in the metadata.
- We need to modify the code that processes the pdf to have the Google Drive url of the PDF, which can be given/exposed to the end user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yes, the source can be a URI (file:///tmp/file.pdf), this flag allows you to ignore or not the source during indexing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so we could add the source as metadata for instance "original_source" (every time so that when it's ignored we still have the pdf filename present in metadata).
Can you juste add this metadata ?
It will then be used when we debug RAG trace to understand were the chunk came from.
gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py
Outdated
Show resolved
Hide resolved
gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py
Outdated
Show resolved
Hide resolved
gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py
Show resolved
Hide resolved
gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py
Outdated
Show resolved
Hide resolved
log_dir = Path('logs') | ||
log_dir.mkdir(exist_ok=True) | ||
|
||
log_file_name = log_dir / f"index_documents_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be documented in the README.md that we now have logs output in this folder, thanks for adding it 👍️
2e769e4
to
29de731
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make it more clearer that indexing_documents.py isn't async and see me comment about adding an optional argument to configure the bulk size it will be helpfull to adjust it depending on the supported embedding server TPM.
Some conflicts also needs to be solved.
@@ -14,14 +14,14 @@ pandas = "^2.2.1" | |||
openpyxl = "^3.1.2" | |||
beautifulsoup4 = "^4.12.2" | |||
langchain = "^0.3.3" | |||
langsmith = "^0.1.132" | |||
langsmith = "^0.1.134" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to update it, but do we still need langsmith ? as it's not officially a supported observability provider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we're keeping langsmith for now
df['source'] = df['source'].replace('UNKNOWN', None) | ||
loader = DataFrameLoader(df, page_content_column='text') | ||
if bool(args['<ignore_source>']): | ||
df_filtered['source'] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so we could add the source as metadata for instance "original_source" (every time so that when it's ignored we still have the pdf filename present in metadata).
Can you juste add this metadata ?
It will then be used when we debug RAG trace to understand were the chunk came from.
# Index all chunks in vector DB | ||
logging.debug('Index chunks in DB') | ||
# Index respecting bulk_size (500 is from_documents current default: it is described for clarity only) | ||
bulk_size = 500 | ||
bulk_size = 100 # Adjust bulk_size to suit your use case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand this is this only way to limit to request token rate ratio ?
Also the number of concurrent task in the asyncIO loop is an other way to limit it .. but I see that this code despite and async main
isn't async at all ... could you remove all async / await
code ? So that people don't misunderstand that this script isn't async and maybe add a comment to explain why.
Can you make if configurable with an optional argument using doc opt :
"""
Options:
--embedding-bulk-size=<bs> Number of chunks sent in each embedding requests [default: 100].
"""
6011648
to
41df4eb
Compare
41df4eb
to
95759ec
Compare
This pull request includes several changes to the
gen-ai/orchestrator-server
project, focusing on updating metadata keys, improving the indexing tool, and enhancing logging capabilities. The most important changes include renaming metadata keys, modifying the indexing tool to handle new arguments, and updating documentation accordingly.Metadata key updates:
gen-ai/orchestrator-server/src/main/python/server/src/gen_ai_orchestrator/services/langchain/qa_chain.py
: Changed the metadata key fromurl
tosource
in theexecute_qa_chain
function.gen-ai/orchestrator-server/src/main/python/server/tests/services/test_qa_chain.py
: Updated tests to reflect the change fromurl
tosource
in metadata. [1] [2]Indexing tool improvements:
gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py
: Addedignore_source
argument to handle cases where sources aren't valid URLs, and improved handling of CSV input and logging. [1] [2] [3] [4] [5] [6] [7] [8]Documentation updates:
gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/README.md
: Updated documentation to reflect the change fromurl
tosource
in CSV columns and other relevant sections. [1] [2] [3] [4]Logging enhancements:
gen-ai/orchestrator-server/src/main/python/tock-llm-indexing-tools/index_documents.py
: Enhanced logging setup with file and console handlers, and added detailed logging for the indexing process. [1] [2]