Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DERCBOT-1168] Indexing improvements #1766

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

assouktim
Copy link
Contributor

@assouktim assouktim commented Oct 15, 2024

This pull request includes several changes to the gen-ai/orchestrator-server project, focusing on updating metadata keys, improving the indexing tool, and enhancing logging capabilities. The most important changes include renaming metadata keys, modifying the indexing tool to handle new arguments, and updating documentation accordingly.

Metadata key updates:

Indexing tool improvements:

Documentation updates:

Logging enhancements:

Copy link
Member

@Benvii Benvii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR

df['source'] = df['source'].replace('UNKNOWN', None)
loader = DataFrameLoader(df, page_content_column='text')
if bool(args['<ignore_source>']):
df_filtered['source'] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this for all PDF chunks we will completely miss the file name / location where the chunk came from, this will be a real nightmare do to any debugging / analysis of RAG traces.

It should at least be kepts in a metadata.
Do you have an explanation why we couldn't keep a file path as a source ?
Is it because of the AnyUrl here.

AnyUrl type is based on the URL rust crates it supports files URLs (but only absolute URLs) for instance the following exemple works :

from pydantic import AnyUrl
file_url = AnyUrl('file:///tmp/ah.gif')

Why not keeping the URL using the file schema ? If needed we could fix Goulven's original pdf parsing tool script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you remember, but we've been discussing the fact that the pdf urls point to Golven's personal folder, and we can't consider that a valid link for end users, since they don't have access to that path. So we've decided to remove them, and consign the pdf documents as unsourced.

I see two things:

  • Yes, I'm in favor of keeping this information in the metadata.
  • We need to modify the code that processes the pdf to have the Google Drive url of the PDF, which can be given/exposed to the end user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, the source can be a URI (file:///tmp/file.pdf), this flag allows you to ignore or not the source during indexing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so we could add the source as metadata for instance "original_source" (every time so that when it's ignored we still have the pdf filename present in metadata).
Can you juste add this metadata ?

It will then be used when we debug RAG trace to understand were the chunk came from.

log_dir = Path('logs')
log_dir.mkdir(exist_ok=True)

log_file_name = log_dir / f"index_documents_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be documented in the README.md that we now have logs output in this folder, thanks for adding it 👍️

@assouktim assouktim changed the title Indexing improvements [DERCBOT-1168] Indexing improvements Nov 12, 2024
@assouktim assouktim marked this pull request as ready for review November 12, 2024 13:50
Copy link
Member

@Benvii Benvii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it more clearer that indexing_documents.py isn't async and see me comment about adding an optional argument to configure the bulk size it will be helpfull to adjust it depending on the supported embedding server TPM.

Some conflicts also needs to be solved.

@@ -14,14 +14,14 @@ pandas = "^2.2.1"
openpyxl = "^3.1.2"
beautifulsoup4 = "^4.12.2"
langchain = "^0.3.3"
langsmith = "^0.1.132"
langsmith = "^0.1.134"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to update it, but do we still need langsmith ? as it's not officially a supported observability provider

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we're keeping langsmith for now

df['source'] = df['source'].replace('UNKNOWN', None)
loader = DataFrameLoader(df, page_content_column='text')
if bool(args['<ignore_source>']):
df_filtered['source'] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so we could add the source as metadata for instance "original_source" (every time so that when it's ignored we still have the pdf filename present in metadata).
Can you juste add this metadata ?

It will then be used when we debug RAG trace to understand were the chunk came from.

# Index all chunks in vector DB
logging.debug('Index chunks in DB')
# Index respecting bulk_size (500 is from_documents current default: it is described for clarity only)
bulk_size = 500
bulk_size = 100 # Adjust bulk_size to suit your use case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand this is this only way to limit to request token rate ratio ?

Also the number of concurrent task in the asyncIO loop is an other way to limit it .. but I see that this code despite and async main isn't async at all ... could you remove all async / await code ? So that people don't misunderstand that this script isn't async and maybe add a comment to explain why.

Can you make if configurable with an optional argument using doc opt :

"""
Options:
  --embedding-bulk-size=<bs>  Number of chunks sent in each embedding requests [default: 100].
"""

@assouktim assouktim force-pushed the feature/dercbot-1168 branch 3 times, most recently from 6011648 to 41df4eb Compare November 21, 2024 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Review in progress
Development

Successfully merging this pull request may close these issues.

2 participants