[GDrive] Add caching of downloaded files #492
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What's being changed:
This PR modifies the Google Drive connector to add an optional caching feature. It adds the ability to store the downloaded documents in Redis, or in the Python process itself with no dependency on another service, using cachetools.
By default it will store the downloaded documents for 1 hour.
The purpose of this change is to speed up the response time from the connector, in cases where a user continues asking questions that would trigger search queries that return the same documents repeatedly. My testing has shown that downloading the documents is the most time consuming part of responding to a search request, as Google is relatively quick at responding to the Google Drive search query itself.
How did you test this change (include any code snippets, API requests, screenshots, or gifs):
I made requests from Coral to the connector running in local env, via ngrok. In local I ran the connector with the new
GDRIVE_CACHE_TYPE
env var unset, with a blank value, and thememory
andredis
options. Inprovider/async_download.py
I had log statements to show me what files were being downloaded and how long it took, and with the debug logging I could see if it was getting cache hits or not. I have not committed the debug logging that I was using.