-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: Add Semantic Caching Tutorial #118
Conversation
supported feature in Triton Inference Server. | ||
|
||
We value your input! If you're interested in seeing semantic caching as a | ||
supported feature in future releases, we encourage you to [FILL IN] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FILL IN
reminder for self
Clearly, the latter 2 requests are semantically similar to the first one, which | ||
resulted in a cache hit scenario, which reduced the latency of our model from | ||
approx 1.1s to the average of 0.048s per request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you have any rough ideas of cache miss cost? ex: 1 request without semantic caching vs 1 request with semantic caching? Just curious on the rough magnitude of the overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can probably do some estimations for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaving this unresolved to remind myself to do this study as a follow-up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! It was very enjoyable to read 🤓
Co-authored-by: Ryan McCormick <[email protected]>
Nice tutorial! Really fun to try it out 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Left some nits and suggested a PR title change - feel free to change
Co-authored-by: Ryan McCormick <[email protected]>
This PR adds reference implementation of local Semantic caching mechanism.
Note:
Adding a CPU based index, since for current tutorial this seems enough. GPU-based index makes more sense when we have large amount of vectors to process.
Opens:
I've added a section called
Interested in This Feature?
, which is not finished at the moment.My suggestion for community engagement is to create a GitHub issue and encourage readers and users to vote on it, in case it is interested.
If there're no objections, I'll proceed with this idea.
[Edit 1] discussion opened -> triton-inference-server/server#7742