-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ai-proxy): add streaming support and transformers #12792
Conversation
@RobSerafini @flrgh @locao It is done. This is (I really think...) the largest PR in AI plugins phase 2. Is it okay for someone to do the quality/standards pass, whilst I'm writing the docs and the tests? Especially if big changes are suggested. We can then meet in the middle. |
c3f0936
to
ee70e4b
Compare
I have no idea why it thinks changelog isn't done. |
b8110f1
to
1c4139c
Compare
Hey @ttyS0e
That's because you included:
You can check the required format in this doc: https://github.com/Kong/gateway-changelog/blob/v1.0.0/README.md |
1c4139c
to
c01b0ab
Compare
@locao Yep I double-broke it myself, trying to figure out what it was. I fixed it now. |
c01b0ab
to
db197ab
Compare
db197ab
to
f0ca2cb
Compare
spec/fixtures/ai-proxy/unit/real-stream-frames/openai/llm-v1-completions.txt
Show resolved
Hide resolved
I have absolutely NO IDEA where these extra 8 commits got picked up |
e5c70eb
to
19033b4
Compare
@flrgh Fixed all coments |
Co-authored-by: Michael Martin <[email protected]>
624cdde
to
48fff98
Compare
okay @flrgh NOW I think it all done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are more optimizations that could be made to the SSE loop in kong/llm/init.lua
, but I'd rather not get into the weeds until there's good reason to.
This is looking ready to me. 👍
Successfully created cherry-pick PR for |
Summary
Adds "streaming" support to AI Proxy plugin.
Streaming is a mode where a client can specify
"stream": true
in their request, and the LLM server will stream each piece piece of the response text (usually token-by-token) as a server-sent event.We need to capture each (batch of) event(s) in order to translate them back into our inference format, so that all providers are compatible with the same framework that our users will create on their side.
Where "streaming=false" requests proxy directly to the LLM, and look like this:
the new streaming framework captures each event, and sends the chunk back to the client, like this:
and then it exits early. Docs will describe the limitations of this (no response transformer, etc).
It will also count/estimate tokens for LLM services that have decided to not stream back the token utilisation counts when the message has completed...
Checklist
changelog/unreleased/kong
orskip-changelog
label added on PR if changelog is unnecessary. README.mdCan this get reviewed for code standard and design, whilst I'm writing tests and docs?
Issue reference
Fix #12680
https://konghq.atlassian.net/browse/KAG-4124