Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Parallel generation implemenation #1209

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

iamlemec
Copy link
Contributor

This is a basic and relatively simple implementation of parallel generation, both streaming and non-streaming, as considered in #771. It sticks mostly to using the existing high level API functions, except when messing around with the KV cache. The only real optimization is detecting a common prefix among the sequences and decoding that in a single pass, which would cover things like system prompts.

Rather than trying to fit into the existing functions, this creates new functions eval_parallel, generate_parallel, etc. Right now it just outputs lists of string results, rather than the full JSON formatting. I wanted to see if this is viable before going down that road.

The existing top-level state variables such as n_tokens, scores, and input_ids are geared toward single stream generation. The way I'm decoding here is position aligned, so n_tokens is basically the same. While scores now stores the logits for a single batch across multiple sequences, but in a way that sample still works the same. And input_ids and draft_model are ignored for now.

@abetlen
Copy link
Owner

abetlen commented Feb 23, 2024

Hey @iamlemec thank you for starting on this!

I think this approach makes for batching embeddings but less so for text generation. In the case of embeddings usually a single user is generating embeddings for several documents and they batch up the data before making requests. In the case of text completion however you either have 1) a user making a single request for multiple text completions given a single input prompt (ie. chain of thought) or 2) a server serving concurrent unaligned requests which can start and stop independantly.

I've been working on 2) but it's quite complex and requires care to not break existing APIs. 1) is more straightforward though and would be useful on it's own, do you think we can adapt this PR in that direction?

@iamlemec
Copy link
Contributor Author

Thanks for the comments @abetlen!

Yeah, so I think that this is basically a superset of (1) right now. If you call create_completion_parallel(n*[prompt]) you'll get back n independent responses for that prompt, computed in parallel. And it'll even do it efficiently because it looks for the longest common prefix and computes that in one sequence. To make the interface simpler, you could add in a n_parallel argument that would kick in when prompt is just a string. But it's basically free to also do the case where someone wants a couple different prompts in parallel. Does that sound right?

In terms of the JSON responses, I guess we just generate a unique ID for each stream and yield OpenAI like dictionaries for each new stream-token. Did you want to try to integrate it into the core routines or keep them as _parallel counterparts?

@abetlen
Copy link
Owner

abetlen commented Feb 23, 2024

@iamlemec for the argument to specify parallelism we would just use the OpenAI n parameter for create_completion and create_chat_completions I think we should stick to this for now and keep everything else private (_eval_parallel / _generate_parallel) just for the time being. I think the only change to your existing implementatoin would be the option to stop a sequence other than reaching the eos token.

Once we have a good way to do that I can just take the token streams and figure out how to merge them in the choices key of the returned create completions dictionaries and adapt the chat completion conversion functions to support this as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants