WIP: Parallel generation implemenation #1209

iamlemec · 2024-02-22T05:15:34Z

This is a basic and relatively simple implementation of parallel generation, both streaming and non-streaming, as considered in #771. It sticks mostly to using the existing high level API functions, except when messing around with the KV cache. The only real optimization is detecting a common prefix among the sequences and decoding that in a single pass, which would cover things like system prompts.

Rather than trying to fit into the existing functions, this creates new functions eval_parallel, generate_parallel, etc. Right now it just outputs lists of string results, rather than the full JSON formatting. I wanted to see if this is viable before going down that road.

The existing top-level state variables such as n_tokens, scores, and input_ids are geared toward single stream generation. The way I'm decoding here is position aligned, so n_tokens is basically the same. While scores now stores the logits for a single batch across multiple sequences, but in a way that sample still works the same. And input_ids and draft_model are ignored for now.

abetlen · 2024-02-23T01:42:47Z

Hey @iamlemec thank you for starting on this!

I think this approach makes for batching embeddings but less so for text generation. In the case of embeddings usually a single user is generating embeddings for several documents and they batch up the data before making requests. In the case of text completion however you either have 1) a user making a single request for multiple text completions given a single input prompt (ie. chain of thought) or 2) a server serving concurrent unaligned requests which can start and stop independantly.

I've been working on 2) but it's quite complex and requires care to not break existing APIs. 1) is more straightforward though and would be useful on it's own, do you think we can adapt this PR in that direction?

iamlemec · 2024-02-23T05:43:03Z

Thanks for the comments @abetlen!

Yeah, so I think that this is basically a superset of (1) right now. If you call create_completion_parallel(n*[prompt]) you'll get back n independent responses for that prompt, computed in parallel. And it'll even do it efficiently because it looks for the longest common prefix and computes that in one sequence. To make the interface simpler, you could add in a n_parallel argument that would kick in when prompt is just a string. But it's basically free to also do the case where someone wants a couple different prompts in parallel. Does that sound right?

In terms of the JSON responses, I guess we just generate a unique ID for each stream and yield OpenAI like dictionaries for each new stream-token. Did you want to try to integrate it into the core routines or keep them as _parallel counterparts?

abetlen · 2024-02-23T10:05:08Z

@iamlemec for the argument to specify parallelism we would just use the OpenAI n parameter for create_completion and create_chat_completions I think we should stick to this for now and keep everything else private (_eval_parallel / _generate_parallel) just for the time being. I think the only change to your existing implementatoin would be the option to stop a sequence other than reaching the eos token.

Once we have a good way to do that I can just take the token streams and figure out how to merge them in the choices key of the returned create completions dictionaries and adapt the chat completion conversion functions to support this as well.

iamlemec added 3 commits February 21, 2024 16:34

first shot at parallel generation

b924a9e

better stream options

c82bcf3

cleanup and type hints

4787ec3

iamlemec force-pushed the generate-parallel branch from 0ce2269 to 4787ec3 Compare February 25, 2024 06:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Parallel generation implemenation #1209

WIP: Parallel generation implemenation #1209

iamlemec commented Feb 22, 2024

abetlen commented Feb 23, 2024

iamlemec commented Feb 23, 2024

abetlen commented Feb 23, 2024

WIP: Parallel generation implemenation #1209

Are you sure you want to change the base?

WIP: Parallel generation implemenation #1209

Conversation

iamlemec commented Feb 22, 2024

abetlen commented Feb 23, 2024

iamlemec commented Feb 23, 2024

abetlen commented Feb 23, 2024