Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dummy output for fastchat inference #688

Closed
leiwen83 opened this issue Nov 1, 2023 · 17 comments
Closed

dummy output for fastchat inference #688

leiwen83 opened this issue Nov 1, 2023 · 17 comments

Comments

@leiwen83
Copy link
Contributor

leiwen83 commented Nov 1, 2023

Hi @wsxiaoys ,

I am current investigating the slow responsing issue reported by @itlackey in #421 (comment).

I try with StarCode1B and fastchat interface, and here are some summary for my findings:

  1. fastchat serving itself is fast enough, and there is actually no api flooding which @itlackey mentioned. But it is true that current responding very flow.
  2. The root cause is that there are many unnessceary words generated by fastchat, then aborted by tabby server.
        // I use tabby source for completion test.
        let state = completions::CompletionState::new

       // The completion result return from fastchat is:
       '(\n            engine.clone(),\n            prompt_template,\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.', logprobs=None, finish_reason='length'

      // tabby choose keep only
(
            engine.clone(),
            prompt_template,
            args.model.clone(),

So there is nearly 600 characters being generated, which take 3090 use 1.8s, while only 100 is useful. If we could make it stop at 100 perfectly, then the response time would be reduced to 1/6, which make it fast enough.

There maybe two approach to achieveing this, and I'd like to hear from your suggestion.

  1. let tabby server to explicit tell fastchat, some stop word. Like maybe right closure?
  2. Current in fastchat.rs, I choose to call fastchat api in blocking way, that is I expect when call is return, I could get full response. Maybe stream calling is more prefered in the code completion case? So that tabby server could directly cancel further generation after 100 character in this case?

Thx~

@wsxiaoys
Copy link
Member

wsxiaoys commented Nov 1, 2023

Two things you might consider verify:

  1. Is there any reason fastchat server doesn't obey max_tokens output length?
    https://github.com/TabbyML/tabby/blob/main/crates/http-api-bindings/src/fastchat.rs#L65C9-L65C9
    .max_decoding_length(128)

For /v1/completion, it was set to 128 as default.

  1. If fastchat supports stop words, you might pass per-language stop words to fastchat server.

@leiwen83
Copy link
Contributor Author

leiwen83 commented Nov 2, 2023

For 1, I check the max_tokens, it is correctly passed into fastchat. And it is true that fastchat get stopped at 128 token.

For the output I mentioned, although it over 600 characters, it is actually only 128 token.

       '(\n            engine.clone(),\n            prompt_template,\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.clone(),\n            args.model.', logprobs=None, finish_reason='length'

However, from the plugin log, I see what plugin received is much shorter than completions.rs send out.

INFO - #com.tabbyml.intellijtabby.agent.Agent - Parsed agent output:        '(\n            engine.clone(),\n            prompt_template,\n            args.model.clone(),\n    '

So where the completion text shorten is taken place?

For 2, fastchat support stop word, but I am not sure whether it could truely be help... Since in this case, there is seem no valid stop word could be used.

@leiwen83
Copy link
Contributor Author

leiwen83 commented Nov 3, 2023

I root cause the issue as http-api-binding current don't read the prompt template from model's tabby.json, which make the output far from our need. #696 fixed it.

@wsxiaoys
Copy link
Member

wsxiaoys commented Nov 3, 2023

I'm glad you've identified the issue, but I don't believe that using #696 is the correct approach to address it.

You can find the relevant code here: https://github.com/TabbyML/tabby/blob/main/crates/http-api-bindings/src/fastchat.rs#L64

In this code, we already have tokens[0] as the prefix and tokens[1] as the suffix. The best course of action would be to simply pass suffix into the suffix field of fastapi and let fastapi handle the prompt template.

@leiwen83
Copy link
Contributor Author

leiwen83 commented Nov 3, 2023

In fastchat, it don't recognize prefix/suffix. And it only accept incoming prompt as it is to next level, like vllm to do the real inference. So tokens[0] here would cause the inference result different from what is conducted like ggml, which use prompt template from tabby.json.

@wsxiaoys
Copy link
Member

wsxiaoys commented Nov 3, 2023

I'll suggest patching fastapi to achieve that

@leiwen83
Copy link
Contributor Author

leiwen83 commented Nov 4, 2023

I'll suggest patching fastapi to achieve that

You mean add some patch for the fastchat side? But it is somehow different logic handling from current we deal with ggml.
In ggml case, tabby already assemble the FIM prompt passing before calling inference. Maybe we shall put ggml and http-binding at the same level?

@leiwen83
Copy link
Contributor Author

leiwen83 commented Nov 5, 2023

@wsxiaoys ,Shall we consider to unify current prompt generation towards ggml/http-binding/ctranslate, so that they share the common interface?
I see there is great demanding in http-binding as current llm model is developing very fast, and we may need to change serving backend from time to time to meet the need, such as performance, massive serving, or simply new model quick enabling.
Make the serving prompt handling logic the same could greatly be helpful in identify issue and long term support, also for different solution comparsion.

And after the logici unified, some other serving bandend could easily get integrated, like grpc? Since for massive deployment consideration and performance, http interface may still somehow bring cost comparing with grpc.

@leiwen83
Copy link
Contributor Author

leiwen83 commented Nov 6, 2023

@wsxiaoys, I further test with several new case, and find that even with <fim_prefix> <fim_suffix> <fim_middle> being set correctly, it would still generate dummy output. So shall we consider add the option like use "\n" as the stop word?

@wsxiaoys
Copy link
Member

wsxiaoys commented Nov 6, 2023

As mentioned earlier, given experimental-http has ability to connect any HTTP API endpoint, I recommend implementing the specific logic for Fastchat in an external HTTP server

@leiwen83
Copy link
Contributor Author

leiwen83 commented Nov 6, 2023

bility to connect any HTTP API endpoint, I recommend implementing the specific logic for Fastchat in an external HTTP server

fastchat itself is not good place to do such logic. In our local practice, we use fastchat as the control panel for model serving, while vllm as the backend for data panel. It means fastchat only do its job for serving like openai interface, it don't change prompt, but pass it as it is to vllm.

So do you mean have some other logic sit between tabby server and fastchat? It seems to me it is quite burden for code completion. If this logic is kept only inside fastchat.rs, not current change experimental-http logic, would it be more acceptable?

@wsxiaoys
Copy link
Member

wsxiaoys commented Nov 6, 2023

I still prefer the implementation in Tabby's repository to be something generic rather than specific to a vendor (in this case, if you want to customize stop words). Therefore, I suggest either attempting to resolve the issue on FastChat or routing requests through an intermediate API server.

@leiwen83
Copy link
Contributor Author

leiwen83 commented Nov 6, 2023

I see. So I would keep the change to my localside. But do you have any plan to fix this dummy output in general?

@wsxiaoys
Copy link
Member

wsxiaoys commented Nov 6, 2023

No, experimental features are not actively being considered for maintenance by the team. I also plan to remove the experimental-http device from the default feature set and retain it solely as a reference implementation

@leiwen83
Copy link
Contributor Author

leiwen83 commented Nov 6, 2023

No, experimental features are not actively being considered for maintenance by the team. I also plan to remove the experimental-http device from the default feature set and retain it solely as a reference implementation

Sorry for confusion; the dummy I mentioned for with fim existed is also existed in ggml serving. So that is why I ask whether it could be fixed

@wsxiaoys
Copy link
Member

wsxiaoys commented Nov 6, 2023

I don't see that from dicussions in this thread. feel free to file a bug if you find a supported devices (gpu, cuda or metal) is not functioning in any scenario. Thank you!

@leiwen83
Copy link
Contributor Author

leiwen83 commented Nov 6, 2023

I see. I would try get some test case which could be made public, and then open as a separated thread for that.

Thx~

@leiwen83 leiwen83 closed this as completed Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants