Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Improve developer experience by anchoring on multimodal use-case #7093

Open
mergennachin opened this issue Nov 26, 2024 · 8 comments
Open

Comments

@mergennachin
Copy link
Contributor

mergennachin commented Nov 26, 2024

🚀 The feature, motivation and pitch

Let's build an example demo app, perhaps in pytorch-labs, which will become a forcing function to improve developer experience from a user perspective. A positive outcome of this demo app is to define and build new higher level abstractions (e.g., similar to Pipelines).

On a high-level, here's the app we would like to build: LLM based on voice input and output. In terms of implementation, it's a three step process:

  • Given a voice input, convert to text (e.g., Whisper)
  • Run text based LLM (e.g., Llama 1B)
  • Convert text output to voice (e.g., using T5)

Here are the requirements:

  • Be able to run on iOS, Android and Desktop app.
  • Be able to prototype e2e flow in Python first and HuggingFace
  • Be able to deploy on laptop without Python runtime easily for testing purposes.
  • Be able to swap underlying models easily (e.g., Whisper -> Seamless, Llama 1B -> Qwen)
  • Easy to swap Sampler/Tokenizer/KVCache implementations in LLM, perhaps, use this issue
  • Easy deployment process to mobile and desktop app.
  • Everything in OSS
  • Easy to improve performance optimization and debugging (e.g., use mobile accelerators, quantization)

Here's a positive outcome of this demo app:

  • Define and build new higher level abstractions to make these possible.
  • ExecuTorch and torchchat uses this abstraction for text-based LLMs.
  • Llava and multimodal image uses this abstraction.
  • Community can build completely new apps using this new abstractions

Alternatives

No response

Additional context

Already another RFC, but specifically in the context of LLMs

RFC (Optional)

No response

@shoumikhin
Copy link
Contributor

Another success story for iOS specifically (and hopefully for Android too) should look roughly like on this video, where the clients could just add some executorch-llm Swift PM package, write a few lines to create a Pipeline using an exported .pte file, add a text-edit field and a button to UI, and then just run LLaMA inference out-the-box.

@mergennachin
Copy link
Contributor Author

@shoumikhin

yeah, that's cool. two additional comments:

  • one thing is to think not only LLM but other models as well... Usually an AI application would be a combination of multiple models, orchestrating together (voice, image, text etc).
  • users would like to experiment with python first by combining multiple models... And when they're satisifed with the result, can "just click a button" and deploy to iOS and Android.

@iseeyuan
Copy link
Contributor

It's great to have this kind of experience in general, but we may need to think more on how the framework can really help. For multimodal we need to learn more on the common pattern. Note that different components may not work together directly out of box due to:

  • different multimodal architecture (for example, with or without cross attention)
  • training dependency: some encoders may be trained on a certain LLM foundation model, or different training steps are required, like training the encoder with a frozen foundation model, and then fine tune the foundation model with frozen encoder.
    As a framework we may think about how we can help users to hook their components with a working and robust pipeline.

@kimishpatel
Copy link
Contributor

I think it would be great if the issue/pain points, solution space and potential way to validate this actually came from users a level higher than framework devs. My fear is that we will do incremental improvements that aesthetically please us based on our own experience. Even if such issues and/or solution space is not driven by other users or product engineers, such personas have to be close part of iterating over solution space. Maybe this would be people from Paris hackathon.

@shoumikhin
Copy link
Contributor

@kimishpatel I imagine if for users it looks similar to HF transformers or OpenAI API, that should be good enough?

@kimishpatel
Copy link
Contributor

@kimishpatel I imagine if for users it looks similar to HF transformers or OpenAI API, that should be good enough?

WHat is OpenAI API?

Why do you believe users similar to HF transformers covers spans of users that interact with torchchat/ET? Any examples?

@shoumikhin
Copy link
Contributor

HF transformers or OpenAI API are sorta de-facto standards how devs and clients interact with LLMs these days. I guess if TC provides a similar interface it at least wouldn't be worse.
The real question is can we do better and what such better would be? Agree that's something the researchers or consumers can help us to define, along with our own iterations.

@kimishpatel
Copy link
Contributor

HF transformers or OpenAI API are sorta de-facto standards how devs and clients interact with LLMs these days. I guess if TC provides a similar interface it at least wouldn't be worse. The real question is can we do better and what such better would be? Agree that's something the researchers or consumers can help us to define, along with our own iterations.

@shoumikhin OpenAI API, AFAIU, is related to end point APIs while building pipeline of components, with customizability across different aspects such as tokenizer, kv cache management, long context management etc. might be different. Dont know enough about HF in this space. I assume that would more closely align with some of the objectives here.

And generally @mergennachin, I would probably want to also understand how different end-users envision deploying models. Requirements listed here make sense but I cant place them in larger context and where they are coming from. @shoumikhin's comment regarding HF users do make sense though but does the same exist for on-device use-cases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: To triage
Development

No branches or pull requests

10 participants