RFC: Improve developer experience by anchoring on multimodal use-case #7093

mergennachin · 2024-11-26T21:23:17Z

🚀 The feature, motivation and pitch

Let's build an example demo app, perhaps in pytorch-labs, which will become a forcing function to improve developer experience from a user perspective. A positive outcome of this demo app is to define and build new higher level abstractions (e.g., similar to Pipelines).

On a high-level, here's the app we would like to build: LLM based on voice input and output. In terms of implementation, it's a three step process:

Given a voice input, convert to text (e.g., Whisper)
Run text based LLM (e.g., Llama 1B)
Convert text output to voice (e.g., using T5)

Here are the requirements:

Be able to run on iOS, Android and Desktop app.
Be able to prototype e2e flow in Python first and HuggingFace
Be able to deploy on laptop without Python runtime easily for testing purposes.
Be able to swap underlying models easily (e.g., Whisper -> Seamless, Llama 1B -> Qwen)
Easy to swap Sampler/Tokenizer/KVCache implementations in LLM, perhaps, use this issue
Easy deployment process to mobile and desktop app.
Everything in OSS
Easy to improve performance optimization and debugging (e.g., use mobile accelerators, quantization)

Here's a positive outcome of this demo app:

Define and build new higher level abstractions to make these possible.
ExecuTorch and torchchat uses this abstraction for text-based LLMs.
Llava and multimodal image uses this abstraction.
Community can build completely new apps using this new abstractions

Alternatives

No response

Additional context

Already another RFC, but specifically in the context of LLMs

RFC (Optional)

No response

shoumikhin · 2024-11-26T21:46:04Z

Another success story for iOS specifically (and hopefully for Android too) should look roughly like on this video, where the clients could just add some executorch-llm Swift PM package, write a few lines to create a Pipeline using an exported .pte file, add a text-edit field and a button to UI, and then just run LLaMA inference out-the-box.

mergennachin · 2024-11-26T21:53:58Z

@shoumikhin

yeah, that's cool. two additional comments:

one thing is to think not only LLM but other models as well... Usually an AI application would be a combination of multiple models, orchestrating together (voice, image, text etc).
users would like to experiment with python first by combining multiple models... And when they're satisifed with the result, can "just click a button" and deploy to iOS and Android.

iseeyuan · 2024-11-26T23:15:00Z

It's great to have this kind of experience in general, but we may need to think more on how the framework can really help. For multimodal we need to learn more on the common pattern. Note that different components may not work together directly out of box due to:

different multimodal architecture (for example, with or without cross attention)
training dependency: some encoders may be trained on a certain LLM foundation model, or different training steps are required, like training the encoder with a frozen foundation model, and then fine tune the foundation model with frozen encoder.
As a framework we may think about how we can help users to hook their components with a working and robust pipeline.

kimishpatel · 2024-12-02T00:49:33Z

I think it would be great if the issue/pain points, solution space and potential way to validate this actually came from users a level higher than framework devs. My fear is that we will do incremental improvements that aesthetically please us based on our own experience. Even if such issues and/or solution space is not driven by other users or product engineers, such personas have to be close part of iterating over solution space. Maybe this would be people from Paris hackathon.

shoumikhin · 2024-12-02T00:53:28Z

@kimishpatel I imagine if for users it looks similar to HF transformers or OpenAI API, that should be good enough?

kimishpatel · 2024-12-02T01:50:04Z

@kimishpatel I imagine if for users it looks similar to HF transformers or OpenAI API, that should be good enough?

WHat is OpenAI API?

Why do you believe users similar to HF transformers covers spans of users that interact with torchchat/ET? Any examples?

shoumikhin · 2024-12-02T02:14:18Z

HF transformers or OpenAI API are sorta de-facto standards how devs and clients interact with LLMs these days. I guess if TC provides a similar interface it at least wouldn't be worse.
The real question is can we do better and what such better would be? Agree that's something the researchers or consumers can help us to define, along with our own iterations.

kimishpatel · 2024-12-02T03:50:39Z

HF transformers or OpenAI API are sorta de-facto standards how devs and clients interact with LLMs these days. I guess if TC provides a similar interface it at least wouldn't be worse. The real question is can we do better and what such better would be? Agree that's something the researchers or consumers can help us to define, along with our own iterations.

@shoumikhin OpenAI API, AFAIU, is related to end point APIs while building pipeline of components, with customizability across different aspects such as tokenizer, kv cache management, long context management etc. might be different. Dont know enough about HF in this space. I assume that would more closely align with some of the objectives here.

And generally @mergennachin, I would probably want to also understand how different end-users envision deploying models. Requirements listed here make sense but I cant place them in larger context and where they are coming from. @shoumikhin's comment regarding HF users do make sense though but does the same exist for on-device use-cases?

mergennachin moved this to To triage in ExecuTorch DevX improvements Nov 26, 2024

mergennachin added this to ExecuTorch DevX improvements Nov 26, 2024

mergennachin assigned mergennachin, shoumikhin, larryliu0820, guangy10, Jack-Khuu and iseeyuan Nov 26, 2024

mergennachin assigned Olivia-liu and tarun292 Nov 26, 2024

Gasoonjia self-assigned this Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Improve developer experience by anchoring on multimodal use-case #7093

RFC: Improve developer experience by anchoring on multimodal use-case #7093

mergennachin commented Nov 26, 2024 •

edited

Loading

shoumikhin commented Nov 26, 2024

mergennachin commented Nov 26, 2024

iseeyuan commented Nov 26, 2024

kimishpatel commented Dec 2, 2024

shoumikhin commented Dec 2, 2024

kimishpatel commented Dec 2, 2024

shoumikhin commented Dec 2, 2024

kimishpatel commented Dec 2, 2024

RFC: Improve developer experience by anchoring on multimodal use-case #7093

RFC: Improve developer experience by anchoring on multimodal use-case #7093

Comments

mergennachin commented Nov 26, 2024 • edited Loading

🚀 The feature, motivation and pitch

Alternatives

Additional context

RFC (Optional)

shoumikhin commented Nov 26, 2024

mergennachin commented Nov 26, 2024

iseeyuan commented Nov 26, 2024

kimishpatel commented Dec 2, 2024

shoumikhin commented Dec 2, 2024

kimishpatel commented Dec 2, 2024

shoumikhin commented Dec 2, 2024

kimishpatel commented Dec 2, 2024

mergennachin commented Nov 26, 2024 •

edited

Loading