Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docker container support #1271

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

Add docker container support #1271

wants to merge 16 commits into from

Conversation

Sing-Li
Copy link
Contributor

@Sing-Li Sing-Li commented Nov 15, 2023

Add frequently requested docker container support for serving REST APIs (AI app developers wanting to use supported mlc-llms on their development machines / workstations / clusters ).

@junrushao junrushao self-assigned this Nov 20, 2023
@louis030195
Copy link

Interested too so i can integrate in:

https://github.com/stellar-amenities/assistants

e.g. one liner deployment open source assistants api!

@Sing-Li
Copy link
Contributor Author

Sing-Li commented Apr 16, 2024

@junrushao Just updated for the new SLM jit flow. Please review, test, and merge soon.

@Sing-Li
Copy link
Contributor Author

Sing-Li commented Apr 17, 2024

@louis030195 that's one cool project you have going 😍 At long last, these containers are ready -- mlc_llm now supports 88+ models and increasing rapidly (expect to be hundreds by the end of the year). Also batching is working (several concurrent inferences) as well as support for function calling on some models. Please give it a whirl !

@louis030195
Copy link

@Sing-Li maybe stupid question but how does containers affect performance?

does containers fully use nvidia gpu, and other ai accelerators?

i know that using apple accelerator thru docker is impossible?

@Sing-Li
Copy link
Contributor Author

Sing-Li commented Apr 17, 2024

@louis030195 great questions!

how does containers affect performance?

From my experience, there is almost no tangible impact. I think this is due to the essentially "pass through" engineering that is done for the GPU (there is no "virtualization layering" for ROCM and cuda). In fact, if you have a tunable container host, you can get better deterministic performance out of the CPU part of your application (and possibly improving overall performance). Outside of ROCm and cuda - which both have had over a decade of engineering evolution - idk

does containers fully use nvidia gpu, and other ai accelerators?

GPU only and only so because of current industry pressure and former work-better-for-"Gaming" work that has been repurposed by open source efforts like MLC-AI for AI/ML. Unfortunately, I think commercial economics will always prevent open source container tech from participating with proprietary "competitive differentiator" AI accelerators.

i know that using apple accelerator thru docker is impossible?

As far as my research goes - impossible. Also unlikely that anyone from Apple will be doing anything to help it. So we do a simple forwarding proxy to the actual host based metal accelerated server to maintain deployment combability.

The cool thing about Apple (and other upcoming unified memory implementations -- such as the flood of Qualcomm Elite machines) is that built-in multitasking at the operating system level is good enough to run multiple different models concurrently on the single system as long as you have enough RAM. (no need to have tightly-coupled-expensive-GPU-memory)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants