-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add docker container support #1271
base: main
Are you sure you want to change the base?
Conversation
Interested too so i can integrate in: https://github.com/stellar-amenities/assistants e.g. one liner deployment open source assistants api! |
@junrushao Just updated for the new SLM jit flow. Please review, test, and merge soon. |
@louis030195 that's one cool project you have going 😍 At long last, these containers are ready -- mlc_llm now supports 88+ models and increasing rapidly (expect to be hundreds by the end of the year). Also batching is working (several concurrent inferences) as well as support for function calling on some models. Please give it a whirl ! |
@Sing-Li maybe stupid question but how does containers affect performance? does containers fully use nvidia gpu, and other ai accelerators? i know that using apple accelerator thru docker is impossible? |
@louis030195 great questions!
From my experience, there is almost no tangible impact. I think this is due to the essentially "pass through" engineering that is done for the GPU (there is no "virtualization layering" for ROCM and cuda). In fact, if you have a tunable container host, you can get better deterministic performance out of the CPU part of your application (and possibly improving overall performance). Outside of ROCm and cuda - which both have had over a decade of engineering evolution - idk
GPU only and only so because of current industry pressure and former work-better-for-"Gaming" work that has been repurposed by open source efforts like MLC-AI for AI/ML. Unfortunately, I think commercial economics will always prevent open source container tech from participating with proprietary "competitive differentiator" AI accelerators.
As far as my research goes - impossible. Also unlikely that anyone from Apple will be doing anything to help it. So we do a simple forwarding proxy to the actual host based metal accelerated server to maintain deployment combability. The cool thing about Apple (and other upcoming unified memory implementations -- such as the flood of Qualcomm Elite machines) is that built-in multitasking at the operating system level is good enough to run multiple different models concurrently on the single system as long as you have enough RAM. (no need to have tightly-coupled-expensive-GPU-memory) |
Add frequently requested docker container support for serving REST APIs (AI app developers wanting to use supported mlc-llms on their development machines / workstations / clusters ).