-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Direct Streaming of Model Weights from Cloud Storage to GPU Memory #7660
Comments
Given the current limitations in Triton Inference Server when dealing with constrained ephemeral storage, are there any workarounds or best practices you would recommend for efficiently loading large models from cloud storage (e.g., S3) directly to GPU memory without relying on significant local disk space? Any guidance or alternative solutions from your side would be greatly appreciated. |
Hi @azsh1725 , thanks for your proposal! I've created an internal ticket [DLIS-7365] for the team to prioritize |
@harryskim , @statiraju , @nicolasnoble for viz |
Not really sure what streaming would look like. It sounds like you want to attach your S3 object to your pod. I suggest that you download the model weights from BLOB and store them in a kubernetes PVC. You can mount that PVC to multiple pods. |
Is your feature request related to a problem? Please describe.
I’m facing an issue when deploying large models in Kubernetes, especially when the pod’s ephemeral storage is limited. Triton Inference Server seems to download models to local disk (ephemeral storage) before loading them into GPU memory, which poses a problem when the available local storage is insufficient for large models. This issue becomes critical when running Triton in environments with constrained disk space, such as cloud environments where models reside in S3, but ephemeral storage on the pods is too limited for full model downloads.
Describe the solution you'd like
I would like Triton Inference Server to support direct streaming of model weights from cloud storage (e.g., S3, GCS) to GPU memory without storing the model on disk first. This feature would allow Triton to efficiently load large models in resource-constrained environments by bypassing the need for intermediate storage and directly loading the model into GPU memory. The system could stream or partially load the model in memory/GPU as required for inference, optimizing the process for large-scale deployments.
Describe alternatives you've considered
Some alternatives to address this issue include:
Additional context
Many modern large language models, such as DeepSeek-Coder-V2-Instruct or other transformer-based architectures, can be too large to fit into a pod’s ephemeral storage. Allowing Triton to stream models from cloud storage directly into GPU memory would simplify deployment in environments like Kubernetes, where scaling and efficient resource use are critical.
The text was updated successfully, but these errors were encountered: