Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct Streaming of Model Weights from Cloud Storage to GPU Memory #7660

Open
azsh1725 opened this issue Sep 26, 2024 · 4 comments
Open

Direct Streaming of Model Weights from Cloud Storage to GPU Memory #7660

azsh1725 opened this issue Sep 26, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@azsh1725
Copy link

Is your feature request related to a problem? Please describe.
I’m facing an issue when deploying large models in Kubernetes, especially when the pod’s ephemeral storage is limited. Triton Inference Server seems to download models to local disk (ephemeral storage) before loading them into GPU memory, which poses a problem when the available local storage is insufficient for large models. This issue becomes critical when running Triton in environments with constrained disk space, such as cloud environments where models reside in S3, but ephemeral storage on the pods is too limited for full model downloads.

Describe the solution you'd like
I would like Triton Inference Server to support direct streaming of model weights from cloud storage (e.g., S3, GCS) to GPU memory without storing the model on disk first. This feature would allow Triton to efficiently load large models in resource-constrained environments by bypassing the need for intermediate storage and directly loading the model into GPU memory. The system could stream or partially load the model in memory/GPU as required for inference, optimizing the process for large-scale deployments.

Describe alternatives you've considered
Some alternatives to address this issue include:

  • Mounting Persistent Volumes (PV) or Persistent Volume Claims (PVC) in Kubernetes to increase available storage, but this introduces additional overhead and complexity.
  • Chunk-based loading or model parallelism techniques, but these require significant changes to model architecture and inference workflows.
  • Mounting models in memory via tmpfs, which works for smaller models but is impractical for very large models.

Additional context
Many modern large language models, such as DeepSeek-Coder-V2-Instruct or other transformer-based architectures, can be too large to fit into a pod’s ephemeral storage. Allowing Triton to stream models from cloud storage directly into GPU memory would simplify deployment in environments like Kubernetes, where scaling and efficient resource use are critical.

@azsh1725
Copy link
Author

Given the current limitations in Triton Inference Server when dealing with constrained ephemeral storage, are there any workarounds or best practices you would recommend for efficiently loading large models from cloud storage (e.g., S3) directly to GPU memory without relying on significant local disk space? Any guidance or alternative solutions from your side would be greatly appreciated.

@oandreeva-nv oandreeva-nv added the enhancement New feature or request label Sep 27, 2024
@oandreeva-nv
Copy link
Contributor

Hi @azsh1725 , thanks for your proposal! I've created an internal ticket [DLIS-7365] for the team to prioritize

@nnshah1
Copy link
Contributor

nnshah1 commented Oct 3, 2024

@harryskim , @statiraju , @nicolasnoble for viz

@jyono
Copy link

jyono commented Oct 9, 2024

Not really sure what streaming would look like. It sounds like you want to attach your S3 object to your pod. I suggest that you download the model weights from BLOB and store them in a kubernetes PVC. You can mount that PVC to multiple pods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

4 participants