Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parallel model preloading #211

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

aybanda
Copy link

@aybanda aybanda commented Sep 9, 2024

@AlexCheema

Implement Parallel Model Preloading

Description

This PR introduces parallel model preloading to significantly reduce startup times for large models distributed across multiple nodes. By leveraging asyncio, we now preload model shards into memory concurrently, followed by a sequential initialization step.

Changes

  • Added preload_model method to the InferenceEngine abstract class
  • Implemented preload_model in MLXDynamicShardInferenceEngine
  • Updated ensure_shard method to work with preloaded models
  • Modified main.py to use parallel preloading

Implementation Details

  1. InferenceEngine now has an abstract preload_model method
  2. MLXDynamicShardInferenceEngine.preload_model loads model config and weights without full initialization
  3. ensure_shard completes initialization using preloaded data
  4. Main script uses asyncio.gather for parallel preloading

Performance Improvements

  • Startup time for multi-shard models is expected to decrease significantly
  • Resource utilization during startup is more efficient

How to Test

  1. Run the main script with a multi-shard model
  2. Observe logs for parallel preloading and sequential initialization
  3. Compare startup times with the previous sequential loading approach

Future Work

  • Fine-tune the balance between parallel preloading and sequential initialization
  • Implement similar optimizations for other inference engines (e.g., TinyGrad)

If you feel like supporting me:

https://buymeacoffee.com/aybanda

@AlexCheema
Copy link
Contributor

Hey, is this AI generated?

We don't accept AI generated PR's.

This doesn't really achieve its intended purpose: calling preload_model in main.py doesn't really make sense since exo doesn't know up front which shards you are going to use.

@aybanda
Copy link
Author

aybanda commented Sep 9, 2024

Hey @AlexCheema I got your point and yes I have generated using AI

Instead of preloading in main.py, we could modify the ensure_shard method to implement a more efficient loading process. Here's a approach that might work better with your design

In MLXDynamicShardInferenceEngine modifying ensure_shard
This approach will be more suitable I guess
Loads config and weights concurrently
Doesn't require changes to main.py or other parts of exo
Keeps the loading process within the ensure_shard method, maintaining your existing architecture

If you are interested in this let me know, I will change the code accordingly.

@aybanda aybanda marked this pull request as draft September 9, 2024 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants