Partial weight loading for reduced RAM utilization #5626

kushalpatil07 · 2024-09-25T11:43:40Z

🚀 The feature, motivation and pitch

I'm trying to use Executorch for running LLM's on low powered(low RAM) devices. I would like all the weights to not be loaded into RAM when the model is loaded initially. As and when the forward pass is happening new weights to be loaded and the model to be executed to improve RAM utilization.

This can be further extended by predicting which sparse weights to load as described in Apple's paper LLM in a flash, which has seen a inference speed-up along with less RAM usage
Link here
@iseeyuan

Alternatives

No response

Additional context

No response

RFC (Optional)

No response

a8nova · 2024-10-04T23:18:04Z

Hi, If no one is working on this, i can contribute to this task. Please let me know!

Olivia-liu assigned iseeyuan Sep 25, 2024

Olivia-liu added feature A request for a proper, new feature. module: runtime Issues related to core runtime triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial weight loading for reduced RAM utilization #5626

Partial weight loading for reduced RAM utilization #5626

kushalpatil07 commented Sep 25, 2024

a8nova commented Oct 4, 2024

Partial weight loading for reduced RAM utilization #5626

Partial weight loading for reduced RAM utilization #5626

Comments

kushalpatil07 commented Sep 25, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

RFC (Optional)

a8nova commented Oct 4, 2024