Partial weight loading for reduced RAM utilization #5626
Labels
feature
A request for a proper, new feature.
module: runtime
Issues related to core runtime
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
🚀 The feature, motivation and pitch
I'm trying to use Executorch for running LLM's on low powered(low RAM) devices. I would like all the weights to not be loaded into RAM when the model is loaded initially. As and when the forward pass is happening new weights to be loaded and the model to be executed to improve RAM utilization.
This can be further extended by predicting which sparse weights to load as described in Apple's paper LLM in a flash, which has seen a inference speed-up along with less RAM usage
Link here
@iseeyuan
Alternatives
No response
Additional context
No response
RFC (Optional)
No response
The text was updated successfully, but these errors were encountered: