[FT] Support llama.cpp inference #402

JoelNiklaus · 2024-11-22T09:57:25Z

Issue encountered

Currently, inference of open models on my Mac device is quite slow since vllm does not support mps.

Llama.cpp does support mps and would significantly speed up local evaluation of open models.

Allowing the use of the mps device in other ways of loading models would also work.

clefourrier · 2024-11-22T10:03:20Z

Hi! Feel free to open a PR for this if you need it fast as our roadmap for EOY is full :)

JoelNiklaus · 2024-11-22T11:43:33Z

Sounds good. Might do at some point, for now it is not a priority for me.

julien-c · 2024-11-22T11:46:05Z

would be an awesome feature IMO! cc @gary149

JoelNiklaus added the feature request New feature/request label Nov 22, 2024