Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ollama support #91

Closed
1 of 3 tasks
kerthcet opened this issue Aug 17, 2024 · 13 comments
Closed
1 of 3 tasks

ollama support #91

kerthcet opened this issue Aug 17, 2024 · 13 comments
Assignees
Labels
feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a label and requires one. needs-triage Indicates an issue or PR lacks a label and requires one.

Comments

@kerthcet
Copy link
Member

What would you like to be added:

ollama provides sdk for integrations, we can easily integrate with it, one of the benefits I can think of is ollama maintains a bunch of quantized models, we can leverage.

Why is this needed:

Ecosystem integration.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@kerthcet
Copy link
Member Author

/kind feature

@InftyAI-Agent InftyAI-Agent added needs-triage Indicates an issue or PR lacks a label and requires one. needs-kind Indicates a PR lacks a label and requires one. needs-priority Indicates a PR lacks a label and requires one. feature Categorizes issue or PR as related to a new feature. and removed needs-kind Indicates a PR lacks a label and requires one. labels Aug 17, 2024
@kerthcet
Copy link
Member Author

kerthcet commented Aug 17, 2024

Because of ollama doesn't provide http servers, one way to integrate with it is to support URI with ollama protocol and inference with llama.cpp

RE:
it supports rest server, see https://github.com/ollama/ollama/blob/main/docs/api.md

@qinguoyi
Copy link
Member

/assign @qinguoyi

@qinguoyi
Copy link
Member

i will finish this work util 11.2

@kerthcet
Copy link
Member Author

Hey @qinguoyi if you have any design details, better to share in this issue, we can discuss about that to avoid unnecessary refactorings. Thanks!

@qinguoyi
Copy link
Member

qinguoyi commented Oct 28, 2024

Let's see some backgorund,

  • how llmaz runs
    • First download the model file, and then infer based on the model file
  • what ollama supports
    • support run direct,which can load file from ollama self repo.
    • support import custom model files, which is not in ollama repo.
      • guff
      • safetensors(import direct; import with tuned adapter)
      • quantizing other type model file

so, Considering the llmaz, our goal is to support ollama importing custom model file inference,including guff and safetensors(import direct)

Let's see the difficult to impl,

https://github.com/ollama/ollama/blob/main/docs/import.md , according to the official docs, if we import custom model file, we need exec some shelll cmd after start ollama server:

  • make the file named Modelfile
  • ollama crate modelName -f Modelfile
  • ollama run modelName

Let we see ollama offlical image cmd,
image

according the image inspect, we can see there is olny to start ollama server.
so, The difficulty is how to execute multiple commands and import custom model files for inference while starting the image.

Let's see how to do it

If you want to execute multiple commands, try to use shell instead of other languages like python, because almost all images will have shell like sh or bash.

We have two containers, an init container for downloading models and a main container for starting the inference service; so, we have two solutions to implement it.

  • The first one is to rebuild a new image based on the official image and inject the shell script into it. This is not flexible enough. If the official image is updated, we need to rebuild it.
  • The second one is to add a logic to copy the script file to the models directory when the init container downloads the model, that is, the script directory is mounted to the models directory, so that it can be expanded for more scripts in the future.

image

In summary, we choose the second method to implement it. The specific script commands are as follows:

#!/bin/bash
# start ollama server
ollama serve &

# ensure server is normal
sleep 5

# check input params
if [ -z "$1" ]; then
    echo "please input GGUF model file path,such as:./start_ollama.sh /path/to/model.gguf"
    exit 1
fi

MODEL_PATH=$1

# judge input is file path or dir path
if [ -f "$MODEL_PATH" ]; then
    echo "input file path:$MODEL_PATH"
    # judge whether the path is suffix with .gguf or not  
    if [ -f "$MODEL_PATH" ]; then
        if [[ "$MODEL_PATH" == *.gguf ]]; then
            echo "file exist and suffix with .guff :$MODEL_PATH"
        else
            echo "file exist but not suffix with .guff:$MODEL_PATH"
            eixt 1
        fi
    else
        echo "file is not exist:$MODEL_PATH"
        exit 1
    fi

elif [ -d "$MODEL_PATH" ]; then
       echo "input dir path:$MODEL_PATH"

    if [ -d "$MODEL_PATH" ]; then
        # judge whether has suffix with .safetensors in the dir or not 
        SAFETENSORS_FILES=$(find "$MODEL_PATH" -type f -name "*.safetensors")

        if [ -z "$SAFETENSORS_FILES" ]; then
            echo "dir exists but there no file suffix with .safetensors"
            exit 1
        else
            echo "dir exists and there has suffix with .safetensors in the dir:"
            echo "$SAFETENSORS_FILES"
        fi
    else
        echo "dir is not exist:$MODEL_PATH"
        exit 1
    fi
else
    echo "input path is not file and not dir:$MODEL_PATH"
    exit 1
fi


# create modelfile
MODEL_FILE="Modelfile"
cat <<EOF > $MODEL_FILE
FROM "$MODEL_PATH"
EOF

echo "create modelfile success"
cat $MODEL_FILE

# run Ollama create
if [ -z "$2" ]; then
    echo "please input model name"
    exit 1
fi
MODEL_PATH=$2
ollama create $MODEL_PATH -f Modelfile
if [ $? -ne 0 ]; then
    echo "run ollama create occur error"
    exit 1
fi

# run Ollama run 
ollama run mymodel

# ensure the shell is not exit,avoid the process exit
while true; do
    sleep 3600
done

Let's see the result,

Here we take the guff file mounting as an example. In order to start faster, we use the minimized image alpine/ollala:latest

  • playground.yaml
{{- if .Values.backendRuntime.install -}}
apiVersion: inference.llmaz.io/v1alpha1
kind: BackendRuntime
metadata:
  labels:
    app.kubernetes.io/name: backendruntime
    app.kubernetes.io/part-of: llmaz
    app.kubernetes.io/created-by: llmaz
  name: ollama
spec:
  commands:
    - sh
    - /workspace/models/llmaz-scripts/start_ollama.sh
  image: alpine/ollama
  version: latest
  # Do not edit the preset argument name unless you know what you're doing.
  # Free to add more arguments with your requirements.
  args:
    - name: default
      flags:
        - "{{`{{ .ModelPath }}`}}"
        - "{{`{{ .ModelName }}`}}"
  resources:
    requests:
      cpu: 2
      memory: 4Gi
    limits:
      cpu: 4
      memory: 8Gi
{{- end }}

Let us to port-forwrd the 11434 to 8080:
image
image
image

so, this is my idea to support ollama. i wolud like to know more idea to support elegant. PTAL @kerthcet

@kerthcet
Copy link
Member Author

Thanks for the detailed information, it's really clear. Based on the fact that ollama is mostly designed for local deploy, but not for cloud, and it's based on llama.cpp, we already supported that, so my suggestion is let's start with the simplest approach and see whether this is popular with users, then step to next level based on feedbacks, rather than make it a perfect one at day1. So maybe we can start with Ignore the Modelfile and run the ollama command directly? In this way, we can leverage the ollama models in the library.

Again, from what I've learned so far, I didn't see a lot of users deploy ollama in the cloud, this is a suboptimal solution, just because we can easy to integrate with inference backends, so I make it a TODO work. wdyt?

@qinguoyi
Copy link
Member

qinguoyi commented Oct 28, 2024

Thanks for your kind reply.

I figured out what had confused me so much, which was that I thought the modelfile was mandatory.

in additional , I have no idea how to ignore the modelfiles.

for example, we can add a Ignore field in plyaground, when ignore is true, we can only run playground not binding model?

but, playground, service and backendruntime controller have a lot of code binding model and model[0], there will be many work to ignore the model.

do you have any suggestions for implementation?

@kerthcet
Copy link
Member Author

kerthcet commented Oct 29, 2024

A simple implementation would like:

  • Use the ollama image for base image
  • The model would like below, then we know we're importing models from ollama
      source:
        uri: ollama://qwen2:0.5b
    
  • The command would like ollama run qwen2:0.5b, which is templated via the backendRuntime
  • We can inference the model via request like below, of course, we need to change the port.
    curl http://localhost:11434/api/generate -d '{
    "model": "qwen2:0.5b",
    "prompt":"Why is the sky blue?"
    }'
    
  • Once we found we're inference models from ollama, then no longer need to add an init container to download the model in prior, as mentioned this is the simplest implementation, no cache for the moment, we can add it at anytime users asked for.

Any suggestions?

@qinguoyi
Copy link
Member

I fully agree, this seems like the least invasive solution. I'll work on getting it done as soon as possible

@qinguoyi qinguoyi mentioned this issue Nov 3, 2024
@qinguoyi
Copy link
Member

qinguoyi commented Nov 3, 2024

i will finish this work util 11.2

I am sorry for late commit.
I make a pr there #193, PTAL @kerthcet , thanks.

@qinguoyi
Copy link
Member

Could we close this issue?@kerthcet

@kerthcet
Copy link
Member Author

Yes, we can. One tip, you can set the PR description like fix #xxx then the issue will be closed as soon as the PR is merged. Better not to remove the fix.

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a label and requires one. needs-triage Indicates an issue or PR lacks a label and requires one.
Projects
None yet
Development

No branches or pull requests

3 participants