feat(code): switch fully to ollama as LLM provider (#101)

* feat: Adds script to evaluate ollama latency * refactor: Remove legacy comment * feat: Adds route to chat with code llm * docs: Updates copyright notice * docs: Updates copyright notice * build(docker): bump ollama to 0.1.23 * feat(ollama): unload model after 30 sec * build(docker): bump ollama to 0.1.25 * feat(code): add chat endpoint * refactor(docker): set ollama as default llm provider * test(code): add unit tests for code router * docs(readme): update the env configuration description * docs(swagger): improve the example for Chatrole * docs(readme): add latency benchmark in the readme * fix(ollama): fix typo in ollama client * fix(openai): fix import * ci(github): remove nvidia driver requirement for docker orchestration * fix(docker): relax healthcheck on test environment * fix(docker): update docker healthcheck for tests * fix(docker): fix docker command for test env * fix(docker): fix healthcheck * fix(docker): fix healthcheck * fix(docker): fix healthcheck of ollama containers * test(repos): disable repo parsing test
quack-ai · Feb 20, 2024 · 2aaa508 · 2aaa508
1 parent 6bc7921
commit 2aaa508
Show file tree

Hide file tree

Showing 19 changed files with 622 additions and 423 deletions.
diff --git a/.github/workflows/builds.yml b/.github/workflows/builds.yml
@@ -32,7 +32,7 @@ jobs:
           POSTGRES_USER: postgres
           POSTGRES_PASSWORD: pg_pwd
           OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
-        run: docker-compose up -d --build
+        run: docker-compose -f docker-compose.test.yml up -d --build
       - name: Docker sanity check
         run: sleep 20 && nc -vz localhost 8050
       - name: Debug

diff --git a/README.md b/README.md
@@ -62,6 +62,27 @@ In order to stop the service, run:
 make stop
 ```
 
+### Latency benchmark
+
+You crave for perfect codde suggestions, but you don't know whether it fits your needs in terms of latency?
+In the table below, you will find a latency benchmark for all tested LLMs from Ollama:
+
+| Model                                                        | Ingestion mean (std)   | Generation mean (std) |
+| ------------------------------------------------------------ | ---------------------- | --------------------- |
+| [tinyllama:1.1b-chat-v1-q4_0](https://ollama.com/library/tinyllama:1.1b-chat-v1-q4_0) | 2014.63 tok/s (±12.62) | 227.13 tok/s (±2.26)  |
+| [dolphin-phi:2.7b-v2.6-q4_0](https://ollama.com/library/dolphin-phi:2.7b-v2.6-q4_0) | 684.07 tok/s (±3.85)   | 122.25 toks/s (±0.87) |
+| [dolphin-mistral:7b-v2.6](https://ollama.com/library/dolphin-mistral:7b-v2.6) | 291.94 tok/s (±0.4)    | 60.56 tok/s (±0.15)   |
+
+
+This benchmark was performed over 20 iterations on the same input sequence, on a **laptop** to better reflect performances that can be expected by common users. The hardware setup includes an [Intel(R) Core(TM) i7-12700H](https://ark.intel.com/content/www/us/en/ark/products/132228/intel-core-i7-12700h-processor-24m-cache-up-to-4-70-ghz.html) for the CPU, and a [NVIDIA GeForce RTX 3060](https://www.nvidia.com/fr-fr/geforce/graphics-cards/30-series/rtx-3060-3060ti/) for the laptop GPU.
+
+You can run this latency benchmark for any Ollama model on your hardware as follows:
+```bash
+python scripts/evaluate_ollama_latency.py dolphin-mistral:7b-v2.6-dpo-laser-q4_0 --endpoint http://localhost:3000
+```
+
+*All script arguments can be checked using `python scripts/evaluate_ollama_latency.py --help`*
+
 
 ### How is the database organized
 
@@ -88,30 +109,33 @@ The back-end core feature is to interact with the metadata tables. For the servi
 
 The project was designed so that everything runs with Docker orchestration (standalone virtual environment), so you won't need to install any additional libraries.
 
-## Configuration
+### Configuration
 
 In order to run the project, you will need to specific some information, which can be done using a `.env` file.
 This file will have to hold the following information:
+- `POSTGRES_DB`*: a name for the [PostgreSQL](https://www.postgresql.org/) database that will be created
+- `POSTGRES_USER`*: a login for the PostgreSQL database
+- `POSTGRES_PASSWORD`*: a password for the PostgreSQL database
 - `SUPERADMIN_GH_PAT`: the GitHub token of the initial admin access (Generate a new token on [GitHub](https://github.com/settings/tokens?type=beta), with no extra permissions = read-only)
 - `SUPERADMIN_PWD`*: the password of the initial admin access
 - `GH_OAUTH_ID`: the Client ID of the GitHub Oauth app (Create an OAuth app on [GitHub](https://github.com/settings/applications/new), pointing to your Quack dashboard w/ callback URL)
 - `GH_OAUTH_SECRET`: the secret of the GitHub Oauth app (Generate a new client secret on the created OAuth app)
-- `POSTGRES_DB`*: a name for the [PostgreSQL](https://www.postgresql.org/) database that will be created
-- `POSTGRES_USER`*: a login for the PostgreSQL database
-- `POSTGRES_PASSWORD`*: a password for the PostgreSQL database
-- `OPENAI_API_KEY`: your API key for Open AI (Create new secret key on [OpenAI](https://platform.openai.com/api-keys))
 
 _* marks the values where you can pick what you want._
 
 Optionally, the following information can be added:
 - `SECRET_KEY`*: if set, tokens can be reused between sessions. All instances sharing the same secret key can use the same token.
+- `OLLAMA_MODEL`: the model tag in [Ollama library](https://ollama.com/library) that will be used for the API.
 - `SENTRY_DSN`: the DSN for your [Sentry](https://sentry.io/) project, which monitors back-end errors and report them back.
 - `SERVER_NAME`*: the server tag that will be used to report events to Sentry.
 - `POSTHOG_KEY`: the project API key for PostHog [PostHog](https://eu.posthog.com/settings/project-details).
 - `SLACK_API_TOKEN`: the App key for your Slack bot (Create New App on [Slack](https://api.slack.com/apps), go to OAuth & Permissions and generate a bot User OAuth Token).
 - `SLACK_CHANNEL`: the Slack channel where your bot will post events (defaults to `#general`, you have to invite the App to your channel).
 - `SUPPORT_EMAIL`: the email used for support of your API.
 - `DEBUG`: if set to false, silence debug logs.
+- `OPENAI_API_KEY`**: your API key for Open AI (Create new secret key on [OpenAI](https://platform.openai.com/api-keys))
+
+_** marks the deprecated values._
 
 So your `.env` file should look like something similar to [`.env.example`](.env.example)
 The file should be placed in the folder of your `./docker-compose.yml`.

diff --git a/docker-compose.ollama.yml b/docker-compose.ollama.yml
diff --git a/docker-compose.test.yml b/docker-compose.test.yml
@@ -9,21 +9,24 @@ services:
     ports:
       - "8050:8050"
     environment:
+      - POSTGRES_URL=postgresql+asyncpg://dummy_login:dummy_pwd@test_db/dummy_db
+      - OLLAMA_ENDPOINT=http://ollama:11434
+      - OLLAMA_MODEL=tinydolphin:1.1b-v2.8-q4_0
       - SUPERADMIN_GH_PAT=${SUPERADMIN_GH_PAT}
       - SUPERADMIN_PWD=superadmin_pwd
       - GH_OAUTH_ID=${GH_OAUTH_ID}
       - GH_OAUTH_SECRET=${GH_OAUTH_SECRET}
-      - POSTGRES_URL=postgresql+asyncpg://dummy_login:dummy_pwd@test_db/dummy_db
-      - OPENAI_API_KEY=${OPENAI_API_KEY}
       - DEBUG=true
     depends_on:
       test_db:
         condition: service_healthy
+      ollama:
+        condition: service_healthy
 
   test_db:
     image: postgres:15-alpine
-    ports:
-      - "5432:5432"
+    expose:
+      - 5432
     environment:
       - POSTGRES_USER=dummy_login
       - POSTGRES_PASSWORD=dummy_pwd
@@ -33,3 +36,26 @@ services:
       interval: 10s
       timeout: 3s
       retries: 3
+
+  ollama:
+    image: ollama/ollama:0.1.25
+    command: serve
+    volumes:
+      - "$HOME/.ollama:/root/.ollama"
+    expose:
+      - 11434
+    healthcheck:
+      test: ["CMD-SHELL", "ollama pull 'tinydolphin:1.1b-v2.8-q4_0'"]
+      interval: 5s
+      timeout: 1m
+      retries: 3
+    # deploy:
+    #   resources:
+    #     reservations:
+    #       devices:
+    #         - driver: nvidia
+    #           count: 1
+    #           capabilities: [gpu]
+
+volumes:
+  ollama:
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -9,23 +9,41 @@ services:
     ports:
       - "8050:8050"
     environment:
+      - POSTGRES_URL=postgresql+asyncpg://${POSTGRES_USER}:${POSTGRES_PASSWORD}@db/${POSTGRES_DB}
+      - OLLAMA_ENDPOINT=http://ollama:11434
+      - OLLAMA_MODEL=${OLLAMA_MODEL}
+      - SECRET_KEY=${SECRET_KEY}
       - SUPERADMIN_GH_PAT=${SUPERADMIN_GH_PAT}
       - SUPERADMIN_PWD=${SUPERADMIN_PWD}
       - GH_OAUTH_ID=${GH_OAUTH_ID}
       - GH_OAUTH_SECRET=${GH_OAUTH_SECRET}
-      - POSTGRES_URL=postgresql+asyncpg://${POSTGRES_USER}:${POSTGRES_PASSWORD}@db/${POSTGRES_DB}
-      - OPENAI_API_KEY=${OPENAI_API_KEY}
-      - SECRET_KEY=${SECRET_KEY}
-      - SENTRY_DSN=${SENTRY_DSN}
-      - SERVER_NAME=${SERVER_NAME}
-      - POSTHOG_KEY=${POSTHOG_KEY}
-      - SLACK_API_TOKEN=${SLACK_API_TOKEN}
-      - SLACK_CHANNEL=${SLACK_CHANNEL}
       - SUPPORT_EMAIL=${SUPPORT_EMAIL}
       - DEBUG=true
     depends_on:
       db:
         condition: service_healthy
+      ollama:
+        condition: service_healthy
+
+  ollama:
+    image: ollama/ollama:0.1.25
+    command: serve
+    volumes:
+      - "$HOME/.ollama:/root/.ollama"
+    expose:
+      - 11434
+    healthcheck:
+      test: ["CMD-SHELL", "ollama pull '${OLLAMA_MODEL}'"]
+      interval: 5s
+      timeout: 1m
+      retries: 3
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]
 
   db:
     image: postgres:15-alpine
@@ -71,3 +89,4 @@ services:
 
 volumes:
   postgres_data:
+  ollama: