ollama.generate() with multimodal llama3.2-vision does not pass images array when raw=True #319

rylativity · 2024-11-09T22:58:42Z

I am using ollama.generate() to process invoice document images. I would like to set raw=True so I can specify my own full prompt without it being passed into the default template.

Here is what I would like to be able to do:

ollama.generate(model="llama3.2-vision",
   prompt="Describe the contents of the photo",
   raw=True,
   images=[img_bytes]
)

However, when I set raw=True and pass a prompt and image to the model as shown in the screenshot below, the images array does not appear to get passed to the LLM as evidenced by the model's hallucinated response when asked to describe the image.

In other instances I've even seen it explicitly state "there is no image for me to describe", further suggesting that the images are not making it to the LLM.

The next screenshot (below) shows a response when I set raw=False (default arg value) where the model is clearly receiving the image and able to provide an appropriate, expected response.

I know that additional templating and formatting is being done to the inputs when raw=False based on the model's Ollama configuration, but I would think that there should be a way to pass images to a multimodal model when passing the prompt through as raw.

I also understand that there are peculiarities with multimodal models in Ollama, so Is this a known limitation, is it a bug, or are there additional steps required to pass images alongside a prompt when raw=True?

Thanks in advance for your help.

The text was updated successfully, but these errors were encountered:

ParthSareen · 2024-11-11T06:17:52Z

Might be better to use the chat interface instead in this case:

import ollama

response = ollama.chat(
    model='llama3.2-vision',
    messages=[{
        'role': 'user',
        'content': 'What is in this image?',
        'images': ['image.jpg']
    }]
)

print(response)

Let me know if this helps! I will check it out on the generate end in the meantime :)

rylativity · 2024-11-11T15:14:31Z

Hi @ParthSareen thanks for your quick reply.

The example prompt, "What is in this image", is not the actual prompt for my use-case. It was just the simplest prompt I could use to both determine that the image was not getting passed when raw=True and to demonstrate the issue here.

For my actual use case, I would like to pass the prompt through as raw so that I guide the output and ensure it is parsable. And the way I generally do that is by starting the assistant's response for it and setting a reasonable stop sequence. For example, in the prompt below, I include the start/end sequence tokens for llama3 and I begin the assistant's response for it with "```csv". I would also then set a stop sequence of three backticks so that I can force the LLM to respond in a (somewhat) predictable format:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Extract all data from this page image and return them as a CSV table with column headers. 
    
In the table, include date, description, individual name, hours, rate, amount. Return the CSV table only, between backticks as shown below.
    
Here is an example output row from another image to give you an example

```csv
Date,Description,Individual,Hours,Rate,Amount
3/7/1999,,Bob Ross,3,100,300
```<|eot_id|><|start_header_id|>assistant<|end_header_id|>
    
```csv

I would like to be able to pass a prompt like the one above and an image into ollama.generate() with raw=True to generate predictably formatted output. If I use ollama.generate() with raw=False, or if I use ollama.chat(), the LLM always begins it's response with something like "Sure, here is the data you requested formatted as a table..." (which then requires post-processing to be able to parse), because my prompt is being filled into a chat prompt template.

As a workaround, I've split it up into two prompts and ollama.generate() calls - one with raw=False to just identify the data in the image and a second with raw=True to take the text-only output from the first prompt (but not the image itself) and format it in the expected format. But this isn't ideal because each page takes approximately twice as long to process since there are two LLM calls per page.

Ideally, I would like to be able to pass the example prompt above and an image to ollama.generate() with raw=True() and still have the model be able to "see" the image so that I can accomplish the task with a single LLM call.

ParthSareen · 2024-11-12T21:45:06Z

Dug through our API and the Python SDK - I'd recommend to use the .chat functionality even in this case.

I think you'd still be able to get away with one LLM call and just do post processing after you get the response from the model. If your pipeline is vision model -> structured data it would make sense why you're not getting as good of results back either. Vision LLM -> Cleanup LLM or functions -> structured data makes more sense given the Ollama's structure for now but we do have lots in the pipeline thinking along similar goals.

rylativity · 2024-11-12T22:30:26Z

Thanks @ParthSareen, I appreciate the advice, and that is helpful.

But back to the root issue - images do not seem to be making it to the LLM when calling ollama.generate() with raw=True (e.g ollama.generate(prompt = "abc", images = [img_bytes], raw=True). I am trying to understand if that is a bug (but is supposed to work today), a roadmap item, or a capability that will never be implemented for one reason or another.

Is there a correct way to pass images to a multimodal LLM in Ollama when calling ollama.generate() with raw=True today, or is passing images when raw=True not possible today? If it is not possible today, is that something you think might be addressed down the line?

ParthSareen · 2024-11-12T22:37:31Z

Hey! Yeah for sure.

It's expected for it to not work at this time (sorry for not making that clear).
raw=True with images not possible at this time
I think we'll be working on improving our prompt templating and making it clear - possibly defining some way of doing this. So not a hard yes but also not a hard no at this time.

rylativity · 2024-11-12T22:53:41Z

Ok, thanks for clarifying and for all the helpful info!

jhud · 2024-12-21T14:00:37Z

I checked the code (https://github.com/ollama/ollama/blob/d8bab8ea4403d3fb05a9bf408e638195b72bebf9/server/routes.go#L234C14-L234C15) and I can't see a reason why images + raw shouldn't work (I don't know Go). This would be super useful to me - do you have any pointers as to why it's not possible, so that I could try forking it? Thank you!

rylativity · 2024-12-21T17:49:27Z

I checked the code (https://github.com/ollama/ollama/blob/d8bab8ea4403d3fb05a9bf408e638195b72bebf9/server/routes.go#L234C14-L234C15) and I can't see a reason why images + raw shouldn't work (I don't know Go). This would be super useful to me - do you have any pointers as to why it's not possible, so that I could try forking it? Thank you!

I am also not super familiar with Go, so I could be off base here, but at first glance it seems the issue with images + raw might arise from the following:

In the file you linked, lines 205-231 preprocess the images regardless of whether raw is true or false. However, lines 260-266, which seem to add the template fields needed for the images to be passed into the completion() call later on, only run if raw is false. When raw is true, lines 235-288, which includes the code responsible for adding the image fields to the prompt, will not run.

When we finally get down to line 297 which seems to create the completion object to be passed to the API, the "prompt" passed on line does not include the template fields needed for the completion API to add the images to the prompt.

If that is indeed what is happening, in order to make images + raw prompts work correctly, I believe either the user would need to add the appropriate image template fields to their raw prompt prior to calling ollama.generate(), or ollama would need to add slightly opinionated logic to add the necessary image template fields to the raw prompt, perhaps enabled by an argument in ollama.generate() (e.g. ollama.generate(prompt=prompt, images=images, raw=True, add_images_to_prompt=True).

rylativity · 2024-12-21T17:51:58Z

As I mentioned, I am not particularly familiar with Go so I'm wary of spinning my wheels on this.

But I'd be curious to hear @ParthSareen's opinion on whether the solution above might work or other things to look out for, and I'd be happy to try to implement the fix described above if it seems like a reasonable solution.

rylativity · 2024-12-22T06:00:05Z

I have opened a pull request here for the implemented functionality - ollama/ollama#8209

In the meantime, I have discovered that you can manually pass a placeholder for your image(s) inside your raw prompt and the multimodal model will then be able to see them.

Here is an example request that will work right now even without the PR being merged:

curl --request POST \
  --url http://localhost:11434/api/generate \
  --header 'content-type: application/json' \
  --data '{
  "model": "llama3.2-vision",
  "prompt": "[img-0]<image>What is in this picture?",
  "raw":true,
  "stream":false,
  "images": ["iVBORw0KGgoAAAANSUhEUgAAAG0AAABmCAYAAADBPx+VAAAACXBIWXMAAAsTAAALEwEAmpwYAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAA3VSURBVHgB7Z27r0zdG8fX743i1bi1ikMoFMQloXRpKFFIqI7LH4BEQ+NWIkjQuSWCRIEoULk0gsK1kCBI0IhrQVT7tz/7zZo888yz1r7MnDl7z5xvsjkzs2fP3uu71nNfa7lkAsm7d++Sffv2JbNmzUqcc8m0adOSzZs3Z+/XES4ZckAWJEGWPiCxjsQNLWmQsWjRIpMseaxcuTKpG/7HP27I8P79e7dq1ars/yL4/v27S0ejqwv+cUOGEGGpKHR37tzJCEpHV9tnT58+dXXCJDdECBE2Ojrqjh071hpNECjx4cMHVycM1Uhbv359B2F79+51586daxN/+pyRkRFXKyRDAqxEp4yMlDDzXG1NPnnyJKkThoK0VFd1ELZu3TrzXKxKfW7dMBQ6bcuWLW2v0VlHjx41z717927ba22U9APcw7Nnz1oGEPeL3m3p2mTAYYnFmMOMXybPPXv2bNIPpFZr1NHn4HMw0KRBjg9NuRw95s8PEcz/6DZELQd/09C9QGq5RsmSRybqkwHGjh07OsJSsYYm3ijPpyHzoiacg35MLdDSIS/O1yM778jOTwYUkKNHWUzUWaOsylE00MyI0fcnOwIdjvtNdW/HZwNLGg+sR1kMepSNJXmIwxBZiG8tDTpEZzKg0GItNsosY8USkxDhD0Rinuiko2gfL/RbiD2LZAjU9zKQJj8RDR0vJBR1/Phx9+PHj9Z7REF4nTZkxzX4LCXHrV271qXkBAPGfP/atWvu/PnzHe4C97F48eIsRLZ9+3a3f/9+87dwP1JxaF7/3r17ba+5l4EcaVo0lj3SBq5kGTJSQmLWMjgYNei2GPT1MuMqGTDEFHzeQSP2wi/jGnkmPJ/nhccs44jvDAxpVcxnq0F6eT8h4ni/iIWpR5lPyA6ETkNXoSukvpJAD3AsXLiwpZs49+fPn5ke4j10TqYvegSfn0OnafC+Tv9ooA/JPkgQysqQNBzagXY55nO/oa1F7qvIPWkRL12WRpMWUvpVDYmxAPehxWSe8ZEXL20sadYIozfmNch4QJPAfeJgW3rNsnzphBKNJM2KKODo1rVOMRYik5ETy3ix4qWNI81qAAirizgMIc+yhTytx0JWZuNI03qsrgWlGtwjoS9XwgUhWGyhUaRZZQNNIEwCiXD16tXcAHUs79co0vSD8rrJCIW98pzvxpAWyyo3HYwqS0+H0BjStClcZJT5coMm6D2LOF8TolGJtK9fvyZpyiC5ePFi9nc/oJU4eiEP0jVoAnHa9wyJycITMP78+eMeP37sXrx44d6+fdt6f82aNdkx1pg9e3Zb5W+RSRE+n+VjksQWifvVaTKFhn5O8my63K8Qabdv33b379/PiAP//vuvW7BggZszZ072/+TJk91YgkafPn166zXB1rQHFvouAWHq9z3SEevSUerqCn2/dDCeta2jxYbr69evk4MHDyY7d+7MjhMnTiTPnz9Pfv/+nfQT2ggpO2dMF8cghuoM7Ygj5iWCqRlGFml0QC/ftGmTmzt3rmsaKDsgBSPh0/8yPeLLBihLkOKJc0jp8H8vUzcxIA1k6QJ/c78tWEyj5P3o4u9+jywNPdJi5rAH9x0KHcl4Hg570eQp3+vHXGyrmEeigzQsQsjavXt38ujRo44LQuDDhw+TW7duRS1HGgMxhNXHgflaNTOsHyKvHK5Ijo2jbFjJBQK9YwFd6RVMzfgRBmEfP37suBBm/p49e1qjEP2mwTViNRo0VJWH1deMXcNK08uUjVUu7s/zRaL+oLNxz1bpANco4npUgX4G2eFbpDFyQoQxojBCpEGSytmOH8qrH5Q9vuzD6ofQylkCUmh8DBAr+q8JCyVNtWQIidKQE9wNtLSQnS4jDSsxNHogzFuQBw4cyM61UKVsjfr3ooBkPSqqQHesUPWVtzi9/vQi1T+rJj7WiTz4Pt/l3LxUkr5P2VYZaZ4URpsE+st/dujQoaBBYokbrz/8TJNQYLSonrPS9kUaSkPeZyj1AWSj+d+VBoy1pIWVNed8P0Ll/ee5HdGRhrHhR5GGN0r4LGZBaj8oFDJitBTJzIZgFcmU0Y8ytWMZMzJOaXUSrUs5RxKnrxmbb5YXO9VGUhtpXldhEUogFr3IzIsvlpmdosVcGVGXFWp2oU9kLFL3dEkSz6NHEY1sjSRdIuDFWEhd8KxFqsRi1uM/nz9/zpxnwlESONdg6dKlbsaMGS4EHFHtjFIDHwKOo46l4TxSuxgDzi+rE2jg+BaFruOX4HXa0Nnf1lwAPufZeF8/r6zD97WK2qFnGjBxTw5qNGPxT+5T/r7/7RawFC3j4vTp09koCxkeHjqbHJqArmH5UrFKKksnxrK7FuRIs8STfBZv+luugXZ2pR/pP9Ois4z+TiMzUUkUjD0iEi1fzX8GmXyuxUBRcaUfykV0YZnlJGKQpOiGB76x5GeWkWWJc3mOrK6S7xdND+W5N6XyaRgtWJFe13GkaZnKOsYqGdOVVVbGupsyA/l7emTLHi7vwTdirNEt0qxnzAvBFcnQF16xh/TMpUuXHDowhlA9vQVraQhkudRdzOnK+04ZSP3DUhVSP61YsaLtd/ks7ZgtPcXqPqEafHkdqa84X6aCeL7YWlv6edGFHb+ZFICPlljHhg0bKuk0CSvVznWsotRu433alNdFrqG45ejoaPCaUkWERpLXjzFL2Rpllp7PJU2a/v7Ab8N05/9t27Z16KUqoFGsxnI9EosS2niSYg9SpU6B4JgTrvVW1flt1sT+0ADIJU2maXzcUTraGCRaL1Wp9rUMk16PMom8QhruxzvZIegJjFU7LLCePfS8uaQdPny4jTTL0dbee5mYokQsXTIWNY46kuMbnt8Kmec+LGWtOVIl9cT1rCB0V8WqkjAsRwta93TbwNYoGKsUSChN44lgBNCoHLHzquYKrU6qZ8lolCIN0Rh6cP0Q3U6I6IXILYOQI513hJaSKAorFpuHXJNfVlpRtmYBk1Su1obZr5dnKAO+L10Hrj3WZW+E3qh6IszE37F6EB+68mGpvKm4eb9bFrlzrok7fvr0Kfv727dvWRmdVTJHw0qiiCUSZ6wCK+7XL/AcsgNyL74DQQ730sv78Su7+t/A36MdY0sW5o40ahslXr58aZ5HtZB8GH64m9EmMZ7FpYw4T6QnrZfgenrhFxaSiSGXtPnz57e9TkNZLvTjeqhr734CNtrK41L40sUQckmj1lGKQ0rC37x544r8eNXRpnVE3ZZY7zXo8NomiO0ZUCj2uHz58rbXoZ6gc0uA+F6ZeKS/jhRDUq8MKrTho9fEkihMmhxtBI1DxKFY9XLpVcSkfoi8JGnToZO5sU5aiDQIW716ddt7ZLYtMQlhECdBGXZZMWldY5BHm5xgAroWj4C0hbYkSc/jBmggIrXJWlZM6pSETsEPGqZOndr2uuuR5rF169a2HoHPdurUKZM4CO1WTPqaDaAd+GFGKdIQkxAn9RuEWcTRyN2KSUgiSgF5aWzPTeA/lN5rZubMmR2bE4SIC4nJoltgAV/dVefZm72AtctUCJU2CMJ327hxY9t7EHbkyJFseq+EJSY16RPo3Dkq1kkr7+q0bNmyDuLQcZBEPYmHVdOBiJyIlrRDq41YPWfXOxUysi5fvtyaj+2BpcnsUV/oSoEMOk2CQGlr4ckhBwaetBhjCwH0ZHtJROPJkyc7UjcYLDjmrH7ADTEBXFfOYmB0k9oYBOjJ8b4aOYSe7QkKcYhFlq3QYLQhSidNmtS2RATwy8YOM3EQJsUjKiaWZ+vZToUQgzhkHXudb/PW5YMHD9yZM2faPsMwoc7RciYJXbGuBqJ1UIGKKLv915jsvgtJxCZDubdXr165mzdvtr1Hz5LONA8jrUwKPqsmVesKa49S3Q4WxmRPUEYdTjgiUcfUwLx589ySJUva3oMkP6IYddq6HMS4o55xBJBUeRjzfa4Zdeg56QZ43LhxoyPo7Lf1kNt7oO8wWAbNwaYjIv5lhyS7kRf96dvm5Jah8vfvX3flyhX35cuX6HfzFHOToS1H4BenCaHvO8pr8iDuwoUL7tevX+b5ZdbBair0xkFIlFDlW4ZknEClsp/TzXyAKVOmmHWFVSbDNw1l1+4f90U6IY/q4V27dpnE9bJ+v87QEydjqx/UamVVPRG+mwkNTYN+9tjkwzEx+atCm/X9WvWtDtAb68Wy9LXa1UmvCDDIpPkyOQ5ZwSzJ4jMrvFcr0rSjOUh+GcT4LSg5ugkW1Io0/SCDQBojh0hPlaJdah+tkVYrnTZowP8iq1F1TgMBBauufyB33x1v+NWFYmT5KmppgHC+NkAgbmRkpD3yn9QIseXymoTQFGQmIOKTxiZIWpvAatenVqRVXf2nTrAWMsPnKrMZHz6bJq5jvce6QK8J1cQNgKxlJapMPdZSR64/UivS9NztpkVEdKcrs5alhhWP9NeqlfWopzhZScI6QxseegZRGeg5a8C3Re1Mfl1ScP36ddcUaMuv24iOJtz7sbUjTS4qBvKmstYJoUauiuD3k5qhyr7QdUHMeCgLa1Ear9NquemdXgmum4fvJ6w1lqsuDhNrg1qSpleJK7K3TF0Q2jSd94uSZ60kK1e3qyVpQK6PVWXp2/FC3mp6jBhKKOiY2h3gtUV64TWM6wDETRPLDfSakXmH3w8g9Jlug8ZtTt4kVF0kLUYYmCCtD/DrQ5YhMGbA9L3ucdjh0y8kOHW5gU/VEEmJTcL4Pz/f7mgoAbYkAAAAAElFTkSuQmCC"]
}'

Note that the [img-0]<image> tag can be placed anywhere in the prompt, and the 0 corresponds to the index of the image in the images array.

jhud · 2024-12-22T15:49:47Z

I have opened a pull request here for the implemented functionality - ollama/ollama#8209

In the meantime, I have discovered that you can manually pass a placeholder for your image(s) inside your raw prompt and the multimodal model will then be able to see them.

Thank you - fantastic work! Wonderful to wake up and see that someone has already worked out a solution.

For what it's worth, here is my obvious patch, using your find:

prompt = prompt.replace("<|image|>", "[img-0]<image>")

Then the usual API call:

        async with httpx.AsyncClient() as client:
            async with client.stream(
                "POST",
                endpoint + "/generate",
                json={
                    "model": "llama3.2-vision:latest",
                    "prompt": prompt,
                    "raw": True,
                    "options": {"temperature": temperature,
                     "stop": stop_sequences,
                     "num_predict": max_tokens,},
                    "images": images,
                },
                timeout=None,
            ) as response:
                async for chunk in response.aiter_bytes():

rylativity · 2024-12-22T17:15:02Z

@jhud glad to hear that this has helped!

I have opened this related PR on the python library to support the proposed new kwarg in the python client - #392

ParthSareen self-assigned this Nov 12, 2024

ParthSareen closed this as completed Nov 12, 2024

rylativity mentioned this issue Dec 22, 2024

api: enable passing images in raw mode ollama/ollama#8209

Open

rylativity mentioned this issue Dec 22, 2024

Add support for prepending images to raw prompts and examples #392

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ollama.generate() with multimodal llama3.2-vision does not pass images array when raw=True #319

ollama.generate() with multimodal llama3.2-vision does not pass images array when raw=True #319

rylativity commented Nov 9, 2024

ParthSareen commented Nov 11, 2024

rylativity commented Nov 11, 2024

ParthSareen commented Nov 12, 2024

rylativity commented Nov 12, 2024

ParthSareen commented Nov 12, 2024

rylativity commented Nov 12, 2024

jhud commented Dec 21, 2024

rylativity commented Dec 21, 2024

rylativity commented Dec 21, 2024

rylativity commented Dec 22, 2024

jhud commented Dec 22, 2024

rylativity commented Dec 22, 2024

ollama.generate() with multimodal llama3.2-vision does not pass images array when raw=True #319

ollama.generate() with multimodal llama3.2-vision does not pass images array when raw=True #319

Comments

rylativity commented Nov 9, 2024

ParthSareen commented Nov 11, 2024

rylativity commented Nov 11, 2024

ParthSareen commented Nov 12, 2024

rylativity commented Nov 12, 2024

ParthSareen commented Nov 12, 2024

rylativity commented Nov 12, 2024

jhud commented Dec 21, 2024

rylativity commented Dec 21, 2024

rylativity commented Dec 21, 2024

rylativity commented Dec 22, 2024

jhud commented Dec 22, 2024

rylativity commented Dec 22, 2024