Add multimodal API such as using image as part of prompt #40

yaoyaoumbc · 2024-09-06T19:27:27Z

Gemini Nano XS claims itself to be multimodal but I did not find any corresponding API in Chrome on desktop. Could you add such APIs? Thank you.

basvandorst · 2024-12-10T15:59:51Z

+1

The current languageModel context/prefix is too specific and doesn't account for future AI capabilities like image/voice/video interactions. I'd suggest to have a look at OpenAI (and others) an see how they don't strictly tight to a "language model"

OpenAI

async function main() {
  const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: "What’s in this image?" },
          {
            type: "image_url",
            image_url: {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
            },
          },
        ],
      },
    ],
  });
}

Claud

const message = await anthropic.messages.create({
  model: 'claude-3-5-sonnet-20241022',
  max_tokens: 1024,
  messages: [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": image_media_type,
                        "data": image_data,
                    },
                }
            ],
        }
      ]
});

I think this will also be (partially) the solution for #8, let's not reinvent the wheel to much.

domenic added the enhancement New feature or request label Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multimodal API such as using image as part of prompt #40

Add multimodal API such as using image as part of prompt #40

yaoyaoumbc commented Sep 6, 2024

basvandorst commented Dec 10, 2024

Add multimodal API such as using image as part of prompt #40

Add multimodal API such as using image as part of prompt #40

Comments

yaoyaoumbc commented Sep 6, 2024

basvandorst commented Dec 10, 2024