> ## Documentation Index
> Fetch the complete documentation index at: https://docs.flex.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Vision

> Send images to multimodal models using the OpenAI `image_url` content shape.

Vision-capable models accept images alongside text in the same message. Pass a list of content parts — one `text` part, one or more `image_url` parts — instead of a plain string.

The current vision-capable model is `phi-4-multimodal-instruct`. Any model with `vision` in its [capabilities list](https://flex.ai/models) accepts the same request shape.

## Example

<CodeGroup>
  ```bash cURL theme={null}
  curl https://tokens.flex.ai/v1/chat/completions \
    -H "Authorization: Bearer $FLEXAI_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "phi-4-multimodal-instruct",
      "messages": [{
        "role": "user",
        "content": [
          {"type": "text", "text": "What is in this image?"},
          {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
        ]
      }]
    }'
  ```

  ```python Python theme={null}
  import os
  from openai import OpenAI
  client = OpenAI(base_url="https://tokens.flex.ai/v1", api_key=os.environ["FLEXAI_API_KEY"])

  response = client.chat.completions.create(
      model="phi-4-multimodal-instruct",
      messages=[{
          "role": "user",
          "content": [
              {"type": "text", "text": "What is in this image?"},
              {"type": "image_url", "image_url": {
                  "url": "https://example.com/photo.jpg",
              }},
          ],
      }],
  )
  print(response.choices[0].message.content)
  ```

  ```typescript TypeScript theme={null}
  import OpenAI from "openai";
  const client = new OpenAI({
    baseURL: "https://tokens.flex.ai/v1",
    apiKey: process.env.FLEXAI_API_KEY,
  });

  const response = await client.chat.completions.create({
    model: "phi-4-multimodal-instruct",
    messages: [{
      role: "user",
      content: [
        { type: "text", text: "What is in this image?" },
        { type: "image_url", image_url: { url: "https://example.com/photo.jpg" } },
      ],
    }],
  });
  console.log(response.choices[0].message.content);
  ```
</CodeGroup>

## Image sources

The `image_url.url` field accepts either a public HTTPS URL or a base64 data URL. Data URLs are the right choice when the image isn't already hosted somewhere — no pre-upload step required.

```python Python theme={null}
import base64
with open("photo.jpg", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

data_url = f"data:image/jpeg;base64,{b64}"
# pass data_url as image_url.url in the request above
```

Supported formats: JPEG, PNG, WebP, non-animated GIF.

## Multiple images

Pass more than one `image_url` part in the same `content` list to ask the model to reason across several images at once. The order is preserved in the model's context.

## Usage accounting

Vision tokens count against your key's budget the same way text tokens do — the `usage` block in the response includes any vision-encoded tokens in `prompt_tokens`. There's no separate "per image" charge for vision input (unlike image *generation*, which is [priced per output image](/inference-api/reference/billing)).
