Skip to main content
Vision-capable models accept images alongside text in the same message. Pass a list of content parts — one text part, one or more image_url parts — instead of a plain string. The current vision-capable model is microsoft/phi-4-multimodal-instruct. Any model with vision in its capabilities list accepts the same request shape.

Example

curl https://tokens.flex.ai/v1/chat/completions \
  -H "Authorization: Bearer $FLEXAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/phi-4-multimodal-instruct",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
      ]
    }]
  }'

Image sources

The image_url.url field accepts either a public HTTPS URL or a base64 data URL. Data URLs are the right choice when the image isn’t already hosted somewhere — no pre-upload step required.
Python
import base64
with open("photo.jpg", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

data_url = f"data:image/jpeg;base64,{b64}"
# pass data_url as image_url.url in the request above
Supported formats: JPEG, PNG, WebP, non-animated GIF.

Multiple images

Pass more than one image_url part in the same content list to ask the model to reason across several images at once. The order is preserved in the model’s context.

Usage accounting

Vision tokens count against your key’s budget the same way text tokens do — the usage block in the response includes any vision-encoded tokens in prompt_tokens. There’s no separate “per image” charge for vision input (unlike image generation, which is priced per output image).