text part, one or more image_url parts — instead of a plain string.
The current vision-capable model is microsoft/phi-4-multimodal-instruct. Any model with vision in its capabilities list accepts the same request shape.
Example
Image sources
Theimage_url.url field accepts either a public HTTPS URL or a base64 data URL. Data URLs are the right choice when the image isn’t already hosted somewhere — no pre-upload step required.
Python
Multiple images
Pass more than oneimage_url part in the same content list to ask the model to reason across several images at once. The order is preserved in the model’s context.
Usage accounting
Vision tokens count against your key’s budget the same way text tokens do — theusage block in the response includes any vision-encoded tokens in prompt_tokens. There’s no separate “per image” charge for vision input (unlike image generation, which is priced per output image).