Skip to main content
The Inference API implements a subset of OpenAI’s v1 surface. For everything below marked Supported, code written against api.openai.com works by changing only base_url and model.

Endpoints

EndpointSupportedNotes
POST /v1/chat/completionsYesStreaming, tools, vision (on multimodal models), seed, response_format.
POST /v1/completionsYesLegacy; prefer chat completions for new integrations.
GET /v1/modelsYesText/chat models only. Media models are in the catalog.
POST /v1/images/generationsYesReturns b64_json. url response format not supported.
POST /v1/videos/generationsYesAsync; not in OpenAI’s API today but follows the same auth/shape.
POST /v1/audio/speechYesTTS; voice names differ per model — check the model page.
POST /v1/audio/transcriptionsYesWhisper-compatible multipart/form-data.
POST /v1/embeddingsNoNo embedding model is deployed. The route is reachable through the gateway but returns 400 invalid_request_error for any model value.
POST /v1/audio/translationsNoUse transcriptions + a chat call for translation.
POST /v1/fine-tuningNoUse FlexAI Fine-Tuning instead.
POST /v1/assistants, /v1/threads, /v1/runsNoNot planned.
POST /v1/images/edits, /v1/images/variationsNoFLUX.1-Kontext-dev reserved for future image editing support.
POST /v1/batchesNoNot planned at launch.
GET /v1/files, /v1/files/*NoUse inline request payloads.

Request fields

Field on POST /v1/chat/completionsSupportedNotes
model, messages, temperature, top_p, max_tokens, stop, seed, userYes
stream, stream_options.include_usageYesSet include_usage: true when streaming for correct billing.
tools, tool_choiceYesOn models labeled tool_use in the catalog.
response_format: { type: "json_object" }YesOn most text models.
response_format: { type: "json_schema", ... }PartialSupported where the underlying model supports it; falls back to json_object otherwise.
logprobs, top_logprobsNoNot at launch.
n (multiple completions per request)NoReturns a single choice.
presence_penalty, frequency_penaltyYes

Response fields

FieldSupportedNotes
id, created, model, choices, usageYes
usage.cache_read_input_tokens, usage.cache_creation_input_tokensYesZero on models without prompt caching; populated for cache-capable models.
system_fingerprintNoNot exposed.

Authentication & headers

  • Bearer tokens only (Authorization: Bearer sk-…). No OAuth, no organization header.
  • Our rate limit response includes Retry-After and x-ratelimit-* headers — see errors.
  • CORS is restricted; the API is intended for server-to-server calls, not browsers.

Error shape

We return OpenAI’s error envelope verbatim:
{ "error": { "message": "…", "type": "invalid_request_error", "code": null, "param": null } }
See the error reference for the full list of codes.