Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.flex.ai/llms.txt

Use this file to discover all available pages before exploring further.

The Inference API implements a subset of OpenAI’s v1 surface. For everything below marked Supported, code written against api.openai.com works by changing only base_url and model.

Endpoints

EndpointSupportedNotes
POST /v1/chat/completionsYesStreaming, tools, vision (on multimodal models), seed, response_format.
POST /v1/completionsYesLegacy; prefer chat completions for new integrations.
GET /v1/modelsYesReturns text, chat, and embedding models routed via LiteLLM. Media models (images, video, audio) are routed through portal-api and are not listed here — see the catalog for the complete set.
POST /v1/images/generationsYesReturns b64_json. url response format not supported.
POST /v1/videos/generationsYesAsync; not in OpenAI’s API today but follows the same auth/shape.
POST /v1/audio/speechYesTTS; voice names differ per model — check the model page.
POST /v1/audio/transcriptionsYesWhisper-compatible multipart/form-data.
POST /v1/embeddingsPartialBAAI/bge-m3 at launch (multilingual, 1024-dim, 8K context). Deviation from OpenAI: our backend currently requires encoding_format in the request body, while OpenAI treats it as optional (defaulting to "float"). Set it explicitly for now — SDKs that omit it will get a 400. We plan to default it server-side so this caveat goes away. See the model catalog.
POST /v1/audio/translationsNoUse transcriptions + a chat call for translation.
POST /v1/fine-tuningNoUse FlexAI Fine-Tuning instead.
POST /v1/assistants, /v1/threads, /v1/runsNoNot planned.
POST /v1/images/edits, /v1/images/variationsNoFLUX.1-Kontext-dev reserved for future image editing support.
POST /v1/batchesNoNot planned at launch.
GET /v1/files, /v1/files/*NoUse inline request payloads.

Request fields

Field on POST /v1/chat/completionsSupportedNotes
model, messages, temperature, top_p, max_tokens, stop, seed, userYes
stream, stream_options.include_usageYesSet include_usage: true when streaming for correct billing.
tools, tool_choiceYesOn models labeled tool_use in the catalog.
response_format: { type: "json_object" }YesOn most text models.
response_format: { type: "json_schema", ... }PartialSupported where the underlying model supports it; falls back to json_object otherwise.
logprobs, top_logprobsNoNot at launch.
n (multiple completions per request)NoReturns a single choice.
presence_penalty, frequency_penaltyYes

Response fields

FieldSupportedNotes
id, created, model, choices, usageYes
usage.cache_read_input_tokens, usage.cache_creation_input_tokensYesZero on models without prompt caching; populated for cache-capable models.
system_fingerprintNoNot exposed.

Authentication & headers

  • Bearer tokens only (Authorization: Bearer sk-…). No OAuth, no organization header.
  • Our rate limit response includes Retry-After and x-ratelimit-* headers — see errors.
  • CORS is restricted; the API is intended for server-to-server calls, not browsers.

Error shape

We return OpenAI’s error envelope verbatim:
{ "error": { "message": "…", "type": "invalid_request_error", "code": null, "param": null } }
See the error reference for the full list of codes.