Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.flex.ai/llms.txt

Use this file to discover all available pages before exploring further.

Streaming lets you render tokens as the model produces them instead of waiting for the whole response. Set stream: true on any /v1/chat/completions request.
Always set stream_options.include_usage: true when streaming. Without it you cannot bill the request correctly on your side, and per-key spend tracking on ours silently loses information. Every server-sent event framework we recommend below does this by default.

Example

curl https://tokens.flex.ai/v1/chat/completions \
  -H "Authorization: Bearer $FLEXAI_API_KEY" \
  -H "Content-Type: application/json" \
  -N \
  -d '{
    "model": "Qwen/Qwen2.5-32B-Instruct",
    "messages": [{"role":"user","content":"Count to five."}],
    "stream": true,
    "stream_options": {"include_usage": true}
  }'

Anatomy of the stream

The response is text/event-stream. Each event is a line beginning with data: followed by a JSON object, then a blank line:
data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":"One"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":", "},"index":0,"finish_reason":null}]}

...

data: {"id":"chatcmpl-abc","choices":[{"delta":{},"index":0,"finish_reason":"stop"}]}

data: {"id":"chatcmpl-abc","choices":[],"usage":{"prompt_tokens":12,"completion_tokens":8,"total_tokens":20}}

data: [DONE]
Three things worth knowing:
  1. delta.content is the incremental piece — concatenate them to reconstruct the full response.
  2. The penultimate chunk carries finish_reason (stop, length, tool_calls, or content_filter). Its delta is typically empty.
  3. The final-before-[DONE] chunk carries usage (only when you set stream_options.include_usage: true). This is the one source of truth for token counts; do not try to count tokens client-side.

Tool calls while streaming

Tool call arguments arrive as delta.tool_calls[].function.arguments fragments that you accumulate the same way as delta.content. See the tool use guide for a full example.

Handling disconnects

If the connection drops mid-stream, no retry is safe — LiteLLM does not re-bill, but the client has no idempotency key to de-duplicate. Treat it as a best-effort fire: surface the partial output to the user and let them re-ask.