Skip to main content
Streaming lets you render tokens as the model produces them instead of waiting for the whole response. Set stream: true on any /v1/chat/completions request.
Always set stream_options.include_usage: true when streaming. Without it you cannot bill the request correctly on your side, and per-key spend tracking on ours silently loses information. Every server-sent event framework we recommend below does this by default.

Example

curl https://tokens.flex.ai/v1/chat/completions \
  -H "Authorization: Bearer $FLEXAI_API_KEY" \
  -H "Content-Type: application/json" \
  -N \
  -d '{
    "model": "Qwen/Qwen2.5-32B-Instruct",
    "messages": [{"role":"user","content":"Count to five."}],
    "stream": true,
    "stream_options": {"include_usage": true}
  }'

Anatomy of the stream

The response is text/event-stream. Each event is a line beginning with data: followed by a JSON object, then a blank line:
data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":"One"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":", "},"index":0,"finish_reason":null}]}

...

data: {"id":"chatcmpl-abc","choices":[{"delta":{},"index":0,"finish_reason":"stop"}]}

data: {"id":"chatcmpl-abc","choices":[],"usage":{"prompt_tokens":12,"completion_tokens":8,"total_tokens":20}}

data: [DONE]
Three things worth knowing:
  1. delta.content is the incremental piece — concatenate them to reconstruct the full response.
  2. The penultimate chunk carries finish_reason (stop, length, tool_calls, or content_filter). Its delta is typically empty.
  3. The final-before-[DONE] chunk carries usage (only when you set stream_options.include_usage: true). This is the one source of truth for token counts; do not try to count tokens client-side.

Tool calls while streaming

Tool call arguments arrive as delta.tool_calls[].function.arguments fragments that you accumulate the same way as delta.content. See the tool use guide for a full example.

Handling disconnects

If the connection drops mid-stream, no retry is safe — LiteLLM does not re-bill, but the client has no idempotency key to de-duplicate. Treat it as a best-effort fire: surface the partial output to the user and let them re-ask.