Skip to main content
Streaming lets you render tokens as the model produces them instead of waiting for the whole response. Set stream: true on any /v1/chat/completions request.
Always set stream_options.include_usage: true when streaming. Without it you cannot bill the request correctly on your side, and per-key spend tracking on ours silently loses information. Every server-sent event framework we recommend below does this by default.

Example

curl https://tokens.flex.ai/v1/chat/completions \
  -H "Authorization: Bearer $FLEXAI_API_KEY" \
  -H "Content-Type: application/json" \
  -N \
  -d '{
    "model": "Qwen2.5-32B-Instruct-FP8",
    "messages": [{"role":"user","content":"Count to five."}],
    "stream": true,
    "stream_options": {"include_usage": true}
  }'

Anatomy of the stream

The response is text/event-stream. Each event is a line beginning with data: followed by a JSON object, then a blank line:
data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":"One"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":", "},"index":0,"finish_reason":null}]}

...

data: {"id":"chatcmpl-abc","choices":[{"delta":{},"index":0,"finish_reason":"stop"}]}

data: {"id":"chatcmpl-abc","choices":[],"usage":{"prompt_tokens":12,"completion_tokens":8,"total_tokens":20}}

data: [DONE]
Three things worth knowing:
  1. delta.content is the incremental piece — concatenate them to reconstruct the full response.
  2. The penultimate chunk carries finish_reason (stop, length, tool_calls, or content_filter). Its delta is typically empty.
  3. The final-before-[DONE] chunk carries usage (only when you set stream_options.include_usage: true). This is the one source of truth for token counts; do not try to count tokens client-side.

Tool calls while streaming

Tool call arguments arrive as delta.tool_calls[].function.arguments fragments that you accumulate the same way as delta.content. See the tool use guide for a full example.

Handling disconnects

No resume protocol

There is no Last-Event-ID-style resume. A dropped TCP connection ends the stream — reconnecting issues a brand-new request with a fresh chatcmpl-… id, not a continuation of the old one. Tokens emitted before the disconnect are not retransmitted. If you need durability across flaky networks, persist the partial output your client has already received and re-prompt with the conversation so far when you reconnect; the model will not pick up exactly where it left off, but it will continue the conversation.

Cancellation billing

When the client closes the connection mid-generation, the server propagates the cancellation to the model backend and bills only the tokens actually emitted up to the disconnect — not the requested max_tokens. The model stops generating when you stop reading. This means it is safe to disconnect early to cap latency or cost. The matching warehouse row will reflect the tokens you received, not the budget you requested.

Retrying

A retry is a fresh, independently-billed request — there is no server-side idempotency key, so we cannot deduplicate. Pragmatic guidance:
  • Short prompts: retry directly; the duplicate cost is bounded.
  • Long generations: surface the partial output to the user and let them decide whether to continue. Echo what you already have back as context if they choose to retry.
See Concurrency for the per-key rate limit headers that govern how aggressively you can retry.