> ## Documentation Index
> Fetch the complete documentation index at: https://docs.flex.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Streaming

> Stream tokens as they are generated and collect the final usage block for accurate spend tracking.

Streaming lets you render tokens as the model produces them instead of waiting for the whole response. Set `stream: true` on any `/v1/chat/completions` request.

<Note>
  **Always set `stream_options.include_usage: true` when streaming.** Without it you cannot bill the request correctly on your side, and per-key spend tracking on ours silently loses information. Every server-sent event framework we recommend below does this by default.
</Note>

## Example

<CodeGroup>
  ```bash cURL theme={null}
  curl https://tokens.flex.ai/v1/chat/completions \
    -H "Authorization: Bearer $FLEXAI_API_KEY" \
    -H "Content-Type: application/json" \
    -N \
    -d '{
      "model": "Qwen2.5-32B-Instruct-FP8",
      "messages": [{"role":"user","content":"Count to five."}],
      "stream": true,
      "stream_options": {"include_usage": true}
    }'
  ```

  ```python Python theme={null}
  import os
  from openai import OpenAI
  client = OpenAI(base_url="https://tokens.flex.ai/v1", api_key=os.environ["FLEXAI_API_KEY"])

  stream = client.chat.completions.create(
      model="Qwen2.5-32B-Instruct-FP8",
      messages=[{"role": "user", "content": "Count to five."}],
      stream=True,
      stream_options={"include_usage": True},
  )

  for chunk in stream:
      if chunk.choices and chunk.choices[0].delta.content:
          print(chunk.choices[0].delta.content, end="", flush=True)
      if chunk.usage:
          print(f"\n\nprompt={chunk.usage.prompt_tokens} completion={chunk.usage.completion_tokens}")
  ```

  ```typescript TypeScript theme={null}
  import OpenAI from "openai";
  const client = new OpenAI({
    baseURL: "https://tokens.flex.ai/v1",
    apiKey: process.env.FLEXAI_API_KEY,
  });

  const stream = await client.chat.completions.create({
    model: "Qwen2.5-32B-Instruct-FP8",
    messages: [{ role: "user", content: "Count to five." }],
    stream: true,
    stream_options: { include_usage: true },
  });

  for await (const chunk of stream) {
    if (chunk.choices[0]?.delta?.content) {
      process.stdout.write(chunk.choices[0].delta.content);
    }
    if (chunk.usage) {
      console.log(`\n\nprompt=${chunk.usage.prompt_tokens} completion=${chunk.usage.completion_tokens}`);
    }
  }
  ```
</CodeGroup>

## Anatomy of the stream

The response is `text/event-stream`. Each event is a line beginning with `data: ` followed by a JSON object, then a blank line:

```
data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":"One"},"index":0,"finish_reason":null}]}

data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":", "},"index":0,"finish_reason":null}]}

...

data: {"id":"chatcmpl-abc","choices":[{"delta":{},"index":0,"finish_reason":"stop"}]}

data: {"id":"chatcmpl-abc","choices":[],"usage":{"prompt_tokens":12,"completion_tokens":8,"total_tokens":20}}

data: [DONE]
```

Three things worth knowing:

1. **`delta.content` is the incremental piece** — concatenate them to reconstruct the full response.
2. **The penultimate chunk carries `finish_reason`** (`stop`, `length`, `tool_calls`, or `content_filter`). Its `delta` is typically empty.
3. **The final-before-`[DONE]` chunk carries `usage`** (only when you set `stream_options.include_usage: true`). This is the one source of truth for token counts; do not try to count tokens client-side.

## Tool calls while streaming

Tool call arguments arrive as `delta.tool_calls[].function.arguments` fragments that you accumulate the same way as `delta.content`. See the [tool use guide](/inference-api/guides/tool-use) for a full example.

## Handling disconnects

### No resume protocol

There is no `Last-Event-ID`-style resume. A dropped TCP connection ends the stream — reconnecting issues a brand-new request with a fresh `chatcmpl-…` id, not a continuation of the old one. Tokens emitted before the disconnect are not retransmitted.

If you need durability across flaky networks, persist the partial output your client has already received and re-prompt with the conversation so far when you reconnect; the model will not pick up exactly where it left off, but it will continue the conversation.

### Cancellation billing

When the client closes the connection mid-generation, the server propagates the cancellation to the model backend and **bills only the tokens actually emitted up to the disconnect** — not the requested `max_tokens`. The model stops generating when you stop reading.

This means it is safe to disconnect early to cap latency or cost. The matching warehouse row will reflect the tokens you received, not the budget you requested.

### Retrying

A retry is a fresh, independently-billed request — there is no server-side idempotency key, so we cannot deduplicate. Pragmatic guidance:

* **Short prompts:** retry directly; the duplicate cost is bounded.
* **Long generations:** surface the partial output to the user and let them decide whether to continue. Echo what you already have back as context if they choose to retry.

See [Concurrency](/inference-api/guides/concurrency) for the per-key rate limit headers that govern how aggressively you can retry.
