stream: true on any /v1/chat/completions request.
Always set
stream_options.include_usage: true when streaming. Without it you cannot bill the request correctly on your side, and per-key spend tracking on ours silently loses information. Every server-sent event framework we recommend below does this by default.Example
Anatomy of the stream
The response istext/event-stream. Each event is a line beginning with data: followed by a JSON object, then a blank line:
delta.contentis the incremental piece — concatenate them to reconstruct the full response.- The penultimate chunk carries
finish_reason(stop,length,tool_calls, orcontent_filter). Itsdeltais typically empty. - The final-before-
[DONE]chunk carriesusage(only when you setstream_options.include_usage: true). This is the one source of truth for token counts; do not try to count tokens client-side.
Tool calls while streaming
Tool call arguments arrive asdelta.tool_calls[].function.arguments fragments that you accumulate the same way as delta.content. See the tool use guide for a full example.
Handling disconnects
No resume protocol
There is noLast-Event-ID-style resume. A dropped TCP connection ends the stream — reconnecting issues a brand-new request with a fresh chatcmpl-… id, not a continuation of the old one. Tokens emitted before the disconnect are not retransmitted.
If you need durability across flaky networks, persist the partial output your client has already received and re-prompt with the conversation so far when you reconnect; the model will not pick up exactly where it left off, but it will continue the conversation.
Cancellation billing
When the client closes the connection mid-generation, the server propagates the cancellation to the model backend and bills only the tokens actually emitted up to the disconnect — not the requestedmax_tokens. The model stops generating when you stop reading.
This means it is safe to disconnect early to cap latency or cost. The matching warehouse row will reflect the tokens you received, not the budget you requested.
Retrying
A retry is a fresh, independently-billed request — there is no server-side idempotency key, so we cannot deduplicate. Pragmatic guidance:- Short prompts: retry directly; the duplicate cost is bounded.
- Long generations: surface the partial output to the user and let them decide whether to continue. Echo what you already have back as context if they choose to retry.