> ## Documentation Index
> Fetch the complete documentation index at: https://docs.flex.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Batching

> The Inference API does not expose a /v1/batches endpoint today. Use client-side fan-out — here is the recommended pattern.

<Note>
  **There is no `/v1/batches` endpoint.** The OpenAI Batch API surface is not supported and is not on the launch roadmap. If you are migrating Batch-API code, replace it with the client-side fan-out below.
</Note>

## Why no batch endpoint

The Batch API trades latency for cost: jobs run within 24 hours at a discount. We don't gate capacity that way — every model is served from a live pool, and concurrent online requests on a single key already run independently up to your tier's per-minute limit. See [Concurrency](/inference-api/guides/concurrency).

If you need bulk throughput, fan out concurrent requests. Cost is the same as one-by-one because we bill per token, not per call.

## Recommended pattern: bounded fan-out

Bound concurrency to your tier's RPM divided by 60 (or whatever your job's per-request latency budget supports), retry on `429`, and write results back keyed by your own correlation id — completion order is not preserved.

<CodeGroup>
  ```python Python theme={null}
  import asyncio, random
  from openai import AsyncOpenAI, RateLimitError

  client = AsyncOpenAI(base_url="https://tokens.flex.ai/v1")
  # Pick `concurrency` to match your tier: free≈1, elevated≈1, paid≈5.
  # Higher values just produce more 429s and the same wallclock.
  sem = asyncio.Semaphore(5)

  async def run_one(job: dict) -> dict:
      async with sem:
          for attempt in range(6):
              try:
                  r = await client.chat.completions.create(
                      model="Qwen2.5-32B-Instruct-FP8",
                      messages=[{"role": "user", "content": job["prompt"]}],
                  )
                  return {"id": job["id"], "output": r.choices[0].message.content}
              except RateLimitError as e:
                  retry = int(e.response.headers.get("retry-after", 1)) + random.random()
                  await asyncio.sleep(retry)
          raise RuntimeError(f"job {job['id']} exhausted retries")

  async def run_all(jobs: list[dict]) -> list[dict]:
      return await asyncio.gather(*(run_one(j) for j in jobs))

  # Each `job` carries a correlation id so you can match results back to inputs.
  results = asyncio.run(run_all([
      {"id": "row-1", "prompt": "Summarise: …"},
      {"id": "row-2", "prompt": "Summarise: …"},
      # …
  ]))
  ```

  ```typescript TypeScript theme={null}
  import OpenAI from "openai";
  const client = new OpenAI({ baseURL: "https://tokens.flex.ai/v1" });

  type Job = { id: string; prompt: string };

  async function runOne(job: Job): Promise<{ id: string; output: string }> {
    for (let attempt = 0; attempt < 6; attempt++) {
      try {
        const r = await client.chat.completions.create({
          model: "Qwen2.5-32B-Instruct-FP8",
          messages: [{ role: "user", content: job.prompt }],
        });
        return { id: job.id, output: r.choices[0].message.content ?? "" };
      } catch (err: any) {
        if (err?.status !== 429) throw err;
        const retry = Number(err.headers?.["retry-after"] ?? 1) + Math.random();
        await new Promise((r) => setTimeout(r, retry * 1000));
      }
    }
    throw new Error(`job ${job.id} exhausted retries`);
  }

  async function runAll(jobs: Job[], concurrency = 5) {
    const queue = [...jobs];
    const out: Array<{ id: string; output: string }> = [];
    await Promise.all(
      Array.from({ length: concurrency }, async () => {
        while (queue.length) {
          const job = queue.shift()!;
          out.push(await runOne(job));
        }
      }),
    );
    return out;
  }
  ```
</CodeGroup>

## What you give up vs. a true batch API

* **No 24-hour discount.** Bulk jobs cost the same per token as online traffic.
* **No server-side job state.** If your client crashes mid-run, you re-issue the missing rows yourself. Persist progress (correlation id → result) as you go.
* **Per-key RPM still applies.** The fan-out pattern doesn't bypass the rate limit — it just keeps you under it. See [Concurrency](/inference-api/guides/concurrency).

If a true batch endpoint becomes available we will document it on this page.
