Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.flex.ai/llms.txt

Use this file to discover all available pages before exploring further.

There is no /v1/batches endpoint. The OpenAI Batch API surface is not supported and is not on the launch roadmap. If you are migrating Batch-API code, replace it with the client-side fan-out below.

Why no batch endpoint

The Batch API trades latency for cost: jobs run within 24 hours at a discount. We don’t gate capacity that way — every model is served from a live pool, and concurrent online requests on a single key already run independently up to your tier’s per-minute limit. See Concurrency. If you need bulk throughput, fan out concurrent requests. Cost is the same as one-by-one because we bill per token, not per call. Bound concurrency to your tier’s RPM divided by 60 (or whatever your job’s per-request latency budget supports), retry on 429, and write results back keyed by your own correlation id — completion order is not preserved.
import asyncio, random
from openai import AsyncOpenAI, RateLimitError

client = AsyncOpenAI(base_url="https://tokens.flex.ai/v1")
# Pick `concurrency` to match your tier: free≈1, elevated≈1, paid≈5.
# Higher values just produce more 429s and the same wallclock.
sem = asyncio.Semaphore(5)

async def run_one(job: dict) -> dict:
    async with sem:
        for attempt in range(6):
            try:
                r = await client.chat.completions.create(
                    model="Qwen/Qwen2.5-32B-Instruct",
                    messages=[{"role": "user", "content": job["prompt"]}],
                )
                return {"id": job["id"], "output": r.choices[0].message.content}
            except RateLimitError as e:
                retry = int(e.response.headers.get("retry-after", 1)) + random.random()
                await asyncio.sleep(retry)
        raise RuntimeError(f"job {job['id']} exhausted retries")

async def run_all(jobs: list[dict]) -> list[dict]:
    return await asyncio.gather(*(run_one(j) for j in jobs))

# Each `job` carries a correlation id so you can match results back to inputs.
results = asyncio.run(run_all([
    {"id": "row-1", "prompt": "Summarise: …"},
    {"id": "row-2", "prompt": "Summarise: …"},
    # …
]))

What you give up vs. a true batch API

  • No 24-hour discount. Bulk jobs cost the same per token as online traffic.
  • No server-side job state. If your client crashes mid-run, you re-issue the missing rows yourself. Persist progress (correlation id → result) as you go.
  • Per-key RPM still applies. The fan-out pattern doesn’t bypass the rate limit — it just keeps you under it. See Concurrency.
If a true batch endpoint becomes available we will document it on this page.