Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.flex.ai/llms.txt

Use this file to discover all available pages before exploring further.

Multiple in-flight requests on the same API key run independently and in parallel. There is no head-of-line blocking, no per-key concurrency cap, and no implicit queuing — each request is dispatched to a backend as soon as it arrives. The only bound on a single key is the per-minute request ceiling described below. Inside that ceiling, fan out as wide as your workload needs.

Per-key rate limit

Each key carries a requests-per-minute (RPM) tier. When you exceed it the API returns 429 Too Many Requests with a Retry-After header — see Rate limits for the full headers.
These numbers are generated from backend/api_rate_limiter.py and backend/lago_client.py in the token-service repo. They update automatically when those files change.
TierRequests per minute
Free (default on signup)10
Elevated (approved users)60
Paid100
Every new signup is granted $10 of free credit. These are per-key, per-minute. They do not aggregate across keys on the same account today. The tier is stored on the key — request an upgrade from support@flex.ai if your steady-state load needs more.
We may introduce account-level concurrency or aggregate RPM caps as the platform scales. We will notify customers before tightening any existing per-key limit; we will not silently lower it.

Backpressure pattern

Bound your in-flight count to roughly RPM / 60 if you want to stay in a steady-state window, and retry on 429 with exponential backoff that respects Retry-After:
import asyncio, random
from openai import AsyncOpenAI, RateLimitError

client = AsyncOpenAI(base_url="https://tokens.flex.ai/v1")
# Bound concurrency to your tier: free=10/60≈1, elevated=60/60=1, paid=300/60=5.
sem = asyncio.Semaphore(5)

async def one(prompt: str, *, attempt: int = 0) -> str:
    async with sem:
        try:
            r = await client.chat.completions.create(
                model="Qwen/Qwen2.5-32B-Instruct",
                messages=[{"role": "user", "content": prompt}],
            )
            return r.choices[0].message.content
        except RateLimitError as e:
            retry_after = int(e.response.headers.get("retry-after", 1))
            jitter = random.uniform(0, 1)
            await asyncio.sleep(retry_after + jitter)
            if attempt >= 5:
                raise
            return await one(prompt, attempt=attempt + 1)
If you find yourself sustaining 429s after backoff, you’re above tier — ask for an upgrade rather than spinning. Tight retry loops do not move you through the limit faster, they just burn your quota on rejected requests.

Ordering

Concurrent requests on one key are independent — completion order is not guaranteed to match submission order. If you need to correlate responses back to inputs, carry your own correlation id in the prompt or response metadata; don’t rely on arrival order. The id we return (chatcmpl-…) is unique per request and safe to use as a join key in your logs.
  • Batching — there is no /v1/batches endpoint today; client-side fan-out is the recommended pattern.
  • Billing & quotas — tier limits and rate-limit response headers.
  • Errors — the full 429 body.