Multiple in-flight requests on the same API key run independently and in parallel. There is no head-of-line blocking, no per-key concurrency cap, and no implicit queuing — each request is dispatched to a backend as soon as it arrives.
The only bound on a single key is the per-minute request ceiling described below. Inside that ceiling, fan out as wide as your workload needs.
Per-key rate limit
Each key carries a requests-per-minute (RPM) tier. When you exceed it the API returns 429 Too Many Requests with a Retry-After header — see Rate limits for the full headers.
| Tier | Requests per minute |
|---|
| Free (default on signup) | 10 |
| Elevated (approved users) | 60 |
| Paid | 100 |
Every new signup is granted $10 of free credit.
These are per-key, per-minute. They do not aggregate across keys on the same account today. The tier is stored on the key — request an upgrade from support@flex.ai if your steady-state load needs more.
We may introduce account-level concurrency or aggregate RPM caps as the platform scales. We will notify customers before tightening any existing per-key limit; we will not silently lower it.
Backpressure pattern
Bound your in-flight count to roughly RPM / 60 if you want to stay in a steady-state window, and retry on 429 with exponential backoff that respects Retry-After:
import asyncio, random
from openai import AsyncOpenAI, RateLimitError
client = AsyncOpenAI(base_url="https://tokens.flex.ai/v1")
# Bound concurrency to your tier: free=10/60≈1, elevated=60/60=1, paid=300/60=5.
sem = asyncio.Semaphore(5)
async def one(prompt: str, *, attempt: int = 0) -> str:
async with sem:
try:
r = await client.chat.completions.create(
model="Qwen2.5-32B-Instruct-FP8",
messages=[{"role": "user", "content": prompt}],
)
return r.choices[0].message.content
except RateLimitError as e:
retry_after = int(e.response.headers.get("retry-after", 1))
jitter = random.uniform(0, 1)
await asyncio.sleep(retry_after + jitter)
if attempt >= 5:
raise
return await one(prompt, attempt=attempt + 1)
If you find yourself sustaining 429s after backoff, you’re above tier — ask for an upgrade rather than spinning. Tight retry loops do not move you through the limit faster, they just burn your quota on rejected requests.
Ordering
Concurrent requests on one key are independent — completion order is not guaranteed to match submission order. If you need to correlate responses back to inputs, carry your own correlation id in the prompt or response metadata; don’t rely on arrival order. The id we return (chatcmpl-…) is unique per request and safe to use as a join key in your logs.
- Batching — there is no
/v1/batches endpoint today; client-side fan-out is the recommended pattern.
- Billing & quotas — tier limits and rate-limit response headers.
- Errors — the full 429 body.