Concurrency - FlexAI Docs

Multiple in-flight requests on the same API key run independently and in parallel. There is no head-of-line blocking, no per-key concurrency cap, and no implicit queuing — each request is dispatched to a backend as soon as it arrives. The only bound on a single key is the per-minute request ceiling described below. Inside that ceiling, fan out as wide as your workload needs.

Per-key rate limit

Each key carries a requests-per-minute (RPM) tier. When you exceed it the API returns 429 Too Many Requests with a Retry-After header — see Rate limits for the full headers.

Tier	Requests per minute
Free (default on signup)	10
Paid	100
Custom	Contact FlexAI for higher limits

These are per-key, per-minute. They do not aggregate across keys on the same account today. The tier is stored on the key — request an upgrade from support@flex.ai if your steady-state load needs more.

We may introduce account-level concurrency or aggregate RPM caps as the platform scales. We will notify customers before tightening any existing per-key limit; we will not silently lower it.

Backpressure pattern

Bound your in-flight count to roughly RPM / 60 if you want to stay in a steady-state window, and retry on 429 with exponential backoff that respects Retry-After:

import asyncio, random
from openai import AsyncOpenAI, RateLimitError

client = AsyncOpenAI(base_url="https://tokens.flex.ai/v1")
# Bound concurrency to your tier: free=10/60≈1, elevated=60/60=1, paid=300/60=5.
sem = asyncio.Semaphore(5)

async def one(prompt: str, *, attempt: int = 0) -> str:
    async with sem:
        try:
            r = await client.chat.completions.create(
                model="Llama-3.3-70B-Instruct-FP8",
                messages=[{"role": "user", "content": prompt}],
            )
            return r.choices[0].message.content
        except RateLimitError as e:
            retry_after = int(e.response.headers.get("retry-after", 1))
            jitter = random.uniform(0, 1)
            await asyncio.sleep(retry_after + jitter)
            if attempt >= 5:
                raise
            return await one(prompt, attempt=attempt + 1)

import OpenAI from "openai";
const client = new OpenAI({ baseURL: "https://tokens.flex.ai/v1" });

async function withBackoff<T>(fn: () => Promise<T>, max = 5): Promise<T> {
  for (let attempt = 0; ; attempt++) {
    try {
      return await fn();
    } catch (err: any) {
      if (err?.status !== 429 || attempt >= max) throw err;
      const retryAfter = Number(err.headers?.["retry-after"] ?? 1);
      await new Promise((r) => setTimeout(r, (retryAfter + Math.random()) * 1000));
    }
  }
}

If you’re sustaining 429s after backoff, you’re above your tier’s steady-state capacity — request an upgrade rather than retrying harder. Tight retry loops don’t move you through the limit faster; they just burn your quota on rejected requests.

Ordering

Concurrent requests on one key are independent — completion order is not guaranteed to match submission order. If you need to correlate responses back to inputs, carry your own correlation id in the prompt or response metadata; don’t rely on arrival order. The id we return (chatcmpl-…) is unique per request and safe to use as a join key in your logs.

Batching — batch processing is coming soon.
Billing & quotas — tier limits and rate-limit response headers.
Errors — the full 429 body.

​Per-key rate limit

​Backpressure pattern

​Ordering

​Related

Per-key rate limit

Backpressure pattern

Ordering

Related