> ## Documentation Index
> Fetch the complete documentation index at: https://docs.flex.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Concurrency

> How concurrent requests on a single API key behave, what the per-key rate limit is, and how to back off cleanly when you hit it.

Multiple in-flight requests on the same API key run **independently and in parallel**. There is no head-of-line blocking, no per-key concurrency cap, and no implicit queuing — each request is dispatched to a backend as soon as it arrives.

The only bound on a single key is the per-minute request ceiling described below. Inside that ceiling, fan out as wide as your workload needs.

## Per-key rate limit

Each key carries a requests-per-minute (RPM) tier. When you exceed it the API returns `429 Too Many Requests` with a `Retry-After` header — see [Rate limits](/inference-api/reference/billing#rate-limits) for the full headers.

| Tier                      | Requests per minute |
| ------------------------- | ------------------- |
| Free (default on signup)  | 10                  |
| Elevated (approved users) | 60                  |
| Paid                      | 100                 |

Every new signup is granted **\$10 of free credit**.

These are per-key, per-minute. They do not aggregate across keys on the same account today. The tier is stored on the key — request an upgrade from [support@flex.ai](mailto:support@flex.ai) if your steady-state load needs more.

<Note>
  We may introduce account-level concurrency or aggregate RPM caps as the platform scales. We will notify customers before tightening any existing per-key limit; we will not silently lower it.
</Note>

## Backpressure pattern

Bound your in-flight count to roughly `RPM / 60` if you want to stay in a steady-state window, and retry on `429` with exponential backoff that respects `Retry-After`:

<CodeGroup>
  ```python Python theme={null}
  import asyncio, random
  from openai import AsyncOpenAI, RateLimitError

  client = AsyncOpenAI(base_url="https://tokens.flex.ai/v1")
  # Bound concurrency to your tier: free=10/60≈1, elevated=60/60=1, paid=300/60=5.
  sem = asyncio.Semaphore(5)

  async def one(prompt: str, *, attempt: int = 0) -> str:
      async with sem:
          try:
              r = await client.chat.completions.create(
                  model="Qwen2.5-32B-Instruct-FP8",
                  messages=[{"role": "user", "content": prompt}],
              )
              return r.choices[0].message.content
          except RateLimitError as e:
              retry_after = int(e.response.headers.get("retry-after", 1))
              jitter = random.uniform(0, 1)
              await asyncio.sleep(retry_after + jitter)
              if attempt >= 5:
                  raise
              return await one(prompt, attempt=attempt + 1)
  ```

  ```typescript TypeScript theme={null}
  import OpenAI from "openai";
  const client = new OpenAI({ baseURL: "https://tokens.flex.ai/v1" });

  async function withBackoff<T>(fn: () => Promise<T>, max = 5): Promise<T> {
    for (let attempt = 0; ; attempt++) {
      try {
        return await fn();
      } catch (err: any) {
        if (err?.status !== 429 || attempt >= max) throw err;
        const retryAfter = Number(err.headers?.["retry-after"] ?? 1);
        await new Promise((r) => setTimeout(r, (retryAfter + Math.random()) * 1000));
      }
    }
  }
  ```
</CodeGroup>

If you find yourself sustaining 429s after backoff, you're above tier — ask for an upgrade rather than spinning. Tight retry loops do not move you through the limit faster, they just burn your quota on rejected requests.

## Ordering

Concurrent requests on one key are independent — completion order is not guaranteed to match submission order. If you need to correlate responses back to inputs, carry your own correlation id in the prompt or response metadata; don't rely on arrival order. The `id` we return (`chatcmpl-…`) is unique per request and safe to use as a join key in your logs.

## Related

* [Batching](/inference-api/guides/batch) — there is no `/v1/batches` endpoint today; client-side fan-out is the recommended pattern.
* [Billing & quotas](/inference-api/reference/billing) — tier limits and rate-limit response headers.
* [Errors](/inference-api/reference/errors) — the full 429 body.
