> ## Documentation Index
> Fetch the complete documentation index at: https://docs.flex.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Concurrency

> How concurrent requests on a single API key behave, what the per-key rate limit is, and how to back off cleanly when you hit it.

Multiple in-flight requests on the same API key run **independently and in parallel**. There is no head-of-line blocking, no per-key concurrency cap, and no implicit queuing — each request is dispatched to a backend as soon as it arrives.

The only bound on a single key is the per-minute request ceiling described below. Inside that ceiling, fan out as wide as your workload needs.

## Per-key rate limit

Each key carries a requests-per-minute (RPM) tier. When you exceed it the API returns `429 Too Many Requests` with a `Retry-After` header — see [Rate limits](/inference-api/reference/billing#rate-limits) for the full headers.

| Tier                     | Requests per minute                                        |
| ------------------------ | ---------------------------------------------------------- |
| Free (default on signup) | 10                                                         |
| Paid                     | 100                                                        |
| Custom                   | [Contact FlexAI](mailto:support@flex.ai) for higher limits |

These are per-key, per-minute. They do not aggregate across keys on the same account today. The tier is stored on the key — request an upgrade from [support@flex.ai](mailto:support@flex.ai) if your steady-state load needs more.

<Note>
  We may introduce account-level concurrency or aggregate RPM caps as the platform scales. We will notify customers before tightening any existing per-key limit; we will not silently lower it.
</Note>

## Backpressure pattern

Bound your in-flight count to roughly `RPM / 60` if you want to stay in a steady-state window, and retry on `429` with exponential backoff that respects `Retry-After`:

<CodeGroup>
  ```python Python theme={null}
  import asyncio, random
  from openai import AsyncOpenAI, RateLimitError

  client = AsyncOpenAI(base_url="https://tokens.flex.ai/v1")
  # Bound concurrency to your tier: free=10/60≈1, elevated=60/60=1, paid=300/60=5.
  sem = asyncio.Semaphore(5)

  async def one(prompt: str, *, attempt: int = 0) -> str:
      async with sem:
          try:
              r = await client.chat.completions.create(
                  model="Llama-3.3-70B-Instruct-FP8",
                  messages=[{"role": "user", "content": prompt}],
              )
              return r.choices[0].message.content
          except RateLimitError as e:
              retry_after = int(e.response.headers.get("retry-after", 1))
              jitter = random.uniform(0, 1)
              await asyncio.sleep(retry_after + jitter)
              if attempt >= 5:
                  raise
              return await one(prompt, attempt=attempt + 1)
  ```

  ```typescript TypeScript theme={null}
  import OpenAI from "openai";
  const client = new OpenAI({ baseURL: "https://tokens.flex.ai/v1" });

  async function withBackoff<T>(fn: () => Promise<T>, max = 5): Promise<T> {
    for (let attempt = 0; ; attempt++) {
      try {
        return await fn();
      } catch (err: any) {
        if (err?.status !== 429 || attempt >= max) throw err;
        const retryAfter = Number(err.headers?.["retry-after"] ?? 1);
        await new Promise((r) => setTimeout(r, (retryAfter + Math.random()) * 1000));
      }
    }
  }
  ```
</CodeGroup>

If you're sustaining 429s after backoff, you're above your tier's steady-state capacity — request an upgrade rather than retrying harder. Tight retry loops don't move you through the limit faster; they just burn your quota on rejected requests.

## Ordering

Concurrent requests on one key are independent — completion order is not guaranteed to match submission order. If you need to correlate responses back to inputs, carry your own correlation id in the prompt or response metadata; don't rely on arrival order. The `id` we return (`chatcmpl-…`) is unique per request and safe to use as a join key in your logs.

## Related

* [Batching](/inference-api/guides/batch) — batch processing is coming soon.
* [Billing & quotas](/inference-api/reference/billing) — tier limits and rate-limit response headers.
* [Errors](/inference-api/reference/errors) — the full 429 body.