Skip to main content
The Inference API is a shared, multi-tenant platform. The capacity behind it — GPUs running real models — is finite, so a single key that consumes a disproportionate share degrades latency and availability for everyone else. This page describes what we consider fair use, the patterns we monitor for, and what we do when a key crosses a line. The goal is transparency: you should know the triggers before you hit them.

Fair use

Normal application traffic is always welcome — interactive chat, batch jobs, evaluation runs, production workloads. You do not need to ask permission to ramp up legitimate usage. What we ask is simply that your traffic reflect real work rather than patterns that exist only to consume capacity. Each API key carries explicit rate limits (requests per minute and tokens per minute) tied to your tier. Those limits are the hard ceiling and are enforced inline — requests over the limit receive 429 Too Many Requests. The policy described below sits above those limits: it is about sustained patterns that, while they may stay under the per-request limits, still indicate abuse or a runaway client.

What we monitor

We continuously review per-key activity for a small set of patterns. On day one these are observed and reviewed, not automatically blocked — a flagged key triggers a human review, not an instant cutoff. The patterns are:
  • Sustained excessive request rate. A single key sending requests far above what its tier and normal application behavior would produce, held over a sustained window.
  • Sustained excessive token consumption. A single key generating tokens at a rate well beyond normal usage, sustained over time — the clearest signal of one key burning a disproportionate share of GPU budget.
  • Runaway or hung requests. A key accumulating many very long-running requests. This usually points at a misconfigured client (for example, an unbounded max_tokens on a non-streaming call that runs until it is cut off) rather than deliberate abuse, but the effect on shared capacity is the same.
For long-running non-streaming requests specifically, the most reliable fix is on your side: set a sensible max_tokens, or stream the response. Streaming keeps the connection alive token-by-token and avoids the timeouts that long non-streamed generations hit.

What happens when a key is flagged

When a key trips one of these patterns, the typical sequence is:
  1. Review. We look at the account and the traffic to distinguish a legitimate ramp (a batch job, a load test, real growth) from abuse or a runaway client. A legitimate ramp needs no action from you — if anything, it is a signal to talk about a higher tier.
  2. Contact. If the pattern looks like abuse or a misconfigured client that is degrading the service for others, we reach out.
  3. Throttle or disable. If a key is actively degrading availability for other customers and the situation cannot wait, we may temporarily disable it. A disabled key receives 402 Payment Required on its next call; in-flight work is allowed to finish, but no new requests are accepted until the key is re-enabled. We re-enable as soon as the issue is resolved.
If your application has a legitimate need for sustained high throughput, get in touch before you ramp — we would much rather provision for you than flag you. You can manage your keys and see your usage from the dashboard, and reach support from the same place.