LLM Rate Limit Planner

Plan your API capacity before hitting 429 errors in production.

Supports OpenAI GPT-5.4, GPT-4.1, o3, o4-mini, Anthropic Claude Opus 4.7, Sonnet 4.6, Haiku 4.5, Google Gemini 3.1, Gemini 2.5, Groq Llama 4, DeepSeek V3, Mistral, xAI Grok. Calculate RPM, TPM, concurrent users, and monthly API costs.

Provider & model

Provider
Model
Tier / plan ?Your tier sets the per-minute request and token limits your provider grants you. It usually upgrades automatically as your cumulative spend increases. Check your provider's console to confirm your current tier.

Concurrent users ?Users actively firing API requests at the same time — not your total user base. For 1,000 registered users, real concurrency is typically 5–15% of that.
Input tokens / msg ?Tokens in your request: system prompt + user message combined. Rule of thumb: 1 token ≈ 0.75 English words. A 500-token message ≈ 375 words.
Output tokens / msg ?Tokens the model generates per response. Short answers: 50–150, detailed replies: 300–800. Output tokens are typically 3–5x more expensive than input tokens.
Avg. request duration (seconds) ?Time from sending the request to receiving the full response. This determines how many requests per minute each user generates. For streaming, use total duration — not time-to-first-token.
Typical: GPT-4o mini ~1–2 s · GPT-5.4 ~2–5 s · Claude Sonnet ~2–5 s · Gemini Flash ~1–3 s · Groq ~0.5–1 s

Required RPM ?RPM = Requests Per Minute. How many API calls your users generate each minute. Each message sent = 1 request. If your required RPM exceeds the tier limit, you'll get 429 errors.
limit: —
Required TPM ?TPM = Tokens Per Minute. Total input tokens processed per minute across all users. Longer prompts or more users burn through TPM faster. Separate from RPM — either one can trigger a 429 first.
limit: —
Max concurrent users
on this tier
RPM usage ?How much of your tier's RPM limit you're consuming. Above 90% means traffic spikes will cause 429 errors.
0%50%100%
TPM usage (input) ?How much of your tier's input TPM limit you're consuming. Often the binding constraint for apps with long system prompts or large context windows.
0%50%100%
Bottleneck ?Which limit you'll hit first — RPM or TPM. Apps with many short messages are RPM-bound. Apps with long prompts or large contexts are TPM-bound.
Estimated cost / minute (USD)

Data from official provider docs — April 2026. Always verify at your provider's console before finalizing architecture decisions.

Monthly budget (USD) ?How much you're willing to spend on this model per month. The planner will calculate the maximum concurrent users this budget supports at your chosen usage profile.
Input tokens / msg ?Tokens in your request including system prompt + user message. 1 token ≈ 0.75 English words.
Output tokens / msg ?Tokens the model generates per response. Output is typically 3–5x more expensive than input.
Messages per user / day ?How many API messages an average active user sends per day. For a chat app: 10–30 typical. For an agent running tasks: could be 100+.
Active days / month ?How many days per month each user is active. 22 = weekdays only, 30 = daily usage.

With your budget you can support
monthly active users
Cost per user / month
Cost per message
Messages per user / month
Total messages / month (at max users)

Cost estimate only — excludes prompt caching savings, batch API discounts, and infrastructure costs. Always verify pricing at your provider's documentation.

How it works

LLM APIs enforce two independent rate limits: RPM (Requests Per Minute) and TPM (Tokens Per Minute). Hitting either one returns a 429 error and blocks your users until the window resets.

This planner calculates how much of your tier capacity your application will consume based on concurrent users, average message size, and request duration. It tells you which limit you will hit first — and how far away you are from the ceiling.

1. Select your model
Choose your provider, model, and current API tier.
2. Enter usage profile
Set concurrent users, tokens per message, and request duration.
3. Read the results
See required RPM/TPM, your bottleneck, and estimated cost per minute.

Example scenarios

Early-stage SaaS — 50 concurrent users, GPT-4o mini, OpenAI Tier 1
50 users × 500 input tokens × 3s request = 1,000 RPM needed. Tier 1 allows 500 RPM. Result: critical — you will hit the RPM limit immediately. Upgrade to Tier 2 or reduce concurrency.
Production chatbot — 200 concurrent users, Claude Sonnet 4.6, Anthropic Tier 3
200 users × 800 input tokens × 4s = 3,000 RPM needed. Tier 3 limit is 2,000 RPM. TPM usage is only 40%. Bottleneck is RPM — consider upgrading to Tier 4 or batching requests.
Document processing pipeline — 30 concurrent jobs, Gemini 2.0 Flash, Pay-as-you-go
30 jobs × 8,000 input tokens × 10s = 180 RPM, 1.44M TPM. Pay-as-you-go limit is 2,000 RPM / 4M TPM. Result: comfortable — plenty of headroom on both dimensions.

Frequently asked questions

What is a 429 error?
A 429 Too Many Requests error means you have exceeded your API provider's rate limit. The response includes a Retry-After header indicating how long to wait. Your application should implement exponential backoff to handle these gracefully.
What is the difference between RPM and TPM?
RPM (Requests Per Minute) limits how many API calls you can make, regardless of size. TPM (Tokens Per Minute) limits the total volume of text processed. Either limit can trigger a 429 — whichever you hit first is your bottleneck. Apps with many short messages are RPM-bound. Apps with long system prompts or large context windows are typically TPM-bound.
How do I find my current tier?
OpenAI: platform.openai.com → Settings → Limits. Anthropic: console.anthropic.com → Settings → Limits. Google: aistudio.google.com → API keys. Tiers are usually assigned automatically based on cumulative spend.
What is "concurrent users" — is it the same as total users?
No. Concurrent users means users actively making API requests at the same moment. For most SaaS apps, real concurrency is 5–15% of your total user base. If you have 1,000 registered users, expect 50–150 concurrent at peak hours.
How often is the pricing data updated?
We update the model and pricing data manually when providers make changes. Data was last verified in April 2026. Always cross-check with your provider's official documentation before making architecture or budget decisions.
Does prompt caching affect TPM limits?
Yes — significantly. Anthropic and OpenAI both exclude cached input tokens from TPM calculations. If your system prompt is large and consistent across requests, enabling prompt caching can effectively multiply your TPM capacity by 5–10x. This planner does not account for caching, so real-world limits will be higher if you use it.