Skip to main content

API Rate Limit & Throttling Planner (Burst + Backoff)

Plan your API rate limits, burst capacity, and throttling strategy based on expected traffic. Get recommended client pacing, 429 risk estimates, and retry/backoff guidance.

Last updated:
Reviewed by Waqar Kaleem Khan, Founder & Lead AI Engineer
Loading calculator...

Your new endpoint goes live Monday. Marketing sends a launch blast to 80,000 subscribers hitting the API within an hour. Without a throttle the first 5,000 concurrent calls saturate the database pool and the rest get a blank page.API rate limiting stands between a smooth launch and a self-inflicted outage. Most teams bolt it on a week before shipping and wonder why legitimate users eat 429s.

Enter average and peak request rates to size a bucket plan, estimate 429 risk, and generate a retry schedule.

Token Bucket vs Leaky Bucket: Pick the Right Model for Your Traffic

A token bucket accumulates tokens at a fixed refill rate up to a cap. Each request consumes one. When empty, requests get rejected. Spikes up to the bucket size pass through, which is good for user-facing APIs where short bursts are normal.

A leaky bucket queues requests and drains at a constant rate. Traffic smooths, but latency rises under load. Pick leaky when downstream can't tolerate bursts: payment processors and webhook receivers are typical cases.

Burst Capacity: The Window Between Smooth and Rejected

A dashboard loading six panels fires six requests at once. That looks like a burst. Set burst too low and normal behaviour triggers 429s. The other extreme is just as bad: a single misbehaving client drains the pool before everyone else gets a turn. Size burst from real traffic logs, not gut feeling.

429 Risk and What Happens When Clients Don’t Back Off

HTTP 429, defined in RFC 6585, tells the client to slow down. But poorly written clients ignore Retry-After and hammer the endpoint in a tight loop, turning mild overload into a feedback storm. If 5% retry without backoff, rejected traffic doubles the load instead of reducing it.

Retry Strategy: Exponential Backoff with Jitter, Not Blind Loops

Exponential backoff doubles the wait: 1 s, 2 s, 4 s, 8 s. Without jitter, every throttled client retries at the same instant — a synchronized stampede. Random jitter (±30%) spreads retries and breaks the herd. Document this in your API docs. Clients follow whatever retry pattern you publish.

Sustained RPS vs Peak RPS: Sizing for the Spike

Average RPS sets baseline capacity. Peak RPS is what hits at 9 AM Monday when every cron fires. Size to the average and every spike becomes a throttle event. Sizing to peak with a 1.1 to 1.3× margin lets burst cover the gap, so normal spikes pass without 429s.

Result Snapshot: What Your Rate Limit Numbers Mean

The planner outputs refill rate, bucket capacity, client pace, and estimated 429 risk at the traffic you entered. If risk exceeds tolerance, raise the ceiling or reduce peak via queue-based ingestion. Numbers assume one gateway node. Multiple nodes with local counters need a shared store like Redis.

Before You Ship It: Rate Limit Deployment Mistakes

  • Distributed counters without sync. Two nodes each allowing 100 RPS means the backend sees 200. Use a central counter or consistent hash.
  • Clock skew on sliding windows. A drifting client clock lands requests outside the allocated window and gets throttled for no real reason.
  • Webhook fan-out that mimics DDoS. One event triggers 500 parallel callbacks. Stagger delivery with a short random delay per hook.
  • Per-IP limits on shared NATs. An entire office behind one public IP shares a single limit. Per-API-key granularity handles this better.

Oversights that surface in production: shipping without a Retry-After header so clients can't implement backoff, and setting identical limits for reads and writes when writes cost 10× more on the backend.

Related on EverydayBudd's developer utilities hub: the SLA Uptime Calculator for the availability targets that interact with rate-limit policy, and the Password Strength & Entropy Estimator for auth-tier rate-limiting context.

Rate limit plans from this tool are capacity-planning estimates. They don't replace load testing, production monitoring, or architectural review of your gateway infrastructure.

Frequently Asked Questions

What's a 429 response actually telling the client to do?

HTTP 429 Too Many Requests, defined in RFC 6585, signals that the client has exceeded the rate limit and should slow down. The response should include a Retry-After header, either a seconds count (`Retry-After: 30`) or an HTTP-date (`Retry-After: Wed, 21 Oct 2026 07:28:00 GMT`). Well-behaved clients (Stripe SDK, GitHub Octokit, AWS SDK) parse Retry-After and back off automatically. Naive clients ignore it and hammer immediately, which is why server-side rate limiters need to be defensive. A 429 isn't a polite request, it's a contract.

Token bucket or leaky bucket: which one for a public REST API?

Token bucket. It allows clients to spike up to a defined burst capacity, then settles to a steady refill rate, which matches how real API traffic behaves (clients send a flurry of requests, then go quiet). Leaky bucket smooths output at a constant drain rate, which is better for protecting downstream systems that genuinely can't handle bursts (databases without connection pooling, payment processors with strict per-second limits). For a typical REST API serving frontends and integrations, token bucket with a 5 to 15 second burst capacity gives the best client experience.

How do I size token bucket capacity from my expected traffic?

Capacity = refill rate × burst seconds. If you want to allow 100 RPS sustained with 10-second bursts, you need a capacity of 1,000 tokens. The burst-seconds parameter is the operational lever. 5 seconds is conservative. 30 seconds is generous. More than 60 seconds usually means you should raise the steady-state limit instead. The calculator does this math directly. The hard part isn't the math, it's deciding what burst behavior your downstream services can actually absorb.

My peak RPS is 300 but my hard cap is 275. What's my 429 risk?

Oversubscription = (peak - cap) / peak × 100. So (300 - 275) / 300 × 100 = 8.3% of peak-load requests will hit 429. That's not "8.3% of all requests" because peaks are usually a small fraction of the total request volume. If your peak duration is 30 seconds out of every 5 minutes (10% of the time), the actual 429 rate is closer to 8.3% × 10% = 0.83% of all requests. The calculator surfaces both numbers because the right one to optimize against depends on whether your customers care about peak-time experience or aggregate experience.

Should I implement client-side pacing or just rely on retry-on-429?

Pace if you can. The pacing math is `recommended_pace_ms = 1000 / allowed_RPS`. For a 100 RPS limit, send a request every 10ms. Pacing prevents 429s from happening, which is better than handling them gracefully because each 429 is a wasted round trip plus a retry that doubles your client's effective latency. Retry-on-429 is the safety net for when your traffic profile changes faster than your pacing logic does. Both belong in production code. Pacing alone is fragile, retry-only is wasteful.

Exponential backoff with jitter. What's the actual formula?

wait = base × (2 ^ attempt) ± jitter. With base = 250ms: attempt 1 waits 500ms ± jitter, attempt 2 waits 1000ms ± jitter, attempt 3 waits 2000ms, and so on. Jitter is typically ±50% of the computed wait. Without it, all clients that got rate-limited at the same moment retry at the same moment, recreating the thundering herd that caused the rate limit in the first place. Cap the maximum wait somewhere reasonable (30 to 60 seconds for user-facing flows, 5+ minutes for background jobs).

Fixed window or sliding window: which counter to pick?

Fixed windows are simpler. Count requests in the current minute (or hour, or day), reset at the boundary. The downside is the boundary effect. A client can send the full quota in the last second of a window and again in the first second of the next, effectively doubling the limit for two seconds. Sliding windows weight requests by recency (or use a rolling counter) and prevent the boundary spike, but cost more memory and CPU. For most applications, fixed windows with appropriate limits work fine. Use sliding only when strict compliance demands it.

What headers should my API return so clients can pace themselves?

Three standard headers: `X-RateLimit-Limit` (max allowed in the current window), `X-RateLimit-Remaining` (how many remain), and `X-RateLimit-Reset` (Unix timestamp or seconds-until-reset). On 429 responses, add `Retry-After`. GitHub uses exactly this scheme. Stripe uses a slightly different convention (`Stripe-Should-Retry`, `Retry-After`). Pick a convention, document it publicly, and never change it without a major version bump. Clients hard-code the header names.

How do I size a concurrency limit alongside the RPS limit?

Concurrency = RPS × average request duration (in seconds). For 100 RPS with 200ms average latency: 100 × 0.2 = 20 concurrent requests in flight on average. Add 50 to 100% headroom for tail latency, capping concurrency at 30 to 40. Concurrency limits matter more than RPS limits for protecting downstream resources. A database with 20 connections doesn't care about your RPS, it cares whether you're trying to hold 50 connections at once. Both limits apply simultaneously. The tighter one wins.

What's the right rate limit for a brand-new API I haven't launched yet?

Start conservative, observe, raise. Pick a limit that's 5 to 10x your expected normal load, generous enough that legitimate clients don't hit it, low enough that abusive clients trip it within minutes. Watch the 429 rate during launch week. If less than 1% of traffic is hitting limits, the limit is too loose. If more than 5% is hitting limits and the affected clients are legitimate, the limit is too tight. Iterate on the actual data, not the prediction. Most APIs that launched with "we'll figure out limits later" never figured them out.

When should distributed rate limiting (Redis-backed) replace local in-memory counters?

When you have more than one server instance handling the same logical rate limit. A 100 RPS limit running locally on three load-balanced instances is actually a 300 RPS limit because each instance counts independently. Redis INCR with TTL gives you a single counter all instances share. The cost is one Redis round trip per request, which adds 1 to 3 ms latency. Worth it once you cross two instances, overkill on a single-instance service. The patterns to research: Redis token bucket, Lua-scripted atomic decrement, sliding-window counters with sorted sets.

Should I send 429 responses or just queue requests internally?

Both have a place. Return 429 with a Retry-After header (RFC 6585) when the request would exceed the published limit. That's the contract clients can program against, and well-behaved clients will back off correctly. Internal queuing makes sense for request types that are time-insensitive and idempotent (batch ingestion, webhook redelivery), where the alternative would be a thundering herd of retries. Queue when the work tolerates latency. Reject with 429 when the contract demands predictable behavior. Don't queue silently. Silent queuing makes upstream timeouts your problem.

Explore More Tech & Dev Utilities

Calculate file transfer times, subnet configurations, password entropy, and more with our suite of developer tools.

How helpful was this calculator?