Rate Limiting
Also known as: throttling, rate throttling, token bucket, leaky bucket
Controlling how many requests a client or service can make in a time window — protecting backends from overload, preventing abuse, and enforcing fair use across tenants.
- Primary domain
- Concurrency & Parallelism
- Sub-category
- Distributed Computing
In simple terms
Rate limiting puts a cap on how often a caller can invoke an API or resource in a given time window. Exceed the cap and you get a 429 Too Many Requests response. It protects the service from being overwhelmed — by a misbehaving client, a scraper, or a traffic spike — and ensures that no single tenant starves others.
The Visual Map
flowchart LR
R["requests arrive"] --> B{"token bucket:<br/>tokens available?"}
B -->|"yes — take 1 token"| P["allow -> backend"]
B -->|"no — bucket empty"| D["reject: 429 Too Many Requests<br/>+ Retry-After header"]
T["refill: +100 tokens/sec,<br/>capped at bucket size"] -.-> B
note["bucket capacity = max burst;<br/>refill rate = sustained rate"] -.-> B
More detail
Algorithms for counting and enforcing:
- Fixed window counter — increment a counter per client per time window (e.g. 1-minute buckets). Simple, but prone to burst at window boundaries: a client can fire limit requests at the end of one window and limit more at the start of the next, doubling the effective rate.
- Sliding window log — store timestamps of all requests in a rolling window. Precise but memory-heavy for high-traffic clients.
- Sliding window counter — a weighted hybrid of two fixed windows that approximates the sliding window with O(1) memory. Stripe and Cloudflare use this.
- Token bucket — each client has a bucket that refills at a fixed rate (e.g. 100 tokens/second). Each request costs 1 token; if the bucket is empty, reject. Allows short bursts up to bucket capacity.
- Leaky bucket — requests enter a queue and are processed at a fixed rate. Smooths traffic to a constant rate; excess is queued or dropped. Useful for output shaping (sending to downstream).
Where to enforce:
- Client-side — SDK retries with backoff; reduces wasted requests to the server.
- API gateway / reverse proxy — Nginx, Kong, AWS API Gateway, Cloudflare all implement rate limiting in the request path.
- Application level — in-process counters (per-process, not cluster-wide) or distributed counters in Redis/Memcached (cluster-wide, adds latency).
Distributed rate limiting requires a shared counter store. Redis INCR with expiry is the most common implementation. Approximate algorithms (token bucket in Redis Lua scripts, or a service like Envoy’s rate-limit service) avoid hot-key contention.
Quota types: per-IP, per-user, per-API key, per-tenant, per-endpoint, per-cost (different operations cost different tokens).
Without rate limiting, a single misbehaving client — intentional or not — can exhaust your server’s capacity and degrade service for all other users. Rate limiting is a fundamental reliability mechanism alongside circuit breakers and load balancing, and it also enables monetisation: tiered API plans charge for higher rate limits.
Under the Hood
The token bucket — the most-used algorithm — is a few lines, and its elegance is that it stores no history, just two numbers and a timestamp:
import time
class TokenBucket:
def __init__(self, rate, capacity):
self.rate = rate # tokens added per second (sustained rate)
self.capacity = capacity # max tokens (burst allowance)
self.tokens = capacity
self.last = time.monotonic()
def allow(self, cost=1):
now = time.monotonic()
# refill lazily: no background timer, just compute elapsed * rate
self.tokens = min(self.capacity, self.tokens + (now - self.last) * self.rate)
self.last = now
if self.tokens >= cost:
self.tokens -= cost
return True
return False
bucket = TokenBucket(rate=5, capacity=10) # 5/s sustained, burst of 10
allowed = sum(bucket.allow() for _ in range(15)) # hammer 15 at once
print(f'instant burst: {allowed} allowed, {15 - allowed} threw 429')
time.sleep(1)
print('after 1s refill:', sum(bucket.allow() for _ in range(8)), 'of 8 allowed')
Lazy refill is the trick: instead of a timer adding tokens, allow computes how many would have accrued since the last call. That makes it O(1) memory and trivially portable to a Redis Lua script for cluster-wide limits — the same five lines, shared state.
Engineering Trade-offs
- Burst tolerance vs smoothness. Token bucket allows bursts up to capacity (good for bursty-but-legitimate clients); leaky bucket forces a constant output rate (good for protecting a fragile downstream). The choice is whether you’re guarding yourself or someone you call.
- Accuracy vs memory. Sliding-window logs are exact but store every timestamp; fixed windows are O(1) but allow 2× bursts at boundaries; sliding-window counters split the difference — the pragmatic default Stripe and Cloudflare ship.
- Local speed vs global correctness. Per-process counters add zero latency but let N replicas each grant the full limit (effective N× the intended cap); a shared Redis counter is globally correct but adds a network hop and a hot-key bottleneck per limited entity.
- Throttling vs rejecting. A 429 with
Retry-Afteris honest and cheap; queuing excess requests (leaky bucket) hides the limit but risks unbounded latency and memory. Rejecting early is usually kinder than accepting work you’ll drop later.
Real-world examples
- GitHub API limits unauthenticated clients to 60 req/hour and authenticated clients to 5,000 req/hour.
- Stripe’s API uses a token bucket per key; bursts are allowed up to a capacity, then throttled.
- Cloudflare uses sliding-window rate limiting at the edge to absorb DDoS traffic before it reaches origin servers.
- AWS API Gateway applies per-stage and per-method rate limits to protect Lambda backends.
Common misconceptions
- “Rate limiting stops DDoS attacks.” It throttles them — a sufficiently distributed DDoS can still saturate bandwidth before rate limits apply. Rate limiting is one layer of defence, not the only one.
- “Fixed window counting is good enough.” For most use cases it is; but APIs used by sophisticated clients (bots, automated pipelines) can exploit window boundaries for burst traffic.
Try it yourself
Expose the fixed-window boundary exploit — the bug that motivates sliding windows — with a burst straddling the reset:
python3 -c "
LIMIT = 5 # 5 requests per 10-second window
# a client fires 5 at the END of window 1, then 5 at the START of window 2
events = [(9.8, 1), (9.8, 1), (9.8, 1), (9.8, 1), (9.8, 1),
(10.1, 2), (10.1, 2), (10.1, 2), (10.1, 2), (10.1, 2)]
counts = {}
allowed = 0
for t, window in events:
counts[window] = counts.get(window, 0) + 1
if counts[window] <= LIMIT:
allowed += 1
print(f'fixed-window allowed {allowed} requests in a ~0.3s span')
print(f'intended max for any 10s window: {LIMIT} -> client got {allowed/LIMIT:.0f}x the cap')
"
Ten requests in a third of a second, every one allowed, because they split across two windows the counter treats independently. Sliding-window counters weight the previous window to close exactly this gap.
Learn next
- Circuit breaker — fail fast when a dependency is already sick.
- Idempotency — what makes the client’s retry-after-429 safe.
- Load balancer — where edge rate limits are often enforced.
Relationships
- Requires
- Related
Neighborhood
A visual companion to the relationships above. Click any node to visit that topic.