Backpressure

Rate limiting and request queueing for load protection

2 min read 483 words

Pounce provides two complementary load protection mechanisms: rate limiting (per-client) and request queueing (global). Use both together for comprehensive protection.

Rate Limiting Request Queueing
Purpose Prevent per-client abuse Handle global overload
Scope Per IP address All clients
Response 429 Too Many Requests 503 Service Unavailable
Algorithm Token bucket Bounded semaphore

Rate Limiting

Per-IP token bucket rate limiting. Each client IP gets its own bucket that refills at a steady rate and allows configurable burst.

Configuration

from pounce import ServerConfig

config = ServerConfig(
    rate_limit_enabled=True,
    rate_limit_requests_per_second=100.0,  # sustained rate per IP
    rate_limit_burst=200,                   # max burst capacity
)
Option Default Description
rate_limit_enabled False Enable per-IP rate limiting
rate_limit_requests_per_second 100.0 Token refill rate
rate_limit_burst 200 Maximum bucket capacity

How It Works

  • New clients start with a full bucket
  • Each request consumes one token
  • Tokens refill atrequests_per_secondrate
  • Empty bucket = 429 response withRetry-Afterheader
  • Inactive buckets are cleaned up every 5 minutes

Choosing Limits

Profile Rate Burst
Public API 10-50 req/s 2-5x rate
Web app 50-100 req/s 2x rate
Internal service 100-1000 req/s 5-10x rate

Behind a Proxy

When behind nginx/HAProxy, configuretrusted_hostsso pounce sees real client IPs:

config = ServerConfig(
    rate_limit_enabled=True,
    trusted_hosts=frozenset({"10.0.0.0/8"}),
)

Request Queueing

Global bounded queue with load shedding. When all workers are busy, requests queue up to a maximum depth. Beyond that, new requests get an immediate 503.

Configuration

config = ServerConfig(
    request_queue_enabled=True,
    request_queue_max_depth=1000,  # 0 = unlimited (not recommended)
)
Option Default Description
request_queue_enabled False Enable request queueing
request_queue_max_depth 1000 Max queued requests (0 = unlimited)

Choosing Queue Depth

queue_depth = peak_rps * acceptable_wait_seconds
  • Conservative (predictable load): 100-500
  • Moderate (variable load): 500-1000
  • Aggressive (bursty traffic): 1000-5000

Capacity Planning

Monitor 503 rates to inform scaling:

  • > 5% rejection rate = scale up
  • 0.1-1% rejection = right-sized
  • Queue frequently full = add replicas

Combined Example

config = ServerConfig(
    # Per-client protection
    rate_limit_enabled=True,
    rate_limit_requests_per_second=100.0,
    rate_limit_burst=200,

    # Global overload protection
    request_queue_enabled=True,
    request_queue_max_depth=500,
)

Client Handling

Both 429 and 503 responses include aRetry-Afterheader. Clients should implement exponential backoff:

for attempt in range(max_retries):
    response = requests.get(url)
    if response.status_code not in (429, 503):
        break
    retry_after = int(response.headers.get("Retry-After", 1))
    time.sleep(retry_after * (2 ** attempt))

Performance

  • Rate limiting: ~5-10 us/request, ~100 bytes per active IP
  • Request queueing: ~1-5 us/request (async semaphore acquire/release)
  • Both are thread-safe for free-threading mode