Request Queueing

Bounded request queues with load shedding for graceful overload handling

4 min read 858 words

Application-level request queueing with bounded capacity for graceful overload handling.

Overview

Request queueing provides graceful degradation under load:

  • Buffer requests when workers are busy
  • Shed load when queue fills up (503 responses)
  • Monitor queue depth and wait times
  • Prevent resource exhaustion during traffic spikes

When to Use

Request queueing is ideal for:

  • Bursty traffic - Handle temporary traffic spikes
  • Background processing - Queue requests during high load
  • Graceful degradation - Return 503 instead of timing out
  • Capacity planning - Monitor queue to identify scaling needs

Difference from Rate Limiting

Feature Rate Limiting Request Queueing
Purpose Prevent per-client abuse Handle global server overload
Scope Per client IP Global (all clients)
Response 429 (rate limited) 503 (overloaded)
When Client exceeds limit Server at capacity

Use both together for comprehensive protection:

  • Rate limiting stops abusive clients
  • Request queueing handles legitimate traffic spikes

Quick Start

Basic Configuration

Enable request queueing with default settings (queue up to 1000 requests):

from pounce import ServerConfig

config = ServerConfig(
    request_queue_enabled=True,
)

Custom Queue Depth

config = ServerConfig(
    request_queue_enabled=True,
    request_queue_max_depth=500,  # Queue up to 500 requests
)

Unlimited Queue

config = ServerConfig(
    request_queue_enabled=True,
    request_queue_max_depth=0,  # Unlimited
)

Warning: Unlimited queues can lead to memory exhaustion under sustained overload!

Configuration Options

Parameter Type Default Description
request_queue_enabled bool False Enable request queueing
request_queue_max_depth int 1000 Maximum queued requests (0 = unlimited)

How It Works

Request Flow

1. Request arrives -> Check queue capacity
2. If capacity available -> Acquire queue slot -> Process request -> Release slot
3. If queue full -> Return 503 Service Unavailable

Queue Mechanics

Bounded Queue:

  • Maximum depth set byrequest_queue_max_depth
  • When full, new requests get 503 immediately
  • Slots released when requests complete (success or error)

Unbounded Queue (max_depth=0):

  • No limit on queued requests
  • All requests eventually processed
  • Risk of memory exhaustion

Concurrency Control

Uses asyncio Semaphore for slot management:

  • Semaphore capacity =max_depth
  • acquire()- Try to get queue slot (non-blocking)
  • release()- Return slot when done
  • Thread-safe for concurrent requests

Response Codes

503 Service Unavailable

Requests rejected when queue is full receive:

HTTP/1.1 503 Service Unavailable
Content-Type: text/plain
Retry-After: 5

Service Unavailable - Server Overloaded

The Retry-Afterheader tells clients to wait 5 seconds before retrying.

Examples

Web Application

from pounce import run, ServerConfig

config = ServerConfig(
    request_queue_enabled=True,
    request_queue_max_depth=1000,
    max_connections=100,
    workers=4,
)

run("webapp:app", config=config)

API Server

Combine with rate limiting:

config = ServerConfig(
    # Rate limiting per client
    rate_limit_enabled=True,
    rate_limit_requests_per_second=100.0,
    rate_limit_burst=200,

    # Global request queueing
    request_queue_enabled=True,
    request_queue_max_depth=500,
)

Background Worker

Conservative queue for long-running tasks:

config = ServerConfig(
    request_queue_enabled=True,
    request_queue_max_depth=100,
    max_connections=10,
    request_timeout=300.0,
)

Best Practices

Choosing Queue Depth

Conservative (predictable load):

  • Queue depth: 100-500
  • Reject requests quickly when overloaded

Moderate (variable load):

  • Queue depth: 500-1000 (default)
  • Buffer moderate traffic spikes

Aggressive (bursty traffic):

  • Queue depth: 1000-5000
  • Handle large traffic bursts

Formula:

queue_depth = peak_rps * acceptable_wait_seconds

Monitoring

Track queue health with Prometheus metrics:

http_requests_total{status="503"}  # Queue rejections

Client Handling

Parse Retry-After:

import requests
import time

response = requests.get("https://api.example.com/users")
if response.status_code == 503:
    retry_after = int(response.headers.get("Retry-After", 5))
    time.sleep(retry_after)

Exponential Backoff:

def make_request_with_backoff(url, max_retries=5):
    for attempt in range(max_retries):
        response = requests.get(url)

        if response.status_code != 503:
            return response

        retry_after = int(response.headers.get("Retry-After", 5))
        backoff = retry_after * (2 ** attempt)
        time.sleep(min(backoff, 60))

    raise Exception("Server overloaded after retries")

Capacity Planning

Use queue metrics for scaling decisions:

Under-provisioned (scale up):

  • High rejection rate (>5%)
  • High average wait time (>500ms)
  • Queue frequently full

Right-sized:

  • Rejection rate: 0.1-1%
  • Average wait: 50-200ms
  • Queue absorbs spikes

Advanced Usage

Per-Route Queues

Different queue depths per route:

from pounce._request_queue import RequestQueue, create_queue_wrapper

expensive_queue = RequestQueue(max_depth=100)
cheap_queue = RequestQueue(max_depth=1000)

async def route_aware_queueing(scope, receive, send):
    if scope["path"].startswith("/api/expensive"):
        wrapper = create_queue_wrapper(app, expensive_queue)
    else:
        wrapper = create_queue_wrapper(app, cheap_queue)

    await wrapper(scope, receive, send)

Performance Impact

Request queueing adds minimal overhead:

  • ~1-5us per request - Queue acquire/release
  • Thread-safe - asyncio Semaphore
  • Memory efficient - ~50 bytes per queued request
  • No blocking - Async acquire (try-lock pattern)

Troubleshooting

High Rejection Rate

If many requests get 503:

  1. Increase queue depth - Allow more buffering
  2. Scale workers - Add more processing capacity
  3. Optimize app - Reduce request processing time
  4. Add replicas - Horizontal scaling

No Load Shedding

If 503s aren't being sent:

  1. Check config - Ensurerequest_queue_enabled=True
  2. Verify integration - Check logs for "Request queueing enabled"
  3. Test capacity - Send more requests thanmax_connections

See Also