# Request Queueing

URL: /docs/deployment/request-queueing/
Section: deployment
Tags: deployment, queueing, load-shedding, backpressure

--------------------------------------------------------------------------------

Application-level request queueing with bounded capacity for graceful overload handling. Overview Request queueing provides graceful degradation under load: Buffer requests when workers are busy Shed load when queue fills up (503 responses) Monitor queue depth and wait times Prevent resource exhaustion during traffic spikes When to Use Request queueing is ideal for: Bursty traffic - Handle temporary traffic spikes Background processing - Queue requests during high load Graceful degradation - Return 503 instead of timing out Capacity planning - Monitor queue to identify scaling needs Difference from Rate Limiting Feature Rate Limiting Request Queueing Purpose Prevent per-client abuse Handle global server overload Scope Per client IP Global (all clients) Response 429 (rate limited) 503 (overloaded) When Client exceeds limit Server at capacity Use both together for comprehensive protection: Rate limiting stops abusive clients Request queueing handles legitimate traffic spikes Quick Start Basic Configuration Enable request queueing with default settings (queue up to 1000 requests): from pounce import ServerConfig config = ServerConfig( request_queue_enabled=True, ) Custom Queue Depth config = ServerConfig( request_queue_enabled=True, request_queue_max_depth=500, # Queue up to 500 requests ) Unlimited Queue config = ServerConfig( request_queue_enabled=True, request_queue_max_depth=0, # Unlimited ) Warning: Unlimited queues can lead to memory exhaustion under sustained overload! Configuration Options Parameter Type Default Description request_queue_enabled bool False Enable request queueing request_queue_max_depth int 1000 Maximum queued requests (0 = unlimited) How It Works Request Flow 1. Request arrives -&gt; Check queue capacity 2. If capacity available -&gt; Acquire queue slot -&gt; Process request -&gt; Release slot 3. If queue full -&gt; Return 503 Service Unavailable Queue Mechanics Bounded Queue: Maximum depth set by request_queue_max_depth When full, new requests get 503 immediately Slots released when requests complete (success or error) Unbounded Queue (max_depth=0): No limit on queued requests All requests eventually processed Risk of memory exhaustion Concurrency Control Uses asyncio Semaphore for slot management: Semaphore capacity = max_depth acquire() - Try to get queue slot (non-blocking) release() - Return slot when done Thread-safe for concurrent requests Response Codes 503 Service Unavailable Requests rejected when queue is full receive: HTTP/1.1 503 Service Unavailable Content-Type: text/plain Retry-After: 5 Service Unavailable - Server Overloaded The Retry-After header tells clients to wait 5 seconds before retrying. Examples Web Application from pounce import run, ServerConfig config = ServerConfig( request_queue_enabled=True, request_queue_max_depth=1000, max_connections=100, workers=4, ) run(&quot;webapp:app&quot;, config=config) API Server Combine with rate limiting: config = ServerConfig( # Rate limiting per client rate_limit_enabled=True, rate_limit_requests_per_second=100.0, rate_limit_burst=200, # Global request queueing request_queue_enabled=True, request_queue_max_depth=500, ) Background Worker Conservative queue for long-running tasks: config = ServerConfig( request_queue_enabled=True, request_queue_max_depth=100, max_connections=10, request_timeout=300.0, ) Best Practices Choosing Queue Depth Conservative (predictable load): Queue depth: 100-500 Reject requests quickly when overloaded Moderate (variable load): Queue depth: 500-1000 (default) Buffer moderate traffic spikes Aggressive (bursty traffic): Queue depth: 1000-5000 Handle large traffic bursts Formula: queue_depth = peak_rps * acceptable_wait_seconds Monitoring Track queue health with Prometheus metrics: http_requests_total{status=&quot;503&quot;} # Queue rejections Client Handling Parse Retry-After: import requests import time response = requests.get(&quot;https://api.example.com/users&quot;) if response.status_code == 503: retry_after = int(response.headers.get(&quot;Retry-After&quot;, 5)) time.sleep(retry_after) Exponential Backoff: def make_request_with_backoff(url, max_retries=5): for attempt in range(max_retries): response = requests.get(url) if response.status_code != 503: return response retry_after = int(response.headers.get(&quot;Retry-After&quot;, 5)) backoff = retry_after * (2 ** attempt) time.sleep(min(backoff, 60)) raise Exception(&quot;Server overloaded after retries&quot;) Capacity Planning Use queue metrics for scaling decisions: Under-provisioned (scale up): High rejection rate (&gt;5%) High average wait time (&gt;500ms) Queue frequently full Right-sized: Rejection rate: 0.1-1% Average wait: 50-200ms Queue absorbs spikes Advanced Usage Per-Route Queues Different queue depths per route: from pounce._request_queue import RequestQueue, create_queue_wrapper expensive_queue = RequestQueue(max_depth=100) cheap_queue = RequestQueue(max_depth=1000) async def route_aware_queueing(scope, receive, send): if scope[&quot;path&quot;].startswith(&quot;/api/expensive&quot;): wrapper = create_queue_wrapper(app, expensive_queue) else: wrapper = create_queue_wrapper(app, cheap_queue) await wrapper(scope, receive, send) Performance Impact Request queueing adds minimal overhead: ~1-5us per request - Queue acquire/release Thread-safe - asyncio Semaphore Memory efficient - ~50 bytes per queued request No blocking - Async acquire (try-lock pattern) Troubleshooting High Rejection Rate If many requests get 503: Increase queue depth - Allow more buffering Scale workers - Add more processing capacity Optimize app - Reduce request processing time Add replicas - Horizontal scaling No Load Shedding If 503s aren't being sent: Check config - Ensure request_queue_enabled=True Verify integration - Check logs for &quot;Request queueing enabled&quot; Test capacity - Send more requests than max_connections See Also Rate Limiting — Per-client abuse prevention Observability — Monitor queue metrics Graceful Shutdown — Handle queued requests during shutdown

--------------------------------------------------------------------------------

Metadata:
- Word Count: 719
- Reading Time: 4 minutes