Application-level request queueing with bounded capacity for graceful overload handling.
Overview
Request queueing provides graceful degradation under load:
- Buffer requests when workers are busy
- Shed load when queue fills up (503 responses)
- Monitor queue depth and wait times
- Prevent resource exhaustion during traffic spikes
When to Use
Request queueing is ideal for:
- Bursty traffic - Handle temporary traffic spikes
- Background processing - Queue requests during high load
- Graceful degradation - Return 503 instead of timing out
- Capacity planning - Monitor queue to identify scaling needs
Difference from Rate Limiting
| Feature | Rate Limiting | Request Queueing |
|---|---|---|
| Purpose | Prevent per-client abuse | Handle global server overload |
| Scope | Per client IP | Global (all clients) |
| Response | 429 (rate limited) | 503 (overloaded) |
| When | Client exceeds limit | Server at capacity |
Use both together for comprehensive protection:
- Rate limiting stops abusive clients
- Request queueing handles legitimate traffic spikes
Quick Start
Basic Configuration
Enable request queueing with default settings (queue up to 1000 requests):
from pounce import ServerConfig
config = ServerConfig(
request_queue_enabled=True,
)
Custom Queue Depth
config = ServerConfig(
request_queue_enabled=True,
request_queue_max_depth=500, # Queue up to 500 requests
)
Unlimited Queue
config = ServerConfig(
request_queue_enabled=True,
request_queue_max_depth=0, # Unlimited
)
Warning: Unlimited queues can lead to memory exhaustion under sustained overload!
Configuration Options
| Parameter | Type | Default | Description |
|---|---|---|---|
request_queue_enabled |
bool | False |
Enable request queueing |
request_queue_max_depth |
int | 1000 |
Maximum queued requests (0 = unlimited) |
How It Works
Request Flow
1. Request arrives -> Check queue capacity
2. If capacity available -> Acquire queue slot -> Process request -> Release slot
3. If queue full -> Return 503 Service Unavailable
Queue Mechanics
Bounded Queue:
- Maximum depth set by
request_queue_max_depth - When full, new requests get 503 immediately
- Slots released when requests complete (success or error)
Unbounded Queue (max_depth=0):
- No limit on queued requests
- All requests eventually processed
- Risk of memory exhaustion
Concurrency Control
Uses asyncio Semaphore for slot management:
- Semaphore capacity =
max_depth acquire()- Try to get queue slot (non-blocking)release()- Return slot when done- Thread-safe for concurrent requests
Response Codes
503 Service Unavailable
Requests rejected when queue is full receive:
HTTP/1.1 503 Service Unavailable
Content-Type: text/plain
Retry-After: 5
Service Unavailable - Server Overloaded
The Retry-Afterheader tells clients to wait 5 seconds before retrying.
Examples
Web Application
from pounce import run, ServerConfig
config = ServerConfig(
request_queue_enabled=True,
request_queue_max_depth=1000,
max_connections=100,
workers=4,
)
run("webapp:app", config=config)
API Server
Combine with rate limiting:
config = ServerConfig(
# Rate limiting per client
rate_limit_enabled=True,
rate_limit_requests_per_second=100.0,
rate_limit_burst=200,
# Global request queueing
request_queue_enabled=True,
request_queue_max_depth=500,
)
Background Worker
Conservative queue for long-running tasks:
config = ServerConfig(
request_queue_enabled=True,
request_queue_max_depth=100,
max_connections=10,
request_timeout=300.0,
)
Best Practices
Choosing Queue Depth
Conservative (predictable load):
- Queue depth: 100-500
- Reject requests quickly when overloaded
Moderate (variable load):
- Queue depth: 500-1000 (default)
- Buffer moderate traffic spikes
Aggressive (bursty traffic):
- Queue depth: 1000-5000
- Handle large traffic bursts
Formula:
queue_depth = peak_rps * acceptable_wait_seconds
Monitoring
Track queue health with Prometheus metrics:
http_requests_total{status="503"} # Queue rejections
Client Handling
Parse Retry-After:
import requests
import time
response = requests.get("https://api.example.com/users")
if response.status_code == 503:
retry_after = int(response.headers.get("Retry-After", 5))
time.sleep(retry_after)
Exponential Backoff:
def make_request_with_backoff(url, max_retries=5):
for attempt in range(max_retries):
response = requests.get(url)
if response.status_code != 503:
return response
retry_after = int(response.headers.get("Retry-After", 5))
backoff = retry_after * (2 ** attempt)
time.sleep(min(backoff, 60))
raise Exception("Server overloaded after retries")
Capacity Planning
Use queue metrics for scaling decisions:
Under-provisioned (scale up):
- High rejection rate (>5%)
- High average wait time (>500ms)
- Queue frequently full
Right-sized:
- Rejection rate: 0.1-1%
- Average wait: 50-200ms
- Queue absorbs spikes
Advanced Usage
Per-Route Queues
Different queue depths per route:
from pounce._request_queue import RequestQueue, create_queue_wrapper
expensive_queue = RequestQueue(max_depth=100)
cheap_queue = RequestQueue(max_depth=1000)
async def route_aware_queueing(scope, receive, send):
if scope["path"].startswith("/api/expensive"):
wrapper = create_queue_wrapper(app, expensive_queue)
else:
wrapper = create_queue_wrapper(app, cheap_queue)
await wrapper(scope, receive, send)
Performance Impact
Request queueing adds minimal overhead:
- ~1-5us per request - Queue acquire/release
- Thread-safe - asyncio Semaphore
- Memory efficient - ~50 bytes per queued request
- No blocking - Async acquire (try-lock pattern)
Troubleshooting
High Rejection Rate
If many requests get 503:
- Increase queue depth - Allow more buffering
- Scale workers - Add more processing capacity
- Optimize app - Reduce request processing time
- Add replicas - Horizontal scaling
No Load Shedding
If 503s aren't being sent:
- Check config - Ensure
request_queue_enabled=True - Verify integration - Check logs for "Request queueing enabled"
- Test capacity - Send more requests than
max_connections
See Also
- Rate Limiting — Per-client abuse prevention
- Observability — Monitor queue metrics
- Graceful Shutdown — Handle queued requests during shutdown