Performance

The Fast Path

Pounce's sync workers use a built-in HTTP/1.1 parser optimized for the common request-head path. Local benchmark snapshots have measured it around ~3 us per request versus h11 around ~22 us on the same parser microbenchmark, but public performance claims should include the command, hardware, Python build, workload, and variance. This isn't a C extension; it's pure Python using directbytes.find() operations on a memoryviewbuffer.

The fast parser has explicit tests for:

Method token validation
Header size limit (16 KB, matching nginx default)
Null byte and control character injection detection
Duplicate Content-Length rejection (request smuggling vector)
Content-Length + Transfer-Encoding conflict detection (RFC 7230 section 3.3.3)
Negative or non-numeric Content-Length rejection

On free-threaded Python, sync workers handle simple request/response at full thread parallelism. When a response requires streaming or WebSocket, the worker hands off to a dedicated async pool — asyncio overhead is only paid when needed.

Streaming-First Design

The dominant response patterns of modern web applications — chunked HTML, server-sent events, AI token delivery — are all streaming. Pounce's response pipeline is designed around this reality:

No buffering — Response body chunks flow fromsend()directly to the socket
Per-chunk compression — Zstd and gzip compressors operate in streaming mode
Immediate delivery — Each chunk is written to the wire as soon as it's ready

This means time-to-first-byte (TTFB) is determined by your application, not by server buffering.

Memory Model

The shared-memory architecture provides a fundamental advantage over fork-based servers:

Workers	Pounce (threads)	Fork-based (processes)
1	1x app memory	1x app memory
4	~1x app memory	~4x app memory
8	~1x app memory	~8x app memory

On Python 3.14t, workers share the same interpreter, application object, and frozen server configuration. Pounce keeps request-local mutable state separate from shared server-owned objects.

The Chirp-shaped sustained GitHub Actions snapshot measured 196.7 MiB aggregate median peak RSS for four Pounce process workers and 108.0 MiB for four Pounce thread workers, about 45% less in this run. Both modes completed the fixed 1,000 req/s schedule with zero errors. Those figures are three-sample Ubuntu CI results, not a general memory ratio; application state, allocator behavior, and workload shape can change the comparison.

Compression

Pounce negotiates content-encoding automatically viaAccept-Encoding:

Encoding	Library	Priority	Notes
zstd	`compression.zstd`(stdlib)	Highest	PEP 784, zero-dependency
gzip	`zlib`(stdlib)	Medium	Universal browser support
identity	—	Fallback	No compression

Zstd provides better compression ratios than gzip at lower CPU cost — and in Python 3.14, it's in the standard library.

Compression is skipped for:

Responses smaller thancompression_min_size(default: 500 bytes)
Already-compressed content types (images, video, archives)
WebSocket frames

Server-Timing

Whenserver_timing=True, Pounce injects a Server-Timingheader into every response:

Server-Timing: parse;dur=0.12, app;dur=4.56, encode;dur=0.34

This appears directly in browser DevTools (Network tab → Timing), enabling zero-config latency profiling.

Connection Handling

Backpressure — Per-worker connection limits prevent overload
Keep-alive — Configurable timeout (default: 5s) to reuse TCP connections
SO_REUSEPORT — Kernel-level load balancing across workers
Graceful shutdown — In-flight requests complete before workers exit

HTTP Parsing

Pounce uses two HTTP/1.1 parser paths:_fast_h1for sync workers and h11 for async workers. Both are pure Python and designed for free-threading use. The fast parser handles the common request-head path; chunked body decoding, obs-fold, and trailer headers fall through to h11 or the async pool.

CPU Affinity (Linux)

On Linux, you can pin each worker to a dedicated CPU core with--cpu-affinity. This reduces cache thrashing and can improve throughput on multi-core systems:

pounce serve --app myapp:app --workers 8 --cpu-affinity

No-op on non-Linux platforms or when sched_setaffinityfails (e.g. restricted cpusets in containers).

AcceptDistributor

On macOS and Windows, whereSO_REUSEPORTis unavailable, multi-worker servers suffer from thundering herd: all workers wake on every new connection, only one wins.

Pounce solves this with the AcceptDistributor — a single thread that callsaccept()and distributes connections via per-worker queues. Fair distribution, zero contention.

This activates automatically when running multi-worker thread mode with a shared socket. On Linux withSO_REUSEPORT, the kernel handles distribution natively.

Benchmarking

Pounce includes a built-in benchmark command:

# Run standard benchmarks (hello, json, body echo)
pounce bench --workers 4 --duration 10

# Compare against uvicorn
pounce bench --workers 4 --compare

Reports throughput (req/s), latency percentiles (p50, p95, p99), error rates, and RSS memory usage.

For PR or release evidence, use the repository benchmark runner and write an artifact metadata file:

python benchmarks/run_benchmark.py --workload chirp --repeat 5 --artifact-output artifacts/chirp.json

For sustained tail-latency evidence, use the built-in fixed-rate driver. It includes scheduled queue delay in latency (avoiding coordinated omission) and reports p50, p99, and p999 alongside the existing RSS/CPU time series:

python benchmarks/run_benchmark.py --workload chirp --duration 120 --repeat 3 \
  --workers 4 --connections 4 --rate 1000 \
  --servers pounce,uvicorn,hypercorn,granian \
  --artifact-output benchmarks/artifacts/<date>/chirp-sustained.json

The weekly/manual/release benchmark workflow runs this shape on Python 3.14 (Pounce process workers) and Python 3.14t (Pounce thread workers). Published GitHub releases receive the schema-validated JSON artifacts as release assets. Four persistent connections match the four workers so the free-threaded sync lane measures active request handling rather than queued connections waiting for a connection-owning worker.

The current sustained run and its full Pounce, uvicorn, Hypercorn, and Granian comparison are linked from the repository benchmark notes. It is a fixed-rate tail-latency snapshot rather than a maximum-throughput ranking, and the notes retain scheduler drops rather than hiding comparisons that miss the target.

An artifact file records the command, server command, git SHA, Python/GIL mode, OS/hardware, workload, worker count, duration, connections, load tool, samples, grouped variance, best-effort server RSS, raw output, and summary. A single artifact sample is not enough for a regression claim; use repeated samples before promoting a number in README, site docs, or release notes.