Lessons from building Pounce
Python 3.14t removes the Global Interpreter Lock, enabling true parallelism across threads sharing a single interpreter. This document distills the architectural patterns Pounce uses to exploit free-threading safely and efficiently. These patterns are not HTTP-specific -- they apply to any concurrent Python infrastructure: task schedulers, message brokers, data pipelines, game servers.
Each pattern targets one goal: eliminate shared mutable state, or make the remaining shared state trivially correct.
1. Frozen Configuration as Lock Elimination
The pattern. Declare all configuration as a frozen, slotted dataclass. Validate exhaustively at construction time. Share the single instance across every worker thread by reference.
@dataclass(frozen=True, slots=True, kw_only=True)
class ServerConfig:
host: str = "127.0.0.1"
port: int = 8000
workers: int = 1
keep_alive_timeout: float = 5.0
max_request_size: int = 1_048_576
compression: bool = True
# ... 60+ fields
Why it works. frozen=True makes every attribute read-only after __init__.
Multiple threads reading the same frozen object require zero synchronization --
there is no write to race against.slots=True eliminates __dict__, preventing
accidental monkey-patching at runtime.kw_only=Trueforces explicit construction,
catching misconfiguration at boot rather than under load.
In Pounce.ServerConfigcarries 60+ fields with 93 validations at boot.
Every worker thread holds a reference to the same object. No per-access locking,
no defensive copies, no stale-config bugs.
Anti-pattern. A mutable config dict protected by a lock on every read. Under
free-threading, a lock-per-read on the hot path (every request checksconfig.keep_alive_timeout)
introduces contention that scales inversely with core count.
Generalization. Any state that is read frequently and written never (or only at startup) belongs in a frozen dataclass. Feature flags, route tables, TLS contexts, database connection parameters -- freeze them at boot.
2. Immutable Events as Thread-Safe Communication
The pattern. Model every observable side effect as a frozen dataclass with a nanosecond monotonic timestamp. Events cross thread boundaries without copying, serialization, or locking.
@dataclass(frozen=True, slots=True, kw_only=True)
class ConnectionOpened:
connection_id: int
worker_id: int
client_addr: str
client_port: int
server_addr: str
server_port: int
protocol: str # "h1", "h2", "websocket"
timestamp_ns: int
@dataclass(frozen=True, slots=True, kw_only=True)
class ResponseCompleted:
connection_id: int
worker_id: int
status: int
bytes_sent: int
duration_ms: float
timestamp_ns: int
Why it works. A frozen dataclass is immutable after creation. The producing
thread creates it; any number of consuming threads can read it concurrently.
time.monotonic_ns()provides a high-resolution, monotonic clock that is
immune to NTP adjustments, giving events a total ordering.
In Pounce. Five event types (ConnectionOpened, RequestStarted,
ResponseCompleted, ClientDisconnected, ConnectionCompleted) flow from
worker threads to aLifecycleCollector. The BufferedCollectoraccumulates
events under a single lock at the collector boundary -- the events themselves
need no protection.
Generalization. Any observer, event bus, or audit log pattern becomes trivially thread-safe when events are frozen value objects. This applies to domain event sourcing, distributed tracing spans, and metrics collection.
3. Sans-I/O Protocol Design
The pattern. Protocol handlers are pure state machines. They consume bytes
and produce typed events plus bytes to send. No socket access, noasyncio
import, no I/O of any kind.
@runtime_checkable
class ProtocolHandler(Protocol):
def receive_data(self, data: bytes) -> list[ProtocolEvent]: ...
def send_response(self, status: int, headers: list[tuple[bytes, bytes]]) -> bytes: ...
def send_body(self, data: bytes, *, more: bool) -> bytes: ...
def start_new_cycle(self) -> None: ...
The worker feeds raw bytes in, reads parsed events and serialized bytes out:
events = handler.receive_data(raw_bytes)
for event in events:
match event:
case RequestReceived(method=method, target=target, headers=headers):
response_bytes = handler.send_response(200, response_headers)
body_bytes = handler.send_body(body, more=False)
Why it works. Each worker thread creates its own protocol handler instance.
No shared state means no contention. The handler is a pure function of its
accumulated input -- deterministic, reproducible, and testable with plain
pytest(no event loop, no mock sockets).
In Pounce. H1, H2, and WebSocket each implement sans-I/O handlers. The
same handler works under both the sync worker (blockingrecv/send) and
the async worker (asynciostreams). The worker is the I/O adapter; the
protocol is the logic.
Generalization. Any parser or encoder benefits from this separation: database wire protocols, message queue framing, serialization codecs. The sans-I/O pattern is especially powerful under free-threading because it guarantees thread isolation by construction rather than by discipline.
4. Queue-Based Thread Handoff
The pattern. When work must transfer between specialized threads, use a
typedqueue.Queuewith frozen or slotted handoff objects. No shared mutable
state; just message passing.
@dataclass(slots=True)
class StreamingHandoff:
conn: socket.socket
scope: dict[str, Any]
body: bytes
request_id: str | None
@dataclass(slots=True)
class WebSocketHandoff:
conn: socket.socket
request: RequestReceived
client: tuple[str, int]
server: tuple[str, int]
scope: dict[str, Any]
type HandoffRequest = StreamingHandoff | WebSocketHandoff
# In SyncWorker: enqueue handoff
handoff_queue.put(StreamingHandoff(conn=conn, scope=scope, body=body, request_id=rid))
# In AsyncPool: dequeue and continue
handoff = handoff_queue.get(timeout=0.1)
Why it works. queue.Queueis internally synchronized. The handoff object
captures everything the receiving thread needs -- no back-references to the
sender's state. Ownership transfers cleanly: the sync worker stops touching
the socket after enqueuing.
In Pounce. SyncWorkers handle fast request-response cycles in tight
blocking loops. When an ASGI app returns a streaming response or WebSocket
upgrade, the SyncWorker hands the live socket to the AsyncPool via a typed
handoff. The AsyncPool wraps it inasynciostreams and continues the ASGI
lifecycle. Two different execution models cooperate without sharing mutable
state.
Generalization. Producer-consumer pipelines, work stealing, and staged-event-driven architectures all reduce to typed queue handoffs. Under free-threading, this is faster than shared-state-with-locks because the queue lock is held only for the enqueue/dequeue, not for the entire processing duration.
5. Accept Distributor (Thundering Herd Fix)
The pattern. A single dedicated thread callsaccept()on the listening
socket and enqueues connections into a sharedQueue. Worker threads pull
from the queue -- the first idle worker wins.
class AcceptDistributor:
def __init__(self, sock, conn_queue, *, shutdown_event=None, ssl_context=None):
self._sock = sock
self._conn_queue = conn_queue
self._ext_shutdown = shutdown_event
def run(self):
while not (self._ext_shutdown and self._ext_shutdown.is_set()):
self._sock.settimeout(0.25)
try:
conn, addr = self._sock.accept()
except TimeoutError:
continue
conn.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
self._conn_queue.put((conn, addr))
Why it works. Without SO_REUSEPORT(unavailable on macOS and Windows),
multiple threads blocking onaccept()on the same fd causes a thundering
herd -- the kernel wakes all threads, but only one gets the connection.
A single accept thread eliminates this entirely. TheQueueprovides
natural load balancing: idle workers dequeue first.
In Pounce. The supervisor detects whether workers share the same socket
(noSO_REUSEPORT) and starts an AcceptDistributorthread automatically.
On Linux withSO_REUSEPORT, each worker gets its own socket and accepts
directly.
Generalization. Any multi-consumer socket pattern on platforms without kernel-level load balancing benefits from this. It also applies to file descriptor distribution in database connection pools and task queue brokers.
6. The Brotli Principle: C Extensions Are the Enemy
The pattern. Audit every dependency for GIL re-acquisition. A single C extension that takes the GIL under free-threading collapses your parallelism back to serial execution.
# compression.py -- Pounce's encoding priority
# zstd: stdlib (PEP 784), GIL-free on 3.14t
# gzip: stdlib zlib, GIL-free on 3.14t
# brotli: EXCLUDED -- C extension re-enables GIL
try:
from compression import zstd as _zstd
_HAS_ZSTD = True
except ImportError:
_HAS_ZSTD = False
_ENCODING_PRIORITY: Final[tuple[str, ...]] = _build_encoding_priority()
# Result: ("zstd", "gzip") -- never "br"
Why it matters. On CPython 3.14t, C extensions that have not been updated
for free-threading will re-enable the GIL for the entire process when imported.
This is silent and catastrophic: yoursys._is_gil_enabled()check at startup
returnsTrue, and all your threading gains vanish.
In Pounce. Brotli is intentionally excluded despite being the most popular
web compression format. ThebrotliC extension re-enables the GIL on 3.14t.
Pounce preferszstd (stdlib in 3.14 via PEP 784) and gzip (stdlib zlib),
both of which are GIL-free.
The audit checklist:
- Run
python -X gil=0 -c "import your_dep; print(sys._is_gil_enabled())"for every dependency. - If it prints
True, that dependency re-enables the GIL. - Find a pure Python alternative, a stdlib replacement, or vendor a GIL-free fork.
- Add a CI check that asserts
sys._is_gil_enabled() == Falseafter all imports.
Generalization. This is the most important pattern in this document. A
single carelessimportcan silently negate your entire free-threading
architecture. Treat GIL-reacquiring C extensions as you would a security
vulnerability: audit, detect, and eliminate.
7. Functional State Machine (Elm Architecture)
The pattern. Model lifecycle transitions as an immutable state plus a pure reducer function. Dispatch actions to advance state. Render views from state.
@dataclass(frozen=True, slots=True, kw_only=True)
class ServerModel:
phase: Phase = Phase.INIT
effective_workers: int = 0
mode_label: str = ""
gil_status: str = ""
generation: int = 0
def server_reducer(state: ServerModel | None, action: Action) -> ServerModel:
if state is None:
state = ServerModel()
match action.type:
case "BANNER":
return replace(state, phase=Phase.STARTUP,
effective_workers=action.payload["effective_workers"],
mode_label=action.payload["mode_label"],
gil_status=action.payload["gil_status"])
case "READY":
return replace(state, phase=Phase.READY)
case "SHUTDOWN_START":
return replace(state, phase=Phase.SHUTTING_DOWN,
connections=action.payload.get("connections", 0))
case "RELOAD_COMPLETE":
p = action.payload or {}
return replace(state, phase=Phase.SERVING,
generation=p.get("generation", state.generation))
case _:
return state
Why it works. replace()on a frozen dataclass returns a new instance --
the old state is never mutated. The reducer is a pure function: same input
always produces same output. This makes lifecycle transitions deterministic,
testable (call the reducer directly in unit tests), and safe to invoke from
any thread.
In Pounce. The server lifecycle flows throughPhase.INIT -> STARTUP -> READY -> SERVING -> SHUTTING_DOWN -> STOPPED. Actions like BANNER, READY,
SHUTDOWN_START, and RELOAD_COMPLETEdrive transitions. A render middleware
produces branded terminal output on each dispatch.
Generalization. Any workflow or state machine benefits: deployment pipelines, connection pool states, circuit breakers, retry policies. The Elm Architecture makes concurrent state transitions trivially correct because there is no mutable state to corrupt.
8. Per-Request Fresh Instances
The pattern. Create a fresh compressor, parser, or handler for each request. Never pool. Never share.
class Compressor(Protocol):
def compress(self, data: bytes) -> bytes: ...
def flush(self) -> bytes: ...
def create_compressor(encoding: str, config: ServerConfig) -> Compressor:
match encoding:
case "zstd":
return ZstdCompressor(level=config.compression_level)
case "gzip":
return GzipCompressor(level=config.compression_level)
Each request gets its own compressor:
# In the request handler (per-request, per-thread)
encoding = negotiate_encoding(accept_encoding_header)
compressor = create_compressor(encoding, config) # fresh instance
compressed = compressor.compress(body)
compressed += compressor.flush()
Why it works. With no sharing, there is no contention. Each thread owns its compressor for the lifetime of one request. When the request completes, the compressor is garbage collected. No reset logic, no cleanup bugs, no use-after-return errors.
Anti-pattern. Object pooling with locks. Under free-threading, a pool of reusable compressors protected by a lock creates contention at both checkout and checkin. The lock cost often exceeds the allocation cost, especially for lightweight objects.
When pooling is still justified. Pool only when construction is genuinely expensive (database connections, TLS handshakes) and the object is long-lived. For anything that lives for a single request, fresh allocation wins.
Generalization. JSON encoders, template renderers, serialization buffers, validation contexts -- create fresh, use once, discard. Modern allocators make this cheap. Free-threading makes it necessary.
9. Monotonic ID Generation
The pattern. Use a lock-protected counter for globally unique IDs. Keep the critical section minimal.
_id_counter = 0
_id_lock = threading.Lock()
def next_connection_id() -> int:
"""Globally unique, monotonically increasing connection ID."""
global _id_counter
with _id_lock:
_id_counter += 1
return _id_counter
Why it works. The lock protects a single integer increment -- the critical section is nanoseconds. This is one of the few places where a lock is the right tool, because the shared state (the counter) genuinely must be mutated by multiple threads and must never produce duplicates.
In Pounce. Every accepted connection gets a uniqueconnection_idfrom
this generator. The ID appears in lifecycle events, access logs, and error
traces, enabling correlation across threads.
Generalization. Request IDs, trace IDs, sequence numbers for ordered delivery, epoch counters for optimistic concurrency -- any global counter that must be unique across threads follows this pattern.
10. Adaptive Runtime Detection
The pattern. Check the GIL state once at startup. Branch your concurrency strategy based on the result. Never check at runtime per-request.
def is_gil_enabled() -> bool:
return getattr(sys, "_is_gil_enabled", lambda: True)()
def detect_worker_mode() -> WorkerMode:
return WorkerMode.PROCESS if is_gil_enabled() else WorkerMode.THREAD
Why it works. A single boolean check at startup selects the entire
concurrency strategy. The same codebase, the same tests, and the same CI
pipeline work on both GIL and free-threaded builds. No#ifdef, no
separate branches, no conditional imports.
In Pounce. The supervisor callsdetect_worker_mode()once. On 3.14t
(nogil), it spawns worker threads that share the interpreter. On GIL builds,
it spawns worker processes viamultiprocessing. The worker implementation
is identical in both cases -- only the spawning differs.
Key design rule. Use feature detection, not version checking:
# Correct: detect capability
if not is_gil_enabled():
spawn_threads()
# Wrong: check version (breaks on 3.14 non-t builds)
if sys.version_info >= (3, 14):
spawn_threads()
Generalization. This pattern applies to any capability that varies across
Python builds:asynciobackends, memory allocators, JIT availability. Detect
the capability, branch once, and run a uniform code path thereafter.
Summary
| # | Pattern | Thread-Safety Guarantee | Lock Required |
|---|---|---|---|
| 1 | Frozen Configuration | Immutable after construction -- no write races possible | None |
| 2 | Immutable Events | Frozen value objects -- safe to share across any number of readers | None (at event level) |
| 3 | Sans-I/O Protocols | Thread-local instances -- no sharing by construction | None |
| 4 | Queue-Based Handoff | queue.Queueinternal synchronization -- ownership transfer |
Built into Queue |
| 5 | Accept Distributor | Single writer (accept thread) + Queue -- no thundering herd | Built into Queue |
| 6 | C Extension Audit | Process-level -- one bad import disables free-threading globally | N/A (prevention) |
| 7 | Functional State Machine | Immutable state + pure reducer -- no mutation possible | None |
| 8 | Per-Request Instances | Thread-local by lifetime -- never shared | None |
| 9 | Monotonic ID Generation | Minimal critical section -- lock held for one integer increment | threading.Lock |
| 10 | Adaptive Detection | Decided once at startup -- no per-request branching | None |
The overarching principle: free-threading rewards architectures that minimize shared mutable state. Eight of the ten patterns above require zero locks. The remaining two (ID generation and queue handoff) confine locking to the smallest possible scope. This is not a coincidence -- it is the design target.
If you find yourself reaching forthreading.Lockon a hot path, step back and
ask: can this state be frozen, this object be per-request, or this communication
be a queue? In free-threaded Python, the fastest lock is the one you eliminated.