Pounce — Thread-Based ASGI Workers on Free-Threaded Python

How Pounce runs thread-based ASGI workers on free-threaded Python with shared immutable config, per-request compressors, and automatic GIL detection — one process, N threads, no IPC.

Pounce is built around a simple operational promise: on free-threaded Python, one process can run many worker threads against one in-memory app.

That promise matters because ASGI servers usually make you choose a process model up front. Pounce tries to make the command stay the same and let the runtime decide the worker strategy.

On GIL Python, threads take turns. On Python 3.14t, they run in parallel. One command, one config, two different operating realities depending on your interpreter. Pounce detects the runtime and picks the right worker model automatically.


Series context

Part 5 of 6Free-Threading in the Bengal Ecosystem. Pounce is the ASGI server — it runs Chirp apps in production, serving pages built with Kida, Patitas, and Rosettes.


Run it

uv python install 3.14t
uv run --python=3.14t pounce myapp:app --workers 4

On Python 3.14t, that means four threads, shared memory, and one app load. On standard Python, it means four processes. Same command either way.


Threads vs processes — automatic

def is_gil_enabled() -> bool:
    return getattr(sys, "_is_gil_enabled", lambda: True)()

def detect_worker_mode() -> WorkerMode:
    return "process" if is_gil_enabled() else "thread"

The important part is that the request flow does not split into two codebases. Same Worker class, same ServerConfig, same request flow. Only the spawning mechanism differs.

flowchart TB Start["pounce myapp:app --workers 4"] --> Detect{"GIL enabled?"} Detect -->|"Yes (standard Python)"| Process["4 Processes"] Detect -->|"No (3.14t)"| Thread["4 Threads"] Process --> P1["Process 1 (own interpreter)"] Process --> P2["Process 2 (own interpreter)"] Thread --> T1["Thread 1 (shared interpreter)"] Thread --> T2["Thread 2 (shared interpreter)"]
  • One process, shared memory
  • One copy of the app loaded
  • Lower RSS (~60–80 MB for 4 workers)
  • No IPC needed for shared state
  • Graceful rolling restart available
  • N processes, isolated memory
  • App loaded N times
  • Higher RSS (~100–150 MB for 4 workers)
  • IPC needed for any shared state
  • Brief-downtime restart only

Shared immutable config

Workers need config: host, port, timeouts, limits, compression settings. Mutating a shared dict from multiple threads is a race, so Pounce makes config immutable instead:

@dataclass(frozen=True, slots=True)
class ServerConfig:
    """Immutable server configuration.
    Created once at startup, shared across all worker threads.
    """
    host: str = "127.0.0.1"
    port: int = 8000
    workers: int = 1
    keep_alive_timeout: float = 5.0
    request_timeout: float = 30.0
    compression: bool = True
    # ... 30+ fields, all immutable

Created once at startup, passed to every worker, and never mutated. That removes an entire category of lock and coordination problems.


Per-request compressors

Compression such as gzip and zstd requires stateful compressor objects. Sharing one across requests would be a race, so Pounce creates a fresh compressor per request:

class GzipCompressor:
    def __init__(self, *, level: int = 6) -> None:
        self._compressor = zlib.compressobj(level, zlib.DEFLATED, 31)

The cost of creating a compressor is small compared to the lifetime of a request. The alternative, locking around a shared compressor, would serialize compression across workers.


The Brotli exclusion

Warning

Pounce supports zstd (stdlib, PEP 784) and gzip (stdlib zlib). Brotli is intentionally excluded — the brotli C extension re-enables the GIL on Python 3.14t. Using it in a free-threaded server would serialize all worker threads whenever any thread compresses a response. Clients that send only Accept-Encoding: br receive uncompressed responses.

This is the free-threading ecosystem in miniature: "has wheels" and "works correctly under contention" are different bars. Audit your C extensions. Prefer stdlib or verified free-threading-safe libraries.


Graceful reload

In thread mode, Pounce supports zero-downtime rolling restart:

  1. Spawn new workers
  2. Mark old workers for draining (finish existing connections, reject new)
  3. Wait for old workers to become idle
  4. Shut down old workers

This works because threads share memory, so the supervisor can signal workers directly. In process mode, workers run in separate address spaces, so Pounce falls back to brief-downtime restart.

That makes graceful reload a concrete operational benefit of the thread-based model, not just an architectural nicety.


What this means in practice

On free-threaded Python 3.14t, pounce myapp:app --workers 4 runs four threads sharing one interpreter. One app load. Shared immutable config. No fork, no IPC. Compression uses stdlib only.

On standard Python, the same command runs four processes. Same behavior, higher memory, and no rolling restart. Upgrade to free-threaded Python and you get the thread-mode benefits without changing your deployment command.


Further reading