Kida — A Template Engine Built for Free-Threaded Python

How Kida achieves thread-safe template rendering with copy-on-write config, immutable AST, and ContextVar — benchmark data and patterns for the free-threading ecosystem.

March 02, 2026 5 min read

Kida Chief Napping Officer

Kida is a template engine built for a world where multiple threads may render at the same time.

That matters because template engines carry more shared state than they first appear to: filters, globals, compiled templates, render context, caches. On free-threaded Python, those are all places a design can quietly fall apart.

Jinja2's filter dict is a shared whiteboard. Every thread reads from it. When you call add_filter(), you pick up the marker and write on it while other threads may be reading the same surface. Under the GIL, threads take turns at the whiteboard. On Python 3.14t, they do not.

Kida avoids that by copying shared configuration before mutation. The thread mid-render keeps reading the old board. The caller of add_filter() gets the fresh one. Nobody fights over the marker because nobody is using the same board at the same moment.

That is copy-on-write. It is one of four patterns that make Kida safe under free-threaded Python and help explain why it is faster than Jinja2 on large templates.

Series context

Part 2 of 6 — Free-Threading in the Bengal Ecosystem. Kida is the template layer used by both Bengal (SSG) and Chirp (web framework).

Run it

uv python install 3.14t
uv run --python=3.14t python -c "
from kida import Environment
env = Environment()
t = env.from_string('Hello, {{ name }}!')
print(t.render(name='World'))
"

Kida declares itself GIL-independent via _Py_mod_gil = 0. The useful reader takeaway is simpler: the public APIs are designed to stay correct when many threads render concurrently.

Performance

List table has no rows

The single-threaded advantage comes from AST-native compilation: Kida generates ast.Module directly instead of generating source strings first. The concurrent table measures benign renders with no config mutations.

That caveat matters. If you add a concurrent add_filter() call to the Jinja2 column, the question stops being "which one is faster?" and starts being "which one is still well-defined?"

Copy-on-write for config

The first pattern is copy-on-write for shared configuration.

def add_filter(self, name: str, func: Callable[..., Any]) -> None:
    new_filters = self._filters.copy()
    new_filters[name] = func
    self._filters = new_filters  # Atomic swap

The same principle applies to add_test(), add_global(), update_filters(), and update_tests(). The thread mid-render keeps reading the old board. The thread calling add_filter() gets the new one. Neither sees a torn state because they were never reading and writing the same object in place.

The cost is one extra copy per config change — rare compared to render volume. The benefit is that the hot path, every render, never touches a lock.

Immutable AST nodes

The second pattern is immutability after compilation:

@dataclass(frozen=True, slots=True)
class Output(Node):
    """Output expression: {{ expr }}"""
    expr: Expr
    escape: bool = True

Once the template is compiled, it becomes a fixed artifact. A hundred threads can hold the same compiled node, read it, and render from it without changing its shape. This is also what makes fragment caching safe: a frozen node can be reused by any thread that needs it because it was finished the moment it was created.

ContextVar for per-render state

The third pattern is keeping render state out of user data.

Template engines accumulate internal state during a render: current template name, line number, include depth, block cache. Older designs often smuggle that into the user's context dict. It works, but it mixes engine state with application state.

Kida keeps that state in its own place:

_render_context: ContextVar[RenderContext | None] = ContextVar(
    "render_context", default=None
)

RenderContext holds template name, line, include depth, block cache, and template stack. Each render installs it at entry and removes it at exit. The user's context dict stays exactly as they left it, with no _template, no _line, and no surprise entries. Because ContextVar propagates into asyncio.to_thread(), the state also follows the render into async contexts without extra plumbing.

StringBuilder for throughput

The fourth pattern is simpler: keep the hot path cheap.

Kida generates two render modes from one template, render() and render_stream(). The fast path appends output pieces to a list and joins once at the end:

def render(ctx, _blocks=None):
    buf = []
    _append = buf.append
    _append("Hello, ")
    _append(_e(_s(ctx["name"])))
    return ''.join(buf)

That is O(n) instead of O(n²) for repeated string concatenation. Each render call uses its own local buffer, with nothing shared between threads.

Compute outside the lock

The cache still uses a lock, but the expensive work happens outside it:

# Compile first, fully, at the workbench
value = factory(key)

with self._lock:
    if key in self._cache:
        return self._cache[key]  # Someone filed it while we were working
    self._cache[key] = value
    return value

Acquire the lock only long enough to store the finished value. If someone else stored the same template while we were compiling, discard ours and return theirs. That duplicated work is acceptable. Putting compilation inside the lock would turn the cache into a bottleneck.

What this means in practice

On free-threaded Python 3.14t, multiple threads can render Kida templates concurrently without leaning on the GIL for safety. Each of the four patterns eliminates a different hazard: in-place shared config mutation, mutable compiled structures, render-state leakage, and lock-heavy cache work.

On GIL builds, the same patterns still eliminate bugs — the races just don't manifest until the GIL goes away. The code is correct in both worlds. It just goes faster in one of them.

Run it

Performance

Copy-on-write for config

Immutable AST nodes

ContextVar for per-render state

StringBuilder for throughput

Compute outside the lock

What this means in practice

Further reading

Related