Performance

Benchmarks and optimization tips

6 min read 1171 words

Kida is designed for high-performance template rendering.

Benchmarks

Methodology:pytest-benchmark, identical templates and contexts, auto_reload=false, bytecode cache enabled. Units are mean times. JSON exports live in .benchmarks/.

Numbers frombenchmarks/test_benchmark_render.py(file-based templates, Python 3.14.2 free-threading build, Apple Silicon).

CPython 3.14t (free-threaded,PYTHON_GIL=0, 3.14.2+ft)

Template Kida
Minimal (hello) 3.48µs
Small (12 vars) 8.88µs
Medium (~100 vars) 0.395ms
Large (1000 loop items) 1.91ms
Complex (inheritance) 21.4µs

Large templates benefit from the StringBuilder pattern. Medium templates are dominated by HTML escaping (Kida uses pure Python for zero-dependency).

Concurrent Performance (Free-Threading)

Numbers frombenchmarks/test_benchmark_full_comparison.py(inline medium template, 100 total renders distributed across workers).

Under concurrent workloads, Kida's thread-safe design scales with worker count:

Workers Kida
1 1.80ms
2 1.12ms
4 1.62ms
8 1.76ms

Lexer Optimization

The lexer uses a compiled regex for delimiter detection, achieving 5-24x faster_find_next_construct() compared to multiple str.find()calls:

Method Speedup
re.search()(current) 5-24x
str.find()× 3 (old) baseline

Render Loop Optimizations

Type-aware escaping: Numeric types (int, float, bool) bypass HTML escaping since they can't contain special characters:

Value Type Time (10k escapes) Speedup
Numeric (optimized) 0.90ms 1.9x
String (full escape) 1.74ms baseline

Lazy LoopContext: Whenloop.*properties aren't used, Kida iterates directly over items without creating a LoopContext wrapper:

Loop Type Time (10k items) Speedup
Withoutloop.* 202.7ms 1.80x
Withloop.* 365.7ms baseline

Compilation Time

Numbers frombenchmarks/test_benchmark_render.pycompile benchmarks.

Template Kida
Small 4.03ms
Medium 6.04ms
Large 4.62ms
Complex 5.08ms

Kida builds a full AST (lexer → parser → AST → compiler → Python code →exec()). This cost is amortized by the bytecode cache — recompilation only happens when template source changes.

Where to Improve Next

  • Medium templates: HTML escaping overhead dominates (pure Python)
  • Cold-start: lazy analysis imports cut import time from 60ms to 31ms (48% improvement)
  • Compilation: amortized by bytecode cache

Run locally:

# GIL on
uv run pytest benchmarks/benchmark_render.py --benchmark-only \
  --benchmark-json .benchmarks/render_auto_reload_off.json \
  --override-ini "python_files=benchmark_*.py test_*.py"

# Free-threaded
PYTHON_GIL=0 uv run --python 3.14t pytest benchmarks/benchmark_render.py \
  --benchmark-only --benchmark-json .benchmarks/render_free_thread.json \
  --override-ini "python_files=benchmark_*.py test_*.py"

Why Kida is Fast

StringBuilder Pattern

Kida useslist.append() + "".join()for O(n) rendering:

# Kida's approach
def render():
    _out = []
    _out.append("Hello, ")
    _out.append(name)
    _out.append("!")
    return "".join(_out)

The StringBuilder pattern has lower overhead:

  • No generator/iterator protocol
  • Single memory allocation for final string

Local Variable Caching

Frequently-used functions are bound to locals once at the top of each render function:

_e = _escape   # Local alias for escape function
_s = _str      # Local alias for str()
_append = buf.append  # Local alias for list.append
# ... rest of render uses _e, _s, _append (LOAD_FAST)

O(1) Operator Dispatch

Token handling uses dict-based dispatch, not if/elif chains:

HANDLERS = {
    "if": handle_if,
    "for": handle_for,
    # ...
}
handler = HANDLERS.get(token.value)

Type-Aware HTML Escaping

Kida skips escaping for numeric types that can't contain HTML special characters:

def html_escape(value):
    # Skip numeric types - cannot contain <>&"'
    if type(value) is int or type(value) is float or type(value) is bool:
        return str(value)  # Fast path
    return str(value).translate(_ESCAPE_TABLE)

Lazy LoopContext

Whenloop.*properties aren't used, Kida generates direct iteration:

# When loop.index, loop.first, etc. are NOT used:
for item in _loop_items:  # Direct iteration (1.80x faster)
    ...

# When loop.* IS used:
loop = _LoopContext(_loop_items)  # Full context tracking
for item in loop:
    ...

Compile-Time Optimization

Kida is AST-native and uses several compile-time passes:

  • Python’s optimizer for constant folding
  • Dead code elimination — Removes branches whose conditions are provably constant (e.g.{% if false %}...{% end %}, {% if 1+1==2 %}...{% end %}). Skips inlining when the body contains block-scoped nodes (Set, Let, Capture, Export).
  • Partial evaluation — Whenstatic_contextis provided, evaluates static expressions and replaces them with constants. Supports Filter and Pipeline for pure filters (e.g.{{ site.title | default("x") }}).

Caching Strategies

Template Cache

Compiled templates are cached in memory:

env = Environment(
    cache_size=400,  # Max templates
    auto_reload=False,  # Skip mtime checks
)

Bytecode Cache

Persist compiled bytecode for cold starts:

from kida.bytecode_cache import BytecodeCache

env = Environment(
    bytecode_cache=BytecodeCache("__pycache__/kida/"),
)

Cold-start improvement: bytecode cache saves ~7-8% on first render. Lazy analysis imports (added in v0.x) reduced from kida import Environment from ~60ms to ~31ms (48% faster) by deferring kida.nodes(974 lines of AST definitions) until analysis is actually needed.

Fragment Cache

Cache expensive template sections:

{% cache "sidebar-" + user.id %}
    {{ render_expensive_sidebar(user) }}
{% end %}

Optimization Tips

  1. 1

    Disable auto-reload in production

    # Production
    env = Environment(auto_reload=False)
    
  2. 2

    Pre-warm template cache

    def warmup():
        for name in ["base.html", "home.html", "post.html"]:
            env.get_template(name)
    
    warmup()  # On startup
    
  3. 3

    Use bytecode cache

    env = Environment(
        bytecode_cache=BytecodeCache("__pycache__/kida/"),
    )
    
  4. 4

    Precompute in Python

    # ✅ Python handles complexity
    template.render(
        items=sorted_items,
        total=calculated_total,
        formatted_date=format_date(date),
    )
    
    # ❌ Complex logic in template
    # {% set total = 0 %}{% for i in items %}...
    
  5. 5

    Use fragment caching

    {# Cache expensive computations #}
    {% cache "recent-posts" %}
        {% for post in get_recent_posts() %}
            {{ render_post_card(post) }}
        {% end %}
    {% end %}
    
  6. 6

    Minimize filter chains

    {# Less efficient: multiple passes #}
    {{ text | lower | trim | truncate(100) }}
    
    {# More efficient: single Python call #}
    {{ preprocess(text) }}
    

Profiling

Render Accumulator (Opt-in)

Profile template rendering with detailed block timings:

from kida.render_accumulator import profiled_render

with profiled_render() as metrics:
    html = template.render(page=page, site=site)

print(metrics.summary())
# {
#   "total_ms": 12.5,
#   "blocks": {
#     "content": {"ms": 8.2, "calls": 1},
#     "nav": {"ms": 2.1, "calls": 1},
#   },
#   "includes": {"partials/sidebar.html": 2},
#   "macros": {"render_card": 15},
#   "filters": {"escape": 45, "truncate": 12},
# }

Zero overhead when disabled—profiling only runs inside profiled_render().

Template Cache Stats

info = env.cache_info()
print(f"Hit rate: {info['template']['hit_rate']:.1%}")

Time Individual Renders

import time

start = time.perf_counter()
html = template.render(**context)
elapsed = time.perf_counter() - start
print(f"Render: {elapsed*1000:.2f}ms")

Memory Usage

import sys

template = env.get_template("page.html")
print(f"Size: {sys.getsizeof(template)} bytes")

See Also