Worker Auto-Tuning

Workload-aware parallelization for free-threaded Python

3 min read 696 words

Kida provides a workload-aware worker pool toolkit for framework authors who need to parallelize template rendering. It is calibrated for free-threaded Python (3.14t) where CPU-bound rendering achieves true parallelism.

from kida.utils.workers import get_optimal_workers, should_parallelize

Quick Start

from concurrent.futures import ThreadPoolExecutor
from kida.utils.workers import get_optimal_workers, should_parallelize

contexts = [{"name": f"User {i}"} for i in range(100)]

if should_parallelize(len(contexts)):
    workers = get_optimal_workers(len(contexts))
    with ThreadPoolExecutor(max_workers=workers) as executor:
        results = list(executor.map(template.render, contexts))
else:
    results = [template.render(**ctx) for ctx in contexts]

Core Functions

should_parallelize

Determine if parallelization is worthwhile. Thread pool overhead (~1-2ms per task) only pays off above a threshold.

from kida.utils.workers import should_parallelize

should_parallelize(5)    # False — below threshold
should_parallelize(100)  # True — above threshold

# With work size estimate (bytes of template output)
should_parallelize(100, total_work_estimate=500)  # False — too small

get_optimal_workers

Calculate the optimal worker count based on workload type, environment, CPU cores, and free-threading status.

from kida.utils.workers import get_optimal_workers, WorkloadType

# Template rendering (default)
get_optimal_workers(100)  # 4 (local, free-threading)

# Template compilation
get_optimal_workers(100, workload_type=WorkloadType.COMPILE)  # 2

# Override auto-tuning
get_optimal_workers(100, config_override=16)  # 16

# Weight heavy templates higher
get_optimal_workers(50, task_weight=2.0)  # Adjusts for heavy work

Workload Types

Type Use Case Parallelism
WorkloadType.RENDER Template rendering (CPU-bound) High — benefits from free-threading
WorkloadType.COMPILE Template compilation (CPU-bound) Moderate — shared cache limits scaling
WorkloadType.IO_BOUND File loading, network High — threads wait on I/O

Environment Detection

The toolkit auto-detects the execution environment to tune worker counts:

Environment Detection Worker Strategy
CI CI, GITHUB_ACTIONS, etc. Conservative (2 workers max)
Local Default Moderate (up to 4 workers)
Production KIDA_ENV=production Aggressive (up to 8 workers)

Override detection with theKIDA_ENVenvironment variable:

export KIDA_ENV=production  # or "ci" or "local"

Free-Threading Detection

The toolkit detects whether the GIL is disabled and scales worker counts accordingly:

from kida.utils.workers import is_free_threading_enabled

if is_free_threading_enabled():
    print("GIL disabled — true parallelism available")

On free-threaded Python, render workloads get a 1.5x multiplier on the CPU-based worker count.

Template Scheduling

For optimal throughput, schedule heavy templates first to avoid the "straggler effect" where one slow render delays overall completion.

estimate_template_weight

Estimate relative complexity of a template:

from kida.utils.workers import estimate_template_weight

weight = estimate_template_weight(template)
# 1.0 = average, >1 = heavy, <1 = light (capped at 5.0)

Weight factors:

  • Source size: +0.5 per 5KB above 5KB threshold
  • Block count: +0.1 per block above 3
  • Macro count: +0.2 per macro
  • Inheritance: +0.5 if extends another template
  • Includes: +0.15 per include statement

order_by_complexity

Sort templates for optimal parallel execution:

from kida.utils.workers import order_by_complexity

# Heavy templates first (default — best for parallel execution)
ordered = order_by_complexity(templates)

# Light templates first (useful for testing)
ordered = order_by_complexity(templates, descending=False)

Workload Profiles

Inspect the tuning parameters for any workload/environment combination:

from kida.utils.workers import get_profile, WorkloadType

profile = get_profile(WorkloadType.RENDER)
print(profile.parallel_threshold)        # 10
print(profile.max_workers)               # 4
print(profile.free_threading_multiplier) # 1.5

WorkloadProfile Fields

Field Type Description
parallel_threshold int Minimum tasks before parallelizing
min_workers int Floor for worker count
max_workers int Ceiling for worker count
cpu_fraction float Fraction of cores to use (0.0-1.0)
free_threading_multiplier float Extra scaling when GIL is disabled

Complete Example

from concurrent.futures import ThreadPoolExecutor
from kida import Environment, FileSystemLoader
from kida.utils.workers import (
    get_optimal_workers,
    order_by_complexity,
    should_parallelize,
    WorkloadType,
)

env = Environment(loader=FileSystemLoader("templates/"))

# Load and schedule templates
templates = [env.get_template(name) for name in env.loader.list_templates()]
ordered = order_by_complexity(templates)

# Build render tasks
tasks = [(tmpl, {"page": page}) for tmpl, page in zip(ordered, pages, strict=True)]

if should_parallelize(len(tasks)):
    workers = get_optimal_workers(
        len(tasks),
        workload_type=WorkloadType.RENDER,
    )
    with ThreadPoolExecutor(max_workers=workers) as pool:
        results = list(pool.map(lambda t: t[0].render(**t[1]), tasks))
else:
    results = [tmpl.render(**ctx) for tmpl, ctx in tasks]

See Also