Module

_parallel

Parallel tokenization for free-threaded Python (3.14t+).

Enables true parallel tokenization of large files by splitting at safe boundaries and processing chunks concurrently.

Design Philosophy:

This module exists for one purpose: maximum throughput on Python 3.14t.

On GIL Python (3.13 and earlier), threads cannot truly parallelize CPU-bound work. But on free-threaded Python 3.14t, Rosettes lexers use only local variables, enabling true parallel tokenization.

Architecture:

  1. Split Detection: Find safe split points (newlines) to avoid cutting tokens in half
  2. Chunking: Divide code into ~64KB chunks with position metadata
  3. Parallel Execution: Tokenize chunks using ThreadPoolExecutor
  4. Line Adjustment: Fix line numbers for chunks after the first
  5. Ordered Merge: Yield tokens in original source order

When to Use:

  • ✅ Large files (>128KB) on Python 3.14t

  • ✅ Batch processing many files withhighlight_many()

  • ❌ Small files (< 128KB) — sequential is faster (thread overhead)

  • ❌ GIL Python — no parallelism benefit

Performance:

  • Sequential: ~50µs per 100-line file
  • Parallel (4 workers, 3.14t): ~15µs per file for batches of 100+

The crossover point is ~8 items or ~128KB of code.

Thread-Safety:

Safe by design:

  • Lexers use only local variables
  • Chunks are independent (no shared state)
  • Token lists are created per-chunk, then merged

Limitations:

  • Splitting at newlines may not be safe for all languages (e.g., heredocs spanning lines). This is rare in practice.
  • Memory: Holds all chunk results before yielding

See Also:

  • rosettes.highlight_many: High-level parallel API
  • rosettes.tokenize_many: Parallel tokenization without formatting

Classes

_Chunk 3
A chunk of source code with position metadata.

A chunk of source code with position metadata.

Attributes

Name Type Description
text str
start_offset int
start_line int

Functions

is_free_threaded 0 bool
Check if running on free-threaded Python (3.14t+).
def is_free_threaded() -> bool
Returns
bool
_find_safe_splits 2 list[int]
Find safe split points (newlines) for parallel tokenization. We split at newli…
def _find_safe_splits(code: str, target_chunk_size: int) -> list[int]

Find safe split points (newlines) for parallel tokenization.

We split at newlines to avoid splitting in the middle of tokens. This is a heuristic that works for most languages.

Parameters
Name Type Description
code str

Source code to split.

target_chunk_size int

Target size for each chunk.

Returns
list[int]
_make_chunks 2 list[_Chunk]
Split code into chunks at the given positions.
def _make_chunks(code: str, splits: list[int]) -> list[_Chunk]
Parameters
Name Type Description
code str

Source code to split.

splits list[int]

List of positions to split at.

Returns
list[_Chunk]
tokenize_parallel 2 Iterator[Token]
Parallel tokenization for large files. Only beneficial on free-threaded Python…
def tokenize_parallel(lexer: StateMachineLexer, code: str) -> Iterator[Token]

Parallel tokenization for large files.

Only beneficial on free-threaded Python (3.14t+). Falls back to sequential on GIL Python.

Parameters
Name Type Description
lexer StateMachineLexer

The lexer to use.

code str

Source code to tokenize.

Returns
Iterator[Token]