Module

`rosettes`

Rosettes — Modern syntax highlighting for Python 3.14t.

A pure-Python syntax highlighter designed for free-threaded Python. All lexers are hand-written state machines with O(n) guaranteed performance and zero ReDoS vulnerability.

Public API:

highlight(): Highlight code and return formatted HTML/terminal output
tokenize(): Get raw tokens for analysis or custom formatting
highlight_many(): Parallel highlighting for multiple code blocks
tokenize_many(): Parallel tokenization for multiple code blocks

Example:

>>> from rosettes import highlight
>>> html = highlight("def foo(): pass", "python")
>>> print(html)
<div class="highlight">...</div>

Design Philosophy:

Why hand-written state machines instead of regex?

Security: Regex-based lexers are vulnerable to ReDoS attacks where crafted input causes exponential backtracking. State machines have O(n) guaranteed performance — no input can cause slowdown.
Thread-Safety: Each tokenize() call uses only local variables. No shared mutable state means true parallelism on Python 3.14t without locks or synchronization.
Predictability: Single-pass, character-by-character processing. No hidden backtracking, no surprising performance cliffs.
Debuggability: Explicit state transitions are easier to trace than regex match failures.

Architecture:

Lexer Pipeline:

code → Lexer.tokenize() → Token stream → Formatter.format() → HTML/ANSI

Key Components:

rosettes._registry: Lazy lexer loading with O(1) alias lookup
rosettes._protocol: Lexer/Formatter contracts (Protocol-based)
rosettes.lexers._state_machine: Base class for all lexers
rosettes.formatters.html: Primary HTML output with semantic classes
rosettes.themes: Color palettes and CSS generation

Memory Model:

Lexers: Stateless singletons (cached via functools.cache)
Formatters: Immutable frozen dataclasses
Tokens: NamedTuples (minimal memory, hashable)

Thread-Safety:

All public APIs are thread-safe by design:

Lexers use only local variables during tokenization
Formatter state is immutable (frozen dataclasses)
Registry uses functools.cache for thread-safe memoization

# ❌ WRONG: Caching lexer instances (already cached internally)
my_lexer_cache = {}
my_lexer_cache["python"] = get_lexer("python")

# ✅ CORRECT: Just call get_lexer() — it's cached via functools.cache
lexer = get_lexer("python")

Parallel Processing (3.14t):

Rosettes supports parallel tokenization for maximum performance on free-threaded Python. Usehighlight_many()for multiple code blocks.

Performance Notes:

Sequential: ~50µs per small code block
Parallel (4 workers): ~15µs per block for batches of 100+
Thread overhead makes parallel slower for < 8 items

Free-Threading Declaration:

This module declares itself safe for free-threaded Python via the_Py_mod_gilattribute (PEP 703).

See Also:

rosettes._protocol: Lexer and Formatter protocol definitions
rosettes._registry: Language registry with alias support
rosettes.formatters.html: HTML formatter with semantic CSS classes
rosettes.themes: Theme palettes and CSS generation

Classes

HighlightItem 4 ▼

Item for highlight_many() with optional line highlighting. Use for blocks that need hl_lines or sh…

Item for highlight_many() with optional line highlighting.

Use for blocks that need hl_lines or show_linenos. Simple (code, lang) tuples remain supported for backward compatibility.

Attributes

Name	Type	Description
`code`	`str`	—
`language`	`str`	—
`hl_lines`	`frozenset[int] \| None`	—
`show_linenos`	`bool`	—

Functions

content_hash 4 str ▼

Compute deterministic hash for (code, language, hl_lines, show_linenos) for cac…

def content_hash(code: str, language: str, hl_lines: frozenset[int] | set[int] | list[int] | None = None, show_linenos: bool = False) -> str

Compute deterministic hash for (code, language, hl_lines, show_linenos) for cache keys.

Use for block-level caching: same inputs always yield same hash. No normalization — whitespace changes produce different hashes (correct for cache invalidation).

Parameters

Name	Type	Description
`code`	`str`	Source code string.
`language`	`str`	Language identifier.
`hl_lines`	`frozenset[int] \| set[int] \| list[int] \| None`	Optional set of 1-based line numbers to highlight. Default:`None`
`show_linenos`	`bool`	Whether line numbers are shown. Default:`False`

Returns

str

highlight 9 str ▼

Highlight source code and return formatted output. This is the primary high-le…

def highlight(code: str, language: str, formatter: str | Formatter = 'html', *, hl_lines: set[int] | frozenset[int] | None = None, show_linenos: bool = False, css_class: str | None = None, css_class_style: Literal['semantic', 'pygments', 'semantic-hybrid'] = 'semantic', start: int = 0, end: int | None = None) -> str

Highlight source code and return formatted output.

This is the primary high-level API for syntax highlighting. Thread-safe and suitable for concurrent use.

All lexers are hand-written state machines with O(n) guaranteed performance and zero ReDoS vulnerability.

Parameters

Name	Type	Description
`code`	`str`	The source code to highlight.
`language`	`str`	Language name or alias (e.g., 'python', 'py', 'js').
`formatter`	`str \| Formatter`	Formatter name ('html', 'terminal', 'null') or instance. Default:`'html'`
`hl_lines`	`set[int] \| frozenset[int] \| None`	Optional set of 1-based line numbers to highlight. Default:`None`
`show_linenos`	`bool`	If True, include line numbers in output. Default:`False`
`css_class`	`str \| None`	Base CSS class for the code container (HTML only). Defaults to "rosettes" for semantic style, "highlight" for pygments. Default:`None`
`css_class_style`	`Literal['semantic', 'pygments', 'semantic-hybrid']`	Class naming style (HTML only): - "semantic" (default): Uses readable classes like .syntax-function - "semantic-hybrid": Role + token-type classes (e.g. .syntax-function .syntax-name-builtin) - "pygments": Uses Pygments-compatible classes like .nf Default:`'semantic'`
`start`	`int`	Starting index in the source string. Default:`0`
`end`	`int \| None`	Optional ending index in the source string. Default:`None`

Returns

str

tokenize 4 list[Token] ▼

Tokenize source code without formatting. Useful for analysis, custom formattin…

def tokenize(code: str, language: str, start: int = 0, end: int | None = None) -> list[Token]

Tokenize source code without formatting.

Useful for analysis, custom formatting, or testing. Thread-safe.

All lexers are hand-written state machines with O(n) guaranteed performance and zero ReDoS vulnerability.

Parameters

Name	Type	Description
`code`	`str`	The source code to tokenize.
`language`	`str`	Language name or alias.
`start`	`int`	Starting index in the source string. Default:`0`
`end`	`int \| None`	Optional ending index in the source string. Default:`None`

Returns

list[Token]

highlight_many 4 list[str] ▼

Highlight multiple code blocks in parallel. This is the recommended way to hig…

def highlight_many(items: Iterable[HighlightItemInput], *, formatter: str | Formatter = 'html', max_workers: int | None = None, css_class_style: Literal['semantic', 'pygments', 'semantic-hybrid'] = 'semantic') -> list[str]

Highlight multiple code blocks in parallel.

This is the recommended way to highlight many code blocks concurrently. On Python 3.14t (free-threaded), this provides true parallelism. On GIL Python, it still provides benefits via I/O overlapping.

Thread-safe by design: each lexer uses only local variables.

Parameters

Name	Type	Description
`items`	`Iterable[HighlightItemInput]`	Iterable of (code, language) tuples or HighlightItem instances. HighlightItem supports hl_lines and show_linenos.
`formatter`	`str \| Formatter`	Formatter name or instance. Default:`'html'`
`max_workers`	`int \| None`	Maximum number of threads. Defaults to min(4, CPU count), which benchmarking shows to be optimal. Default:`None`
`css_class_style`	`Literal['semantic', 'pygments', 'semantic-hybrid']`	Class naming style (HTML only). Default:`'semantic'`

Returns

list[str]

tokenize_many 2 list[list[Token]] ▼

Tokenize multiple code blocks in parallel. Similar to highlight_many() but ret…

def tokenize_many(items: Iterable[tuple[str, str]], *, max_workers: int | None = None) -> list[list[Token]]

Tokenize multiple code blocks in parallel.

Similar to highlight_many() but returns raw tokens instead of HTML. Useful for analysis, custom formatting, or when you need token data.

Thread-safe by design: each lexer uses only local variables.

Parameters

Name	Type	Description
`items`	`Iterable[tuple[str, str]]`	Iterable of (code, language) tuples.
`max_workers`	`int \| None`	Maximum number of threads. Defaults to min(4, CPU count). Default:`None`

Returns

list[list[Token]]