Module

rosettes

Rosettes — Modern syntax highlighting for Python 3.14t.

A pure-Python syntax highlighter designed for free-threaded Python. All lexers are hand-written state machines with O(n) guaranteed performance and zero ReDoS vulnerability.

Public API:

  • highlight(): Highlight code and return formatted HTML/terminal output
  • tokenize(): Get raw tokens for analysis or custom formatting
  • highlight_many(): Parallel highlighting for multiple code blocks
  • tokenize_many(): Parallel tokenization for multiple code blocks

Example:

>>> from rosettes import highlight
>>> html = highlight("def foo(): pass", "python")
>>> print(html)
<div class="highlight">...</div>

Design Philosophy:

Why hand-written state machines instead of regex?

  1. Security: Regex-based lexers are vulnerable to ReDoS attacks where crafted input causes exponential backtracking. State machines have O(n) guaranteed performance — no input can cause slowdown.

  2. Thread-Safety: Each tokenize() call uses only local variables. No shared mutable state means true parallelism on Python 3.14t without locks or synchronization.

  3. Predictability: Single-pass, character-by-character processing. No hidden backtracking, no surprising performance cliffs.

  4. Debuggability: Explicit state transitions are easier to trace than regex match failures.

Architecture:

Lexer Pipeline:

code → Lexer.tokenize() → Token stream → Formatter.format() → HTML/ANSI

Key Components:

  • rosettes._registry: Lazy lexer loading with O(1) alias lookup
  • rosettes._protocol: Lexer/Formatter contracts (Protocol-based)
  • rosettes.lexers._state_machine: Base class for all lexers
  • rosettes.formatters.html: Primary HTML output with semantic classes
  • rosettes.themes: Color palettes and CSS generation

Memory Model:

  • Lexers: Stateless singletons (cached via functools.cache)
  • Formatters: Immutable frozen dataclasses
  • Tokens: NamedTuples (minimal memory, hashable)

Thread-Safety:

All public APIs are thread-safe by design:

  • Lexers use only local variables during tokenization
  • Formatter state is immutable (frozen dataclasses)
  • Registry uses functools.cache for thread-safe memoization
# ❌ WRONG: Caching lexer instances (already cached internally)
my_lexer_cache = {}
my_lexer_cache["python"] = get_lexer("python")

# ✅ CORRECT: Just call get_lexer() — it's cached via functools.cache
lexer = get_lexer("python")

Parallel Processing (3.14t):

Rosettes supports parallel tokenization for maximum performance on free-threaded Python. Usehighlight_many()for multiple code blocks.

Performance Notes:

  • Sequential: ~50µs per small code block
  • Parallel (4 workers): ~15µs per block for batches of 100+
  • Thread overhead makes parallel slower for < 8 items

Free-Threading Declaration:

This module declares itself safe for free-threaded Python via the_Py_mod_gilattribute (PEP 703).

See Also:

  • rosettes._protocol: Lexer and Formatter protocol definitions
  • rosettes._registry: Language registry with alias support
  • rosettes.formatters.html: HTML formatter with semantic CSS classes
  • rosettes.themes: Theme palettes and CSS generation

Classes

HighlightItem 4
Item for highlight_many() with optional line highlighting. Use for blocks that need hl_lines or sh…

Item for highlight_many() with optional line highlighting.

Use for blocks that need hl_lines or show_linenos. Simple (code, lang) tuples remain supported for backward compatibility.

Attributes

Name Type Description
code str
language str
hl_lines frozenset[int] | None
show_linenos bool

Functions

content_hash 4 str
Compute deterministic hash for (code, language, hl_lines, show_linenos) for cac…
def content_hash(code: str, language: str, hl_lines: frozenset[int] | set[int] | list[int] | None = None, show_linenos: bool = False) -> str

Compute deterministic hash for (code, language, hl_lines, show_linenos) for cache keys.

Use for block-level caching: same inputs always yield same hash. No normalization — whitespace changes produce different hashes (correct for cache invalidation).

Parameters
Name Type Description
code str

Source code string.

language str

Language identifier.

hl_lines frozenset[int] | set[int] | list[int] | None

Optional set of 1-based line numbers to highlight.

Default:None
show_linenos bool

Whether line numbers are shown.

Default:False
Returns
str
highlight 9 str
Highlight source code and return formatted output. This is the primary high-le…
def highlight(code: str, language: str, formatter: str | Formatter = 'html', *, hl_lines: set[int] | frozenset[int] | None = None, show_linenos: bool = False, css_class: str | None = None, css_class_style: Literal['semantic', 'pygments', 'semantic-hybrid'] = 'semantic', start: int = 0, end: int | None = None) -> str

Highlight source code and return formatted output.

This is the primary high-level API for syntax highlighting. Thread-safe and suitable for concurrent use.

All lexers are hand-written state machines with O(n) guaranteed performance and zero ReDoS vulnerability.

Parameters
Name Type Description
code str

The source code to highlight.

language str

Language name or alias (e.g., 'python', 'py', 'js').

formatter str | Formatter

Formatter name ('html', 'terminal', 'null') or instance.

Default:'html'
hl_lines set[int] | frozenset[int] | None

Optional set of 1-based line numbers to highlight.

Default:None
show_linenos bool

If True, include line numbers in output.

Default:False
css_class str | None

Base CSS class for the code container (HTML only). Defaults to "rosettes" for semantic style, "highlight" for pygments.

Default:None
css_class_style Literal['semantic', 'pygments', 'semantic-hybrid']

Class naming style (HTML only): - "semantic" (default): Uses readable classes like .syntax-function - "semantic-hybrid": Role + token-type classes (e.g. .syntax-function .syntax-name-builtin) - "pygments": Uses Pygments-compatible classes like .nf

Default:'semantic'
start int

Starting index in the source string.

Default:0
end int | None

Optional ending index in the source string.

Default:None
Returns
str
tokenize 4 list[Token]
Tokenize source code without formatting. Useful for analysis, custom formattin…
def tokenize(code: str, language: str, start: int = 0, end: int | None = None) -> list[Token]

Tokenize source code without formatting.

Useful for analysis, custom formatting, or testing. Thread-safe.

All lexers are hand-written state machines with O(n) guaranteed performance and zero ReDoS vulnerability.

Parameters
Name Type Description
code str

The source code to tokenize.

language str

Language name or alias.

start int

Starting index in the source string.

Default:0
end int | None

Optional ending index in the source string.

Default:None
Returns
list[Token]
highlight_many 4 list[str]
Highlight multiple code blocks in parallel. This is the recommended way to hig…
def highlight_many(items: Iterable[HighlightItemInput], *, formatter: str | Formatter = 'html', max_workers: int | None = None, css_class_style: Literal['semantic', 'pygments', 'semantic-hybrid'] = 'semantic') -> list[str]

Highlight multiple code blocks in parallel.

This is the recommended way to highlight many code blocks concurrently. On Python 3.14t (free-threaded), this provides true parallelism. On GIL Python, it still provides benefits via I/O overlapping.

Thread-safe by design: each lexer uses only local variables.

Parameters
Name Type Description
items Iterable[HighlightItemInput]

Iterable of (code, language) tuples or HighlightItem instances. HighlightItem supports hl_lines and show_linenos.

formatter str | Formatter

Formatter name or instance.

Default:'html'
max_workers int | None

Maximum number of threads. Defaults to min(4, CPU count), which benchmarking shows to be optimal.

Default:None
css_class_style Literal['semantic', 'pygments', 'semantic-hybrid']

Class naming style (HTML only).

Default:'semantic'
Returns
list[str]
tokenize_many 2 list[list[Token]]
Tokenize multiple code blocks in parallel. Similar to highlight_many() but ret…
def tokenize_many(items: Iterable[tuple[str, str]], *, max_workers: int | None = None) -> list[list[Token]]

Tokenize multiple code blocks in parallel.

Similar to highlight_many() but returns raw tokens instead of HTML. Useful for analysis, custom formatting, or when you need token data.

Thread-safe by design: each lexer uses only local variables.

Parameters
Name Type Description
items Iterable[tuple[str, str]]

Iterable of (code, language) tuples.

max_workers int | None

Maximum number of threads. Defaults to min(4, CPU count).

Default:None
Returns
list[list[Token]]