Module

rosettes

Rosettes — Modern syntax highlighting for Python 3.14t.

A pure-Python syntax highlighter designed for free-threaded Python. All lexers are hand-written state machines with O(n) guaranteed performance and zero ReDoS vulnerability.

Public API:

  • highlight(): Highlight code and return formatted HTML/terminal output
  • tokenize(): Get raw tokens for analysis or custom formatting
  • highlight_many(): Parallel highlighting for multiple code blocks
  • tokenize_many(): Parallel tokenization for multiple code blocks

Example:

>>> from rosettes import highlight
>>> html = highlight("def foo(): pass", "python")
>>> print(html)
<div class="highlight">...</div>

Design Philosophy:

Why hand-written state machines instead of regex?

  1. Security: Regex-based lexers are vulnerable to ReDoS attacks where crafted input causes exponential backtracking. State machines have O(n) guaranteed performance — no input can cause slowdown.

  2. Thread-Safety: Each tokenize() call uses only local variables. No shared mutable state means true parallelism on Python 3.14t without locks or synchronization.

  3. Predictability: Single-pass, character-by-character processing. No hidden backtracking, no surprising performance cliffs.

  4. Debuggability: Explicit state transitions are easier to trace than regex match failures.

Architecture:

Lexer Pipeline:

code → Lexer.tokenize() → Token stream → Formatter.format() → HTML/ANSI

Key Components:

  • rosettes._registry: Lazy lexer loading with O(1) alias lookup
  • rosettes._protocol: Lexer/Formatter contracts (Protocol-based)
  • rosettes.lexers._state_machine: Base class for all lexers
  • rosettes.formatters.html: Primary HTML output with semantic classes
  • rosettes.themes: Color palettes and CSS generation

Memory Model:

  • Lexers: Stateless singletons (cached via functools.cache)
  • Formatters: Immutable frozen dataclasses
  • Tokens: NamedTuples (minimal memory, hashable)

Thread-Safety:

All public APIs are thread-safe by design:

  • Lexers use only local variables during tokenization
  • Formatter state is immutable (frozen dataclasses)
  • Registry uses functools.cache for thread-safe memoization
# ❌ WRONG: Caching lexer instances (already cached internally)
my_lexer_cache = {}
my_lexer_cache["python"] = get_lexer("python")

# ✅ CORRECT: Just call get_lexer() — it's cached via functools.cache
lexer = get_lexer("python")

Parallel Processing (3.14t):

Rosettes supports parallel tokenization for maximum performance on free-threaded Python. Usehighlight_many()for multiple code blocks.

Performance Notes:

  • Sequential: ~50µs per small code block
  • Parallel (4 workers): ~15µs per block for batches of 100+
  • Thread overhead makes parallel slower for < 8 items

Free-Threading Declaration:

This module declares itself safe for free-threaded Python via the_Py_mod_gilattribute (PEP 703).

See Also:

  • rosettes._protocol: Lexer and Formatter protocol definitions
  • rosettes._registry: Language registry with alias support
  • rosettes.formatters.html: HTML formatter with semantic CSS classes
  • rosettes.themes: Theme palettes and CSS generation

Functions

highlight 3 str
Highlight source code and return formatted output. This is the primary high-le…
def highlight(code: str, language: str, formatter: str | Formatter = 'html') -> str

Highlight source code and return formatted output.

This is the primary high-level API for syntax highlighting. Thread-safe and suitable for concurrent use.

All lexers are hand-written state machines with O(n) guaranteed performance and zero ReDoS vulnerability.

Parameters
Name Type Description
code str

The source code to highlight.

language str

Language name or alias (e.g., 'python', 'py', 'js').

formatter str | Formatter

Formatter name ('html', 'terminal', 'null') or instance.

Default:'html'
Returns
str
tokenize 4 list[Token]
Tokenize source code without formatting. Useful for analysis, custom formattin…
def tokenize(code: str, language: str, start: int = 0, end: int | None = None) -> list[Token]

Tokenize source code without formatting.

Useful for analysis, custom formatting, or testing. Thread-safe.

All lexers are hand-written state machines with O(n) guaranteed performance and zero ReDoS vulnerability.

Parameters
Name Type Description
code str

The source code to tokenize.

language str

Language name or alias.

start int

Starting index in the source string.

Default:0
end int | None

Optional ending index in the source string.

Default:None
Returns
list[Token]
highlight_many 1 list[str]
Highlight multiple code blocks in parallel. This is the recommended way to hig…
def highlight_many(items: Iterable[tuple[str, str]]) -> list[str]

Highlight multiple code blocks in parallel.

This is the recommended way to highlight many code blocks concurrently. On Python 3.14t (free-threaded), this provides true parallelism. On GIL Python, it still provides benefits via I/O overlapping.

Thread-safe by design: each lexer uses only local variables.

Parameters
Name Type Description
items Iterable[tuple[str, str]]

Iterable of (code, language) tuples.

Returns
list[str]
tokenize_many 1 list[list[Token]]
Tokenize multiple code blocks in parallel. Similar to highlight_many() but ret…
def tokenize_many(items: Iterable[tuple[str, str]]) -> list[list[Token]]

Tokenize multiple code blocks in parallel.

Similar to highlight_many() but returns raw tokens instead of HTML. Useful for analysis, custom formatting, or when you need token data.

Thread-safe by design: each lexer uses only local variables.

Parameters
Name Type Description
items Iterable[tuple[str, str]]

Iterable of (code, language) tuples.

Returns
list[list[Token]]
__getattr__ 1 object
Module-level getattr for free-threading declaration. This allows Python to que…
def __getattr__(name: str) -> object

Module-level getattr for free-threading declaration.

This allows Python to query whether this module is safe for free-threaded execution without enabling the GIL.

Parameters
Name Type Description
name str
Returns
object