Architecture

Rosettes uses hand-written state machine lexers instead of regular expressions. This design guarantees O(n) time complexity and eliminates ReDoS vulnerabilities.

State Machine Lexers

Every lexer is a finite state machine that processes input character-by-character:

Key Properties

Property	Guarantee
Single-character lookahead	O(n) time complexity
No backtracking	No ReDoS possible
Immutable state	Thread-safe
Local variables only	No shared mutable state

How It Works

Character-by-character processing

The lexer reads one character at a time, deciding what to do based on the current state and the character:

def tokenize(self, code: str) -> Iterator[Token]:
    state = State.INITIAL
    pos = 0
    
    while pos < len(code):
        char = code[pos]
        
        if state == State.INITIAL:
            if char == '"':
                state = State.STRING
                start = pos
            elif char.isdigit():
                state = State.NUMBER
                start = pos
            # ... more transitions
        
        elif state == State.STRING:
            if char == '\\':
                state = State.ESCAPE
            elif char == '"':
                yield Token(TokenType.STRING, code[start:pos+1])
                state = State.INITIAL
        
        pos += 1

No backtracking

Unlike regex engines that may backtrack on failed matches, state machines make irrevocable decisions:

Regex:      a+b  on "aaac"
            Tries: a, aa, aaa, then backtracks to try fewer a's
            
State Machine: 
            Reads 'a' → accumulate
            Reads 'a' → accumulate  
            Reads 'a' → accumulate
            Reads 'c' → emit NAME token, continue
            No backtracking needed

Predictable performance

Because each character is processed exactly once, time complexity is always O(n):

Input Size	Time (State Machine)	Time (Regex worst case)
100 chars	0.01ms	0.01ms
1,000 chars	0.1ms	0.1ms
10,000 chars	1ms	1-10ms
100,000 chars	10ms	10ms - ∞ (ReDoS)

Lexer Structure

Each lexer follows the same pattern:

class PythonStateMachineLexer(StateMachineLexer):
    name = "python"
    aliases = ("py", "python3", "py3")
    
    def tokenize(
        self,
        code: str,
        config: LexerConfig | None = None,
        *,
        start: int = 0,
        end: int | None = None,
    ) -> Iterator[Token]:
        # State machine implementation
        ...
    
    def tokenize_fast(
        self,
        code: str,
        start: int = 0,
        end: int | None = None,
    ) -> Iterator[tuple[TokenType, str]]:
        # Optimized path: yields (type, value) tuples without line tracking
        ...

Two Tokenization Paths

Method	Use Case	Features
`tokenize()`	Line highlighting, analysis	Full line/column tracking
`tokenize_fast()`	Maximum performance	No line tracking

Thehighlight()function automatically chooses the appropriate path based on options.

Registry

Lexers are registered with their canonical name and aliases using a lazy-loading pattern:

from dataclasses import dataclass
from functools import cache

@dataclass(frozen=True, slots=True)
class LexerSpec:
    """Specification for lazy-loading a lexer."""
    module: str
    class_name: str
    aliases: tuple[str, ...] = ()

# Static registry — lexers loaded on-demand
_LEXER_SPECS: dict[str, LexerSpec] = {
    "python": LexerSpec(
        "rosettes.lexers.python_sm",
        "PythonStateMachineLexer",
        aliases=("py", "python3", "py3"),
    ),
    "javascript": LexerSpec(
        "rosettes.lexers.javascript_sm",
        "JavaScriptStateMachineLexer",
        aliases=("js", "jsx"),
    ),
    # ...
}

The registry uses functools.cachefor thread-safe memoization:

def get_lexer(name: str) -> StateMachineLexer:
    """Get a lexer by name or alias. Cached for performance."""
    canonical = _normalize_name(name)
    return _get_lexer_by_canonical(canonical)

@cache
def _get_lexer_by_canonical(canonical: str) -> StateMachineLexer:
    """Internal cached loader - keyed by canonical name."""
    spec = _LEXER_SPECS[canonical]
    module = import_module(spec.module)
    lexer_class = getattr(module, spec.class_name)
    return lexer_class()

Adding a New Lexer

To add a new language lexer:

1
Create lexer module
Createrosettes/lexers/mylang_sm.pywith a state machine class.
2
Register in registry
Add aLexerSpec entry in rosettes/_registry.py.
3
Add tests
Add tests intests/lexers/test_mylang.py.

Example minimal lexer:

from collections.abc import Iterator

from rosettes._config import LexerConfig
from rosettes._types import Token, TokenType
from rosettes.lexers._state_machine import StateMachineLexer

class MyLangStateMachineLexer(StateMachineLexer):
    name = "mylang"
    aliases = ("ml",)
    
    def tokenize(
        self,
        code: str,
        config: LexerConfig | None = None,
        *,
        start: int = 0,
        end: int | None = None,
    ) -> Iterator[Token]:
        if end is None:
            end = len(code)
        
        pos = start
        line = 1
        col = 1
        
        while pos < end:
            char = code[pos]
            # State machine logic here
            yield Token(TokenType.TEXT, char, line, col)
            pos += 1
            col += 1
    
    def tokenize_fast(
        self,
        code: str,
        start: int = 0,
        end: int | None = None,
    ) -> Iterator[tuple[TokenType, str]]:
        # Yields (type, value) tuples without line/col tracking
        if end is None:
            end = len(code)
        pos = start
        while pos < end:
            yield (TokenType.TEXT, code[pos])
            pos += 1

Next Steps

Thread Safety — How thread safety is achieved
Performance — Benchmarks and optimization