Architecture

State machine lexer design and internals

4 min read 783 words

Rosettes uses hand-written state machine lexers instead of regular expressions. This design guarantees O(n) time complexity and eliminates ReDoS vulnerabilities.

State Machine Lexers

Every lexer is a finite state machine that processes input character-by-character:

flowchart LR INITIAL[INITIAL] -->|char| STRING[STRING] STRING -->|char| ESCAPE[ESCAPE] INITIAL -->|emit| T1[Token] STRING -->|emit| T2[Token] ESCAPE -->|emit| T3[Token]

Key Properties

Property Guarantee
Single-character lookahead O(n) time complexity
No backtracking No ReDoS possible
Immutable state Thread-safe
Local variables only No shared mutable state

How It Works

  1. 1

    Character-by-character processing

    The lexer reads one character at a time, deciding what to do based on the current state and the character:

    def tokenize(self, code: str) -> Iterator[Token]:
        state = State.INITIAL
        pos = 0
        
        while pos < len(code):
            char = code[pos]
            
            if state == State.INITIAL:
                if char == '"':
                    state = State.STRING
                    start = pos
                elif char.isdigit():
                    state = State.NUMBER
                    start = pos
                # ... more transitions
            
            elif state == State.STRING:
                if char == '\\':
                    state = State.ESCAPE
                elif char == '"':
                    yield Token(TokenType.STRING, code[start:pos+1])
                    state = State.INITIAL
            
            pos += 1
    
  2. 2

    No backtracking

    Unlike regex engines that may backtrack on failed matches, state machines make irrevocable decisions:

    Regex:      a+b  on "aaac"
                Tries: a, aa, aaa, then backtracks to try fewer a's
                
    State Machine: 
                Reads 'a' → accumulate
                Reads 'a' → accumulate  
                Reads 'a' → accumulate
                Reads 'c' → emit NAME token, continue
                No backtracking needed
    
  3. 3

    Predictable performance

    Because each character is processed exactly once, time complexity is always O(n):

    Input Size Time (State Machine) Time (Regex worst case)
    100 chars 0.01ms 0.01ms
    1,000 chars 0.1ms 0.1ms
    10,000 chars 1ms 1-10ms
    100,000 chars 10ms 10ms - ∞ (ReDoS)

Lexer Structure

Each lexer follows the same pattern:

class PythonStateMachineLexer(StateMachineLexer):
    name = "python"
    aliases = ("py", "python3", "py3")
    
    def tokenize(
        self,
        code: str,
        config: LexerConfig | None = None,
        *,
        start: int = 0,
        end: int | None = None,
    ) -> Iterator[Token]:
        # State machine implementation
        ...
    
    def tokenize_fast(
        self,
        code: str,
        start: int = 0,
        end: int | None = None,
    ) -> Iterator[tuple[TokenType, str]]:
        # Optimized path: yields (type, value) tuples without line tracking
        ...

Two Tokenization Paths

Method Use Case Features
tokenize() Line highlighting, analysis Full line/column tracking
tokenize_fast() Maximum performance No line tracking

Thehighlight()function automatically chooses the appropriate path based on options.


Registry

Lexers are registered with their canonical name and aliases using a lazy-loading pattern:

from dataclasses import dataclass
from functools import cache

@dataclass(frozen=True, slots=True)
class LexerSpec:
    """Specification for lazy-loading a lexer."""
    module: str
    class_name: str
    aliases: tuple[str, ...] = ()

# Static registry — lexers loaded on-demand
_LEXER_SPECS: dict[str, LexerSpec] = {
    "python": LexerSpec(
        "rosettes.lexers.python_sm",
        "PythonStateMachineLexer",
        aliases=("py", "python3", "py3"),
    ),
    "javascript": LexerSpec(
        "rosettes.lexers.javascript_sm",
        "JavaScriptStateMachineLexer",
        aliases=("js", "jsx"),
    ),
    # ...
}

The registry uses functools.cachefor thread-safe memoization:

def get_lexer(name: str) -> StateMachineLexer:
    """Get a lexer by name or alias. Cached for performance."""
    canonical = _normalize_name(name)
    return _get_lexer_by_canonical(canonical)

@cache
def _get_lexer_by_canonical(canonical: str) -> StateMachineLexer:
    """Internal cached loader - keyed by canonical name."""
    spec = _LEXER_SPECS[canonical]
    module = import_module(spec.module)
    lexer_class = getattr(module, spec.class_name)
    return lexer_class()

Adding a New Lexer

To add a new language lexer:

  1. 1

    Create lexer module

    Createrosettes/lexers/mylang_sm.pywith a state machine class.

  2. 2

    Register in registry

    Add aLexerSpec entry in rosettes/_registry.py.

  3. 3

    Add tests

    Add tests intests/lexers/test_mylang.py.

Example minimal lexer:

from collections.abc import Iterator

from rosettes._config import LexerConfig
from rosettes._types import Token, TokenType
from rosettes.lexers._state_machine import StateMachineLexer

class MyLangStateMachineLexer(StateMachineLexer):
    name = "mylang"
    aliases = ("ml",)
    
    def tokenize(
        self,
        code: str,
        config: LexerConfig | None = None,
        *,
        start: int = 0,
        end: int | None = None,
    ) -> Iterator[Token]:
        if end is None:
            end = len(code)
        
        pos = start
        line = 1
        col = 1
        
        while pos < end:
            char = code[pos]
            # State machine logic here
            yield Token(TokenType.TEXT, char, line, col)
            pos += 1
            col += 1
    
    def tokenize_fast(
        self,
        code: str,
        start: int = 0,
        end: int | None = None,
    ) -> Iterator[tuple[TokenType, str]]:
        # Yields (type, value) tuples without line/col tracking
        if end is None:
            end = len(code)
        pos = start
        while pos < end:
            yield (TokenType.TEXT, code[pos])
            pos += 1

Next Steps