# Architecture URL: /docs/about/architecture/ Section: about Tags: architecture, design -------------------------------------------------------------------------------- # Architecture Rosettes uses hand-written state machine lexers instead of regular expressions. This design guarantees O(n) time complexity and eliminates ReDoS vulnerabilities. ## State Machine Lexers Every lexer is a finite state machine that processes input character-by-character: ``` ┌─────────────────────────────────────────────────────────────┐ │ State Machine Lexer │ │ │ │ ┌─────────┐ char ┌─────────┐ char ┌─────────┐ │ │ │ INITIAL │ ────────► │ STRING │ ────────► │ ESCAPE │ │ │ │ STATE │ │ STATE │ │ STATE │ │ │ └─────────┘ └─────────┘ └─────────┘ │ │ │ │ │ │ │ │ emit │ emit │ emit │ │ ▼ ▼ ▼ │ │ [Token] [Token] [Token] │ └─────────────────────────────────────────────────────────────┘ ``` ### Key Properties | Property | Guarantee | |----------|-----------| | **Single-character lookahead** | O(n) time complexity | | **No backtracking** | No ReDoS possible | | **Immutable state** | Thread-safe | | **Local variables only** | No shared mutable state | --- ## How It Works ### 1. Character-by-Character Processing The lexer reads one character at a time, deciding what to do based on the current state and the character: ```python def tokenize(self, code: str) -> Iterator[Token]: state = State.INITIAL pos = 0 while pos < len(code): char = code[pos] if state == State.INITIAL: if char == '"': state = State.STRING start = pos elif char.isdigit(): state = State.NUMBER start = pos # ... more transitions elif state == State.STRING: if char == '\\': state = State.ESCAPE elif char == '"': yield Token(TokenType.STRING, code[start:pos+1]) state = State.INITIAL pos += 1 ``` ### 2. No Backtracking Unlike regex engines that may backtrack on failed matches, state machines make irrevocable decisions: ``` Regex: a+b on "aaac" Tries: a, aa, aaa, then backtracks to try fewer a's State Machine: Reads 'a' → accumulate Reads 'a' → accumulate Reads 'a' → accumulate Reads 'c' → emit NAME token, continue No backtracking needed ``` ### 3. Predictable Performance Because each character is processed exactly once, time complexity is always O(n): | Input Size | Time (State Machine) | Time (Regex worst case) | |------------|---------------------|-------------------------| | 100 chars | 0.01ms | 0.01ms | | 1,000 chars | 0.1ms | 0.1ms | | 10,000 chars | 1ms | 1-10ms | | 100,000 chars | 10ms | 10ms - ∞ (ReDoS) | --- ## Lexer Structure Each lexer follows the same pattern: ```python class PythonStateMachineLexer(StateMachineLexer): name = "python" aliases = ("py", "python3", "py3") def tokenize( self, code: str, config: LexerConfig | None = None, *, start: int = 0, end: int | None = None, ) -> Iterator[Token]: # State machine implementation ... def tokenize_fast( self, code: str, start: int = 0, end: int | None = None, ) -> Iterator[tuple[TokenType, str]]: # Optimized path: yields (type, value) tuples without line tracking ... ``` ### Two Tokenization Paths | Method | Use Case | Features | |--------|----------|----------| | `tokenize()` | Line highlighting, analysis | Full line/column tracking | | `tokenize_fast()` | Maximum performance | No line tracking | The `highlight()` function automatically chooses the appropriate path based on options. --- ## Registry Lexers are registered with their canonical name and aliases using a lazy-loading pattern: ```python from dataclasses import dataclass from functools import cache @dataclass(frozen=True, slots=True) class LexerSpec: """Specification for lazy-loading a lexer.""" module: str class_name: str aliases: tuple[str, ...] = () # Static registry — lexers loaded on-demand _LEXER_SPECS: dict[str, LexerSpec] = { "python": LexerSpec( "rosettes.lexers.python_sm", "PythonStateMachineLexer", aliases=("py", "python3", "py3"), ), "javascript": LexerSpec( "rosettes.lexers.javascript_sm", "JavaScriptStateMachineLexer", aliases=("js", "jsx"), ), # ... } ``` The registry uses `functools.cache` for thread-safe memoization: ```python def get_lexer(name: str) -> StateMachineLexer: """Get a lexer by name or alias. Cached for performance.""" canonical = _normalize_name(name) return _get_lexer_by_canonical(canonical) @cache def _get_lexer_by_canonical(canonical: str) -> StateMachineLexer: """Internal cached loader - keyed by canonical name.""" spec = _LEXER_SPECS[canonical] module = import_module(spec.module) lexer_class = getattr(module, spec.class_name) return lexer_class() ``` --- ## Adding a New Lexer To add a new language lexer: 1. Create `rosettes/lexers/mylang_sm.py` with a state machine class 2. Register in `rosettes/_registry.py` with a `LexerSpec` entry 3. Add tests in `tests/lexers/test_mylang.py` Example minimal lexer: ```python from collections.abc import Iterator from rosettes._config import LexerConfig from rosettes._types import Token, TokenType from rosettes.lexers._state_machine import StateMachineLexer class MyLangStateMachineLexer(StateMachineLexer): name = "mylang" aliases = ("ml",) def tokenize( self, code: str, config: LexerConfig | None = None, *, start: int = 0, end: int | None = None, ) -> Iterator[Token]: if end is None: end = len(code) pos = start line = 1 col = 1 while pos < end: char = code[pos] # State machine logic here yield Token(TokenType.TEXT, char, line, col) pos += 1 col += 1 def tokenize_fast( self, code: str, start: int = 0, end: int | None = None, ) -> Iterator[tuple[TokenType, str]]: # Yields (type, value) tuples without line/col tracking if end is None: end = len(code) pos = start while pos < end: yield (TokenType.TEXT, code[pos]) pos += 1 ``` --- ## Next Steps - [[docs/about/thread-safety|Thread Safety]] — How thread safety is achieved - [[docs/about/performance|Performance]] — Benchmarks and optimization -------------------------------------------------------------------------------- Metadata: - Word Count: 820 - Reading Time: 4 minutes