Module

`lexers._state_machine`

Base class for hand-written state machine lexers.

Thread-safe, O(n) guaranteed, zero regex.

Design Philosophy:

Rosettes lexers are hand-written state machines rather than regex-based.

This design provides:

Security: No ReDoS vulnerability. Crafted input cannot cause exponential backtracking because there IS no backtracking.
Performance: O(n) guaranteed. Single pass, character by character. Predictable performance regardless of input.
Thread-Safety: Each tokenize() call uses only local variables. No shared mutable state means true parallelism on Python 3.14t.
Debuggability: Explicit state transitions are easier to trace than regex match failures.

Architecture:

StateMachineLexer provides:

Base class with shared character sets (DIGITS, IDENT_START, etc.)
Defaulttokenize_fast()implementation
Protocol-compatible interface

Helper functions provide common patterns:

scan_while(): Advance while chars match set
scan_until(): Advance until char in set
scan_string(): Handle quoted strings with escapes
scan_triple_string(): Handle triple-quoted strings
scan_line_comment(): Scan to end of line
scan_block_comment(): Scan to end marker

Adding New Languages:

To add a new language lexer:

Createrosettes/lexers/{language}_sm.py
Subclass StateMachineLexer
Set name, aliases, filenames, mimetypes class attributes
Implementtokenize()method with character-by-character logic
Add entry to_LEXER_SPECS in rosettes/_registry.py
Add tests intests/lexers/test_{language}_sm.py

Example skeleton:

class MyLangStateMachineLexer(StateMachineLexer):
    name = "mylang"
    aliases = ("ml",)
    filenames = ("*.ml",)
    mimetypes = ("text/x-mylang",)

    # Language-specific character sets
    KEYWORDS = frozenset({"if", "else", "while"})

    def tokenize(self, code, config=None, start=0, end=None):
        # Your tokenization logic here
        ...

Key rules:

Use only local variables (noself.statemutations)
Yield tokens as you find them (streaming)
Handle all characters (emit TEXT for unknown)
Use helper functions for common patterns

Performance Tips:

Use frozenset for keyword/operator lookups (O(1))
Usescan_while/scan_untilhelpers for common patterns
Avoid string slicing in hot loops (use start/end indices)
Pre-compute character sets as class attributes

See Also:

rosettes/_protocol.Lexer: Protocol that all lexers must satisfy
rosettes/_registry: How lexers are registered and looked up
rosettes/lexers/python_sm.py: Reference implementation

Classes

StateMachineLexer 14 ▼

Base class for hand-written state machine lexers. Thread-safe: `tokenize()` uses only local variab…

Base class for hand-written state machine lexers.

Thread-safe:tokenize()uses only local variables. O(n) guaranteed: single pass, no backtracking.

Subclasses implement language-specific tokenization by overriding thetokenize()method with character-by-character logic.

Design Principles:

No regex — character matching only
Explicit state — no hidden backtracking
Local variables only — thread-safe by design
Single pass — O(n) guaranteed

Class Attributes:

name: Canonical language name (e.g., 'python')
aliases: Alternative names for registry lookup (e.g., ('py', 'python3'))
filenames: Glob patterns for file detection (e.g., ('*.py',))
mimetypes: MIME types for content detection

Shared Character Sets:

DIGITS: '0'-'9'
HEX_DIGITS: '0'-'9', 'a'-'f', 'A'-'F'
LETTERS: 'a'-'z', 'A'-'Z'
IDENT_START: Letters + '_'
IDENT_CONT: IDENT_START + digits
WHITESPACE: Space, tab, newline, etc.

Example Implementation:

class MyLangLexer(StateMachineLexer):
    name = "mylang"
    aliases = ("ml",)
    KEYWORDS = frozenset({"if", "else"})

    def tokenize(self, code, config=None, start=0, end=None):
        pos = start
        end = end or len(code)
        line, col = 1, 1

        while pos < end:
            char = code[pos]
            # ... tokenization logic ...
            yield Token(TokenType.TEXT, char, line, col)
            pos += 1
            col += 1

Common Mistakes:

# ❌ WRONG: Storing state in instance variables
self.current_line = 1  # NOT thread-safe!

# ✅ CORRECT: Use local variables
line = 1

# ❌ WRONG: Using regex for matching
match = re.match(r'\d+', code[pos:])  # ReDoS vulnerable!

# ✅ CORRECT: Use scan_while helper
end_pos = scan_while(code, pos, self.DIGITS)

Attributes

Name	Type	Description
`name`	`str`	—
`aliases`	`tuple[str, ...]`	—
`filenames`	`tuple[str, ...]`	—
`mimetypes`	`tuple[str, ...]`	—
`DIGITS`	`frozenset[str]`	—
`HEX_DIGITS`	`frozenset[str]`	—
`OCTAL_DIGITS`	`frozenset[str]`	—
`BINARY_DIGITS`	`frozenset[str]`	—
`LETTERS`	`frozenset[str]`	—
`IDENT_START`	`frozenset[str]`	—
`IDENT_CONT`	`frozenset[str]`	—
`WHITESPACE`	`frozenset[str]`	—

Methods

tokenize 4 Iterator[Token] ▼

Tokenize source code. Subclasses override this with language-specific logic.

def tokenize(self, code: str, config: LexerConfig | None = None, *, start: int = 0, end: int | None = None) -> Iterator[Token]

Parameters

Name	Type	Description
`code`	`—`	The source code to tokenize.
`config`	`—`	Optional lexer configuration. Default:`None`
`start`	`—`	Starting index in the source string. Default:`0`
`end`	`—`	Optional ending index in the source string. Default:`None`

Returns

Iterator[Token]

tokenize_fast 3 Iterator[tuple[TokenType… ▼

Fast tokenization without position tracking. Default implementation strips pos…

def tokenize_fast(self, code: str, start: int = 0, end: int | None = None) -> Iterator[tuple[TokenType, str]]

Fast tokenization without position tracking.

Default implementation strips position info from tokenize(). Subclasses may override for further optimization.

Parameters

Name	Type	Description
`code`	`—`	The source code to tokenize.
`start`	`—`	Starting index in the source string. Default:`0`
`end`	`—`	Optional ending index in the source string. Default:`None`

Returns

Iterator[tuple[TokenType, str]]

Functions

scan_while 3 int ▼

Advance position while characters are in char_set.

def scan_while(code: str, pos: int, char_set: frozenset[str]) -> int

Parameters

Name	Type	Description
`code`	`str`	Source code string.
`pos`	`int`	Starting position.
`char_set`	`frozenset[str]`	Set of characters to match.

Returns

int

scan_until 3 int ▼

Advance position until a character in char_set is found.

def scan_until(code: str, pos: int, char_set: frozenset[str]) -> int

Parameters

Name	Type	Description
`code`	`str`	Source code string.
`pos`	`int`	Starting position.
`char_set`	`frozenset[str]`	Set of characters to stop at.

Returns

int

scan_string 5 int ▼

Scan a string literal, handling escapes.

def scan_string(code: str, pos: int, quote: str, *, allow_escape: bool = True, allow_multiline: bool = False) -> int

Parameters

Name	Type	Description
`code`	`str`	Source code.
`pos`	`int`	Position after opening quote.
`quote`	`str`	The quote character (' or ").
`allow_escape`	`bool`	Whether backslash escapes are allowed. Default:`True`
`allow_multiline`	`bool`	Whether newlines are allowed. Default:`False`

Returns

int

scan_triple_string 3 int ▼

Scan a triple-quoted string.

def scan_triple_string(code: str, pos: int, quote: str) -> int

Parameters

Name	Type	Description
`code`	`str`	Source code.
`pos`	`int`	Position after opening triple quote.
`quote`	`str`	The quote character (' or ").

Returns

int