Module

lexers._state_machine

Base class for hand-written state machine lexers.

Thread-safe, O(n) guaranteed, zero regex.

Design Philosophy:

Rosettes lexers are hand-written state machines rather than regex-based.

This design provides:

  1. Security: No ReDoS vulnerability. Crafted input cannot cause exponential backtracking because there IS no backtracking.

  2. Performance: O(n) guaranteed. Single pass, character by character. Predictable performance regardless of input.

  3. Thread-Safety: Each tokenize() call uses only local variables. No shared mutable state means true parallelism on Python 3.14t.

  4. Debuggability: Explicit state transitions are easier to trace than regex match failures.

Architecture:

StateMachineLexer provides:

  • Base class with shared character sets (DIGITS, IDENT_START, etc.)
  • Defaulttokenize_fast()implementation
  • Protocol-compatible interface

Helper functions provide common patterns:

  • scan_while(): Advance while chars match set
  • scan_until(): Advance until char in set
  • scan_string(): Handle quoted strings with escapes
  • scan_triple_string(): Handle triple-quoted strings
  • scan_line_comment(): Scan to end of line
  • scan_block_comment(): Scan to end marker

Adding New Languages:

To add a new language lexer:

  1. Createrosettes/lexers/{language}_sm.py
  2. Subclass StateMachineLexer
  3. Set name, aliases, filenames, mimetypes class attributes
  4. Implementtokenize()method with character-by-character logic
  5. Add entry to_LEXER_SPECSinrosettes/_registry.py
  6. Add tests intests/lexers/test_{language}_sm.py

Example skeleton:

class MyLangStateMachineLexer(StateMachineLexer):
    name = "mylang"
    aliases = ("ml",)
    filenames = ("*.ml",)
    mimetypes = ("text/x-mylang",)

    # Language-specific character sets
    KEYWORDS = frozenset({"if", "else", "while"})

    def tokenize(self, code, config=None, start=0, end=None):
        # Your tokenization logic here
        ...

Key rules:

  • Use only local variables (noself.statemutations)
  • Yield tokens as you find them (streaming)
  • Handle all characters (emit TEXT for unknown)
  • Use helper functions for common patterns

Performance Tips:

  • Use frozenset for keyword/operator lookups (O(1))
  • Usescan_while/scan_untilhelpers for common patterns
  • Avoid string slicing in hot loops (use start/end indices)
  • Pre-compute character sets as class attributes

See Also:

  • rosettes/_protocol.Lexer: Protocol that all lexers must satisfy
  • rosettes/_registry: How lexers are registered and looked up
  • rosettes/lexers/python_sm.py: Reference implementation

Classes

StateMachineLexer 14
Base class for hand-written state machine lexers. Thread-safe: `tokenize()` uses only local variab…

Base class for hand-written state machine lexers.

Thread-safe:tokenize()uses only local variables. O(n) guaranteed: single pass, no backtracking.

Subclasses implement language-specific tokenization by overriding thetokenize()method with character-by-character logic.

Design Principles:

  1. No regex — character matching only
  2. Explicit state — no hidden backtracking
  3. Local variables only — thread-safe by design
  4. Single pass — O(n) guaranteed

Class Attributes:

  • name: Canonical language name (e.g., 'python')
  • aliases: Alternative names for registry lookup (e.g., ('py', 'python3'))
  • filenames: Glob patterns for file detection (e.g., ('*.py',))
  • mimetypes: MIME types for content detection

Shared Character Sets:

  • DIGITS: '0'-'9'
  • HEX_DIGITS: '0'-'9', 'a'-'f', 'A'-'F'
  • LETTERS: 'a'-'z', 'A'-'Z'
  • IDENT_START: Letters + '_'
  • IDENT_CONT: IDENT_START + digits
  • WHITESPACE: Space, tab, newline, etc.

Example Implementation:

class MyLangLexer(StateMachineLexer):
    name = "mylang"
    aliases = ("ml",)
    KEYWORDS = frozenset({"if", "else"})

    def tokenize(self, code, config=None, start=0, end=None):
        pos = start
        end = end or len(code)
        line, col = 1, 1

        while pos < end:
            char = code[pos]
            # ... tokenization logic ...
            yield Token(TokenType.TEXT, char, line, col)
            pos += 1
            col += 1

Common Mistakes:

# ❌ WRONG: Storing state in instance variables
self.current_line = 1  # NOT thread-safe!

# ✅ CORRECT: Use local variables
line = 1

# ❌ WRONG: Using regex for matching
match = re.match(r'\d+', code[pos:])  # ReDoS vulnerable!

# ✅ CORRECT: Use scan_while helper
end_pos = scan_while(code, pos, self.DIGITS)

Attributes

Name Type Description
name str
aliases tuple[str, ...]
filenames tuple[str, ...]
mimetypes tuple[str, ...]
DIGITS frozenset[str]
HEX_DIGITS frozenset[str]
OCTAL_DIGITS frozenset[str]
BINARY_DIGITS frozenset[str]
LETTERS frozenset[str]
IDENT_START frozenset[str]
IDENT_CONT frozenset[str]
WHITESPACE frozenset[str]

Methods

tokenize 4 Iterator[Token]
Tokenize source code. Subclasses override this with language-specific logic.
def tokenize(self, code: str, config: LexerConfig | None = None, start: int = 0, end: int | None = None) -> Iterator[Token]
Parameters
Name Type Description
code

The source code to tokenize.

config

Optional lexer configuration.

Default:None
start

Starting index in the source string.

Default:0
end

Optional ending index in the source string.

Default:None
Returns
Iterator[Token]
tokenize_fast 3 Iterator[tuple[TokenType…
Fast tokenization without position tracking. Default implementation strips pos…
def tokenize_fast(self, code: str, start: int = 0, end: int | None = None) -> Iterator[tuple[TokenType, str]]

Fast tokenization without position tracking.

Default implementation strips position info from tokenize(). Subclasses may override for further optimization.

Parameters
Name Type Description
code

The source code to tokenize.

start

Starting index in the source string.

Default:0
end

Optional ending index in the source string.

Default:None
Returns
Iterator[tuple[TokenType, str]]

Functions

scan_while 3 int
Advance position while characters are in char_set.
def scan_while(code: str, pos: int, char_set: frozenset[str]) -> int
Parameters
Name Type Description
code str

Source code string.

pos int

Starting position.

char_set frozenset[str]

Set of characters to match.

Returns
int
scan_until 3 int
Advance position until a character in char_set is found.
def scan_until(code: str, pos: int, char_set: frozenset[str]) -> int
Parameters
Name Type Description
code str

Source code string.

pos int

Starting position.

char_set frozenset[str]

Set of characters to stop at.

Returns
int
scan_string 3 int
Scan a string literal, handling escapes.
def scan_string(code: str, pos: int, quote: str) -> int
Parameters
Name Type Description
code str

Source code.

pos int

Position after opening quote.

quote str

The quote character (' or ").

Returns
int
scan_triple_string 3 int
Scan a triple-quoted string.
def scan_triple_string(code: str, pos: int, quote: str) -> int
Parameters
Name Type Description
code str

Source code.

pos int

Position after opening triple quote.

quote str

The quote character (' or ").

Returns
int
scan_line_comment 2 int
Scan to end of line (for line comments).
def scan_line_comment(code: str, pos: int) -> int
Parameters
Name Type Description
code str

Source code.

pos int

Starting position (after comment marker).

Returns
int
scan_block_comment 3 int
Scan a block comment until end marker.
def scan_block_comment(code: str, pos: int, end_marker: str) -> int
Parameters
Name Type Description
code str

Source code.

pos int

Position after opening marker.

end_marker str

The closing marker (e.g., "*/" or "-->").

Returns
int