Module

lexers.python_sm

Hand-written Python lexer using state machine approach.

O(n) guaranteed, zero regex, thread-safe.

Design Philosophy:

This is the reference implementation for Rosettes lexers. It demonstrates:

  1. State Machine Architecture: Character-by-character processing with explicit state (position, line, column) as local variables.

  2. Frozen Lookup Tables: Keywords, builtins, and operators as frozensets for O(1) membership testing and thread-safety.

  3. Fast Path / Slow Path: Simple cases (identifiers, operators) handled inline; complex cases (strings, numbers) delegated to helper methods.

Architecture:

Main Loop (tokenize):

pos = 0
while pos < length:
    char = code[pos]
    # Dispatch based on first character
    if char is whitespace: ...
    elif char is comment: ...
    elif char is string: ...
    # etc.

Helper Methods:

  • _scan_string_literal(): Handles prefixed and triple-quoted strings
  • _scan_number(): Handles int, float, hex, octal, binary, complex
  • _classify_word(): Maps identifiers to KEYWORD, BUILTIN, NAME

Python Language Support:

  • All Python 3.x syntax including 3.14
  • F-strings (prefix detection)
  • Type hints (annotations)
  • Walrus operator (:=)
  • Match/case statements (3.10+)
  • Type parameter syntax (3.12+)
  • Unicode identifiers (PEP 3131)

Performance:

  • ~50µs per 100-line file
  • O(n) guaranteed (single pass, no backtracking)
  • ~500 tokens/ms throughput

Thread-Safety:

All state is local totokenize(). Class attributes are frozen:

  • _KEYWORDS: frozenset
  • _BUILTINS: frozenset
  • _TWO_CHAR_OPS: frozenset
  • etc.

Adding New Lexers:

Use this file as a template. Key patterns to follow:

  1. Frozen lookup tables as module constants
  2. Local variables for all state (pos, line, col)
  3. Character-by-character dispatch in main loop
  4. Helper methods for complex constructs

See Also:

  • rosettes.lexers._state_machine: Base class and helper functions
  • rosettes._registry: How lexers are registered

Classes

PythonStateMachineLexer 13
Hand-written Python 3 lexer. O(n) guaranteed, zero regex, thread-safe. Handles all Python 3.x synt…

Hand-written Python 3 lexer.

O(n) guaranteed, zero regex, thread-safe. Handles all Python 3.x syntax including f-strings, type hints, walrus operator.

This is the reference implementation for Rosettes lexers. Use it as a template when adding new language support.

Performance: ~50µs per 100-line file, ~500 tokens/ms throughput.

Attributes

Name Type Description
name

Canonical language name ("python")

aliases

Alternative names for registry lookup ("py", "python3", "py3")

filenames

Glob patterns for file detection (".py", ".pyw", "*.pyi")

mimetypes

MIME types ("text/x-python", "application/x-python") Thread-Safety: All class attributes are frozen (frozenset). The tokenize() method uses only local variables for state (pos, line, col).

Methods

tokenize 2 Iterator[Token]
Tokenize Python source code. Single-pass, character-by-character. O(n) guarant…
def tokenize(self, code: str, config: LexerConfig | None = None) -> Iterator[Token]

Tokenize Python source code.

Single-pass, character-by-character. O(n) guaranteed.

Parameters
Name Type Description
code
config Default:None
Returns
Iterator[Token]
Internal Methods 8
_scan_string_literal 2 tuple[TokenType, int, in…
Scan a string literal with optional prefix. Returns (token_type, end_position,…
def _scan_string_literal(self, code: str, pos: int) -> tuple[TokenType, int, int]

Scan a string literal with optional prefix.

Returns (token_type, end_position, newline_count).

Parameters
Name Type Description
code
pos
Returns
tuple[TokenType, int, int]
_scan_number 2 tuple[TokenType, int]
Scan a numeric literal. Returns (token_type, end_position).
def _scan_number(self, code: str, pos: int) -> tuple[TokenType, int]
Parameters
Name Type Description
code
pos
Returns
tuple[TokenType, int]
_scan_digits_with_underscore 2 int
Scan digits with optional underscores.
def _scan_digits_with_underscore(self, code: str, pos: int) -> int
Parameters
Name Type Description
code
pos
Returns
int
_scan_hex_digits 2 int
Scan hex digits with optional underscores.
def _scan_hex_digits(self, code: str, pos: int) -> int
Parameters
Name Type Description
code
pos
Returns
int
_scan_octal_digits 2 int
Scan octal digits with optional underscores.
def _scan_octal_digits(self, code: str, pos: int) -> int
Parameters
Name Type Description
code
pos
Returns
int
_scan_binary_digits 2 int
Scan binary digits with optional underscores.
def _scan_binary_digits(self, code: str, pos: int) -> int
Parameters
Name Type Description
code
pos
Returns
int
_scan_exponent 2 int
Scan optional exponent part of number.
def _scan_exponent(self, code: str, pos: int) -> int
Parameters
Name Type Description
code
pos
Returns
int
_classify_word 1 TokenType
Classify an identifier into the appropriate token type.
def _classify_word(self, word: str) -> TokenType
Parameters
Name Type Description
word
Returns
TokenType