Module

`lexers.python_sm`

Hand-written Python lexer using state machine approach.

O(n) guaranteed, zero regex, thread-safe.

Design Philosophy:

This is the reference implementation for Rosettes lexers. It demonstrates:

State Machine Architecture: Character-by-character processing with explicit state (position, line, column) as local variables.
Frozen Lookup Tables: Keywords, builtins, and operators as frozensets for O(1) membership testing and thread-safety.
Fast Path / Slow Path: Simple cases (identifiers, operators) handled inline; complex cases (strings, numbers) delegated to helper methods.

Architecture:

Main Loop (tokenize):

pos = 0
while pos < length:
    char = code[pos]
    # Dispatch based on first character
    if char is whitespace: ...
    elif char is comment: ...
    elif char is string: ...
    # etc.

Helper Methods:

_scan_string_literal(): Handles prefixed and triple-quoted strings
_scan_number(): Handles int, float, hex, octal, binary, complex
_classify_word(): Maps identifiers to KEYWORD, BUILTIN, NAME

Python Language Support:

All Python 3.x syntax including 3.14
F-strings (prefix detection)
Type hints (annotations)
Walrus operator (:=)
Match/case statements (3.10+)
Type parameter syntax (3.12+)
Unicode identifiers (PEP 3131)

Performance:

~50µs per 100-line file
O(n) guaranteed (single pass, no backtracking)
~500 tokens/ms throughput

Thread-Safety:

All state is local totokenize(). Class attributes are frozen:

_KEYWORDS: frozenset
_BUILTINS: frozenset
_TWO_CHAR_OPS: frozenset
etc.

Adding New Lexers:

Use this file as a template. Key patterns to follow:

Frozen lookup tables as module constants
Local variables for all state (pos, line, col)
Character-by-character dispatch in main loop
Helper methods for complex constructs

See Also:

rosettes.lexers._state_machine: Base class and helper functions
rosettes._registry: How lexers are registered

Classes

PythonStateMachineLexer 13 ▼

Hand-written Python 3 lexer. O(n) guaranteed, zero regex, thread-safe. Handles all Python 3.x synt…

Hand-written Python 3 lexer.

O(n) guaranteed, zero regex, thread-safe. Handles all Python 3.x syntax including f-strings, type hints, walrus operator.

This is the reference implementation for Rosettes lexers. Use it as a template when adding new language support.

Performance: ~50µs per 100-line file, ~500 tokens/ms throughput.

Attributes

Name	Type	Description
`name`	—	Canonical language name ("python")
`aliases`	—	Alternative names for registry lookup ("py", "python3", "py3")
`filenames`	—	Glob patterns for file detection (".py", ".pyw", "*.pyi")
`mimetypes`	—	MIME types ("text/x-python", "application/x-python") Thread-Safety: All class attributes are frozen (frozenset). The tokenize() method uses only local variables for state (pos, line, col).

Methods

tokenize 4 Iterator[Token] ▼

Tokenize Python source code. Single-pass, character-by-character. O(n) guarant…

def tokenize(self, code: str, config: LexerConfig | None = None, *, start: int = 0, end: int | None = None) -> Iterator[Token]

Tokenize Python source code.

Single-pass, character-by-character. O(n) guaranteed.

Parameters

Name	Type	Description
`code`	`—`
`config`	`—`	Default:`None`
`start`	`—`	Default:`0`
`end`	`—`	Default:`None`

Returns

Iterator[Token]

Internal Methods 8 ▼

_scan_string_literal 2 tuple[TokenType, int, in… ▼

Scan a string literal with optional prefix. Returns (token_type, end_position,…

def _scan_string_literal(self, code: str, pos: int) -> tuple[TokenType, int, int]

Scan a string literal with optional prefix.

Returns (token_type, end_position, newline_count).

Parameters

Name	Type	Description
`code`	`—`
`pos`	`—`

Returns

tuple[TokenType, int, int]

_scan_number 2 tuple[TokenType, int] ▼

Scan a numeric literal. Returns (token_type, end_position).

def _scan_number(self, code: str, pos: int) -> tuple[TokenType, int]

Parameters

Name	Type	Description
`code`	`—`
`pos`	`—`

Returns

tuple[TokenType, int]

_scan_digits_with_underscore 2 int ▼

Scan digits with optional underscores.

def _scan_digits_with_underscore(self, code: str, pos: int) -> int

Parameters

Name	Type	Description
`code`	`—`
`pos`	`—`

Returns

int

_scan_hex_digits 2 int ▼

Scan hex digits with optional underscores.

def _scan_hex_digits(self, code: str, pos: int) -> int

Parameters

Name	Type	Description
`code`	`—`
`pos`	`—`

Returns

int

_scan_octal_digits 2 int ▼

Scan octal digits with optional underscores.

def _scan_octal_digits(self, code: str, pos: int) -> int

Parameters

Name	Type	Description
`code`	`—`
`pos`	`—`

Returns

int

_scan_binary_digits 2 int ▼

Scan binary digits with optional underscores.

def _scan_binary_digits(self, code: str, pos: int) -> int

Parameters

Name	Type	Description
`code`	`—`
`pos`	`—`

Returns

int

_scan_exponent 2 int ▼

Scan optional exponent part of number.

def _scan_exponent(self, code: str, pos: int) -> int

Parameters

Name	Type	Description
`code`	`—`
`pos`	`—`

Returns

int

_classify_word 1 TokenType ▼

Classify an identifier into the appropriate token type.

def _classify_word(self, word: str) -> TokenType

Parameters

Name	Type	Description
`word`	`—`

Returns

TokenType