Classes
StateMachineLexer
14
▼
Base class for hand-written state machine lexers.
Thread-safe: `tokenize()` uses only local variab…
StateMachineLexer
14
▼
Base class for hand-written state machine lexers.
Thread-safe:tokenize()uses only local variables.
O(n) guaranteed: single pass, no backtracking.
Subclasses implement language-specific tokenization by overriding
thetokenize()method with character-by-character logic.
Design Principles:
- No regex — character matching only
- Explicit state — no hidden backtracking
- Local variables only — thread-safe by design
- Single pass — O(n) guaranteed
Class Attributes:
name: Canonical language name (e.g., 'python')aliases: Alternative names for registry lookup (e.g., ('py', 'python3'))filenames: Glob patterns for file detection (e.g., ('*.py',))mimetypes: MIME types for content detection
Shared Character Sets:
DIGITS: '0'-'9'HEX_DIGITS: '0'-'9', 'a'-'f', 'A'-'F'LETTERS: 'a'-'z', 'A'-'Z'IDENT_START: Letters + '_'IDENT_CONT: IDENT_START + digitsWHITESPACE: Space, tab, newline, etc.
Example Implementation:
class MyLangLexer(StateMachineLexer):
name = "mylang"
aliases = ("ml",)
KEYWORDS = frozenset({"if", "else"})
def tokenize(self, code, config=None, start=0, end=None):
pos = start
end = end or len(code)
line, col = 1, 1
while pos < end:
char = code[pos]
# ... tokenization logic ...
yield Token(TokenType.TEXT, char, line, col)
pos += 1
col += 1
Common Mistakes:
# ❌ WRONG: Storing state in instance variables
self.current_line = 1 # NOT thread-safe!
# ✅ CORRECT: Use local variables
line = 1
# ❌ WRONG: Using regex for matching
match = re.match(r'\d+', code[pos:]) # ReDoS vulnerable!
# ✅ CORRECT: Use scan_while helper
end_pos = scan_while(code, pos, self.DIGITS)
Attributes
| Name | Type | Description |
|---|---|---|
name |
str
|
— |
aliases |
tuple[str, ...]
|
— |
filenames |
tuple[str, ...]
|
— |
mimetypes |
tuple[str, ...]
|
— |
DIGITS |
frozenset[str]
|
— |
HEX_DIGITS |
frozenset[str]
|
— |
OCTAL_DIGITS |
frozenset[str]
|
— |
BINARY_DIGITS |
frozenset[str]
|
— |
LETTERS |
frozenset[str]
|
— |
IDENT_START |
frozenset[str]
|
— |
IDENT_CONT |
frozenset[str]
|
— |
WHITESPACE |
frozenset[str]
|
— |
Methods
tokenize
4
Iterator[Token]
▼
Tokenize source code.
Subclasses override this with language-specific logic.
tokenize
4
Iterator[Token]
▼
def tokenize(self, code: str, config: LexerConfig | None = None, start: int = 0, end: int | None = None) -> Iterator[Token]
Parameters
| Name | Type | Description |
|---|---|---|
code |
— |
The source code to tokenize. |
config |
— |
Optional lexer configuration. Default:None
|
start |
— |
Starting index in the source string. Default:0
|
end |
— |
Optional ending index in the source string. Default:None
|
Returns
Iterator[Token]
tokenize_fast
3
Iterator[tuple[TokenType…
▼
Fast tokenization without position tracking.
Default implementation strips pos…
tokenize_fast
3
Iterator[tuple[TokenType…
▼
def tokenize_fast(self, code: str, start: int = 0, end: int | None = None) -> Iterator[tuple[TokenType, str]]
Fast tokenization without position tracking.
Default implementation strips position info from tokenize(). Subclasses may override for further optimization.
Parameters
| Name | Type | Description |
|---|---|---|
code |
— |
The source code to tokenize. |
start |
— |
Starting index in the source string. Default:0
|
end |
— |
Optional ending index in the source string. Default:None
|
Returns
Iterator[tuple[TokenType, str]]
Functions
scan_while
3
int
▼
Advance position while characters are in char_set.
scan_while
3
int
▼
def scan_while(code: str, pos: int, char_set: frozenset[str]) -> int
Parameters
| Name | Type | Description |
|---|---|---|
code |
str |
Source code string. |
pos |
int |
Starting position. |
char_set |
frozenset[str] |
Set of characters to match. |
Returns
int
scan_until
3
int
▼
Advance position until a character in char_set is found.
scan_until
3
int
▼
def scan_until(code: str, pos: int, char_set: frozenset[str]) -> int
Parameters
| Name | Type | Description |
|---|---|---|
code |
str |
Source code string. |
pos |
int |
Starting position. |
char_set |
frozenset[str] |
Set of characters to stop at. |
Returns
int
scan_string
3
int
▼
Scan a string literal, handling escapes.
scan_string
3
int
▼
def scan_string(code: str, pos: int, quote: str) -> int
Parameters
| Name | Type | Description |
|---|---|---|
code |
str |
Source code. |
pos |
int |
Position after opening quote. |
quote |
str |
The quote character (' or "). |
Returns
int
scan_triple_string
3
int
▼
Scan a triple-quoted string.
scan_triple_string
3
int
▼
def scan_triple_string(code: str, pos: int, quote: str) -> int
Parameters
| Name | Type | Description |
|---|---|---|
code |
str |
Source code. |
pos |
int |
Position after opening triple quote. |
quote |
str |
The quote character (' or "). |
Returns
int
scan_line_comment
2
int
▼
Scan to end of line (for line comments).
scan_line_comment
2
int
▼
def scan_line_comment(code: str, pos: int) -> int
Parameters
| Name | Type | Description |
|---|---|---|
code |
str |
Source code. |
pos |
int |
Starting position (after comment marker). |
Returns
int
scan_block_comment
3
int
▼
Scan a block comment until end marker.
scan_block_comment
3
int
▼
def scan_block_comment(code: str, pos: int, end_marker: str) -> int
Parameters
| Name | Type | Description |
|---|---|---|
code |
str |
Source code. |
pos |
int |
Position after opening marker. |
end_marker |
str |
The closing marker (e.g., "*/" or "-->"). |
Returns
int