Module

lexer

Kida lexer — tokenizes template source code into a token stream.

The lexer scans template source and produces Token objects that the Parser consumes. It operates in four modes based on current context:

Modes:

  • DATA: Outside template constructs; collects raw text
  • VARIABLE: Inside{{ }}; tokenizes expression
  • BLOCK: Inside{% %}; tokenizes statement
  • COMMENT: Inside{# #}; skips to closing delimiter

Token Types:

  • Delimiters: BLOCK_BEGIN, BLOCK_END, VARIABLE_BEGIN, VARIABLE_END
  • Literals: STRING, INTEGER, FLOAT
  • Identifiers: NAME (includes keywords like 'if', 'for', 'and')
  • Operators: ADD, SUB, MUL, DIV, EQ, NE, LT, GT, etc.
  • Punctuation: DOT, COMMA, COLON, PIPE, LPAREN, RPAREN, etc.
  • Data: DATA (raw text between template constructs)

Whitespace Control:

Supports Jinja2-style whitespace trimming:

  • {{- expr }}: Strip whitespace before
  • {{ expr -}}: Strip whitespace after
  • {%- stmt %}/{% stmt -%}: Same for blocks

Performance:

  • Compiled regex: Patterns are class-level, compiled once
  • O(1) operator lookup: Dict-based, not list iteration
  • Single-pass scanning: No backtracking
  • Generator-based: Memory-efficient for large templates

Thread-Safety: Lexer instances are single-use. Create one per tokenization. The resulting token list is immutable.

Example:

>>> from kida.lexer import Lexer, tokenize
>>> lexer = Lexer("Hello, {{ name }}!")
>>> tokens = list(lexer.tokenize())
>>> [(t.type.name, t.value) for t in tokens]

[('DATA', 'Hello, '), ('VARIABLE_BEGIN', '{{'), ('NAME', 'name'), ('VARIABLE_END', '}}'), ('DATA', '!'), ('EOF', '')]

Convenience function:

>>> tokens = tokenize("{{ x | upper }}")

Classes

LexerMode 0
Lexer operating mode.

Lexer operating mode.

LexerConfig 10
Lexer configuration for delimiter customization and whitespace control. Allows customizing templat…

Lexer configuration for delimiter customization and whitespace control.

Allows customizing template delimiters and enabling automatic whitespace trimming. Frozen for thread-safety (immutable after creation).

Attributes

Name Type Description
block_start str

Block tag opening delimiter (default: '{%')

block_end str

Block tag closing delimiter (default: '%}')

variable_start str

Variable tag opening delimiter (default: '{{')

variable_end str

Variable tag closing delimiter (default: '}}')

comment_start str

Comment opening delimiter (default: '{#')

comment_end str

Comment closing delimiter (default: '#}')

line_statement_prefix str | None

Line statement prefix, e.g., '#' (default: None)

line_comment_prefix str | None

Line comment prefix, e.g., '##' (default: None)

trim_blocks bool

Remove first newline after block tags (default: False)

lstrip_blocks bool

Strip leading whitespace before block tags (default: False)

LexerError 1
Lexer error with source location.

Lexer error with source location.

Methods

Internal Methods 1
__init__ 5
def __init__(self, message: str, source: str, lineno: int, col_offset: int, suggestion: str | None = None)
Parameters
Name Type Description
message
source
lineno
col_offset
suggestion Default:None
Lexer 17
Template lexer that transforms source into a token stream. The Lexer is the first stage of templat…

Template lexer that transforms source into a token stream.

The Lexer is the first stage of template compilation. It scans source text and yields Token objects representing literals, operators, identifiers, and template delimiters.

Thread-Safety: Instance state is mutable during tokenization (position tracking). Create one Lexer per source string; do not reuse across threads.

Operator Lookup: Uses O(1) dict lookup instead of O(k) list iteration: python _OPERATORS_2CHAR = {"**": TokenType.POW, "//": TokenType.FLOORDIV, ...} _OPERATORS_1CHAR = {"+": TokenType.ADD, "-": TokenType.SUB, ...}

Whitespace Control: Handles{{-,-}},{%-,-%}modifiers:

  • Left modifier ({{-,{%-): Strips trailing whitespace from preceding DATA
  • Right modifier (-}},-%}): Strips leading whitespace from following DATA

Error Handling: LexerErrorincludes source snippet with caret and suggestions: Lexer Error: Unterminated string literal --> line 3:15 | 3 | {% set x = "hello %} | ^ Suggestion: Add closing " to end the string

Attributes

Name Type Description
_OPERATORS_3CHAR dict[str, TokenType]
_OPERATORS_2CHAR dict[str, TokenType]
_OPERATORS_1CHAR dict[str, TokenType]

Methods

tokenize 0 Iterator[Token]
Tokenize the source and yield tokens.
def tokenize(self) -> Iterator[Token]
Returns
Iterator[Token]
Internal Methods 13
_get_delimiter_pattern 1 re.Pattern[str]
Get compiled delimiter pattern for config (cached). Compiles a single regex th…
staticmethod
def _get_delimiter_pattern(config: LexerConfig) -> re.Pattern[str]

Get compiled delimiter pattern for config (cached).

Compiles a single regex that matches any of the three delimiter types. Result is cached per unique config for O(1) subsequent lookups.

Performance: Single regex search is 5-24x faster than 3x str.find() (validated in benchmarks/test_benchmark_lexer.py).

Parameters
Name Type Description
config
Returns
re.Pattern[str]
__init__ 2
Initialize lexer with source code.
def __init__(self, source: str, config: LexerConfig | None = None)
Parameters
Name Type Description
source

Template source code

config

Lexer configuration (uses defaults if None)

Default:None
_tokenize_data 0 Iterator[Token]
Tokenize raw data outside template constructs.
def _tokenize_data(self) -> Iterator[Token]
Returns
Iterator[Token]
_tokenize_code 2 Iterator[Token]
Tokenize code inside {{ }} or {% %}.
def _tokenize_code(self, end_delimiter: str, end_token_type: TokenType) -> Iterator[Token]
Parameters
Name Type Description
end_delimiter
end_token_type
Returns
Iterator[Token]
_tokenize_comment 0 Iterator[Token]
Skip comment content until closing delimiter.
def _tokenize_comment(self) -> Iterator[Token]
Returns
Iterator[Token]
_next_code_token 0 Token
Get the next token from code content. Complexity: O(1) for operator lookup (di…
def _next_code_token(self) -> Token

Get the next token from code content.

Complexity: O(1) for operator lookup (dict-based).

Returns
Token
_scan_string 0 Token
Scan a string literal.
def _scan_string(self) -> Token
Returns
Token
_scan_number 0 Token
Scan a number literal (integer or float). Note: Special handling for range lit…
def _scan_number(self) -> Token

Scan a number literal (integer or float).

Note: Special handling for range literals (1..10, 1...11). If we see digits followed by '..' or '...', treat the digits as an integer, not a float, so the range operator can be parsed.

Returns
Token
_scan_name 0 Token
Scan a name or keyword.
def _scan_name(self) -> Token
Returns
Token
_find_next_construct 0 tuple[str, int] | None
Find the next template construct ({{ }}, {% %}, or {# #}). Uses a single compi…
def _find_next_construct(self) -> tuple[str, int] | None

Find the next template construct ({{ }}, {% %}, or {# #}).

Uses a single compiled regex search instead of 3x str.find() calls. The regex is cached per LexerConfig for O(1) subsequent lookups.

Performance: 5-24x faster than the previous str.find() approach (validated in benchmarks/test_benchmark_lexer.py).

Returns
tuple[str, int] | None
_emit_delimiter 2 Token
Emit a delimiter token and advance position.
def _emit_delimiter(self, delimiter: str, token_type: TokenType) -> Token
Parameters
Name Type Description
delimiter
token_type
Returns
Token
_skip_whitespace 0
Skip whitespace characters.
def _skip_whitespace(self) -> None
_advance 1
Advance position by count characters, tracking line/column. Optimized to use b…
def _advance(self, count: int) -> None

Advance position by count characters, tracking line/column.

Optimized to use batch processing with count() for newline detection instead of character-by-character iteration. Provides ~15-20% speedup for templates with long DATA nodes.

Parameters
Name Type Description
count

Functions

tokenize 2 list[Token]
Convenience function to tokenize source into a list.
def tokenize(source: str, config: LexerConfig | None = None) -> list[Token]
Parameters
Name Type Description
source str

Template source code

config LexerConfig | None

Optional lexer configuration

Default:None
Returns
list[Token]