# lexer

URL: /kida/api/lexer/
Section: api
Description: Kida lexer — tokenizes template source code into a token stream.

The lexer scans template source and produces Token objects that the Parser
consumes. It operates in four modes based on current context:

Modes:
- **DATA**: Outside template constructs; collects raw text
- **VARIABLE**: Inside `{{ }}`; tokenizes expression
- **BLOCK**: Inside `{% %}`; tokenizes statement
- **COMMENT**: Inside `{# #}`; skips to closing delimiter

Token Types:
- **Delimiters**: BLOCK_BEGIN, BLOCK_END, VARIABLE_BEGIN, VARIABLE_END
- **Literals**: STRING, INTEGER, FLOAT
- **Identifiers**: NAME (includes keywords like 'if', 'for', 'and')
- **Operators**: ADD, SUB, MUL, DIV, EQ, NE, LT, GT, etc.
- **Punctuation**: DOT, COMMA, COLON, PIPE, LPAREN, RPAREN, etc.
- **Data**: DATA (raw text between template constructs)

Whitespace Control:
Supports Jinja2-style whitespace trimming:
- `{{- expr }}`: Strip whitespace before
- `{{ expr -}}`: Strip whitespace after
- `{%- stmt %}` / `{% stmt -%}`: Same for blocks

Performance:
- **Compiled regex**: Patterns are class-level, compiled once
- **O(1) operator lookup**: Dict-based, not list iteration
- **Single-pass scanning**: No backtracking
- **Generator-based**: Memory-efficient for large templates

Thread-Safety:
Lexer instances are single-use. Create one per tokenization.
The resulting token list is immutable.

Example:
    >>> from kida.lexer import Lexer, tokenize
    >>> lexer = Lexer("Hello, {{ name }}!")
    >>> tokens = list(lexer.tokenize())
    >>> [(t.type.name, t.value) for t in tokens]
[('DATA', 'Hello, '), ('VARIABLE_BEGIN', '{{'), ('NAME', 'name'),
 ('VARIABLE_END', '}}'), ('DATA', '!'), ('EOF', '')]

# Convenience function:
    >>> tokens = tokenize("{{ x | upper }}")

---

> For a complete page index, fetch /kida/llms.txt.

Open LLM text
(/kida/api/lexer/index.txt)

Share with AI

Ask Claude
(https://claude.ai/new?q=Please%20help%20me%20understand%20this%20documentation%3A%20%2Fkida%2Fapi%2Flexer%2Findex.txt)

Ask ChatGPT
(https://chatgpt.com/?q=Please%20help%20me%20understand%20this%20documentation%3A%20%2Fkida%2Fapi%2Flexer%2Findex.txt)

Ask Gemini
(https://gemini.google.com/app?q=Please%20help%20me%20understand%20this%20documentation%3A%20%2Fkida%2Fapi%2Flexer%2Findex.txt)

Ask Copilot
(https://copilot.microsoft.com/?q=Please%20help%20me%20understand%20this%20documentation%3A%20%2Fkida%2Fapi%2Flexer%2Findex.txt)

Module

#
`lexer`

Kida lexer — tokenizes template source code into a token stream.

The lexer scans template source and produces Token objects that the Parser
consumes. It operates in four modes based on current context:

Modes:

- DATA: Outside template constructs; collects raw text

- VARIABLE: Inside`{{ }}`; tokenizes expression

- BLOCK: Inside`{% %}`; tokenizes statement

- COMMENT: Inside`{# #}`; skips to closing delimiter

Token Types:

- Delimiters: BLOCK_BEGIN, BLOCK_END, VARIABLE_BEGIN, VARIABLE_END

- Literals: STRING, INTEGER, FLOAT

- Identifiers: NAME (includes keywords like 'if', 'for', 'and')

- Operators: ADD, SUB, MUL, DIV, EQ, NE, LT, GT, etc.

- Punctuation: DOT, COMMA, COLON, PIPE, LPAREN, RPAREN, etc.

- Data: DATA (raw text between template constructs)

Whitespace Control:

Supports Jinja2-style whitespace trimming:

- `{{- expr }}`: Strip whitespace before

- `{{ expr -}}`: Strip whitespace after

- `{%- stmt %}` / `{% stmt -%}`: Same for blocks

Performance:

- Compiled regex: Patterns are class-level, compiled once

- O(1) operator lookup: Dict-based, not list iteration

- Single-pass scanning: No backtracking

- Generator-based: Memory-efficient for large templates

Thread-Safety:
Lexer instances are single-use. Create one per tokenization.
The resulting token list is immutable.

Example:

```
>>> from kida.lexer import Lexer, tokenize
>>> lexer = Lexer("Hello, {{ name }}!")
>>> tokens = list(lexer.tokenize())
>>> [(t.type.name, t.value) for t in tokens]
```

[('DATA', 'Hello, '), ('VARIABLE_BEGIN', '{{'), ('NAME', 'name'),
('VARIABLE_END', '}}'), ('DATA', '!'), ('EOF', '')]

# Convenience function:

```
>>> tokens = tokenize("{{ x | upper }}")
```

4Classes1Function

## Classes

`LexerMode`

0

▼

Lexer operating mode.

Lexer operating mode.

`LexerConfig`

10

▼

Lexer configuration for delimiter customization and whitespace control.

Allows customizing templat…

Lexer configuration for delimiter customization and whitespace control.

Allows customizing template delimiters and enabling automatic whitespace
trimming. Frozen for thread-safety (immutable after creation).

#### Attributes

Name
Type
Description

`block_start`

`str`

Block tag opening delimiter (default: '{%')

`block_end`

`str`

Block tag closing delimiter (default: '%}')

`variable_start`

`str`

Variable tag opening delimiter (default: '{{')

`variable_end`

`str`

Variable tag closing delimiter (default: '}}')

`comment_start`

`str`

Comment opening delimiter (default: '{#')

`comment_end`

`str`

Comment closing delimiter (default: '#}')

`line_statement_prefix`

`str | None`

Line statement prefix, e.g., '#' (default: None)

`line_comment_prefix`

`str | None`

Line comment prefix, e.g., '##' (default: None)

`trim_blocks`

`bool`

Remove first newline after block tags (default: False)

`lstrip_blocks`

`bool`

Strip leading whitespace before block tags (default: False)

`LexerError`

1

▼

Lexer error with source location.

Lexer error with source location.

#### Methods

Internal Methods
1

▼

`__init__`

5

▼

`def __init__(self, message: str, source: str, lineno: int, col_offset: int, suggestion: str | None = None)`

##### Parameters

Name
Type
Description

`message`
`—`

`source`
`—`

`lineno`
`—`

`col_offset`
`—`

`suggestion`
`—`

Default:`None`

`Lexer`

18

▼

Template lexer that transforms source into a token stream.

The Lexer is the first stage of templat…

Template lexer that transforms source into a token stream.

The Lexer is the first stage of template compilation. It scans source text
and yields Token objects representing literals, operators, identifiers,
and template delimiters.

Thread-Safety:
Instance state is mutable during tokenization (position tracking).
Create one Lexer per source string; do not reuse across threads.

Operator Lookup:
Uses O(1) dict lookup instead of O(k) list iteration:
`python _OPERATORS_2CHAR = {"**": TokenType.POW, "//": TokenType.FLOORDIV, ...} _OPERATORS_1CHAR = {"+": TokenType.ADD, "-": TokenType.SUB, ...} `

Whitespace Control:
Handles`{{-`, `-}}`, `{%-`, `-%}`modifiers:

- Left modifier (`{{-`, `{%-`): Strips trailing whitespace from preceding DATA

- Right modifier (`-}}`, `-%}`): Strips leading whitespace from following DATA

Error Handling:
`LexerError`includes source snippet with caret and suggestions:
` Lexer Error: Unterminated string literal --> line 3:15 | 3 | {% set x = "hello %} | ^ Suggestion: Add closing " to end the string `

#### Attributes

Name
Type
Description

`_OPERATORS_3CHAR`

`ClassVar[dict[str, TokenType]]`

—

`_OPERATORS_2CHAR`

`ClassVar[dict[str, TokenType]]`

—

`_OPERATORS_1CHAR`

`ClassVar[dict[str, TokenType]]`

—

#### Methods

`tokenize`

0

`Iterator[Token]`

▼

Tokenize the source and yield tokens.

`def tokenize(self) -> Iterator[Token]`

##### Returns

`Iterator[Token]`

Internal Methods
14

▼

`_get_delimiter_pattern`

1

`re.Pattern[str]`

▼

Get compiled delimiter pattern for config (cached).

Compiles a single regex th…

staticmethod

`def _get_delimiter_pattern(config: LexerConfig) -> re.Pattern[str]`

Get compiled delimiter pattern for config (cached).

Compiles a single regex that matches any of the three delimiter types.
Result is cached per unique config for O(1) subsequent lookups.

Performance: Single regex search is 5-24x faster than 3x str.find()
(validated in benchmarks/test_benchmark_lexer.py).

##### Parameters

Name
Type
Description

`config`
`—`

##### Returns

`re.Pattern[str]`

`__init__`

2

▼

Initialize lexer with source code.

`def __init__(self, source: str, config: LexerConfig | None = None)`

##### Parameters

Name
Type
Description

`source`
`—`

Template source code

`config`
`—`

Lexer configuration (uses defaults if None)

Default:`None`

`_tokenize_data`

0

`Iterator[Token]`

▼

Tokenize raw data outside template constructs.

`def _tokenize_data(self) -> Iterator[Token]`

##### Returns

`Iterator[Token]`

`_tokenize_code`

2

`Iterator[Token]`

▼

Tokenize code inside {{ }} or {% %}.

`def _tokenize_code(self, end_delimiter: str, end_token_type: TokenType) -> Iterator[Token]`

##### Parameters

Name
Type
Description

`end_delimiter`
`—`

`end_token_type`
`—`

##### Returns

`Iterator[Token]`

`_tokenize_comment`

0

`Iterator[Token]`

▼

Skip comment content until closing delimiter.

`def _tokenize_comment(self) -> Iterator[Token]`

##### Returns

`Iterator[Token]`

`_next_code_token`

0

`Token`

▼

Get the next token from code content.

Complexity: O(1) for operator lookup (di…

`def _next_code_token(self) -> Token`

Get the next token from code content.

Complexity: O(1) for operator lookup (dict-based).

##### Returns

`Token`

`_scan_string`

0

`Token`

▼

Scan a string literal with Python-style escape decoding.

`def _scan_string(self) -> Token`

##### Returns

`Token`

`_decode_string_escapes`

4

`str`

▼

Decode Python-style escape sequences in string literal content.

`def _decode_string_escapes(self, raw: str, quote_char: str, lineno: int, col_offset: int) -> str`

##### Parameters

Name
Type
Description

`raw`
`—`

`quote_char`
`—`

`lineno`
`—`

`col_offset`
`—`

##### Returns

`str`

`_scan_number`

0

`Token`

▼

Scan a number literal (integer or float).

Note: Special handling for range lit…

`def _scan_number(self) -> Token`

Scan a number literal (integer or float).

Note: Special handling for range literals (1..10, 1...11).
If we see digits followed by '..' or '...', treat the digits as
an integer, not a float, so the range operator can be parsed.

##### Returns

`Token`

`_scan_name`

0

`Token`

▼

Scan a name or keyword.

`def _scan_name(self) -> Token`

##### Returns

`Token`

`_find_next_construct`

0

`tuple[str, int] | None`

▼

Find the next template construct ({{ }}, {% %}, or {# #}).

Uses a single compi…

`def _find_next_construct(self) -> tuple[str, int] | None`

Find the next template construct ({{ }}, {% %}, or {# #}).

Uses a single compiled regex search instead of 3x str.find() calls.
The regex is cached per LexerConfig for O(1) subsequent lookups.

Performance: 5-24x faster than the previous str.find() approach
(validated in benchmarks/test_benchmark_lexer.py).

##### Returns

`tuple[str, int] | None`

`_emit_delimiter`

2

`Token`

▼

Emit a delimiter token and advance position.

`def _emit_delimiter(self, delimiter: str, token_type: TokenType) -> Token`

##### Parameters

Name
Type
Description

`delimiter`
`—`

`token_type`
`—`

##### Returns

`Token`

`_skip_whitespace`

0

▼

Skip whitespace characters.

`def _skip_whitespace(self) -> None`

`_advance`

1

▼

Advance position by count characters, tracking line/column.

Optimized to use b…

`def _advance(self, count: int) -> None`

Advance position by count characters, tracking line/column.

Optimized to use batch processing with count() for newline detection
instead of character-by-character iteration. Provides ~15-20% speedup
for templates with long DATA nodes.

##### Parameters

Name
Type
Description

`count`
`—`

## Functions

`tokenize`

2

`list[Token]`

▼

Convenience function to tokenize source into a list.

`def tokenize(source: str, config: LexerConfig | None = None) -> list[Token]`

##### Parameters

Name
Type
Description

`source`
`str`

Template source code

`config`
`LexerConfig | None`

Optional lexer configuration

Default:`None`

##### Returns

`list[Token]`