Module

`lexers.markdown_sm`

Hand-written Markdown lexer using state machine approach.

O(n) guaranteed, zero regex, thread-safe.

Language Support:

CommonMark syntax
Fenced code blocks (```) with language hints
Inline code (`code`)
Headers (# through ######)
Bold (**text**), italic (*text*), strikethrough (~~text~~)
Links[text](url) and images ![alt](url)
Blockquotes (>)
Horizontal rules (---, ***, ___)
Unordered lists (-, *, +) and ordered lists (1.)

Design Philosophy:

Markdown lexing is line-oriented. The lexer tracks at_line_start context to distinguish block-level elements (headers, lists, code blocks) from inline elements (bold, links).

Fenced code blocks receive special handling: the content between ```markers is yielded as a single token, preserving the language hint for potential nested highlighting.

Performance:

~50µs per 100-line file.

Thread-Safety:

Uses only local variables intokenize().

See Also:

rosettes.lexers.rst_sm: reStructuredText lexer

Classes

MarkdownStateMachineLexer 1 ▼

Markdown lexer with CommonMark syntax support. Line-oriented lexer that tracks block-level context…

Markdown lexer with CommonMark syntax support.

Line-oriented lexer that tracks block-level context for accurate tokenization of headers, lists, and code blocks.

Token Types:

GENERIC_HEADING: Headers (# through ######)
STRING: Fenced code blocks and inline code
GENERIC_STRONG: Bold text (**text**)
GENERIC_EMPH: Italic text (*text*)
NAME_TAG: Link/image markers and URLs

Example:

>>> from rosettes import get_lexer
>>> lexer = get_lexer("markdown")
>>> tokens = list(lexer.tokenize("# Header"))
>>> tokens[0].type  # '#' is a heading marker
<TokenType.GENERIC_HEADING: 'gh'>

Methods

tokenize 4 Iterator[Token] ▼

def tokenize(self, code: str, config: LexerConfig | None = None, *, start: int = 0, end: int | None = None) -> Iterator[Token]

Parameters

Name	Type	Description
`code`	`—`
`config`	`—`	Default:`None`
`start`	`—`	Default:`0`
`end`	`—`	Default:`None`

Returns

Iterator[Token]