Module

lexers.markdown_sm

Hand-written Markdown lexer using state machine approach.

O(n) guaranteed, zero regex, thread-safe.

Language Support:

  • CommonMark syntax
  • Fenced code blocks (```) with language hints
  • Inline code (`code`)
  • Headers (#through######)
  • Bold (**text**), italic (*text*), strikethrough (~~text~~)
  • Links[text](url)and images![alt](url)
  • Blockquotes (>)
  • Horizontal rules (---,***,___)
  • Unordered lists (-,*,+) and ordered lists (1.)

Design Philosophy:

Markdown lexing is line-oriented. The lexer tracks at_line_start context to distinguish block-level elements (headers, lists, code blocks) from inline elements (bold, links).

Fenced code blocks receive special handling: the content between ```markers is yielded as a single token, preserving the language hint for potential nested highlighting.

Performance:

~50µs per 100-line file.

Thread-Safety:

Uses only local variables intokenize().

See Also:

  • rosettes.lexers.rst_sm: reStructuredText lexer

Classes

MarkdownStateMachineLexer 1
Markdown lexer with CommonMark syntax support. Line-oriented lexer that tracks block-level context…

Markdown lexer with CommonMark syntax support.

Line-oriented lexer that tracks block-level context for accurate tokenization of headers, lists, and code blocks.

Token Types:

  • GENERIC_HEADING: Headers (#through######)
  • STRING: Fenced code blocks and inline code
  • GENERIC_STRONG: Bold text (**text**)
  • GENERIC_EMPH: Italic text (*text*)
  • NAME_TAG: Link/image markers and URLs

Example:

>>> from rosettes import get_lexer
>>> lexer = get_lexer("markdown")
>>> tokens = list(lexer.tokenize("# Header"))
>>> tokens[0].type  # '#' is a heading marker
<TokenType.GENERIC_HEADING: 'gh'>

Methods

tokenize 2 Iterator[Token]
def tokenize(self, code: str, config: LexerConfig | None = None) -> Iterator[Token]
Parameters
Name Type Description
code
config Default:None
Returns
Iterator[Token]