Module

lexers.html_sm

Hand-written HTML lexer using state machine approach.

O(n) guaranteed, zero regex, thread-safe.

Language Support:

  • HTML5 syntax
  • Comments (<!-- -->)
  • DOCTYPE declarations
  • Tags with attributes (quoted and unquoted values)
  • Self-closing tags
  • Script and style blocks (minimal handling)

Token Classification:

  • Tag names: NAME_TAG (div,span,p, etc.)
  • Attribute names: NAME_ATTRIBUTE (class,id,href, etc.)
  • Attribute values: STRING
  • Comments: COMMENT_MULTILINE
  • Text content: TEXT

Performance:

~45µs per 100-line file.

Thread-Safety:

Uses only local variables intokenize().

See Also:

  • rosettes.lexers.xml_sm: XML lexer (stricter syntax)
  • rosettes.lexers.css_sm: CSS lexer (for style content)

Classes

HtmlStateMachineLexer 1
HTML lexer with tag, attribute, and comment parsing. Handles HTML5 syntax including comments, doct…

HTML lexer with tag, attribute, and comment parsing.

Handles HTML5 syntax including comments, doctype, and tag attributes.

Methods

tokenize 2 Iterator[Token]
def tokenize(self, code: str, config: LexerConfig | None = None) -> Iterator[Token]
Parameters
Name Type Description
code
config Default:None
Returns
Iterator[Token]