Module

rendering.parsers.native_html

Native HTML parser for build-time validation and health checks.

This parser is used duringbengal buildfor:

  • Health check validation (detecting unrendered directives, Jinja templates)
  • Text extraction from rendered HTML (excluding code blocks)
  • Performance-optimized alternative to BeautifulSoup4

Design:

  • Uses Python's stdlib html.parser (fast, zero dependencies)
  • Tracks state for code/script/style blocks to exclude from text extraction
  • Optimized for build-time validation, not complex DOM manipulation

Performance:

  • ~5-10x faster than BeautifulSoup4 for text extraction
  • Suitable for high-volume build-time validation

Classes

NativeHTMLParser
Fast HTML parser for build-time validation and text extraction. This parser is the production pars…
7

Fast HTML parser for build-time validation and text extraction.

This parser is the production parser used duringbengal buildfor health checks and validation. It's optimized for speed over features, using Python's stdlib html.parser without external dependencies.

Primary use cases:

  • Health check validation (unrendered directives, Jinja templates)
  • Text extraction for search indexing
  • Link validation and content analysis

Performance:

  • ~5-10x faster than BeautifulSoup4 for text extraction
  • Zero external dependencies (uses stdlib only)

Example: >>> parser = NativeHTMLParser() >>> result = parser.feed("<p>Hello <code>world</code></p>") >>> result.get_text() 'Hello' # Code block excluded

Inherits from HTMLParser

Methods 6

handle_starttag
Handle opening tags.
2 None
def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]) -> None

Handle opening tags.

Parameters 2
tag str
attrs list[tuple[str, str | None]]
handle_endtag
Handle closing tags.
1 None
def handle_endtag(self, tag: str) -> None

Handle closing tags.

Parameters 1
tag str
handle_data
Handle text data.
1 None
def handle_data(self, data: str) -> None

Handle text data.

Parameters 1
data str
feed
Parse HTML content and return self for chaining.
1 NativeHTMLParser
def feed(self, data: str) -> NativeHTMLParser

Parse HTML content and return self for chaining.

Parameters 1
data str
Returns

NativeHTMLParser

self to allow parser(html).get_text() pattern

get_text
Get extracted text content (excluding code/script/style blocks).
0 str
def get_text(self) -> str

Get extracted text content (excluding code/script/style blocks).

Returns

str

Text content with whitespace normalized

reset
Reset parser state for reuse.
0 None
def reset(self) -> None

Reset parser state for reuse.

Internal Methods 1
__init__
0 None
def __init__(self) -> None