Classes
NativeHTMLParser
Fast HTML parser for build-time validation and text extraction.
This parser is the production pars…
NativeHTMLParser
Fast HTML parser for build-time validation and text extraction.
This parser is the production parser used duringbengal buildfor health
checks and validation. It's optimized for speed over features, using Python's
stdlib html.parser without external dependencies.
Primary use cases:
- Health check validation (unrendered directives, Jinja templates)
- Text extraction for search indexing
- Link validation and content analysis
Performance:
- ~5-10x faster than BeautifulSoup4 for text extraction
- Zero external dependencies (uses stdlib only)
Example: >>> parser = NativeHTMLParser() >>> result = parser.feed("<p>Hello <code>world</code></p>") >>> result.get_text() 'Hello' # Code block excluded
HTMLParserMethods 6
handle_starttag
Handle opening tags.
handle_starttag
def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]) -> None
Handle opening tags.
Parameters 2
tag |
str |
|
attrs |
list[tuple[str, str | None]] |
handle_endtag
Handle closing tags.
handle_endtag
def handle_endtag(self, tag: str) -> None
Handle closing tags.
Parameters 1
tag |
str |
handle_data
Handle text data.
handle_data
def handle_data(self, data: str) -> None
Handle text data.
Parameters 1
data |
str |
feed
Parse HTML content and return self for chaining.
feed
def feed(self, data: str) -> NativeHTMLParser
Parse HTML content and return self for chaining.
Parameters 1
data |
str |
Returns
self to allow parser(html).get_text() patternNativeHTMLParser
—
get_text
Get extracted text content (excluding code/script/style blocks).
get_text
def get_text(self) -> str
Get extracted text content (excluding code/script/style blocks).
Returns
Text content with whitespace normalizedstr
—
reset
Reset parser state for reuse.
reset
def reset(self) -> None
Reset parser state for reuse.
Internal Methods 1
__init__
__init__
def __init__(self) -> None