Classes
AssetExtractorParser
HTML parser for extracting asset references from rendered content.
AssetExtractorParser
HTML parser for extracting asset references from rendered content.
HTMLParserMethods 5
handle_starttag
Extract asset references from opening tags.
Handles:
- <img src>, <img srcset>…
handle_starttag
def handle_starttag(self, tag: str, attrs: list[tuple[str, str | None]]) -> None
Extract asset references from opening tags.
Handles:
- <img src>, <img srcset>
<script src>
<link href>
<source srcset>
<iframe src>
- <picture> with sources
Parameters 2
tag |
str |
|
attrs |
list[tuple[str, str | None]] |
handle_endtag
Handle closing tags.
handle_endtag
def handle_endtag(self, tag: str) -> None
Handle closing tags.
Parameters 1
tag |
str |
handle_data
Extract @import URLs from style tag content.
Handles:
- @import url('...')
- @…
handle_data
def handle_data(self, data: str) -> None
Extract @import URLs from style tag content.
Handles:
- @import url('...')
- @import url("...")
- @import url(...) - without quotes
Parameters 1
data |
str |
feed
Parse HTML and return self for chaining.
feed
def feed(self, data: str) -> AssetExtractorParser
Parse HTML and return self for chaining.
Parameters 1
data |
str |
Returns
self to allow parser(html).get_assets() patternAssetExtractorParser
—
get_assets
Get all extracted asset URLs.
Filters out empty strings and returns normalized set.
get_assets
def get_assets(self) -> set[str]
Get all extracted asset URLs.
Filters out empty strings and returns normalized set.
Returns
Set of asset URLs/pathsset[str]
—
Internal Methods 1
__init__
Initialize the asset extractor parser.
__init__
def __init__(self) -> None
Initialize the asset extractor parser.
Functions
extract_assets_from_html
Extract all asset references from rendered HTML.
extract_assets_from_html
def extract_assets_from_html(html_content: str) -> set[str]
Extract all asset references from rendered HTML.
Parameters 1
| Name | Type | Default | Description |
|---|---|---|---|
html_content |
str |
— | Rendered HTML content |
Returns
Set of asset URLs/paths referenced in the HTMLset[str]
—