Module

core.page.content

Page Content Mixin - AST-based content representation.

This module provides the true AST architecture for content processing, replacing the misleadingparsed_astfield (which actually contains HTML).

Architecture:

  • _ast: True AST from parser (list of tokens) - Phase 3
  • html: HTML rendered from AST (or legacy parsed_ast)
  • plain_text: Plain text for search/LLM (AST walk or raw markdown)

Benefits:

  • Parse once, use many times
  • Faster post-processing (O(n) AST walks vs regex)
  • Cleaner transformations (shortcodes at AST level)
  • Better caching (cache AST separately from HTML)

Migration Plan:

Phase 1: Add html, plain_text properties (non-breaking)
Phase 2: Deprecate parsed_ast
Phase 3: Implement true AST with hybrid fallback

See: plan/active/rfc-content-ast-architecture.md

Classes

PageContentMixin
Mixin providing AST-based content properties for pages. This mixin handles content representation …
7

Mixin providing AST-based content properties for pages.

This mixin handles content representation across multiple formats:

  • AST (Abstract Syntax Tree) - structural representation (Phase 3)
  • HTML - rendered for display
  • Plain text - for search indexing and LLM

All properties use lazy evaluation with caching for performance.

Attributes

Name Type Description
content str
parsed_ast Any
links list[str]
_ast_cache list[dict[str, Any]] | None
_html_cache str | None
_plain_text_cache str | None

Methods 3

ast property
True AST - list of tokens from markdown parser. Returns the structural represe…
list[dict[str, Any]…
def ast(self) -> list[dict[str, Any]] | None

True AST - list of tokens from markdown parser.

Returns the structural representation of content as parsed by the markdown engine. This enables efficient multi-output generation:

  • HTML rendering
  • Plain text extraction
  • TOC generation
  • Link extraction
Returns

list[dict[str, Any]] | None

List of AST tokens if available, None if parser doesn't support AST.

html property
HTML content rendered from AST or legacy parser. This is the preferred way to …
str
def html(self) -> str

HTML content rendered from AST or legacy parser.

This is the preferred way to access rendered HTML content. Use this instead of the deprecatedparsed_astfield.

Returns

str

Rendered HTML string

plain_text property
Plain text extracted from content (for search/LLM). Strips HTML tags f…
str
def plain_text(self) -> str

Plain text extracted from content (for search/LLM).

    Strips HTML tags from rendered content to get clean text.
    Uses the rendered HTML (which includes directive output) for accuracy.
Returns

str

Plain text content with HTML tags removed

Internal Methods 4
_render_ast_to_html
Render AST tokens to HTML. Internal method used when true AST is available (Ph…
0 str
def _render_ast_to_html(self) -> str

Render AST tokens to HTML.

Internal method used when true AST is available (Phase 3).

Returns

str

Rendered HTML string

_extract_text_from_ast
Extract plain text from AST tokens. Walks the AST tree and extracts all text c…
0 str
def _extract_text_from_ast(self) -> str

Extract plain text from AST tokens.

Walks the AST tree and extracts all text content, ignoring structural elements like code blocks.

Returns

str

Plain text string

_extract_links_from_ast
Extract links from AST tokens. Walks the AST tree and extracts all link URLs (…
0 list[str]
def _extract_links_from_ast(self) -> list[str]

Extract links from AST tokens.

Walks the AST tree and extracts all link URLs (Phase 3). Handles Mistune 3.x AST format where URLs are inattrs.url.

Returns

list[str]

List of link URLs

_strip_html_to_text
Strip HTML tags from content to get plain text. Fallback method when AST is no…
1 str
def _strip_html_to_text(self, html: str) -> str

Strip HTML tags from content to get plain text.

Fallback method when AST is not available.

Parameters 1
html str

HTML content

Returns

str

Plain text with HTML tags removed