Module

discovery.content_discovery

Content discovery - finds and organizes pages and sections.

Robustness:

  • Symlink loop detection via inode tracking to prevent infinite recursion
  • Content collection validation (opt-in via collections.py)

Classes

ContentDiscovery
Discovers and organizes content files into pages and sections. Notes: - YAML errors in front matte…
14

Discovers and organizes content files into pages and sections.

Notes:

  • YAML errors in front matter are downgraded to debug; we fall back to using the content and synthesize minimal metadata to keep the build progressing.
  • UTF-8 BOM is stripped at read time bybengal.utils.file_io.read_text_fileto avoid confusing the YAML/front matter parser.
  • I18n dir-prefix strategy is supported (e.g.,content/en/...); hidden files/dirs are skipped except_index.md.
  • Parsing uses a thread pool for concurrency; unchanged pages can be represented as PageProxyin lazy modes.
  • Symlink loops are detected via inode tracking to prevent infinite recursion.
  • Content collections: When collections.py is present at project root, frontmatter is validated against schemas during discovery (fail fast).

Methods 1

discover
Discover all content in the content directory. Supports optional lazy loading …
2 tuple[list[Section]…
def discover(self, use_cache: bool = False, cache: Any | None = None) -> tuple[list[Section], list[Page]]

Discover all content in the content directory.

Supports optional lazy loading with PageProxy for incremental builds.

Parameters 2
use_cache bool

Whether to use PageDiscoveryCache for lazy loading

cache Any | None

PageDiscoveryCache instance (if use_cache=True)

Returns

tuple[list[Section], list[Page]]

Tuple of (sections, pages)

Internal Methods 13
__init__
Initialize content discovery.
2 None
def __init__(self, content_dir: Path, site: Any | None = None) -> None

Initialize content discovery.

Parameters 2
content_dir Path

Root content directory

site Any | None

Optional Site reference for configuration access

_discover_full
Full discovery (current behavior) - discover all pages completely.
0 tuple[list[Section]…
def _discover_full(self) -> tuple[list[Section], list[Page]]

Full discovery (current behavior) - discover all pages completely.

Returns

tuple[list[Section], list[Page]]

Tuple of (sections, pages)

_discover_with_cache
Discover content with lazy loading from cache. Uses PageProxy for unchanged pa…
1 tuple[list[Section]…
def _discover_with_cache(self, cache: Any) -> tuple[list[Section], list[Page]]

Discover content with lazy loading from cache.

Uses PageProxy for unchanged pages (metadata only) and parses changed pages.

Parameters 1
cache Any

PageDiscoveryCache instance

Returns

tuple[list[Section], list[Page]]

Tuple of (sections, pages) with mixed Page and PageProxy objects

_cache_is_valid
Check if cached metadata is still valid for a page.
2 bool
def _cache_is_valid(self, page: Page, cached_metadata: Any) -> bool

Check if cached metadata is still valid for a page.

Parameters 2
page Page

Discovered page

cached_metadata Any

Cached metadata from PageDiscoveryCache

Returns

bool

True if cache is valid and can be used (unchanged page)

_walk_directory
Recursively walk a directory to discover content. Uses inode tracking to detec…
3 None
def _walk_directory(self, directory: Path, parent_section: Section, current_lang: str | None = None) -> None

Recursively walk a directory to discover content.

Uses inode tracking to detect and skip symlink loops.

Parameters 3
directory Path

Directory to walk

parent_section Section

Parent section to add content to

current_lang str | None
_is_content_file
Check if a file is a content file.
1 bool
def _is_content_file(self, file_path: Path) -> bool

Check if a file is a content file.

Parameters 1
file_path Path

Path to check

Returns

bool

True if it's a content file

_validate_against_collection
Validate frontmatter against collection schema if applicable.
2 dict[str, Any]
def _validate_against_collection(self, file_path: Path, metadata: dict[str, Any]) -> dict[str, Any]

Validate frontmatter against collection schema if applicable.

Parameters 2
file_path Path

Path to content file

metadata dict[str, Any]

Parsed frontmatter metadata

Returns

dict[str, Any]

Validated metadata (possibly with schema-enforced defaults)

_get_collection_for_file
Find which collection a file belongs to based on its path.
1 tuple[str | None, C…
def _get_collection_for_file(self, file_path: Path) -> tuple[str | None, CollectionConfig[Any] | None]

Find which collection a file belongs to based on its path.

Parameters 1
file_path Path

Path to content file

Returns

tuple[str | None, CollectionConfig[Any] | None]

Tuple of (collection_name, CollectionConfig) or (None, None)

_create_page
Create a Page object from a file with robust error handling. Handles: - Valid …
3 Page
def _create_page(self, file_path: Path, current_lang: str | None = None, section: Section | None = None) -> Page

Create a Page object from a file with robust error handling.

Handles:

  • Valid frontmatter
  • Invalid YAML in frontmatter
  • Missing frontmatter
  • File encoding issues
  • IO errors
  • Collection schema validation (when collections defined)
Parameters 3
file_path Path

Path to content file

current_lang str | None
section Section | None
Returns

Page

Page object (always succeeds with fallback metadata)

_parse_content_file
Parse content file with robust error handling. Caches raw content in BuildCont…
1 tuple[str, dict[str…
def _parse_content_file(self, file_path: Path) -> tuple[str, dict[str, Any]]

Parse content file with robust error handling.

Caches raw content in BuildContext for later use by validators, eliminating redundant disk I/O during health checks.

Parameters 1
file_path Path

Path to content file

Returns

tuple[str, dict[str, Any]]

Tuple of (content, metadata)

_extract_content_skip_frontmatter
Extract content, skipping broken frontmatter section. Frontmatter is between -…
1 str
def _extract_content_skip_frontmatter(self, file_content: str) -> str

Extract content, skipping broken frontmatter section.

Frontmatter is between --- delimiters at start of file. If parsing failed, skip the section entirely.

Parameters 1
file_content str

Full file content

Returns

str

Content without frontmatter section

_sort_all_sections
Sort all sections and their children by weight. This recursively sorts: - Page…
0 None
def _sort_all_sections(self) -> None

Sort all sections and their children by weight.

This recursively sorts:

  • Pages within each section
  • Subsections within each section

Called after content discovery is complete.

_sort_section_recursive
Recursively sort a section and all its subsections.
1 None
def _sort_section_recursive(self, section: Section) -> None

Recursively sort a section and all its subsections.

Parameters 1
section Section

Section to sort