Module

analysis.knowledge_graph

Knowledge Graph Analysis for Bengal SSG.

Analyzes page connectivity, identifies hubs and leaves, finds orphaned pages, and provides insights for optimization and content strategy.

Classes

GraphMetrics dataclass
Metrics about the knowledge graph structure.
0

Metrics about the knowledge graph structure.

Attributes

Name Type Description
total_pages int

Total number of pages analyzed

total_links int

Total number of links between pages

avg_connectivity float

Average connectivity score per page

hub_count int

Number of hub pages (highly connected)

leaf_count int

Number of leaf pages (low connectivity)

orphan_count int

Number of orphaned pages (no connections at all)

PageConnectivity dataclass
Connectivity information for a single page.
0

Connectivity information for a single page.

Attributes

Name Type Description
page Page

The page object

incoming_refs int

Number of incoming references

outgoing_refs int

Number of outgoing references

connectivity_score int

Total connectivity (incoming + outgoing)

is_hub bool

True if page has many incoming references

is_leaf bool

True if page has few connections

is_orphan bool

True if page has no connections at all

KnowledgeGraph
Analyzes the connectivity structure of a Bengal site. Builds a graph of all pages and their connec…
37

Analyzes the connectivity structure of a Bengal site.

Builds a graph of all pages and their connections through:

  • Internal links (cross-references)
  • Taxonomies (tags, categories)
  • Related posts
  • Menu items

Provides insights for:

  • Content strategy (find orphaned pages)
  • Performance optimization (hub-first streaming)
  • Navigation design (understand structure)
  • SEO improvements (link structure)

Methods 26

build
Build the knowledge graph by analyzing all page connections. This analyzes: 1.…
0 None
def build(self) -> None

Build the knowledge graph by analyzing all page connections.

This analyzes:

  1. Cross-references (internal links between pages)
  2. Taxonomy references (pages grouped by tags/categories)
  3. Related posts (pre-computed relationships)
  4. Menu items (navigation references)

Call this before using any analysis methods.

get_analysis_pages
Get list of pages to analyze, excluding autodoc pages if configured.
0 list[Page]
def get_analysis_pages(self) -> list[Page]

Get list of pages to analyze, excluding autodoc pages if configured.

Returns

list[Page]

List of pages to include in graph analysis

get_connectivity
Get connectivity information for a specific page.
1 PageConnectivity
def get_connectivity(self, page: Page) -> PageConnectivity

Get connectivity information for a specific page.

Parameters 1
page Page

Page to analyze

Returns

PageConnectivity

PageConnectivity with detailed metrics

get_hubs
Get hub pages (highly connected pages). Hubs are pages with many incoming refe…
1 list[Page]
def get_hubs(self, threshold: int | None = None) -> list[Page]

Get hub pages (highly connected pages).

Hubs are pages with many incoming references. These are typically:

  • Index pages
  • Popular articles
  • Core documentation
Parameters 1
threshold int | None

Minimum incoming refs (defaults to self.hub_threshold)

Returns

list[Page]

List of hub pages sorted by incoming references (descending)

get_leaves
Get leaf pages (low connectivity pages). Leaves are pages with few connections…
1 list[Page]
def get_leaves(self, threshold: int | None = None) -> list[Page]

Get leaf pages (low connectivity pages).

Leaves are pages with few connections. These are typically:

  • One-off blog posts
  • Changelog entries
  • Niche content
Parameters 1
threshold int | None

Maximum connectivity (defaults to self.leaf_threshold)

Returns

list[Page]

List of leaf pages sorted by connectivity (ascending)

get_orphans
Get orphaned pages (no connections at all). Orphans are pages with no incoming…
0 list[Page]
def get_orphans(self) -> list[Page]

Get orphaned pages (no connections at all).

Orphans are pages with no incoming or outgoing references. These might be:

  • Forgotten content
  • Draft pages
  • Pages that should be linked from navigation
Returns

list[Page]

List of orphaned pages sorted by slug

get_connectivity_report
Get comprehensive connectivity report with pages grouped by level. Uses weight…
2 ConnectivityReport
def get_connectivity_report(self, thresholds: dict[str, float] | None = None, weights: dict[LinkType, float] | None = None) -> ConnectivityReport

Get comprehensive connectivity report with pages grouped by level.

Uses weighted scoring based on semantic link types to provide nuanced analysis beyond binary orphan detection.

Parameters 2
thresholds dict[str, float] | None

Custom thresholds for connectivity levels. Defaults to DEFAULT_THRESHOLDS.

weights dict[LinkType, float] | None

Custom weights for link types. Defaults to DEFAULT_WEIGHTS.

Returns

ConnectivityReport

ConnectivityReport with pages grouped by level and statistics.

get_page_link_metrics
Get detailed link metrics for a specific page.
1 LinkMetrics
def get_page_link_metrics(self, page: Page) -> LinkMetrics

Get detailed link metrics for a specific page.

Parameters 1
page Page

Page to get metrics for

Returns

LinkMetrics

LinkMetrics with breakdown by link type

get_connectivity_score
Get total connectivity score for a page. Connectivity = incoming_refs + outgoi…
1 int
def get_connectivity_score(self, page: Page) -> int

Get total connectivity score for a page.

Connectivity = incoming_refs + outgoing_refs

Parameters 1
page Page

Page to analyze

Returns

int

Connectivity score (higher = more connected)

get_layers
Partition pages into three layers by connectivity. Layers enable hub-first str…
0 PageLayers
def get_layers(self) -> PageLayers

Partition pages into three layers by connectivity.

Layers enable hub-first streaming builds:

  • Layer 0 (Hubs): High connectivity, process first, keep in memory
  • Layer 1 (Mid-tier): Medium connectivity, batch processing
  • Layer 2 (Leaves): Low connectivity, stream and release
Returns

PageLayers

PageLayers dataclass with hubs, mid_tier, and leaves attributes (supports tuple unpacking for backward compatibility)

get_metrics
Get overall graph metrics.
0 GraphMetrics
def get_metrics(self) -> GraphMetrics

Get overall graph metrics.

Returns

GraphMetrics

GraphMetrics with summary statistics

format_stats
Format graph statistics as a human-readable string.
0 str
def format_stats(self) -> str

Format graph statistics as a human-readable string.

Returns

str

Formatted statistics string

get_actionable_recommendations
Generate actionable recommendations for improving site structure.
0 list[str]
def get_actionable_recommendations(self) -> list[str]

Generate actionable recommendations for improving site structure.

Returns

list[str]

List of recommendation strings with emoji prefixes

get_seo_insights
Generate SEO-focused insights about site structure.
0 list[str]
def get_seo_insights(self) -> list[str]

Generate SEO-focused insights about site structure.

Returns

list[str]

List of SEO insight strings with emoji prefixes

get_content_gaps
Identify content gaps based on link structure and taxonomies.
0 list[str]
def get_content_gaps(self) -> list[str]

Identify content gaps based on link structure and taxonomies.

Returns

list[str]

List of content gap descriptions

compute_pagerank
Compute PageRank scores for all pages in the graph. PageRank assigns importanc…
3 PageRankResults
def compute_pagerank(self, damping: float = 0.85, max_iterations: int = 100, force_recompute: bool = False) -> PageRankResults

Compute PageRank scores for all pages in the graph.

PageRank assigns importance scores based on link structure. Pages that are linked to by many important pages get high scores.

Parameters 3
damping float

Probability of following links vs random jump (default 0.85)

max_iterations int

Maximum iterations before stopping (default 100)

force_recompute bool

If True, recompute even if cached

Returns

PageRankResults

PageRankResults with scores and metadata

compute_personalized_pagerank
Compute personalized PageRank from seed pages. Personalized PageRank biases ra…
3 PageRankResults
def compute_personalized_pagerank(self, seed_pages: set[Page], damping: float = 0.85, max_iterations: int = 100) -> PageRankResults

Compute personalized PageRank from seed pages.

Personalized PageRank biases random jumps toward seed pages, useful for finding pages related to a specific topic or set of pages.

Parameters 3
seed_pages set[Page]

Set of pages to bias toward

damping float

Probability of following links vs random jump

max_iterations int

Maximum iterations before stopping

Returns

PageRankResults

PageRankResults with personalized scores

get_top_pages_by_pagerank
Get top-ranked pages by PageRank score. Automatically computes PageRank if not…
1 list[tuple[Page, float]]
def get_top_pages_by_pagerank(self, limit: int = 20) -> list[tuple[Page, float]]

Get top-ranked pages by PageRank score.

Automatically computes PageRank if not already computed.

Parameters 1
limit int

Number of pages to return

Returns

list[tuple[Page, float]]

List of (page, score) tuples sorted by score descending

get_pagerank_score
Get PageRank score for a specific page. Automatically computes PageRank if not…
1 float
def get_pagerank_score(self, page: Page) -> float

Get PageRank score for a specific page.

Automatically computes PageRank if not already computed.

Parameters 1
page Page

Page to get score for

Returns

float

PageRank score (0.0 if page not found)

detect_communities
Detect topical communities using Louvain method. Discovers natural clusters of…
3 CommunityDetectionResults
def detect_communities(self, resolution: float = 1.0, random_seed: int | None = None, force_recompute: bool = False) -> CommunityDetectionResults

Detect topical communities using Louvain method.

Discovers natural clusters of related pages based on link structure. Communities represent topic areas or content groups.

Parameters 3
resolution float

Resolution parameter (higher = more communities, default 1.0)

random_seed int | None

Random seed for reproducibility

force_recompute bool

If True, recompute even if cached

Returns

CommunityDetectionResults

CommunityDetectionResults with discovered communities

get_community_for_page
Get community ID for a specific page. Automatically detects communities if not…
1 int | None
def get_community_for_page(self, page: Page) -> int | None

Get community ID for a specific page.

Automatically detects communities if not already computed.

Parameters 1
page Page

Page to get community for

Returns

int | None

Community ID or None if page not found

analyze_paths
Analyze navigation paths and centrality metrics. Computes: - Betweenness centr…
4 PathAnalysisResults
def analyze_paths(self, force_recompute: bool = False, k_pivots: int = 100, seed: int = 42, auto_approximate_threshold: int = 500) -> PathAnalysisResults

Analyze navigation paths and centrality metrics.

Computes:

  • Betweenness centrality: Pages that act as bridges
  • Closeness centrality: Pages that are easily accessible
  • Network diameter and average path length

For large sites (> auto_approximate_threshold pages), uses pivot-based approximation for O(k*N) complexity instead of O(N²).

Parameters 4
force_recompute bool

If True, recompute even if cached

k_pivots int

Number of pivot nodes for approximation (default: 100)

seed int

Random seed for deterministic results (default: 42)

auto_approximate_threshold int

Use exact if pages <= this (default: 500)

Returns

PathAnalysisResults

PathAnalysisResults with centrality metrics

get_betweenness_centrality
Get betweenness centrality for a specific page. Automatically analyzes paths i…
1 float
def get_betweenness_centrality(self, page: Page) -> float

Get betweenness centrality for a specific page.

Automatically analyzes paths if not already computed.

Parameters 1
page Page

Page to get centrality for

Returns

float

Betweenness centrality score

get_closeness_centrality
Get closeness centrality for a specific page. Automatically analyzes paths if …
1 float
def get_closeness_centrality(self, page: Page) -> float

Get closeness centrality for a specific page.

Automatically analyzes paths if not already computed.

Parameters 1
page Page

Page to get centrality for

Returns

float

Closeness centrality score

suggest_links
Generate smart link suggestions to improve site connectivity. Uses multiple si…
3 LinkSuggestionResults
def suggest_links(self, min_score: float = 0.3, max_suggestions_per_page: int = 10, force_recompute: bool = False) -> LinkSuggestionResults

Generate smart link suggestions to improve site connectivity.

Uses multiple signals:

  • Topic similarity (shared tags/categories)
  • PageRank importance
  • Betweenness centrality (bridge pages)
  • Link gaps (underlinked content)
Parameters 3
min_score float

Minimum score threshold for suggestions

max_suggestions_per_page int

Maximum suggestions per page

force_recompute bool

If True, recompute even if cached

Returns

LinkSuggestionResults

LinkSuggestionResults with all suggestions

get_suggestions_for_page
Get link suggestions for a specific page. Automatically generates suggestions …
2 list[tuple[Page, fl…
def get_suggestions_for_page(self, page: Page, limit: int = 10) -> list[tuple[Page, float, list[str]]]

Get link suggestions for a specific page.

Automatically generates suggestions if not already computed.

Parameters 2
page Page

Page to get suggestions for

limit int

Maximum number of suggestions

Returns

list[tuple[Page, float, list[str]]]

List of (target_page, score, reasons) tuples

Internal Methods 11
__init__
Initialize knowledge graph analyzer.
4 None
def __init__(self, site: Site, hub_threshold: int = 10, leaf_threshold: int = 2, exclude_autodoc: bool = True)

Initialize knowledge graph analyzer.

Parameters 4
site Site

Site instance to analyze

hub_threshold int

Minimum incoming refs to be considered a hub

leaf_threshold int

Maximum connectivity to be considered a leaf

exclude_autodoc bool

If True, exclude autodoc/API reference pages from analysis (default: True)

_ensure_links_extracted
Extract links from all pages if not already extracted. Links are normally extr…
0 None
def _ensure_links_extracted(self) -> None

Extract links from all pages if not already extracted.

Links are normally extracted during rendering, but graph analysis needs them before rendering happens. This ensures links are available.

_analyze_cross_references
Analyze cross-references (internal links between pages). Uses the site's xref_…
0 None
def _analyze_cross_references(self) -> None

Analyze cross-references (internal links between pages).

Uses the site's xref_index to find all internal links. Only analyzes links from/to pages included in analysis (excludes autodoc).

_resolve_link
Resolve a link string to a target page.
1 Page | None
def _resolve_link(self, link: str) -> Page | None

Resolve a link string to a target page.

Parameters 1
link str

Link string (path, slug, or ID)

Returns

Page | None

Target page or None if not found

_analyze_taxonomies
Analyze taxonomy references (pages grouped by tags/categories). Pages in the s…
0 None
def _analyze_taxonomies(self) -> None

Analyze taxonomy references (pages grouped by tags/categories).

Pages in the same taxonomy group reference each other implicitly. Only includes pages in analysis (excludes autodoc).

_analyze_related_posts
Analyze related posts (pre-computed relationships). Related posts are pages th…
0 None
def _analyze_related_posts(self) -> None

Analyze related posts (pre-computed relationships).

Related posts are pages that share tags or other criteria. Only includes pages in analysis (excludes autodoc).

_analyze_menus
Analyze menu items (navigation references). Pages in menus get a significant b…
0 None
def _analyze_menus(self) -> None

Analyze menu items (navigation references).

Pages in menus get a significant boost in importance. Only includes pages in analysis (excludes autodoc).

_analyze_section_hierarchy
Analyze implicit section links (parent _index.md → children). Section index pa…
0 None
def _analyze_section_hierarchy(self) -> None

Analyze implicit section links (parent _index.md → children).

Section index pages implicitly link to all child pages in their directory. This represents topical containment—the parent page defines the topic, children belong to that topic.

Weight: 0.5 (structural but semantically meaningful)

_analyze_navigation_links
Analyze next/prev sequential relationships. Pages in a section often have prev…
0 None
def _analyze_navigation_links(self) -> None

Analyze next/prev sequential relationships.

Pages in a section often have prev/next relationships representing a reading order or logical sequence (e.g., tutorial steps, changelogs).

Weight: 0.25 (pure navigation, lowest editorial intent)

_build_link_metrics
Build detailed link metrics for each page. Aggregates links by type into LinkM…
0 None
def _build_link_metrics(self) -> None

Build detailed link metrics for each page.

Aggregates links by type into LinkMetrics objects for weighted connectivity scoring.

_compute_metrics
Compute overall graph metrics.
0 GraphMetrics
def _compute_metrics(self) -> GraphMetrics

Compute overall graph metrics.

Returns

GraphMetrics

GraphMetrics with summary statistics