Module

analysis.path_analysis

Path Analysis for Bengal SSG.

Implements algorithms for understanding navigation patterns and page accessibility:

  • Shortest paths between pages (BFS-based)
  • Betweenness centrality (identifies bridge pages)
  • Closeness centrality (measures accessibility)

These metrics help optimize navigation structure and identify critical pages.

For large sites (>500 pages), pivot-based approximation is used automatically to maintain O(k*N) complexity instead of O(N²).

References:

  • Brandes, U. (2001). A faster algorithm for betweenness centrality. Journal of Mathematical Sociology.
  • Bader, D. A., et al. (2007). Approximating betweenness centrality. Algorithms and Models for the Web-Graph.

Classes

PathAnalysisResults dataclass
Results from path analysis and centrality computations. Contains centrality metrics that identify …
4

Results from path analysis and centrality computations.

Contains centrality metrics that identify important pages in the site's link structure. High betweenness indicates bridge pages, high closeness indicates easily accessible pages.

Attributes

Name Type Description
betweenness_centrality dict[Page, float]

Map of pages to betweenness scores (0.0-1.0)

closeness_centrality dict[Page, float]

Map of pages to closeness scores (0.0-1.0)

avg_path_length float

Average shortest path length between all page pairs

diameter int

Network diameter (longest shortest path)

is_approximate bool

True if approximation was used (for large sites)

pivots_used int

Number of pivot nodes used (if approximate)

Methods 4

get_top_bridges
Get pages with highest betweenness centrality (bridge pages).
1 list[tuple[Page, float]]
def get_top_bridges(self, limit: int = 20) -> list[tuple[Page, float]]

Get pages with highest betweenness centrality (bridge pages).

Parameters 1
limit int

Number of pages to return

Returns

list[tuple[Page, float]]

List of (page, centrality) tuples sorted by centrality descending

get_most_accessible
Get most accessible pages (highest closeness centrality).
1 list[tuple[Page, float]]
def get_most_accessible(self, limit: int = 20) -> list[tuple[Page, float]]

Get most accessible pages (highest closeness centrality).

Parameters 1
limit int

Number of pages to return

Returns

list[tuple[Page, float]]

List of (page, centrality) tuples sorted by centrality descending

get_betweenness
Get betweenness centrality for specific page.
1 float
def get_betweenness(self, page: Page) -> float

Get betweenness centrality for specific page.

Parameters 1
page Page
Returns

float

get_closeness
Get closeness centrality for specific page.
1 float
def get_closeness(self, page: Page) -> float

Get closeness centrality for specific page.

Parameters 1
page Page
Returns

float

PathSearchResult dataclass
Result from find_all_paths including metadata about the search.
0

Result from find_all_paths including metadata about the search.

Attributes

Name Type Description
paths list[list[Page]]

List of paths found (each path is a list of pages)

complete bool

True if search completed, False if terminated early

termination_reason str | None

Reason for early termination (if any)

PathAnalyzer
Analyze navigation paths and page accessibility. Computes centrality metrics to identify: - Bridge…
8

Analyze navigation paths and page accessibility.

Computes centrality metrics to identify:

  • Bridge pages (high betweenness): Pages that connect different parts of the site
  • Accessible pages (high closeness): Pages that are easy to reach from anywhere
  • Navigation bottlenecks: Critical pages for site navigation

For large sites (>500 pages by default), uses pivot-based approximation to achieve O(k*N) complexity instead of O(N²). This provides ~100x speedup for 10k page sites while maintaining accurate relative rankings.

Methods 4

find_shortest_path
Find shortest path between two pages using BFS.
2 list[Page] | None
def find_shortest_path(self, source: Page, target: Page) -> list[Page] | None

Find shortest path between two pages using BFS.

Parameters 2
source Page

Starting page

target Page

Destination page

Returns

list[Page] | None

List of pages representing the path, or None if no path exists

analyze
Compute path-based centrality metrics. Computes: - Betweenness centrality: How…
1 PathAnalysisResults
def analyze(self, progress_callback: ProgressCallback | None = None) -> PathAnalysisResults

Compute path-based centrality metrics.

Computes:

  • Betweenness centrality: How often a page appears on shortest paths
  • Closeness centrality: How close a page is to all other pages
  • Network diameter: Longest shortest path
  • Average path length

For large sites, automatically uses pivot-based approximation for O(k*N) complexity instead of O(N²).

Parameters 1
progress_callback ProgressCallback | None

Optional callback for progress updates. Called as callback(current, total, phase) where phase is "betweenness" or "closeness".

Returns

PathAnalysisResults

PathAnalysisResults with all metrics

find_all_paths
Find all simple paths between two pages with safety limits. This operation can…
5 PathSearchResult
def find_all_paths(self, source: Page, target: Page, max_length: int = 10, max_paths: int = 1000, timeout_seconds: float | None = 30.0) -> PathSearchResult

Find all simple paths between two pages with safety limits.

This operation can be expensive for highly connected graphs.

Safety limits prevent runaway computation:

  • max_length: Maximum path length to consider
  • max_paths: Stop after finding this many paths
  • timeout_seconds: Stop after this many seconds
Parameters 5
source Page

Starting page

target Page

Destination page

max_length int

Maximum path length to search (default: 10)

max_paths int

Maximum number of paths to find (default: 1000)

timeout_seconds float | None

Maximum time in seconds (default: 30.0, None for no limit)

Returns

PathSearchResult

PathSearchResult with paths found and completion status

find_all_paths_simple
Find all simple paths between two pages (legacy API). This is the original API…
3 list[list[Page]]
def find_all_paths_simple(self, source: Page, target: Page, max_length: int = 10) -> list[list[Page]]

Find all simple paths between two pages (legacy API).

This is the original API preserved for backward compatibility. For new code, prefer find_all_paths() which includes safety limits.

Parameters 3
source Page

Starting page

target Page

Destination page

max_length int

Maximum path length to search

Returns

list[list[Page]]

List of paths (each path is a list of pages)

Internal Methods 4
__init__
Initialize path analyzer.
4 None
def __init__(self, graph: KnowledgeGraph, k_pivots: int = DEFAULT_K_PIVOTS, seed: int = DEFAULT_SEED, auto_approximate_threshold: int = DEFAULT_AUTO_THRESHOLD)

Initialize path analyzer.

Parameters 4
graph KnowledgeGraph

KnowledgeGraph with page connections

k_pivots int

Number of pivot nodes to use for approximation (default: 100)

seed int

Random seed for deterministic pivot selection (default: 42)

auto_approximate_threshold int

Use exact algorithm if page count <= this (default: 500)

_compute_betweenness_centrality
Compute betweenness centrality using Brandes' algorithm. Betweenness measures …
3 dict[Page, float]
def _compute_betweenness_centrality(self, pages: list[Page], use_approximate: bool, progress_callback: ProgressCallback | None = None) -> dict[Page, float]

Compute betweenness centrality using Brandes' algorithm.

Betweenness measures how often a page appears on shortest paths between other pages. High betweenness indicates a "bridge" page.

For large sites, uses pivot-based approximation: only compute from k randomly selected source nodes and scale results.

Parameters 3
pages list[Page]

List of pages to analyze

use_approximate bool

If True, use pivot-based approximation

progress_callback ProgressCallback | None

Optional progress callback

Returns

dict[Page, float]

Dictionary mapping pages to betweenness centrality scores

_compute_closeness_centrality
Compute closeness centrality and network metrics. Closeness measures how close…
3 tuple[dict[Page, fl…
def _compute_closeness_centrality(self, pages: list[Page], use_approximate: bool, progress_callback: ProgressCallback | None = None) -> tuple[dict[Page, float], float, int]

Compute closeness centrality and network metrics.

Closeness measures how close a page is to all other pages. Higher closeness = more accessible.

For large sites, uses pivot-based approximation: only compute distances from k randomly selected source nodes.

Parameters 3
pages list[Page]

List of pages to analyze

use_approximate bool

If True, use pivot-based approximation

progress_callback ProgressCallback | None

Optional progress callback

Returns

tuple[dict[Page, float], float, int]

Tuple of (closeness_dict, avg_path_length, diameter)

_bfs_distances
Compute shortest path distances from source to all other pages.
2 dict[Page, int]
def _bfs_distances(self, source: Page, pages: list[Page]) -> dict[Page, int]

Compute shortest path distances from source to all other pages.

Parameters 2
source Page
pages list[Page]
Returns

dict[Page, int]

Functions

analyze_paths
Convenience function for path analysis.
5 PathAnalysisResults
def analyze_paths(graph: KnowledgeGraph, k_pivots: int = PathAnalyzer.DEFAULT_K_PIVOTS, seed: int = PathAnalyzer.DEFAULT_SEED, auto_approximate_threshold: int = PathAnalyzer.DEFAULT_AUTO_THRESHOLD, progress_callback: ProgressCallback | None = None) -> PathAnalysisResults

Convenience function for path analysis.

Parameters 5

Name Type Default Description
graph KnowledgeGraph

KnowledgeGraph with page connections

k_pivots int PathAnalyzer.DEFAULT_K_PIVOTS

Number of pivot nodes for approximation (default: 100)

seed int PathAnalyzer.DEFAULT_SEED

Random seed for deterministic results (default: 42)

auto_approximate_threshold int PathAnalyzer.DEFAULT_AUTO_THRESHOLD

Use exact if pages <= this (default: 500)

progress_callback ProgressCallback | None None

Optional progress callback

Returns

PathAnalysisResults

PathAnalysisResults with centrality metrics