Content Sources

Fetch content from external sources

4 min read 728 words

Fetch content from GitHub, Notion, REST APIs, and more.

Do I Need This?

No. By default, Bengal reads content from local files. That works for most sites.

Use remote sources when:

  • Your docs live in multiple GitHub repos
  • Content lives in a CMS (Notion, Contentful, etc.)
  • You want to pull API docs from a separate service
  • You need to aggregate content from different teams

Quick Start

Install the loader you need:

pip install bengal[github]   # GitHub repositories
pip install bengal[notion]   # Notion databases
pip install bengal[rest]     # REST APIs
pip install bengal[all-sources]  # Everything

Update your collections.py:

from bengal.collections import define_collection, DocPage
from bengal.content.sources import github_loader

collections = {
    # Local content (default)
    "docs": define_collection(
        schema=DocPage,
        directory="content/docs",
    ),

    # Remote content from GitHub
    "api-docs": define_collection(
        schema=DocPage,
        loader=github_loader(
            repo="myorg/api-docs",
            path="docs/",
        ),
    ),
}

Build as normal. Remote content is fetched, cached, and validated like local content.

Available Loaders

GitHub

Fetch markdown from any GitHub repository:

from bengal.content.sources import github_loader

loader = github_loader(
    repo="owner/repo",       # Required: "owner/repo" format
    branch="main",           # Default: "main"
    path="docs/",            # Default: "" (root)
    token=None,              # Default: uses GITHUB_TOKEN env var
    glob="*.md",             # Default: "*.md" (file pattern to match)
)

For private repos, set GITHUB_TOKEN environment variable or pass tokendirectly.

Notion

Fetch pages from a Notion database:

from bengal.content.sources import notion_loader

loader = notion_loader(
    database_id="abc123...",  # Required: database ID from URL
    token=None,               # Default: uses NOTION_TOKEN env var
    property_mapping={        # Map Notion properties to frontmatter
        "title": "Name",
        "date": "Published",
        "tags": "Tags",
    },
)

Setup:

  1. Create integration at notion.so/my-integrations
  2. Share your database with the integration
  3. SetNOTION_TOKENenvironment variable

REST API

Fetch from any JSON API:

from bengal.content.sources import rest_loader

loader = rest_loader(
    url="https://api.example.com/posts",
    headers={"Authorization": "Bearer ${API_TOKEN}"},  # Env vars expanded
    content_field="body",           # JSON path to content
    id_field="id",                  # JSON path to ID
    frontmatter_fields={            # Map API fields to frontmatter
        "title": "title",
        "date": "published_at",
        "tags": "categories",
    },
)

Local (Explicit)

For consistency, you can also use an explicit local loader:

from bengal.content.sources import local_loader

loader = local_loader(
    directory="content/docs",
    glob="**/*.md",
    exclude=["_drafts/*"],
)

Caching

Remote content is cached locally to avoid repeated API calls:

# Check cache status
bengal sources status

# Force refresh from remote
bengal sources fetch --force

# Clear all cached content
bengal sources clear

Cache behavior:

  • Default TTL: 1 hour
  • Cache directory:.bengal/content_cache/
  • Automatic invalidation when config changes
  • Falls back to cache if remote unavailable

CLI Commands

# List configured content sources
bengal sources list

# Show cache status (age, size, validity)
bengal sources status

# Fetch/refresh from remote sources
bengal sources fetch
bengal sources fetch --source api-docs  # Specific source
bengal sources fetch --force            # Ignore cache

# Clear cached content
bengal sources clear
bengal sources clear --source api-docs

Environment Variables

Variable Used By Description
GITHUB_TOKEN GitHub loader Personal access token for private repos
NOTION_TOKEN Notion loader Integration token
Custom REST loader Any${VAR}in headers is expanded

Multi-Repo Documentation

A common pattern for large organizations:

from bengal.collections import define_collection, DocPage
from bengal.content.sources import github_loader, local_loader

collections = {
    # Main docs (local)
    "docs": define_collection(
        schema=DocPage,
        directory="content/docs",
    ),

    # API reference (from API team's repo)
    "api": define_collection(
        schema=DocPage,
        loader=github_loader(repo="myorg/api-service", path="docs/"),
    ),

    # SDK docs (from SDK repo)
    "sdk": define_collection(
        schema=DocPage,
        loader=github_loader(repo="myorg/sdk", path="docs/"),
    ),
}

Custom Loaders

ImplementContentSourcefor any content origin:

from collections.abc import AsyncIterator
from bengal.content.sources import ContentSource, ContentEntry

class MyCustomSource(ContentSource):
    source_type = "my-api"

    async def fetch_all(self) -> AsyncIterator[ContentEntry]:
        items = await self._get_items()
        for item in items:
            yield ContentEntry(
                id=item["id"],
                slug=item["slug"],
                content=item["body"],
                frontmatter={"title": item["title"]},
                source_type=self.source_type,
                source_name=self.name,
            )

    async def fetch_one(self, id: str) -> ContentEntry | None:
        item = await self._get_item(id)
        if not item:
            return None
        return ContentEntry(
            id=item["id"],
            slug=item["slug"],
            content=item["body"],
            frontmatter={"title": item["title"]},
            source_type=self.source_type,
            source_name=self.name,
        )

Zero-Cost Design

If you don't use remote sources:

  • No extra dependencies installed
  • No network calls
  • No import overhead
  • No configuration needed

Remote loaders are lazy-loaded only when you import them.

Other Content Sources

  • Autodoc — Generate API docs from Python, CLI commands, and OpenAPI specs

Seealso