Content Collections

Content collections let you define typed schemas for your content's frontmatter. Bengal validates content against these schemas during discovery, catching errors early and providing IDE autocompletion.

Quick Start

Create acollections.pyfile in your project root:

from dataclasses import dataclass, field
from datetime import datetime
from bengal.collections import define_collection

@dataclass
class BlogPost:
    title: str
    date: datetime
    author: str = "Anonymous"
    tags: list[str] = field(default_factory=list)
    draft: bool = False

collections = {
    "blog": define_collection(
        schema=BlogPost,
        directory="content/blog",
    ),
}

Now any file in content/blog/must have valid frontmatter:

---
title: My First Post
date: 2025-01-15
author: Jane Doe
tags: [python, tutorial]
---

Schema Definition

Using Dataclasses

Define your schema as a Python dataclass:

from dataclasses import dataclass, field
from datetime import datetime

@dataclass
class DocPage:
    title: str                              # Required
    weight: int = 0                         # Optional with default
    description: str | None = None          # Optional, nullable
    tags: list[str] = field(default_factory=list)  # Mutable default

Field Types

Bengal automatically coerces frontmatter values to these types:

Type	YAML Example	Notes
`str`	`title: "Hello"`	Basic string
`int`	`weight: 10`	Integer
`float`	`rating: 4.5`	Float
`bool`	`draft: true`	Boolean
`datetime`	`date: 2025-01-15`	ISO date string
`date`	`published: 2025-01-15`	Date only
`list[str]`	`tags: [a, b, c]`	List of strings
`T \| None`	`author: null`	Optional/nullable

Nested Schemas

Schemas can contain nested dataclasses:

@dataclass
class Author:
    name: str
    email: str | None = None

@dataclass
class BlogPost:
    title: str
    author: Author  # Nested schema

Frontmatter:

---
title: My Post
author:
  name: Jane Doe
  email: jane@example.com
---

Collection Configuration

define_collection Options

define_collection(
    schema=BlogPost,           # Required: dataclass or Pydantic model
    directory="content/blog",  # Directory containing content
    glob="**/*.md",            # File matching pattern (default: all .md)
    strict=True,               # Reject unknown fields (default: True)
    allow_extra=False,         # Store extra fields in _extra (default: False)
    loader=None,               # Custom content source
)

Strict vs Lenient Mode

Strict mode (default) rejects content with unknown frontmatter fields:

# Strict: unknown fields cause validation errors
collections = {
    "docs": define_collection(schema=DocPage, directory="content/docs", strict=True),
}

Lenient mode allows extra fields:

# Lenient: extra fields are ignored or stored
collections = {
    "docs": define_collection(
        schema=DocPage,
        directory="content/docs",
        strict=False,
        allow_extra=True,  # Store extras in _extra dict
    ),
}

Built-in Schemas

Bengal provides ready-to-use schemas for common content types:

from bengal.collections.schemas import (
    BlogPost,      # title, date, author, tags, draft, description, image, excerpt
    DocPage,       # title, weight, category, tags, toc, description, deprecated, since
    APIReference,  # title, endpoint, method, version, auth_required, rate_limit
    Tutorial,      # title, difficulty, duration, prerequisites, series, order
    Changelog,     # title, date, version, breaking, draft, summary
)

collections = {
    "blog": define_collection(schema=BlogPost, directory="content/blog"),
    "docs": define_collection(schema=DocPage, directory="content/docs"),
}

Extending Built-in Schemas

Add custom fields by subclassing:

from dataclasses import dataclass
from bengal.collections.schemas import BlogPost

@dataclass
class MyBlogPost(BlogPost):
    """Extended blog post with custom fields."""
    series: str | None = None
    reading_time: int | None = None
    featured: bool = False

Validation Errors

When content fails validation, Bengal reports detailed errors:

Content validation failed: content/blog/my-post.md (collection: blog)
  └─ title: Required field 'title' is missing
  └─ date: Cannot parse 'January 15' as datetime
  └─ author.email: Invalid value for type 'str'

Validation Result

The validator returns aValidationResult:

from bengal.collections import SchemaValidator, ValidationResult

validator = SchemaValidator(BlogPost)
result: ValidationResult = validator.validate(frontmatter_dict)

if result.valid:
    post: BlogPost = result.data  # Typed instance
else:
    for error in result.errors:
        print(f"{error.field}: {error.message}")

Using with Remote Sources

Collections work with remote content sources:

from bengal.content.sources import github_loader, notion_loader

collections = {
    # Local content
    "docs": define_collection(
        schema=DocPage,
        directory="content/docs",
    ),

    # GitHub repository
    "api-docs": define_collection(
        schema=APIReference,
        loader=github_loader(repo="myorg/api-docs", path="docs/"),
    ),

    # Notion database
    "wiki": define_collection(
        schema=WikiPage,
        loader=notion_loader(database_id="abc123"),
    ),
}

IDE Support

With typed collections, your IDE provides:

Autocompletion for frontmatter fields
Type checking for field values
Go to definition for schema classes
Inline documentation from docstrings

Pydantic Support

Bengal also supports Pydantic models for advanced validation:

from pydantic import BaseModel, EmailStr, HttpUrl

class Author(BaseModel):
    name: str
    email: EmailStr
    website: HttpUrl | None = None

class BlogPost(BaseModel):
    title: str
    author: Author

    class Config:
        extra = "forbid"  # Strict mode

Best Practices

1. Start with Built-in Schemas

Use Bengal's built-in schemas and extend as needed:

from dataclasses import dataclass
from bengal.collections.schemas import DocPage

@dataclass
class MyDocPage(DocPage):
    custom_field: str | None = None

2. Use Strict Mode in Production

Catch frontmatter errors early:

define_collection(schema=DocPage, directory="content/docs", strict=True)

3. Document Your Schemas

Add docstrings for IDE support:

@dataclass
class BlogPost:
    """
    Blog post content schema.

    Attributes:
        title: Post title displayed in listings and page header
        date: Publication date (ISO format: YYYY-MM-DD)
        author: Author name for byline
        tags: List of topic tags for categorization
    """
    title: str
    date: datetime
    author: str = "Anonymous"
    tags: list[str] = field(default_factory=list)

Custom Content Sources for remote content
Cheatsheet for frontmatter field quick reference