Content Collections

Define typed schemas for frontmatter validation and IDE support

4 min read 867 words

Content collections let you define typed schemas for your content's frontmatter. Bengal validates content against these schemas during discovery, catching errors early and providing IDE autocompletion.

Quick Start

Create acollections.pyfile in your project root:

from dataclasses import dataclass, field
from datetime import datetime
from bengal.collections import define_collection

@dataclass
class BlogPost:
    title: str
    date: datetime
    author: str = "Anonymous"
    tags: list[str] = field(default_factory=list)
    draft: bool = False

collections = {
    "blog": define_collection(
        schema=BlogPost,
        directory="content/blog",
    ),
}

Now any file incontent/blog/must have valid frontmatter:

---
title: My First Post
date: 2025-01-15
author: Jane Doe
tags: [python, tutorial]
---

Schema Definition

Using Dataclasses

Define your schema as a Python dataclass:

from dataclasses import dataclass, field
from datetime import datetime

@dataclass
class DocPage:
    title: str                              # Required
    weight: int = 0                         # Optional with default
    description: str | None = None          # Optional, nullable
    tags: list[str] = field(default_factory=list)  # Mutable default

Field Types

Bengal automatically coerces frontmatter values to these types:

Type YAML Example Notes
str title: "Hello" Basic string
int weight: 10 Integer
float rating: 4.5 Float
bool draft: true Boolean
datetime date: 2025-01-15 ISO date string
date published: 2025-01-15 Date only
list[str] tags: [a, b, c] List of strings
T | None author: null Optional/nullable

Nested Schemas

Schemas can contain nested dataclasses:

@dataclass
class Author:
    name: str
    email: str | None = None

@dataclass
class BlogPost:
    title: str
    author: Author  # Nested schema

Frontmatter:

---
title: My Post
author:
  name: Jane Doe
  email: jane@example.com
---

Collection Configuration

define_collection Options

define_collection(
    schema=BlogPost,           # Required: dataclass or Pydantic model
    directory="content/blog",  # Directory containing content
    glob="**/*.md",            # File matching pattern (default: all .md)
    strict=True,               # Reject unknown fields (default: True)
    allow_extra=False,         # Store extra fields in _extra (default: False)
    loader=None,               # Custom content source
)

Strict vs Lenient Mode

Strict mode (default) rejects content with unknown frontmatter fields:

# Strict: unknown fields cause validation errors
collections = {
    "docs": define_collection(schema=DocPage, directory="content/docs", strict=True),
}

Lenient mode allows extra fields:

# Lenient: extra fields are ignored or stored
collections = {
    "docs": define_collection(
        schema=DocPage,
        directory="content/docs",
        strict=False,
        allow_extra=True,  # Store extras in _extra dict
    ),
}

Built-in Schemas

Bengal provides ready-to-use schemas for common content types:

from bengal.collections.schemas import (
    BlogPost,      # title, date, author, tags, draft, description, image, excerpt
    DocPage,       # title, weight, category, tags, toc, description, deprecated, since
    APIReference,  # title, endpoint, method, version, auth_required, rate_limit
    Tutorial,      # title, difficulty, duration, prerequisites, series, order
    Changelog,     # title, date, version, breaking, draft, summary
)

collections = {
    "blog": define_collection(schema=BlogPost, directory="content/blog"),
    "docs": define_collection(schema=DocPage, directory="content/docs"),
}

Extending Built-in Schemas

Add custom fields by subclassing:

from dataclasses import dataclass
from bengal.collections.schemas import BlogPost

@dataclass
class MyBlogPost(BlogPost):
    """Extended blog post with custom fields."""
    series: str | None = None
    reading_time: int | None = None
    featured: bool = False

Validation Errors

When content fails validation, Bengal reports detailed errors:

Content validation failed: content/blog/my-post.md (collection: blog)
  └─ title: Required field 'title' is missing
  └─ date: Cannot parse 'January 15' as datetime
  └─ author.email: Invalid value for type 'str'

Validation Result

The validator returns aValidationResult:

from bengal.collections import SchemaValidator, ValidationResult

validator = SchemaValidator(BlogPost)
result: ValidationResult = validator.validate(frontmatter_dict)

if result.valid:
    post: BlogPost = result.data  # Typed instance
else:
    for error in result.errors:
        print(f"{error.field}: {error.message}")

Using with Remote Sources

Collections work with remote content sources:

from bengal.content_layer import github_loader, notion_loader

collections = {
    # Local content
    "docs": define_collection(
        schema=DocPage,
        directory="content/docs",
    ),

    # GitHub repository
    "api-docs": define_collection(
        schema=APIReference,
        loader=github_loader(repo="myorg/api-docs", path="docs/"),
    ),

    # Notion database
    "wiki": define_collection(
        schema=WikiPage,
        loader=notion_loader(database_id="abc123"),
    ),
}

IDE Support

With typed collections, your IDE provides:

  • Autocompletion for frontmatter fields
  • Type checking for field values
  • Go to definition for schema classes
  • Inline documentation from docstrings

Pydantic Support

Bengal also supports Pydantic models for advanced validation:

from pydantic import BaseModel, EmailStr, HttpUrl

class Author(BaseModel):
    name: str
    email: EmailStr
    website: HttpUrl | None = None

class BlogPost(BaseModel):
    title: str
    author: Author

    class Config:
        extra = "forbid"  # Strict mode

Best Practices

1. Start with Built-in Schemas

Use Bengal's built-in schemas and extend as needed:

from dataclasses import dataclass
from bengal.collections.schemas import DocPage

@dataclass
class MyDocPage(DocPage):
    custom_field: str | None = None

2. Use Strict Mode in Production

Catch frontmatter errors early:

define_collection(schema=DocPage, directory="content/docs", strict=True)

3. Document Your Schemas

Add docstrings for IDE support:

@dataclass
class BlogPost:
    """
    Blog post content schema.

    Attributes:
        title: Post title displayed in listings and page header
        date: Publication date (ISO format: YYYY-MM-DD)
        author: Author name for byline
        tags: List of topic tags for categorization
    """
    title: str
    date: datetime
    author: str = "Anonymous"
    tags: list[str] = field(default_factory=list)