When an AI agent visits your documentation site, it faces the same problem a human does: figuring out what exists, where to start, and what it's allowed to use. Bengal solves this by generating a complete set of machine-readable outputs alongside your HTML — not as an afterthought, but as a default part of every build.
What Bengal Generates
Everybengal buildproduces these files automatically:
| File | Purpose | Audience |
|---|---|---|
llms.txt |
Curated site overview with navigation links | AI agents deciding where to look |
llm-full.txt |
Complete plain-text corpus | RAG pipelines ingesting full content |
agent.json |
Hierarchical site map with content hashes | Programmatic agent navigation |
{page}/index.json |
Per-page metadata, navigation, optional chunks | Fine-grained RAG retrieval |
{page}/index.txt |
Per-page plain text | Single-page LLM consumption |
index.json |
Searchable site index with facets | Client-side search, indexers |
robots.txt |
Content Signal directives | Crawlers respecting your policies |
.well-known/content-signals.json |
Machine-readable policy manifest | Automated compliance checks |
All formats respect your Content Signals policies. Denied pages simply don't get machine-readable files generated.
Content Signals: Who Can Use What
Bengal implements the Content Signals specification — a three-way policy that lets you control how automated systems interact with your content:
| Signal | Default | Controls |
|---|---|---|
search |
true |
Search engine indexing |
ai_input |
true |
RAG, grounding, AI-generated answers |
ai_train |
false |
Model training and fine-tuning |
The default posture is privacy-first: your docs are discoverable and citable by AI systems, but not available for training unless you opt in.
The Cascade
Policies cascade through three levels (highest priority first):
- Page frontmatter — overrides everything for that page
- Section
_index.md— inherited by all pages in the section - Site config — the default for all pages
# docs/internal/_index.md — hide an entire section from AI
---
cascade:
visibility:
ai_input: false
ai_train: false
---
# A single page opting in to training
---
visibility:
ai_train: true
---
Enforcement
Content Signals are not advisory. Bengal enforces them at the file level:
ai_input: falsepages get noindex.jsonorindex.txtai_train: falsepages are excluded fromllm-full.txtsearch: falsepages are excluded fromindex.jsonsite index- Draft pages are excluded from all machine-readable outputs
The files simply don't exist on disk for denied pages.
For RAG Pipelines
If you're feeding Bengal docs into a retrieval pipeline, the per-page JSON includes everything you need:
content_hash(SHA-256) — skip re-indexing unchanged pageslast_modified(ISO 8601) — freshness filteringchunks— heading-level content splits with individual hashes (enable withinclude_chunks: true)navigation.related— related pages for context expansion
Thechangelog.jsonfile tracks per-build diffs (added, modified, removed pages), so incremental indexing pipelines can process only what changed.
Configuration
With default features enabled, all AI-native formats are generated automatically. To customize:
# config/_default/outputs.yaml
output_formats:
per_page: ["json", "llm_txt"]
site_wide: ["index_json", "llm_full", "llms_txt", "agent_manifest", "changelog"]
options:
include_chunks: true # Heading-level RAG chunks
excerpt_length: 200 # Search index excerpt length
exclude_sections: ["internal"] # Hide sections from machine output
# config/_default/site.yaml
content_signals:
search: true
ai_input: true
ai_train: false
See Output Formats for the full configuration reference and SEO & Discovery for the broader discovery strategy.