AI-Native Output

How Bengal makes your documentation discoverable, navigable, and policy-compliant for AI agents and RAG pipelines

3 min read 596 words

When an AI agent visits your documentation site, it faces the same problem a human does: figuring out what exists, where to start, and what it's allowed to use. Bengal solves this by generating a complete set of machine-readable outputs alongside your HTML — not as an afterthought, but as a default part of every build.

What Bengal Generates

Everybengal buildproduces these files automatically:

File	Purpose	Audience
`llms.txt`	Curated site overview with navigation links	AI agents deciding where to look
`llm-full.txt`	Complete plain-text corpus	RAG pipelines ingesting full content
`agent.json`	Hierarchical site map with content hashes	Programmatic agent navigation
`{page}/index.json`	Per-page metadata, navigation, optional chunks	Fine-grained RAG retrieval
`{page}/index.txt`	Per-page plain text	Single-page LLM consumption
`{page}/index.md`	Per-page Markdown mirror with an`llms.txt`directive	Coding agents and documentation checkers
`index.json`	Searchable site index with facets	Client-side search, indexers
`robots.txt`	Content Signal directives	Crawlers respecting your policies
`.well-known/content-signals.json`	Machine-readable policy manifest	Automated compliance checks

All formats respect your Content Signals policies. Denied pages simply don't get machine-readable files generated.

Content Signals: Who Can Use What

Bengal implements the Content Signals specification — a three-way policy that lets you control how automated systems interact with your content:

Signal	Default	Controls
`search`	`true`	Search engine indexing
`ai_input`	`true`	RAG, grounding, AI-generated answers
`ai_train`	`false`	Model training and fine-tuning

The default posture is privacy-first: your docs are discoverable and citable by AI systems, but not available for training unless you opt in.

The Cascade

Policies cascade through three levels (highest priority first):

Page frontmatter — overrides everything for that page
Section_index.md — inherited by all pages in the section
Site config — the default for all pages

# docs/internal/_index.md — hide an entire section from AI
---
cascade:
  visibility:
    ai_input: false
    ai_train: false
---

# A single page opting in to training
---
visibility:
  ai_train: true
---

Enforcement

Content Signals are not advisory. Bengal enforces them at the file level:

ai_input: false pages get no index.json or index.txt
ai_input: false pages get no index.mdMarkdown mirror
ai_train: false pages are excluded from llm-full.txt
search: false pages are excluded from index.jsonsite index
Draft pages are excluded from all machine-readable outputs

The files simply don't exist on disk for denied pages.

For RAG Pipelines

If you're feeding Bengal docs into a retrieval pipeline, the per-page JSON includes everything you need:

content_hash (SHA-256) — skip re-indexing unchanged pages
last_modified (ISO 8601) — freshness filtering
chunks — heading-level content splits with individual hashes (enable with include_chunks: true)
navigation.related — related pages for context expansion

Thechangelog.jsonfile tracks per-build diffs (added, modified, removed pages), so incremental indexing pipelines can process only what changed.

Configuration

With default features enabled, all AI-native formats are generated automatically. To customize:

# config/_default/outputs.yaml
output_formats:
  per_page: ["json", "llm_txt"]
  site_wide: ["index_json", "llm_full", "llms_txt", "agent_manifest", "changelog"]
  options:
    include_chunks: true              # Heading-level RAG chunks
    excerpt_length: 200               # Search index excerpt length
    exclude_sections: ["internal"]    # Hide sections from machine output

# config/_default/site.yaml
content_signals:
  search: true
  ai_input: true
  ai_train: false

See Output Formats for the full configuration reference and SEO & Discovery for the broader discovery strategy.