AI-Native Output

How Bengal makes your documentation discoverable, navigable, and policy-compliant for AI agents and RAG pipelines

3 min read 570 words

When an AI agent visits your documentation site, it faces the same problem a human does: figuring out what exists, where to start, and what it's allowed to use. Bengal solves this by generating a complete set of machine-readable outputs alongside your HTML — not as an afterthought, but as a default part of every build.

What Bengal Generates

Everybengal buildproduces these files automatically:

File Purpose Audience
llms.txt Curated site overview with navigation links AI agents deciding where to look
llm-full.txt Complete plain-text corpus RAG pipelines ingesting full content
agent.json Hierarchical site map with content hashes Programmatic agent navigation
{page}/index.json Per-page metadata, navigation, optional chunks Fine-grained RAG retrieval
{page}/index.txt Per-page plain text Single-page LLM consumption
index.json Searchable site index with facets Client-side search, indexers
robots.txt Content Signal directives Crawlers respecting your policies
.well-known/content-signals.json Machine-readable policy manifest Automated compliance checks

All formats respect your Content Signals policies. Denied pages simply don't get machine-readable files generated.

Content Signals: Who Can Use What

Bengal implements the Content Signals specification — a three-way policy that lets you control how automated systems interact with your content:

Signal Default Controls
search true Search engine indexing
ai_input true RAG, grounding, AI-generated answers
ai_train false Model training and fine-tuning

The default posture is privacy-first: your docs are discoverable and citable by AI systems, but not available for training unless you opt in.

The Cascade

Policies cascade through three levels (highest priority first):

  1. Page frontmatter — overrides everything for that page
  2. Section_index.md — inherited by all pages in the section
  3. Site config — the default for all pages
# docs/internal/_index.md — hide an entire section from AI
---
cascade:
  visibility:
    ai_input: false
    ai_train: false
---
# A single page opting in to training
---
visibility:
  ai_train: true
---

Enforcement

Content Signals are not advisory. Bengal enforces them at the file level:

  • ai_input: false pages get no index.json or index.txt
  • ai_train: false pages are excluded from llm-full.txt
  • search: false pages are excluded from index.jsonsite index
  • Draft pages are excluded from all machine-readable outputs

The files simply don't exist on disk for denied pages.

For RAG Pipelines

If you're feeding Bengal docs into a retrieval pipeline, the per-page JSON includes everything you need:

  • content_hash (SHA-256) — skip re-indexing unchanged pages
  • last_modified (ISO 8601) — freshness filtering
  • chunks — heading-level content splits with individual hashes (enable with include_chunks: true)
  • navigation.related — related pages for context expansion

Thechangelog.jsonfile tracks per-build diffs (added, modified, removed pages), so incremental indexing pipelines can process only what changed.

Configuration

With default features enabled, all AI-native formats are generated automatically. To customize:

# config/_default/outputs.yaml
output_formats:
  per_page: ["json", "llm_txt"]
  site_wide: ["index_json", "llm_full", "llms_txt", "agent_manifest", "changelog"]
  options:
    include_chunks: true              # Heading-level RAG chunks
    excerpt_length: 200               # Search index excerpt length
    exclude_sections: ["internal"]    # Hide sections from machine output

# config/_default/site.yaml
content_signals:
  search: true
  ai_input: true
  ai_train: false

See Output Formats for the full configuration reference and SEO & Discovery for the broader discovery strategy.