When sending Markdown to an LLM (RAG retrieval, user-provided docs, context windows), you should sanitize untrusted content and render to a safe, structured format. Patitas provides a parse → sanitize → render pipeline for this.
Pipeline
from patitas import parse, sanitize, render_llm
from patitas.sanitize import llm_safe
# 1. Parse
doc = parse(user_content)
# 2. Sanitize (strip HTML, dangerous URLs, zero-width chars)
clean = sanitize(doc, policy=llm_safe)
# 3. Render to LLM-friendly plain text
safe_text = render_llm(clean, source=user_content)
What Gets Removed
Thellm_safepolicy strips:
- HTML — All
HtmlBlockandHtmlInlinenodes (prevents script injection, hidden content) - Dangerous URLs — Links and images with
javascript:,data:,vbscript:schemes - Zero-width / bidi characters — Trojan Source mitigation (invisible Unicode that can change display order or hide malicious content)
Pre-built Policies
| Policy | Strips | Use case |
|---|---|---|
llm_safe |
HTML, dangerous URLs, zero-width | LLM context (RAG, retrieval) |
web_safe |
Same as llm_safe (alias) | Web display of untrusted content |
strict |
llm_safe + images (→ alt text) + raw code blocks | Maximum reduction |
from patitas.sanitize import llm_safe, web_safe, strict
clean = sanitize(doc, policy=llm_safe) # RAG, retrieval
clean = sanitize(doc, policy=web_safe) # User content on web (same as llm_safe)
clean = sanitize(doc, policy=strict) # Minimal text only
Custom Policies
Compose policies with the|operator:
from patitas.sanitize import strip_html, strip_dangerous_urls, normalize_unicode, allow_url_schemes
# Only https and mailto
custom = strip_html | strip_dangerous_urls | normalize_unicode | allow_url_schemes("https", "mailto")
clean = sanitize(doc, policy=custom)
Available building blocks: strip_html, strip_html_comments, strip_dangerous_urls,
normalize_unicode, strip_images, strip_raw_code, allow_url_schemes(*schemes).
render_llm Output
render_llm()produces structured plain text with explicit labels:
- Code blocks:
[code:python]\n...\n[/code] - Math:
[math] ... [/math] - Images:
[image: alt text] - HtmlBlock / HtmlInline: skipped (not rendered)
No raw HTML is emitted. The format is predictable and parseable for downstream tools.
Example
Run the full pipeline:
python examples/llm_safety/llm_safe_context.py
This demonstrates parse → sanitize → render_llm on sample content with HTML, dangerous links, and zero-width characters.
See Also
- API Reference —
render_llm(),sanitize(),Policy - Examples — Runnable examples in the repo