Knowledge Curator
Not everything that enters the system deserves to be in the knowledge graph. Web pages contain navigation menus, cookie banners, and ads. Workflow results include debug output and status messages. Chat transcripts have small talk alongside substance.
The Knowledge Curator is the gatekeeper. It evaluates incoming content, extracts what's substantive, discards what's noise, and routes the remainder to the right place in your project's document library and knowledge graph.
Three-Path Enrichment Model
When content enters the system — from document uploads, workflow results, or web captures — the enrichment queue routes it through one of three paths:
| Path | When | What happens |
|---|---|---|
| No KG | Project has Long-Term Memory disabled | Content is saved to the document library. No classification. |
| Direct | Content is clean and structured (e.g., file uploads, structured data) | Skips the Curator. Goes straight to classification. |
| Curator | Content is messy or unstructured (web scrapes, tool output, chat) | Curator evaluates, extracts, routes, then classification runs on the cleaned content. |
What the Curator Does
The Curator performs four functions in a single LLM evaluation:
1. Filter
Is this content worth keeping? The Curator rejects:
- Navigation menus, footers, sidebars, cookie notices
- Ads and promotional content
- Content shorter than 200 characters after cleaning
- Duplicate content (matched by content hash)
2. Extract
For content that passes the filter, the Curator strips boilerplate and extracts the substantive prose. It preserves full paragraphs — no summarization, no paraphrasing. The goal is to keep the author's original words while removing the web chrome around them.
3. Route
The Curator suggests where the content should live in the project's document library:
- Matches against existing folder structure
- Proposes new folders when the content doesn't fit existing categories
- Respects the project description for topic guidance
4. Classify Domain
The Curator assigns a semantic domain name (1-3 words) that becomes the concept container in the knowledge graph:
- For URLs: the brand or product name (e.g.,
crewai.com/docs/agents→ "CrewAI") - For files: a meaningful topic derived from the filename
- For tool output: the tool name or subject area
Upload Flow
When a user uploads a document to a project with Long-Term Memory enabled:
The Curator's folder suggestion appears in a modal within seconds. Classification takes longer (1-4 minutes) and runs in the background with real-time progress via WebSocket.
Document Parsing
Before the Curator or classifier can work on a document, it needs to be in markdown. Plain text and markdown files pass through directly, but binary formats are converted by a dedicated extraction sidecar service.
PDF Extraction
PDFs go through a hybrid pipeline:
- Image extraction — PyMuPDF scans each page for embedded images. Tiny decorative images (under 10KB) are skipped. Identical images across pages are deduplicated.
- Page rendering — Each page is rendered as a high-resolution PNG snapshot (200 DPI).
- Vision OCR — Each snapshot is sent to a vision model (Gemini Flash by default) that produces faithful markdown preserving headings, tables, lists, bold/italic formatting, and LaTeX formulas (
$inline$and$$block$$). - Section splitting — Documents longer than 30 pages are automatically split into semantic sections on top-level headings.
The vision model is configurable via environment variables (VISION_BASE_URL, VISION_API_KEY, VISION_MODEL), so you can use any OpenAI-compatible vision endpoint.
Office Document Extraction
DOCX, ODT, EPUB, and HTML files are converted using Pandoc:
- Output format is GitHub-Flavored Markdown
- Embedded images are extracted to a media directory
- Image references in the markdown are rewritten to relative paths
Images as Knowledge
Images and illustrations are not discarded during parsing — they are treated as knowledge carriers:
- Each extracted image is classified with a media type: Diagram, Chart, Table, Formula, or Image.
- The vision model generates a detailed alt-text description (2-4 sentences) explaining what the image shows, what concepts it conveys, and any visible text or labels.
- Image references are placed at their exact position in the markdown output, so a diagram on page 12 appears inline where it belongs — not in a disconnected image dump.
- The alt-text descriptions flow into the classification pipeline alongside text content, so images produce their own knowledge units in the graph, marked as media.
This means a chart showing quarterly revenue trends or an architecture diagram with component labels becomes searchable, classifiable knowledge — not a binary blob that agents can't reason about.
Workflow Enrichment
When workflows produce results, the enrichment queue decides what to capture:
- Tool results from agents are evaluated — did the tool return substantive content?
- Web content captured by browsing tools goes through the full Curator path (filter + extract + route).
- Structured output (JSON, data tables) takes the Direct path — straight to classification.
- Status messages and debug output are filtered out automatically.
The enrichment queue runs as a background daemon — fire-and-forget, no impact on workflow execution speed.
Real-Time Feedback
The Curator and classification pipeline emit WebSocket events so the project page updates in real-time:
| Event | Meaning |
|---|---|
knowledge_enrichment_skipped | Content was filtered out (not worth keeping) |
knowledge_curator_discard | Curator explicitly rejected the content |
knowledge_folder_proposed | Curator suggested a folder path |
knowledge_enriched | Content was classified and added to the graph |
Configuration
The Curator uses a lightweight, fast LLM (Gemini Flash by default) to keep evaluation cost low and latency under 3 seconds. The classifier uses a more capable model for accurate ontological classification.
| Component | Default Model | Purpose |
|---|---|---|
| Curator | Gemini Flash | Fast content evaluation and routing |
| Classifier | Claude Sonnet | Accurate ontological classification |
| Embeddings | text-embedding-3-small | Vector representations for similarity and relation discovery |