Knowledge Curator
Not everything that enters the system deserves to be in the knowledge graph. Web pages contain navigation menus, cookie banners, and ads. Workflow results include debug output and status messages. Chat transcripts have small talk alongside substance.
The Knowledge Curator is the gatekeeper. It evaluates incoming content, extracts what's substantive, discards what's noise, and routes the remainder to the right place in your project's document library and knowledge graph.
Three-Path Enrichment Model
When content enters the system — from document uploads, workflow results, or web captures — the enrichment queue routes it through one of three paths:
| Path | When | What happens |
|---|---|---|
| No KG | Project has Long-Term Memory disabled | Content is saved to the document library. No classification. |
| Direct | Content is clean and structured (e.g., file uploads, structured data) | Skips the Curator. Goes straight to classification. |
| Curator | Content is messy or unstructured (web scrapes, tool output, chat) | Curator evaluates, extracts, routes, then classification runs on the cleaned content. |
What the Curator Does
The Curator performs four functions in a single LLM evaluation:
1. Filter
Is this content worth keeping? The Curator rejects:
- Navigation menus, footers, sidebars, cookie notices
- Ads and promotional content
- Content shorter than 200 characters after cleaning
- Duplicate content (matched by content hash)
2. Extract
For content that passes the filter, the Curator strips boilerplate and extracts the substantive prose. It preserves full paragraphs — no summarization, no paraphrasing. The goal is to keep the author's original words while removing the web chrome around them.
3. Route
The Curator suggests where the content should live in the project's document library:
- Matches against existing folder structure
- Proposes new folders when the content doesn't fit existing categories
- Respects the project description for topic guidance
4. Classify Domain
The Curator assigns a semantic domain name (1-3 words) that becomes the concept container in the knowledge graph:
- For URLs: the brand or product name (e.g.,
crewai.com/docs/agents→ "CrewAI") - For files: a meaningful topic derived from the filename
- For tool output: the tool name or subject area
Upload Flow
When a user uploads a document to a project with Long-Term Memory enabled:
User drops a file
↓
File is saved to Spaces storage
↓
Curator evaluates a 3,000-character preview
↓
Curator suggests: folder path + domain name
↓
User sees a modal with the suggestion → confirms or adjusts
↓
Document is saved with the chosen folder
↓
Classification pipeline runs asynchronously (3-pass)
↓
Knowledge units appear in the graph
The Curator's folder suggestion appears in a modal within seconds. Classification takes longer (1-4 minutes) and runs in the background with real-time progress via WebSocket.
Workflow Enrichment
When workflows produce results, the enrichment queue decides what to capture:
- Tool results from agents are evaluated — did the tool return substantive content?
- Web content captured by browsing tools goes through the full Curator path (filter + extract + route).
- Structured output (JSON, data tables) takes the Direct path — straight to classification.
- Status messages and debug output are filtered out automatically.
The enrichment queue runs as a background daemon — fire-and-forget, no impact on workflow execution speed.
Real-Time Feedback
The Curator and classification pipeline emit WebSocket events so the project page updates in real-time:
| Event | Meaning |
|---|---|
knowledge_enrichment_skipped | Content was filtered out (not worth keeping) |
knowledge_curator_discard | Curator explicitly rejected the content |
knowledge_folder_proposed | Curator suggested a folder path |
knowledge_enriched | Content was classified and added to the graph |
Configuration
The Curator uses a lightweight, fast LLM (Gemini Flash by default) to keep evaluation cost low and latency under 3 seconds. The classifier uses a more capable model for accurate ontological classification.
| Component | Default Model | Purpose |
|---|---|---|
| Curator | Gemini Flash | Fast content evaluation and routing |
| Classifier | Claude Sonnet | Accurate Web-Didaktik classification |
| Embeddings | text-embedding-3-small | Vector representations for similarity and relation discovery |