Skip to main content

Knowledge Curator

Not everything that enters the system deserves to be in the knowledge graph. Web pages contain navigation menus, cookie banners, and ads. Workflow results include debug output and status messages. Chat transcripts have small talk alongside substance.

The Knowledge Curator is the gatekeeper. It evaluates incoming content, extracts what's substantive, discards what's noise, and routes the remainder to the right place in your project's document library and knowledge graph.

Knowledge Curator pipeline

Three-Path Enrichment Model

When content enters the system — from document uploads, workflow results, or web captures — the enrichment queue routes it through one of three paths:

PathWhenWhat happens
No KGProject has Long-Term Memory disabledContent is saved to the document library. No classification.
DirectContent is clean and structured (e.g., file uploads, structured data)Skips the Curator. Goes straight to classification.
CuratorContent is messy or unstructured (web scrapes, tool output, chat)Curator evaluates, extracts, routes, then classification runs on the cleaned content.

What the Curator Does

The Curator performs four functions in a single LLM evaluation:

1. Filter

Is this content worth keeping? The Curator rejects:

  • Navigation menus, footers, sidebars, cookie notices
  • Ads and promotional content
  • Content shorter than 200 characters after cleaning
  • Duplicate content (matched by content hash)

2. Extract

For content that passes the filter, the Curator strips boilerplate and extracts the substantive prose. It preserves full paragraphs — no summarization, no paraphrasing. The goal is to keep the author's original words while removing the web chrome around them.

3. Route

The Curator suggests where the content should live in the project's document library:

  • Matches against existing folder structure
  • Proposes new folders when the content doesn't fit existing categories
  • Respects the project description for topic guidance

4. Classify Domain

The Curator assigns a semantic domain name (1-3 words) that becomes the concept container in the knowledge graph:

  • For URLs: the brand or product name (e.g., crewai.com/docs/agents → "CrewAI")
  • For files: a meaningful topic derived from the filename
  • For tool output: the tool name or subject area

Upload Flow

When a user uploads a document to a project with Long-Term Memory enabled:

User drops a file

File is saved to Spaces storage

Curator evaluates a 3,000-character preview

Curator suggests: folder path + domain name

User sees a modal with the suggestion → confirms or adjusts

Document is saved with the chosen folder

Classification pipeline runs asynchronously (3-pass)

Knowledge units appear in the graph

The Curator's folder suggestion appears in a modal within seconds. Classification takes longer (1-4 minutes) and runs in the background with real-time progress via WebSocket.

Document upload with Curator suggestion

Workflow Enrichment

When workflows produce results, the enrichment queue decides what to capture:

  1. Tool results from agents are evaluated — did the tool return substantive content?
  2. Web content captured by browsing tools goes through the full Curator path (filter + extract + route).
  3. Structured output (JSON, data tables) takes the Direct path — straight to classification.
  4. Status messages and debug output are filtered out automatically.

The enrichment queue runs as a background daemon — fire-and-forget, no impact on workflow execution speed.

Real-Time Feedback

The Curator and classification pipeline emit WebSocket events so the project page updates in real-time:

EventMeaning
knowledge_enrichment_skippedContent was filtered out (not worth keeping)
knowledge_curator_discardCurator explicitly rejected the content
knowledge_folder_proposedCurator suggested a folder path
knowledge_enrichedContent was classified and added to the graph

Configuration

The Curator uses a lightweight, fast LLM (Gemini Flash by default) to keep evaluation cost low and latency under 3 seconds. The classifier uses a more capable model for accurate ontological classification.

ComponentDefault ModelPurpose
CuratorGemini FlashFast content evaluation and routing
ClassifierClaude SonnetAccurate Web-Didaktik classification
Embeddingstext-embedding-3-smallVector representations for similarity and relation discovery