Skip to main content

Knowledge Curator

Not everything that enters the system deserves to be in the knowledge graph. Web pages contain navigation menus, cookie banners, and ads. Workflow results include debug output and status messages. Chat transcripts have small talk alongside substance.

The Knowledge Curator is the gatekeeper. It evaluates incoming content, extracts what's substantive, discards what's noise, and routes the remainder to the right place in your project's document library and knowledge graph.

Three-Path Enrichment Model

When content enters the system — from document uploads, workflow results, or web captures — the enrichment queue routes it through one of three paths:

PathWhenWhat happens
No KGProject has Long-Term Memory disabledContent is saved to the document library. No classification.
DirectContent is clean and structured (e.g., file uploads, structured data)Skips the Curator. Goes straight to classification.
CuratorContent is messy or unstructured (web scrapes, tool output, chat)Curator evaluates, extracts, routes, then classification runs on the cleaned content.

What the Curator Does

The Curator performs four functions in a single LLM evaluation:

1. Filter

Is this content worth keeping? The Curator rejects:

  • Navigation menus, footers, sidebars, cookie notices
  • Ads and promotional content
  • Content shorter than 200 characters after cleaning
  • Duplicate content (matched by content hash)

2. Extract

For content that passes the filter, the Curator strips boilerplate and extracts the substantive prose. It preserves full paragraphs — no summarization, no paraphrasing. The goal is to keep the author's original words while removing the web chrome around them.

3. Route

The Curator suggests where the content should live in the project's document library:

  • Matches against existing folder structure
  • Proposes new folders when the content doesn't fit existing categories
  • Respects the project description for topic guidance

4. Classify Domain

The Curator assigns a semantic domain name (1-3 words) that becomes the concept container in the knowledge graph:

  • For URLs: the brand or product name (e.g., crewai.com/docs/agents → "CrewAI")
  • For files: a meaningful topic derived from the filename
  • For tool output: the tool name or subject area

Upload Flow

When a user uploads a document to a project with Long-Term Memory enabled:

The Curator's folder suggestion appears in a modal within seconds. Classification takes longer (1-4 minutes) and runs in the background with real-time progress via WebSocket.

Document Parsing

Before the Curator or classifier can work on a document, it needs to be in markdown. Plain text and markdown files pass through directly, but binary formats are converted by a dedicated extraction sidecar service.

PDF Extraction

PDFs go through a hybrid pipeline:

  1. Image extractionPyMuPDF scans each page for embedded images. Tiny decorative images (under 10KB) are skipped. Identical images across pages are deduplicated.
  2. Page rendering — Each page is rendered as a high-resolution PNG snapshot (200 DPI).
  3. Vision OCR — Each snapshot is sent to a vision model (Gemini Flash by default) that produces faithful markdown preserving headings, tables, lists, bold/italic formatting, and LaTeX formulas ($inline$ and $$block$$).
  4. Section splitting — Documents longer than 30 pages are automatically split into semantic sections on top-level headings.

The vision model is configurable via environment variables (VISION_BASE_URL, VISION_API_KEY, VISION_MODEL), so you can use any OpenAI-compatible vision endpoint.

Office Document Extraction

DOCX, ODT, EPUB, and HTML files are converted using Pandoc:

  • Output format is GitHub-Flavored Markdown
  • Embedded images are extracted to a media directory
  • Image references in the markdown are rewritten to relative paths

Images as Knowledge

Images and illustrations are not discarded during parsing — they are treated as knowledge carriers:

  • Each extracted image is classified with a media type: Diagram, Chart, Table, Formula, or Image.
  • The vision model generates a detailed alt-text description (2-4 sentences) explaining what the image shows, what concepts it conveys, and any visible text or labels.
  • Image references are placed at their exact position in the markdown output, so a diagram on page 12 appears inline where it belongs — not in a disconnected image dump.
  • The alt-text descriptions flow into the classification pipeline alongside text content, so images produce their own knowledge units in the graph, marked as media.

This means a chart showing quarterly revenue trends or an architecture diagram with component labels becomes searchable, classifiable knowledge — not a binary blob that agents can't reason about.

Workflow Enrichment

When workflows produce results, the enrichment queue decides what to capture:

  1. Tool results from agents are evaluated — did the tool return substantive content?
  2. Web content captured by browsing tools goes through the full Curator path (filter + extract + route).
  3. Structured output (JSON, data tables) takes the Direct path — straight to classification.
  4. Status messages and debug output are filtered out automatically.

The enrichment queue runs as a background daemon — fire-and-forget, no impact on workflow execution speed.

Real-Time Feedback

The Curator and classification pipeline emit WebSocket events so the project page updates in real-time:

EventMeaning
knowledge_enrichment_skippedContent was filtered out (not worth keeping)
knowledge_curator_discardCurator explicitly rejected the content
knowledge_folder_proposedCurator suggested a folder path
knowledge_enrichedContent was classified and added to the graph

Configuration

The Curator uses a lightweight, fast LLM (Gemini Flash by default) to keep evaluation cost low and latency under 3 seconds. The classifier uses a more capable model for accurate ontological classification.

ComponentDefault ModelPurpose
CuratorGemini FlashFast content evaluation and routing
ClassifierClaude SonnetAccurate ontological classification
Embeddingstext-embedding-3-smallVector representations for similarity and relation discovery