Classification Pipeline
The Classification Pipeline is the engine that transforms unstructured text into typed, connected knowledge graph nodes. It runs a three-pass LLM pipeline that classifies content according to the Web-Didaktik ontology, then extracts semantic relations between the resulting knowledge units.
Pipeline Overview
Source content (document, article, workflow result)
↓
┌─────────────────────────────┐
│ Pass 1: Base Classification │ → Orientation | Explanation | Action | Reference
└─────────────────────────────┘
↓
┌─────────────────────────────┐
│ Pass 2: Sub-Classification │ → Full type path (e.g., Explanation:What:Definition:Term)
└───────────────── ────────────┘
↓
┌─────────────────────────────┐
│ Flush: Embed & Store │ → Knowledge units + concept containers in graph
└─────────────────────────────┘
↓
┌─────────────────────────────┐
│ Pass 3: Relation Extraction │ → Typed edges between units (CauseOf, Specializes, ...)
└─────────────────────────────┘
Pass 1: Base Classification
The first pass reads the entire content and splits it into classifiable paragraphs — text blocks of at least 40 characters (images, whitespace, and trivially short fragments are skipped).
Each paragraph is assigned to one of the four base classes:
- Orientation — if it helps navigate the domain without yet explaining or instructing
- Explanation — if it provides reasons, definitions, or understanding
- Action — if it describes procedures, rules, or how to do something
- Reference — if it points to external sources or archives
The classifier uses a step-by-step exclusion approach: first test for Reference (the most distinctive), then Action, then distinguish Orientation from Explanation. This hierarchy minimizes misclassification.
Parallelization: Content is batched by character count and classified in parallel across multiple LLM workers. Context overlap ensures each batch includes the previous and next paragraph for coherence.
Pass 2: Sub-Classification
Once base classes are assigned, each group is refined into the full Web-Didaktik type path. The pipeline uses specialized prompts per base class — a different sub-classification prompt for Orientation, Explanation, Action, and Reference.
For example, an Explanation unit might be refined to:
Explanation:Why:Causal— if it describes cause-effectExplanation:What:Definition:Term— if it defines a conceptExplanation:Case:Counterexample— if it refutes by contrary case
Sub-classification runs in parallel per base class — all Orientation units classify simultaneously with all Explanation units, and so on.
Output: Each paragraph now has a full knowledge type path, a title, and its original content.
Flush: Embed and Store
Before extracting relations, all classified units are stored in the knowledge graph:
- Batch embedding — all unit contents are embedded in a single API call (1536-dimensional vectors via text-embedding-3-small).
- Graph storage — each unit is created as a
KnowledgeUnitnode in Apache AGE with its full metadata (type, class, title, content, source, confidence, content hash). - Concept container — a semantic domain container is found or created (e.g., "PostgreSQL", "Machine Learning"), and all units are linked to it via
BelongsTorelations. - Deduplication — content hashes (SHA-256) prevent duplicate units from being stored.
Pass 3: Relation Extraction
The final pass discovers typed semantic relations between the knowledge units:
- Similarity matrix — cosine similarity is computed across all unit embeddings.
- Candidate pairs — the top-K most similar pairs above a threshold (default: 0.5) are selected.
- LLM relation typing — candidate pairs are sent to the LLM with the full relation taxonomy. The LLM determines which specific relation type connects the pair (e.g.,
CauseOf,Specializes,BasisFor), or determines that no meaningful relation exists. - Relation normalization — LLM output (which may use natural language aliases like "EXAMPLE_OF" or "EXPLAINS") is mapped to canonical Meder relation types (e.g.,
Specializes,BasisFor). - Edge creation — typed edges are created in the graph with a confidence score and reason.
Embedding reuse: Pass 3 reuses the embeddings computed during the Flush phase — no re-embedding is needed.
Relation Type Normalization
LLMs don't always produce exact ontology terms. The pipeline maintains an alias mapping of 80+ natural language expressions to canonical Meder relation types:
| LLM Output | Canonical Type |
|---|---|
EXAMPLE_OF, ILLUSTRATES, INSTANCE_OF | Specializes |
EXPLAINS, JUSTIFIES, SUPPORTS | BasisFor |
CAUSED_BY, LEADS_TO, RESULTS_IN | CauseOf |
USED_FOR, ENABLES, FACILITATES | PurposeOf |
PART_OF, CONTAINED_IN, INCLUDED_IN | PartOf |
CONTRASTS_WITH, DIFFERS_FROM | Opposite |
SIMILAR_TO, RESEMBLES, COMPARABLE_TO | Similar |
DEPENDS_ON, REQUIRES, NEEDS | DeterminedBy |
This normalization ensures consistent graph edges regardless of which LLM model is used or how it phrases the relationship.
Performance
Typical throughput for a 100-paragraph document:
| Phase | Duration |
|---|---|
| Pass 1 (Base Classification) | 10–30 seconds |
| Pass 2 (Sub-Classification) | 20–60 seconds |
| Flush (Embed + Store) | 5–15 seconds |
| Pass 3 (Relation Extraction) | 30–120 seconds |
| Total | 1–4 minutes |
Performance scales with parallelization — the pipeline runs up to 10 concurrent LLM workers by default.
Deduplication
The pipeline prevents duplicate knowledge at multiple levels:
- Content hash — SHA-256 of content prevents exact duplicate units.
- Source-level purge — re-classifying a document first removes all units from the previous classification.
- Edge deduplication — duplicate relations (same source, target, and type) are collapsed.
- Bulk deduplication — a maintenance operation can scan and merge near-duplicates across the entire graph.
Triggering Classification
Classification runs automatically when:
- A document is uploaded to a project with Long-Term Memory enabled.
- A workflow result is flagged as knowledge-worthy by the enrichment queue.
- An agent uses the StoreKnowledge tool explicitly.
- A user clicks Classify on a document in the document library.
The pipeline reports progress in real-time via WebSocket events, so you can watch classification happen on the project page.