Classification Pipeline

The Classification Pipeline is the engine that transforms unstructured text into typed, connected knowledge graph nodes. It runs a three-pass LLM pipeline that classifies content according to the ontology, then extracts semantic relations between the resulting knowledge units.

Pipeline Overview

Pass 1: Base Classification

The first pass reads the entire content and splits it into classifiable paragraphs — text blocks of at least 40 characters (images, whitespace, and trivially short fragments are skipped).

Each paragraph is assigned to one of the four base classes:

Orientation — if it helps navigate the domain without yet explaining or instructing
Explanation — if it provides reasons, definitions, or understanding
Action — if it describes procedures, rules, or how to do something
Reference — if it points to external sources or archives

The classifier uses a step-by-step exclusion approach: first test for Reference (the most distinctive), then Action, then distinguish Orientation from Explanation. This hierarchy minimizes misclassification.

Parallelization: Content is batched by character count and classified in parallel across multiple LLM workers. Context overlap ensures each batch includes the previous and next paragraph for coherence.

Pass 2: Sub-Classification

Once base classes are assigned, each group is refined into the full ontology type path. The pipeline uses specialized prompts per base class — a different sub-classification prompt for Orientation, Explanation, Action, and Reference.

For example, an Explanation unit might be refined to:

Explanation:Why:Causal — if it describes cause-effect
Explanation:What:Definition:Term — if it defines a concept
Explanation:Case:Counterexample — if it refutes by contrary case

Sub-classification runs in parallel per base class — all Orientation units classify simultaneously with all Explanation units, and so on.

Output: Each paragraph now has a full knowledge type path, a title, and its original content.

Flush: Embed and Store

Before extracting relations, all classified units are stored in the knowledge graph:

Batch embedding — all unit contents are embedded in a single API call (1536-dimensional vectors via text-embedding-3-small).
Graph storage — each unit is created as a KnowledgeUnit node in Apache AGE with its full metadata (type, class, title, content, source, confidence, content hash).
Concept container — a semantic domain container is found or created (e.g., "PostgreSQL", "Machine Learning"), and all units are linked to it via BelongsTo relations.
Deduplication — content hashes (SHA-256) prevent duplicate units from being stored.

Pass 3: Relation Extraction

The final pass discovers typed semantic relations between the knowledge units:

Similarity matrix — cosine similarity is computed across all unit embeddings.
Candidate pairs — the top-K most similar pairs above a threshold (default: 0.5) are selected.
LLM relation typing — candidate pairs are sent to the LLM with the full relation taxonomy. The LLM determines which specific relation type connects the pair (e.g., CauseOf, Specializes, BasisFor), or determines that no meaningful relation exists.
Relation normalization — LLM output (which may use natural language aliases like "EXAMPLE_OF" or "EXPLAINS") is mapped to canonical relation types (e.g., Specializes, BasisFor).
Edge creation — typed edges are created in the graph with a confidence score and reason.

Embedding reuse: Pass 3 reuses the embeddings computed during the Flush phase — no re-embedding is needed.

Relation Type Normalization

LLMs don't always produce exact ontology terms. The pipeline maintains an alias mapping of 80+ natural language expressions to canonical relation types:

LLM Output	Canonical Type
`EXAMPLE_OF`, `ILLUSTRATES`, `INSTANCE_OF`	`Specializes`
`EXPLAINS`, `JUSTIFIES`, `SUPPORTS`	`BasisFor`
`CAUSED_BY`, `LEADS_TO`, `RESULTS_IN`	`CauseOf`
`USED_FOR`, `ENABLES`, `FACILITATES`	`PurposeOf`
`PART_OF`, `CONTAINED_IN`, `INCLUDED_IN`	`PartOf`
`CONTRASTS_WITH`, `DIFFERS_FROM`	`Opposite`
`SIMILAR_TO`, `RESEMBLES`, `COMPARABLE_TO`	`Similar`
`DEPENDS_ON`, `REQUIRES`, `NEEDS`	`DeterminedBy`

This normalization ensures consistent graph edges regardless of which LLM model is used or how it phrases the relationship.

Performance

Typical throughput for a 100-paragraph document:

Phase	Duration
Pass 1 (Base Classification)	10–30 seconds
Pass 2 (Sub-Classification)	20–60 seconds
Flush (Embed + Store)	5–15 seconds
Pass 3 (Relation Extraction)	30–120 seconds
Total	1–4 minutes

Performance scales with parallelization — the pipeline runs up to 10 concurrent LLM workers by default.

Deduplication

The pipeline prevents duplicate knowledge at multiple levels:

Content hash — SHA-256 of content prevents exact duplicate units.
Source-level purge — re-classifying a document first removes all units from the previous classification.
Edge deduplication — duplicate relations (same source, target, and type) are collapsed.
Bulk deduplication — a maintenance operation can scan and merge near-duplicates across the entire graph.

Triggering Classification

Classification runs automatically when:

A document is uploaded to a project with Long-Term Memory enabled.
A workflow result is flagged as knowledge-worthy by the enrichment queue.
An agent uses the StoreKnowledge tool explicitly.
A user clicks Classify on a document in the document library.

The pipeline reports progress in real-time via WebSocket events, so you can watch classification happen on the project page.

Pipeline Overview​

Pass 1: Base Classification​

Pass 2: Sub-Classification​

Flush: Embed and Store​

Pass 3: Relation Extraction​

Relation Type Normalization​

Performance​

Deduplication​

Triggering Classification​