The full pipeline: from document upload to intelligent answers, knowledge graphs, and interactive wikis.
What you need for a smooth experience at different scales.
Every document goes through a 6-stage pipeline. Each stage transforms raw content into structured, queryable intelligence.
graph LR
A["📤 Upload
(PDF / MP4 / YT / EPUB)"]
B["✂️ Chunker
(LangChain + Whisper)"]
C["🧠 Embedder
(pgvector 1024d)"]
D["📊 PostgreSQL 16 + pgvector"]
E1["🔍 RAG Query
(hybrid search)"]
E2["📖 Wiki
(concept articles)"]
E3["🧩 Reasoning
(BFS traversal)"]
A --> B
B --> C
C --> D
D --> E1
D --> E2
D --> E3
style A fill:#1a1a2a,stroke:#7c6af0,color:#e0e0f0
style B fill:#1a1a2a,stroke:#f0a06a,color:#e0e0f0
style C fill:#1a1a2a,stroke:#60a5fa,color:#e0e0f0
style D fill:#14141f,stroke:#4ade80,color:#e0e0f0
style E1 fill:#1a1a2a,stroke:#7c6af0,color:#e0e0f0
style E2 fill:#1a1a2a,stroke:#7c6af0,color:#e0e0f0
style E3 fill:#1a1a2a,stroke:#7c6af0,color:#e0e0f0
A file or YouTube URL is received, saved, and queued for background processing via Celery + Redis.
The Flask endpoint saves the file with a UUID-prefixed name (no collisions), creates a Document row with status='pending', and enqueues a Celery task that runs asynchronously.
PDF / EPUB
Audio / Video / YouTube
langdetect samples the first chunks to detect the document language. This sets the PostgreSQL full-text search config (english, spanish, etc.), enabling proper stemming per language.
Why chunk? LLMs have context limits. Breaking documents into smaller pieces makes retrieval precise, efficient, and scalable.
Tool: RecursiveCharacterTextSplitter (LangChain)
Tool: Custom merge logic
Chunks become vectors that capture meaning. Similar chunks have similar vectors.
PostgreSQL 16 with pgvector stores vectors directly in the DB. HNSW indexes enable fast approximate nearest-neighbor search — finding the most similar chunks among millions in milliseconds.
When you ask a question, two searches run in parallel:
Results merge via Reciprocal Rank Fusion (RRF), then MMR re-ranks for diversity. Neighbor-window expansion adds adjacent chunks for context continuity.
Before the graph, the system builds a structured summary index — chapter-like sections with executive summaries.
Chunks are grouped into batches of 5. Each batch goes to the LLM in parallel asking for: a title, a summary, and key concepts with relevance scores.
All batch results are stitched: consecutive batches with the same title merge. Concepts are aggregated by frequency (top 20). A final LLM call writes an Executive Summary.
The LLM reads every chunk and extracts a structured knowledge graph: concepts, definitions, and relationships between them.
Each batch of 4 chunks goes to the LLM with a JSON schema. Returns events (source → relation → target) and definitions. All batches run in parallel.
Concepts are deduplicated and fuzzy-matched. Missing ones get embedded and inserted. HyperEdges (multi-way relationships) are created linking concepts to documents.
Two systems consume the hypergraph: the Wiki (browsable articles) and the Reasoning Engine (graph traversal).
Each concept becomes a wiki article with: description, all relationships (with peer concepts), source document chunks, and related concepts. Supports prefix + vector fuzzy search.
BFS on HyperEdges from start concept to goal. Intersection filter, optional Semantic Leap via vector similarity. Returns narrative explanation + Cytoscape graph JSON.
When you ask a question, everything comes together.
graph TD
Q["❓ User Question"]
Q --> H["🔀 Hybrid Search
(vector + keyword → RRF → MMR)"]
Q --> G["🧠 Graph-RAG
(optional)"]
Q --> W["🌐 Web Search
(optional)"]
H --> R["📄 Ranked Chunks
+ source citations"]
G --> R
W --> R
R --> C["📦 Build Context"]
C --> M1["📑 Document + Section hierarchy"]
C --> M2["💬 Conversation history"]
C --> M3["🧠 User memories"]
C --> M4["🌍 Web results"]
M1 & M2 & M3 & M4 --> P["🤖 LLM
(System Prompt + Context + Question + Images)"]
P --> S["✅ Response
+ cited sources + saved conversation"]
style Q fill:#1a1a2a,stroke:#7c6af0,color:#fff,font-weight:bold
style H fill:#1a1a2a,stroke:#4ade80,color:#e0e0f0
style G fill:#1a1a2a,stroke:#60a5fa,color:#e0e0f0
style W fill:#1a1a2a,stroke:#f0a06a,color:#e0e0f0
style R fill:#14141f,stroke:#8888aa,color:#e0e0f0
style C fill:#14141f,stroke:#7c6af0,color:#fff,font-weight:bold
style P fill:#1a1a2a,stroke:#f0a06a,color:#fff,font-weight:bold
style S fill:#1a1a2a,stroke:#4ade80,color:#fff,font-weight:bold
Before sending to the LLM, token count is checked against the model's context window. If it exceeds, lowest-ranked chunks are dropped until it fits. This prevents silent truncation.
Concrete scenarios you can solve with MNEMOS.
Each design decision solves a real problem.
Document processing takes seconds to minutes. Blocking HTTP would time out. Celery queues the work, frees the API. Redis doubles as broker and result backend.
Your data stays in PostgreSQL — one less system. No ETL, no sync lag, no extra cost. pgvector + HNSW gets 95%+ of dedicated vector DB performance for datasets under 10M vectors.
Plain RAG is reactive — it finds chunks similar to your question. The hypergraph is proactive: it extracts knowledge structure before you ask. Enables browsing, traversal between distant concepts, and narrative synthesis.
MNEMOS runs entirely on your hardware with llama.cpp for inference and Sentence-Transformers for embeddings. No data leaves your network. But you can plug in GPT-4 or Claude when needed.
LLMs have context windows. A 300-page PDF doesn't fit. Map-Reduce processes small chunks in parallel, then aggregates. 100 chunks process in ~10 seconds with 5 workers.