How MNEMOS Works

The full pipeline: from document upload to intelligent answers, knowledge graphs, and interactive wikis.

💻 Hardware Recommendations

What you need for a smooth experience at different scales.

🟢 Minimum

  • CPU: 4 cores (AVX support)
  • RAM: 8 GB
  • Storage: 10 GB free
  • GPU: none (CPU mode)
  • Great for: testing, small docs, CPU-only

🟡 Recommended

  • CPU: 8+ cores
  • RAM: 16-32 GB
  • Storage: 50 GB SSD
  • GPU: NVIDIA 8GB+ VRAM (CUDA)
  • Great for: local LLMs, hypergraph, daily use

🟣 Pro / Heavy

  • CPU: 16+ cores
  • RAM: 64 GB
  • Storage: 100+ GB NVMe
  • GPU: NVIDIA 24GB+ VRAM (CUDA)
  • Great for: large models (70B+), heavy batch processing

🔵 Cloud LLM

  • No GPU needed
  • RAM: 8 GB minimum
  • Storage: 5 GB free
  • API keys: OpenAI / Anthropic / Groq
  • Great for: low-resource hardware, maximum quality

1. Pipeline Overview

Every document goes through a 6-stage pipeline. Each stage transforms raw content into structured, queryable intelligence.

1 Upload
2 Extract & Chunk LLM
3 Embed & Save pgvector
4 Summarize Map-Reduce
5 Hypergraph LLM
6 Wiki / RAG Ready
graph LR
    A["📤 Upload
(PDF / MP4 / YT / EPUB)"] B["✂️ Chunker
(LangChain + Whisper)"] C["🧠 Embedder
(pgvector 1024d)"] D["📊 PostgreSQL 16 + pgvector"] E1["🔍 RAG Query
(hybrid search)"] E2["📖 Wiki
(concept articles)"] E3["🧩 Reasoning
(BFS traversal)"] A --> B B --> C C --> D D --> E1 D --> E2 D --> E3 style A fill:#1a1a2a,stroke:#7c6af0,color:#e0e0f0 style B fill:#1a1a2a,stroke:#f0a06a,color:#e0e0f0 style C fill:#1a1a2a,stroke:#60a5fa,color:#e0e0f0 style D fill:#14141f,stroke:#4ade80,color:#e0e0f0 style E1 fill:#1a1a2a,stroke:#7c6af0,color:#e0e0f0 style E2 fill:#1a1a2a,stroke:#7c6af0,color:#e0e0f0 style E3 fill:#1a1a2a,stroke:#7c6af0,color:#e0e0f0

2. Upload & Extract

A file or YouTube URL is received, saved, and queued for background processing via Celery + Redis.

📤 Upload API: POST /api/documents/upload

The Flask endpoint saves the file with a UUID-prefixed name (no collisions), creates a Document row with status='pending', and enqueues a Celery task that runs asynchronously.

🔄 Extraction by File Type

PDF / EPUB

  • PyMuPDF extracts text page by page
  • Metadata (author, title) saved to Document
  • Each page is chunked via RecursiveCharacterTextSplitter

Audio / Video / YouTube

  • YouTube: yt-dlp downloads audio, saves metadata
  • Whisper transcribes speech into timed segments
  • Chunker merges small segments into larger chunks

🌍 Language Detection

langdetect samples the first chunks to detect the document language. This sets the PostgreSQL full-text search config (english, spanish, etc.), enabling proper stemming per language.

3. Chunking

Why chunk? LLMs have context limits. Breaking documents into smaller pieces makes retrieval precise, efficient, and scalable.

✂️ Text Chunking

Tool: RecursiveCharacterTextSplitter (LangChain)

  • Default size: 512 characters, overlap: 50
  • Separators: "\n\n" → "\n" → " " → ""
  • User-customizable via Preferences UI

🎤 Transcript Chunking

Tool: Custom merge logic

  • Whisper returns segments of ~1-5 seconds
  • Segments accumulated until target size
  • Timestamps (start/end) preserved per chunk
  • Enables time-accurate citations in answers
-- Each chunk: a searchable unit with vector + full-text chunks ( id UUID PRIMARY KEY, document_id UUID → documents.id, content TEXT, chunk_index INTEGER, page_number INTEGER, start_time FLOAT, end_time FLOAT, embedding VECTOR(1024), search_vector TSVECTOR );

4. Embedding & Vector Search

Chunks become vectors that capture meaning. Similar chunks have similar vectors.

🧠 Embedder Service Local or Remote

  • Local: Sentence-Transformers (bge-m3) — CPU or GPU (CUDA), FP16
  • Remote: OpenAI, LM Studio, or Ollama embedding API
  • Auto-batching based on VRAM
  • Query embeddings cached via LRU (512 entries)

📐 pgvector + HNSW Index

PostgreSQL 16 with pgvector stores vectors directly in the DB. HNSW indexes enable fast approximate nearest-neighbor search — finding the most similar chunks among millions in milliseconds.

-- Vector index CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);

🔀 Hybrid Search (RRF + MMR)

When you ask a question, two searches run in parallel:

  • Vector search — finds chunks with similar meaning
  • Full-text search — finds chunks with matching keywords

Results merge via Reciprocal Rank Fusion (RRF), then MMR re-ranks for diversity. Neighbor-window expansion adds adjacent chunks for context continuity.

5. Summary Indexing (Map-Reduce)

Before the graph, the system builds a structured summary index — chapter-like sections with executive summaries.

🗺️ Map Phase Parallel

Chunks are grouped into batches of 5. Each batch goes to the LLM in parallel asking for: a title, a summary, and key concepts with relevance scores.

🔗 Reduce Phase

All batch results are stitched: consecutive batches with the same title merge. Concepts are aggregated by frequency (top 20). A final LLM call writes an Executive Summary.

-- Structured sections enable chapter-level retrieval document_sections ( id UUID PRIMARY KEY, document_id UUID → documents.id, title VARCHAR(255), content TEXT, start_page INTEGER, end_page INTEGER, metadata_ JSONB );

6. Hypergraph Extraction

The LLM reads every chunk and extracts a structured knowledge graph: concepts, definitions, and relationships between them.

🔬 Two-Pass Architecture

1 LLM Extract Parallel
2 Dedup + Embed + Save Single Thread

📡 Pass 1: Extract

Each batch of 4 chunks goes to the LLM with a JSON schema. Returns events (source → relation → target) and definitions. All batches run in parallel.

💾 Pass 2: Save

Concepts are deduplicated and fuzzy-matched. Missing ones get embedded and inserted. HyperEdges (multi-way relationships) are created linking concepts to documents.

-- Hypergraph: three tables forming a labeled property graph concepts ( -- Nodes id UUID, name VARCHAR(255) UNIQUE, description TEXT, embedding VECTOR(1024) ); hyper_edges ( -- Relationships id UUID, description TEXT, source_doc UUID → documents, source_chunk UUID → chunks ); hyper_edge_members ( -- Concept participation hyper_edge_id UUID → hyper_edges, concept_id UUID → concepts, role VARCHAR(50) );

7. Wiki & Reasoning Engine

Two systems consume the hypergraph: the Wiki (browsable articles) and the Reasoning Engine (graph traversal).

📖 Wiki GET /api/wiki/article/{name}

Each concept becomes a wiki article with: description, all relationships (with peer concepts), source document chunks, and related concepts. Supports prefix + vector fuzzy search.

🧠 Reasoning Engine BFS Traversal

BFS on HyperEdges from start concept to goal. Intersection filter, optional Semantic Leap via vector similarity. Returns narrative explanation + Cytoscape graph JSON.

8. The RAG Query

When you ask a question, everything comes together.

graph TD
    Q["❓ User Question"]
    
    Q --> H["🔀 Hybrid Search
(vector + keyword → RRF → MMR)"] Q --> G["🧠 Graph-RAG
(optional)"] Q --> W["🌐 Web Search
(optional)"] H --> R["📄 Ranked Chunks
+ source citations"] G --> R W --> R R --> C["📦 Build Context"] C --> M1["📑 Document + Section hierarchy"] C --> M2["💬 Conversation history"] C --> M3["🧠 User memories"] C --> M4["🌍 Web results"] M1 & M2 & M3 & M4 --> P["🤖 LLM
(System Prompt + Context + Question + Images)"] P --> S["✅ Response
+ cited sources + saved conversation"] style Q fill:#1a1a2a,stroke:#7c6af0,color:#fff,font-weight:bold style H fill:#1a1a2a,stroke:#4ade80,color:#e0e0f0 style G fill:#1a1a2a,stroke:#60a5fa,color:#e0e0f0 style W fill:#1a1a2a,stroke:#f0a06a,color:#e0e0f0 style R fill:#14141f,stroke:#8888aa,color:#e0e0f0 style C fill:#14141f,stroke:#7c6af0,color:#fff,font-weight:bold style P fill:#1a1a2a,stroke:#f0a06a,color:#fff,font-weight:bold style S fill:#1a1a2a,stroke:#4ade80,color:#fff,font-weight:bold

🎯 Token Budget Guard

Before sending to the LLM, token count is checked against the model's context window. If it exceeds, lowest-ranked chunks are dropped until it fits. This prevents silent truncation.

9. Real-World Use Cases

Concrete scenarios you can solve with MNEMOS.

10. Why This Architecture?

Each design decision solves a real problem.

⚡ Why Celery + Redis?

Document processing takes seconds to minutes. Blocking HTTP would time out. Celery queues the work, frees the API. Redis doubles as broker and result backend.

🗄️ Why pgvector instead of Pinecone/Qdrant?

Your data stays in PostgreSQL — one less system. No ETL, no sync lag, no extra cost. pgvector + HNSW gets 95%+ of dedicated vector DB performance for datasets under 10M vectors.

🧩 Why Hypergraph + Wiki?

Plain RAG is reactive — it finds chunks similar to your question. The hypergraph is proactive: it extracts knowledge structure before you ask. Enables browsing, traversal between distant concepts, and narrative synthesis.

💻 Why Local-First + Optional Cloud?

MNEMOS runs entirely on your hardware with llama.cpp for inference and Sentence-Transformers for embeddings. No data leaves your network. But you can plug in GPT-4 or Claude when needed.

📐 Why Map-Reduce for summaries?

LLMs have context windows. A 300-page PDF doesn't fit. Map-Reduce processes small chunks in parallel, then aggregates. 100 chunks process in ~10 seconds with 5 workers.

MNEMOS — Home · API · GitHub

Flask + Angular 21 + PostgreSQL 16 + pgvector + llama.cpp.