MNEMOS — How It Works

💻 Hardware Recommendations

What you need for a smooth experience at different scales.

🟢 Minimum

CPU: 4 cores (AVX support)
RAM: 8 GB
Storage: 10 GB free
GPU: none (CPU mode)
Great for: testing, small docs, CPU-only

🟡 Recommended

CPU: 8+ cores
RAM: 16-32 GB
Storage: 50 GB SSD
GPU: NVIDIA 8GB+ VRAM (CUDA)
Great for: local LLMs, hypergraph, daily use

🟣 Pro / Heavy

CPU: 16+ cores
RAM: 64 GB
Storage: 100+ GB NVMe
GPU: NVIDIA 24GB+ VRAM (CUDA)
Great for: large models (70B+), heavy batch processing

🔵 Cloud LLM

No GPU needed
RAM: 8 GB minimum
Storage: 5 GB free
API keys: OpenAI / Anthropic / Groq
Great for: low-resource hardware, maximum quality

1. Pipeline Overview

Every document goes through a 6-stage pipeline. Each stage transforms raw content into structured, queryable intelligence.

1 Upload

→

2 Extract & Chunk LLM

→

3 Embed & Save pgvector

→

4 Summarize Map-Reduce

→

5 Hypergraph LLM

→

6 Wiki / RAG Ready

graph LR
    A["📤 Upload
(PDF / MP4 / YT / EPUB)"]
    B["✂️ Chunker
(LangChain + Whisper)"]
    C["🧠 Embedder
(pgvector 1024d)"]
    D["📊 PostgreSQL 16 + pgvector"]
    E1["🔍 RAG Query
(hybrid search)"]
    E2["📖 Wiki
(concept articles)"]
    E3["🧩 Reasoning
(BFS traversal)"]

    A --> B
    B --> C
    C --> D
    D --> E1
    D --> E2
    D --> E3

    style A fill:#1a1a2a,stroke:#7c6af0,color:#e0e0f0
    style B fill:#1a1a2a,stroke:#f0a06a,color:#e0e0f0
    style C fill:#1a1a2a,stroke:#60a5fa,color:#e0e0f0
    style D fill:#14141f,stroke:#4ade80,color:#e0e0f0
    style E1 fill:#1a1a2a,stroke:#7c6af0,color:#e0e0f0
    style E2 fill:#1a1a2a,stroke:#7c6af0,color:#e0e0f0
    style E3 fill:#1a1a2a,stroke:#7c6af0,color:#e0e0f0

2. Upload & Extract

A file or YouTube URL is received, saved, and queued for background processing via Celery + Redis.

📤 Upload API: POST /api/documents/upload

The Flask endpoint saves the file with a UUID-prefixed name (no collisions), creates a Document row with status='pending', and enqueues a Celery task that runs asynchronously.

🔄 Extraction by File Type

PDF / EPUB

PyMuPDF extracts text page by page
Metadata (author, title) saved to Document
Each page is chunked via RecursiveCharacterTextSplitter

Audio / Video / YouTube

YouTube: yt-dlp downloads audio, saves metadata
Whisper transcribes speech into timed segments
Chunker merges small segments into larger chunks

🌍 Language Detection

langdetect samples the first chunks to detect the document language. This sets the PostgreSQL full-text search config (english, spanish, etc.), enabling proper stemming per language.

3. Chunking

Why chunk? LLMs have context limits. Breaking documents into smaller pieces makes retrieval precise, efficient, and scalable.

✂️ Text Chunking

Tool: RecursiveCharacterTextSplitter (LangChain)

Default size: 512 characters, overlap: 50
Separators: "\n\n" → "\n" → " " → ""
User-customizable via Preferences UI

🎤 Transcript Chunking

Tool: Custom merge logic

Whisper returns segments of ~1-5 seconds
Segments accumulated until target size
Timestamps (start/end) preserved per chunk
Enables time-accurate citations in answers

-- Each chunk: a searchable unit with vector + full-text chunks ( id UUID PRIMARY KEY, document_id UUID → documents.id, content TEXT, chunk_index INTEGER, page_number INTEGER, start_time FLOAT, end_time FLOAT, embedding VECTOR(1024), search_vector TSVECTOR );

4. Embedding & Vector Search

Chunks become vectors that capture meaning. Similar chunks have similar vectors.

🧠 Embedder Service Local or Remote

Local: Sentence-Transformers (bge-m3) — CPU or GPU (CUDA), FP16
Remote: OpenAI, LM Studio, or Ollama embedding API
Auto-batching based on VRAM
Query embeddings cached via LRU (512 entries)

📐 pgvector + HNSW Index

PostgreSQL 16 with pgvector stores vectors directly in the DB. HNSW indexes enable fast approximate nearest-neighbor search — finding the most similar chunks among millions in milliseconds.

-- Vector index CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);

🔀 Hybrid Search (RRF + MMR)

When you ask a question, two searches run in parallel:

Vector search — finds chunks with similar meaning
Full-text search — finds chunks with matching keywords

Results merge via Reciprocal Rank Fusion (RRF), then MMR re-ranks for diversity. Neighbor-window expansion adds adjacent chunks for context continuity.

5. Summary Indexing (Map-Reduce)

Before the graph, the system builds a structured summary index — chapter-like sections with executive summaries.

🗺️ Map Phase Parallel

Chunks are grouped into batches of 5. Each batch goes to the LLM in parallel asking for: a title, a summary, and key concepts with relevance scores.

🔗 Reduce Phase

All batch results are stitched: consecutive batches with the same title merge. Concepts are aggregated by frequency (top 20). A final LLM call writes an Executive Summary.

-- Structured sections enable chapter-level retrieval document_sections ( id UUID PRIMARY KEY, document_id UUID → documents.id, title VARCHAR(255), content TEXT, start_page INTEGER, end_page INTEGER, metadata_ JSONB );

6. Hypergraph Extraction

The LLM reads every chunk and extracts a structured knowledge graph: concepts, definitions, and relationships between them.

🔬 Two-Pass Architecture

1 LLM Extract Parallel

→

2 Dedup + Embed + Save Single Thread

📡 Pass 1: Extract

Each batch of 4 chunks goes to the LLM with a JSON schema. Returns events (source → relation → target) and definitions. All batches run in parallel.

💾 Pass 2: Save

Concepts are deduplicated and fuzzy-matched. Missing ones get embedded and inserted. HyperEdges (multi-way relationships) are created linking concepts to documents.

-- Hypergraph: three tables forming a labeled property graph concepts ( -- Nodes id UUID, name VARCHAR(255) UNIQUE, description TEXT, embedding VECTOR(1024) ); hyper_edges ( -- Relationships id UUID, description TEXT, source_doc UUID → documents, source_chunk UUID → chunks ); hyper_edge_members ( -- Concept participation hyper_edge_id UUID → hyper_edges, concept_id UUID → concepts, role VARCHAR(50) );

7. Wiki & Reasoning Engine

Two systems consume the hypergraph: the Wiki (browsable articles) and the Reasoning Engine (graph traversal).

📖 Wiki GET /api/wiki/article/{name}

Each concept becomes a wiki article with: description, all relationships (with peer concepts), source document chunks, and related concepts. Supports prefix + vector fuzzy search.

🧠 Reasoning Engine BFS Traversal

BFS on HyperEdges from start concept to goal. Intersection filter, optional Semantic Leap via vector similarity. Returns narrative explanation + Cytoscape graph JSON.

8. The RAG Query

When you ask a question, everything comes together.

graph TD
    Q["❓ User Question"]
    
    Q --> H["🔀 Hybrid Search
(vector + keyword → RRF → MMR)"]
    Q --> G["🧠 Graph-RAG
(optional)"]
    Q --> W["🌐 Web Search
(optional)"]

    H --> R["📄 Ranked Chunks
+ source citations"]
    G --> R
    W --> R

    R --> C["📦 Build Context"]
    C --> M1["📑 Document + Section hierarchy"]
    C --> M2["💬 Conversation history"]
    C --> M3["🧠 User memories"]
    C --> M4["🌍 Web results"]

    M1 & M2 & M3 & M4 --> P["🤖 LLM
(System Prompt + Context + Question + Images)"]
    P --> S["✅ Response
+ cited sources + saved conversation"]

    style Q fill:#1a1a2a,stroke:#7c6af0,color:#fff,font-weight:bold
    style H fill:#1a1a2a,stroke:#4ade80,color:#e0e0f0
    style G fill:#1a1a2a,stroke:#60a5fa,color:#e0e0f0
    style W fill:#1a1a2a,stroke:#f0a06a,color:#e0e0f0
    style R fill:#14141f,stroke:#8888aa,color:#e0e0f0
    style C fill:#14141f,stroke:#7c6af0,color:#fff,font-weight:bold
    style P fill:#1a1a2a,stroke:#f0a06a,color:#fff,font-weight:bold
    style S fill:#1a1a2a,stroke:#4ade80,color:#fff,font-weight:bold

🎯 Token Budget Guard

Before sending to the LLM, token count is checked against the model's context window. If it exceeds, lowest-ranked chunks are dropped until it fits. This prevents silent truncation.

10. Why This Architecture?

Each design decision solves a real problem.

⚡ Why Celery + Redis?

Document processing takes seconds to minutes. Blocking HTTP would time out. Celery queues the work, frees the API. Redis doubles as broker and result backend.

🗄️ Why pgvector instead of Pinecone/Qdrant?

Your data stays in PostgreSQL — one less system. No ETL, no sync lag, no extra cost. pgvector + HNSW gets 95%+ of dedicated vector DB performance for datasets under 10M vectors.

🧩 Why Hypergraph + Wiki?

Plain RAG is reactive — it finds chunks similar to your question. The hypergraph is proactive: it extracts knowledge structure before you ask. Enables browsing, traversal between distant concepts, and narrative synthesis.

💻 Why Local-First + Optional Cloud?

MNEMOS runs entirely on your hardware with llama.cpp for inference and Sentence-Transformers for embeddings. No data leaves your network. But you can plug in GPT-4 or Claude when needed.

📐 Why Map-Reduce for summaries?

LLMs have context windows. A 300-page PDF doesn't fit. Map-Reduce processes small chunks in parallel, then aggregates. 100 chunks process in ~10 seconds with 5 workers.

How MNEMOS Works