Active

Mnemon: Local RAG Server

A self-hosted AI inference and RAG server running entirely on local hardware. No API costs, no data leaving the network, no rate limits.

OllamaAnythingLLMDockerRAGUbuntuNVIDIA CUDA

The Idea

I've spent years building and documenting things, but finding that information later always felt harder than it should be. Searching through old notes, digging through config files, trying to remember which doc had the answer. It adds up.

The goal was simple: take everything I care about, documents, notes, configs, articles, and make it searchable through natural language. Ask a question, get an answer with a source. No API costs, no rate limits, no data leaving the network.

I called it Mnemon, a nod to memory and recall. It runs completely on local hardware, and everything stays on-premises.

The Stack

Core Stack

Ollama: local LLM inference, GPU-accelerated systemd service
AnythingLLM: RAG pipeline, document ingestion, chat UI
Docker: container for AnythingLLM with GPU passthrough
LanceDB: built-in vector database via AnythingLLM
Ubuntu 22.04 LTS: host OS

Hardware

RTX 3070 (8GB VRAM): primary inference GPU
92GB ECC RAM: memory for embeddings and model context
447GB SSD: OS drive
745GB SSD: dedicated AI data drive (/mnt/aidata)

How I Set It Up

The setup came together in a few stages. Getting each piece working independently before connecting them made the whole process much smoother.

1. OS and storage. Installed Ubuntu 22.04 and mounted the dedicated AI data drive at /mnt/aidata to keep model weights and all persistent data off the OS drive.
2. Ollama. Installed Ollama as a systemd service, set the OLLAMA_MODELS environment variable to point at the data drive before pulling any models, then pulled the chat and embedding models.
3. GPU passthrough. Installed the NVIDIA Container Toolkit so Docker containers could access the GPU for accelerated inference.
4. AnythingLLM. Deployed AnythingLLM in Docker, configured it to connect to Ollama using the server's LAN IP. Using localhost or host.docker.internal doesn't reliably resolve inside Linux containers.
5. First documents. Ingested a few PDFs and notes, ran test queries, confirmed the pipeline was returning grounded answers with source citations.

What It's Running

Ollama handles all inference and exposes a local API on the network. The primary chat model is gemma4:e4b for general Q&A, with mistral:7b-instruct-q4_K_M as a fallback when I want to compare outputs or need a second pass. For vision work, llava handles image analysis for Frigate and Home Assistant. nomic-embed-text handles all embeddings for RAG retrieval.

All model weights live on the dedicated AI drive. The GPU is passed through to Docker via the NVIDIA Container Toolkit so AnythingLLM can run inference without the overhead of CPU-only processing.

What It Actually Does

Day to day, the most useful thing is document Q&A. I upload PDFs, notes, or exported articles and query them through chat. AnythingLLM cites its sources, so I can see exactly which chunk it pulled the answer from.

It also runs as the local AI backend for Home Assistant, replacing any cloud dependency for voice assist. Frigate uses llava to generate descriptions of surveillance events locally, so nothing leaves the network. And when I want to experiment with a new model or pipeline idea, I don't have to worry about API costs or rate limits.

What I Learned

The data pipeline matters more than the model. I spent time trying different LLMs before realizing that well-structured, clean source documents made a bigger difference than which model was doing the answering. Good context beats a bigger model with bad data every time.

A few setup details that aren't obvious. Ollama runs as its own system user, so the AI data drive permissions need to match that user, not your login user. The OLLAMA_MODELS environment variable has to be set before pulling any models or they land on the OS drive by default. And inside Docker on Linux, you need the server's actual LAN IP, not a hostname alias.

What's Next

Add Synology NAS as a data source
Configure Google Drive connector
Set up NGINX reverse proxy with SSL for secure remote access
Evaluate Qdrant as an external vector DB for larger workloads

← Back