Local AI Inference Machine
Build a Debian local AI inference host with GPU drivers, model storage, Ollama, llama.cpp, containers, resource monitoring, and LAN access boundaries.
The Local AI Inference Machine scenario is for running large language models, embedding models, speech transcription, or image generation services on a Debian host. The goal is to plan drivers, model files, inference services, resource monitoring, and access boundaries together.
Who It Is For
- Individuals or teams running models locally to reduce external API dependency
- Developers who need GPU-accelerated inference, batch embeddings, or offline experiments
- Users with a workstation or server who want a clean model and service layout
Recommended Hardware
| Component | Recommendation |
|---|---|
| CPU | 6+ cores for concurrency and preprocessing |
| Memory | 16 GB minimum, 32+ GB for larger models |
| VRAM | 8 GB for lightweight models, 16+ GB for medium-sized models |
| Disk | 1 TB SSD is more comfortable; model directories grow quickly |
| Network | Wired LAN for local APIs; public access needs authentication and rate limits |
Installation Path
- Install Debian stable and verify the kernel, firmware, and GPU are detected.
- Use Hardware & Drivers for graphics, firmware, and laptop thermal issues.
- Plan separate directories for model files, caches, and service data.
- Choose a runtime: native binary, Python virtual environment, container, or dedicated inference service.
- Expose APIs only on the LAN by default; add reverse proxying, authentication, and logs before public access.
Base Packages
sudo apt update
sudo apt install git git-lfs build-essential cmake pkg-config python3 python3-venv \
python3-pip ffmpeg jq htop nvtopWithout an NVIDIA GPU, nvtop can still support some AMD / Intel devices depending on driver and kernel support.
GPU And Drivers
Identify hardware first:
lspci -nn | grep -Ei 'vga|3d|display'NVIDIA users should start with NVIDIA Drivers. Avoid mixing drivers, CUDA packages, and kernel modules from multiple sources. Laptop users should also read Laptop Compatibility for thermals, power mode, and external display paths.
Model Directory
Keep models and service data outside user home directories:
sudo install -d -m 2775 -o root -g users /srv/ai
sudo install -d -m 2775 -o root -g users /srv/ai/models
sudo install -d -m 2775 -o root -g users /srv/ai/cacheSuggested layout:
/srv/ai/
models/
cache/
services/
logs/Model files are large. For backups, prioritize manifests, download sources, and custom configuration; back up model files only when they cannot be recreated.
Runtime Choices
| Runtime | Good fit | Notes |
|---|---|---|
| Ollama | Quickly running common LLMs and a local API | Confirm package or container source and update policy first |
| llama.cpp | Lightweight, controllable GGUF model runtime | Match build options and GPU backend to your hardware |
| Python virtual environment | Custom scripts, embeddings, experiments | Use one venv per project; do not pollute system Python |
| Containers | Isolating services and dependencies | Verify GPU pass-through, volumes, and bind addresses |
Do not pipe remote installer scripts directly into a shell. Download, verify, and review scripts before deciding whether to install them.
Service Boundaries
Inference APIs should bind to local or LAN addresses by default:
127.0.0.1 local machine only
192.168.x.x trusted LAN clients
0.0.0.0 only after reverse proxying and authentication are readyFor LAN access, open firewall rules only to trusted subnets and the specific service port.
Resource Monitoring
Useful checks:
free -h
df -h /srv/ai
systemctl --failed
journalctl -p warning -n 100For NVIDIA:
nvidia-smiIf nvidia-smi is unavailable, return to driver and kernel-module diagnostics before stacking inference services on top.
Backup Strategy
Prioritize:
- Inference service configuration
- Custom prompts, workflows, and model manifests
- Business data, vector stores, or embedding indexes
- Model files that cannot be downloaded again
Usually skip:
- Public models that can be downloaded again
- Build caches
- Temporary conversion files
Common Issues
| Issue | Check first |
|---|---|
| GPU is not detected | lspci, driver source, kernel module, Secure Boot |
| Inference is slow | CPU fallback, quantization, VRAM usage, context length |
| Service is unreachable | Bind address, firewall, reverse proxy, service logs |
| Disk fills quickly | Duplicate model downloads, cache directory, logs, conversion files |
| Update breaks services | Mixed driver/CUDA sources, polluted Python env, drifting container tags |