llama.cpp
llama.cpp is a high-performance C++ library for local large language model inference. It provides both a CLI chat tool and a production-grade HTTP server with an OpenAI-compatible API. llama.cpp is the underlying engine behind many popular tools like Ollama and LM Studio, but using it directly gives you maximum control over model loading, quantization, and inference parameters.
Prerequisites
Before building llama.cpp, make sure you have the following:
- Debian 12 (Bookworm) or later -- 64-bit x86_64 or ARM64
- Build tools -- GCC, CMake, Make, and Git
- 4 GB RAM minimum -- More recommended for larger models
- GPU (optional) -- NVIDIA GPU with CUDA toolkit for GPU acceleration
Installation
Step 1: Install Build Dependencies
# Install essential build tools
sudo apt update
sudo apt install -y build-essential cmake git pkg-configStep 2: Clone the Repository
# Clone the llama.cpp repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cppStep 3: Build (CPU Only)
# Create a build directory and compile
cmake -B build
cmake --build build --config Release -j $(nproc)
# Verify the build
./build/bin/llama-cli --version
./build/bin/llama-server --versionStep 4: Build with CUDA Support (Optional)
If you have an NVIDIA GPU, you can build with CUDA acceleration for significantly faster inference.
# Install NVIDIA CUDA toolkit
sudo apt install -y nvidia-cuda-toolkit
# Build with CUDA support
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)
# Verify CUDA is enabled
./build/bin/llama-cli --versionTIP
For the latest CUDA toolkit, you can also install it directly from NVIDIA's repository. See NVIDIA CUDA Installation Guide for Debian packages.
Optional: Install System-Wide
# Install the built binaries to /usr/local
sudo cmake --install build
# Verify system-wide installation
llama-cli --version
llama-server --versionConfiguration
Downloading GGUF Models
llama.cpp uses the GGUF model format. You can download models from Hugging Face.
# Create a directory for models
mkdir -p ~/models
# Download a model using wget (example: Llama 3.2 1B Q4_K_M)
wget -O ~/models/llama-3.2-1b-q4_k_m.gguf \
"https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf"
# Download a larger model (example: Mistral 7B Q4_K_M)
wget -O ~/models/mistral-7b-q4_k_m.gguf \
"https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf"TIP
When choosing a quantization level:
- Q4_K_M -- Best balance of quality and size for most users
- Q5_K_M -- Higher quality, ~15% larger than Q4
- Q8_0 -- Near-original quality, significantly larger
- Q3_K_S -- Smallest size, noticeable quality reduction
Model File Locations
| Source | URL Pattern |
|---|---|
| Hugging Face | https://huggingface.co/{user}/{repo}/resolve/main/{filename}.gguf |
| TheBloke's Collection | https://huggingface.co/TheBloke -- Large collection of pre-quantized models |
| bartowski's Collection | https://huggingface.co/bartowski -- Up-to-date GGUF conversions |
Usage
CLI Chat (llama-cli)
Use llama-cli for interactive chat sessions in the terminal.
# Start an interactive chat session
./build/bin/llama-cli \
-m ~/models/llama-3.2-1b-q4_k_m.gguf \
--conversation \
-ngl 99
# Run with a specific prompt
./build/bin/llama-cli \
-m ~/models/llama-3.2-1b-q4_k_m.gguf \
-p "Explain the Debian package management system." \
-n 256
# Run with custom parameters
./build/bin/llama-cli \
-m ~/models/llama-3.2-1b-q4_k_m.gguf \
--conversation \
-ngl 99 \
--temp 0.7 \
--top-p 0.9 \
-c 4096Key parameters:
| Flag | Description |
|---|---|
-m | Path to the GGUF model file |
-ngl | Number of layers to offload to GPU (use 99 for all layers) |
-c | Context size in tokens (default: 2048) |
--temp | Temperature for sampling (0.0 = deterministic, 1.0 = creative) |
--top-p | Top-p (nucleus) sampling threshold |
-n | Maximum number of tokens to generate |
-t | Number of CPU threads to use |
--conversation | Enable interactive conversation mode |
HTTP Server (llama-server)
llama-server provides a production-grade HTTP server with an OpenAI-compatible API and a built-in web UI.
# Start the server with a model
./build/bin/llama-server \
-m ~/models/llama-3.2-1b-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 99 \
-c 4096
# Start with multiple slots for concurrent requests
./build/bin/llama-server \
-m ~/models/llama-3.2-1b-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 99 \
-c 4096 \
--parallel 4After starting the server, open http://localhost:8080 in your browser for the built-in web UI.
# Test the server with curl (OpenAI-compatible endpoint)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-1b",
"messages": [
{"role": "user", "content": "What is Debian?"}
]
}'
# Generate a text completion
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-1b",
"prompt": "Debian is a",
"max_tokens": 100
}'
# Check server health
curl http://localhost:8080/healthRunning as a Systemd Service
Create a systemd service to run llama-server in the background.
# Create a systemd service file
sudo tee /etc/systemd/system/llama-server.service << 'EOF'
[Unit]
Description=llama.cpp HTTP Server
After=network.target
[Service]
Type=simple
User=nobody
ExecStart=/usr/local/bin/llama-server \
-m /home/user/models/llama-3.2-1b-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 99 \
-c 4096
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
# Reload systemd, enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable llama-server
sudo systemctl start llama-server
# Check the service status
sudo systemctl status llama-serverUpdate
# Navigate to the llama.cpp directory
cd llama.cpp
# Pull the latest changes
git pull origin master
# Rebuild
cmake -B build
cmake --build build --config Release -j $(nproc)
# If installed system-wide, reinstall
sudo cmake --install buildTroubleshooting
Build fails with missing dependencies
# Install all required build dependencies
sudo apt install -y build-essential cmake git pkg-config
# If CUDA build fails, verify CUDA toolkit installation
nvcc --version
dpkg -l | grep nvidia-cuda-toolkitCUDA not found during build
# Install CUDA toolkit
sudo apt install -y nvidia-cuda-toolkit
# Set CUDA path if cmake cannot find it
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/bin/nvcc
cmake --build build --config Release -j $(nproc)Model loading fails
# Verify the model file exists and is not corrupted
ls -lh ~/models/llama-3.2-1b-q4_k_m.gguf
# Check file integrity (re-download if the size looks wrong)
md5sum ~/models/llama-3.2-1b-q4_k_m.gguf
# Try loading with verbose output
./build/bin/llama-cli -m ~/models/llama-3.2-1b-q4_k_m.gguf --verbose-promptOut of memory when loading large models
# Reduce the context size
./build/bin/llama-cli -m ~/models/large-model.gguf -c 2048 -ngl 99
# Use partial GPU offloading (offload fewer layers)
./build/bin/llama-cli -m ~/models/large-model.gguf -ngl 20
# Use a more aggressively quantized model (Q3 or Q4 instead of Q8)Slow inference on CPU
# Specify the number of threads to match your CPU cores
./build/bin/llama-cli -m ~/models/model.gguf -t $(nproc)
# Enable memory-mapped I/O (enabled by default, but verify)
./build/bin/llama-cli -m ~/models/model.gguf --mlockRelated Resources
- AI Tools Overview -- Overview of all AI tools on Debian
- Ollama -- User-friendly wrapper around llama.cpp
- llamafile -- Single-file executable LLM based on llama.cpp
- llama.cpp GitHub -- Official repository and documentation