Ollama Setup Guide๏
This guide provides comprehensive instructions for setting up Ollama for local LLM inference with pr-resolve.
See Also:
LLM Configuration Guide for advanced configuration options and presets
Privacy Architecture for privacy benefits and compliance
Local LLM Operation Guide for local LLM setup
Privacy FAQ for common privacy questions
Table of Contents๏
Why Ollama?๏
Ollama provides several advantages for local LLM inference:
Privacy & Local LLM Processing ๐๏
๐ Reduced Exposure: Eliminates LLM vendor (OpenAI/Anthropic) from access chain
๐ GitHub API Required: Internet needed to fetch PR data (not offline/air-gapped)
โ Simpler Compliance: One fewer data processor for GDPR, HIPAA, SOC2
โ ๏ธ Reality Check: Code is on GitHub, CodeRabbit has access (required)
๐ Verifiable: Localhost-only LLM operation can be proven with network monitoring
Performance & Cost๏
๐ฐ Free: No API costs - runs entirely on your hardware (zero ongoing fees)
โก Fast: Local inference with GPU acceleration (NVIDIA, AMD, Apple Silicon)
๐ฆ Simple: Easy installation and model management
Recommended For๏
Reducing third-party LLM vendor exposure (eliminate OpenAI/Anthropic)
Regulated industries (simpler compliance with one fewer data processor)
Organizations with policies against cloud LLM services
Cost-conscious usage (no per-request LLM fees)
Development and testing
Trade-offs๏
Requires local compute resources (8-16GB RAM, 10-20GB disk)
Slower than cloud APIs on CPU-only systems (fast with GPU)
Model quality varies (improving rapidly, generally lower than GPT-4/Claude)
Learn More About Privacy๏
For detailed information about Ollamaโs privacy benefits:
Privacy Architecture - Comprehensive privacy analysis
Local LLM Operation Guide - Local LLM setup procedures
Privacy FAQ - Common questions about privacy and local LLM operation
Privacy Verification - Verify localhost-only LLM operation
Quick Start๏
The fastest way to get started with Ollama:
# 1. Install and setup Ollama
./scripts/setup_ollama.sh
# 2. Download recommended model
./scripts/download_ollama_models.sh
# 3. Use with pr-resolve
pr-resolve apply 123 --llm-preset ollama-local
Thatโs it! The scripts handle everything automatically.
Installation๏
Automated Installation (Recommended)๏
Use the provided setup script for automatic installation:
./scripts/setup_ollama.sh
This script:
Detects your operating system (Linux, macOS, Windows/WSL)
Checks for existing Ollama installation
Downloads and installs Ollama using the official installer
Starts the Ollama service
Verifies the installation with health checks
Options:
./scripts/setup_ollama.sh --help
--skip-install: Skip installation if Ollama is already present--skip-start: Skip starting the Ollama service
Manual Installation๏
Linux๏
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Start service
ollama serve
macOS๏
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Or use Homebrew
brew install ollama
# Start service
ollama serve
Windows (WSL)๏
# In WSL terminal
curl -fsSL https://ollama.ai/install.sh | sh
ollama serve
Verifying Installation๏
Check that Ollama is running:
# Check version
ollama --version
# List models (should work even if empty)
ollama list
# Test API health
curl http://localhost:11434/api/tags
Model Selection๏
Interactive Model Download๏
Use the interactive script to download models with recommendations:
./scripts/download_ollama_models.sh
Features:
Interactive menu with recommendations
Model size and quality information
Disk space checking
Shows already downloaded models
Direct Model Download๏
Download a specific model directly:
# Using script
./scripts/download_ollama_models.sh qwen2.5-coder:7b
# Using ollama CLI
ollama pull qwen2.5-coder:7b
Recommended Models๏
For code conflict resolution, we recommend:
Model |
Size |
Speed |
Quality |
Best For |
|---|---|---|---|---|
qwen2.5-coder:7b โญ |
~4GB |
Fast |
Good |
Default choice - Best balance |
qwen2.5-coder:14b |
~8GB |
Medium |
Better |
Higher quality, more RAM |
qwen2.5-coder:32b |
~18GB |
Slow |
Best |
Maximum quality, powerful hardware |
codellama:7b |
~4GB |
Fast |
Good |
Alternative code-focused model |
codellama:13b |
~7GB |
Medium |
Better |
Larger CodeLlama variant |
deepseek-coder:6.7b |
~4GB |
Fast |
Good |
Code specialist |
mistral:7b |
~4GB |
Fast |
Good |
General-purpose alternative |
โญ Default preset: qwen2.5-coder:7b - Excellent for code tasks with minimal resource usage.
Model Comparison๏
qwen2.5-coder:7b vs codellama:7b:
Qwen 2.5 Coder: Better at code understanding and multi-language support
CodeLlama: Strong at Python and code generation
Recommendation: Start with qwen2.5-coder:7b
7B vs 14B vs 32B:
7B: Fast, suitable for most conflicts, 8-16GB RAM
14B: Better quality, complex conflicts, 16-32GB RAM
32B: Best quality, very complex conflicts, 32GB+ RAM
Hardware Requirements๏
Model Size |
RAM |
Disk Space |
Speed (Inference) |
|---|---|---|---|
7B |
8-16GB |
~5GB |
~1-3 tokens/sec (CPU) |
14B |
16-32GB |
~10GB |
~0.5-1 tokens/sec (CPU) |
32B |
32GB+ |
~20GB |
~0.2-0.5 tokens/sec (CPU) |
With GPU (NVIDIA):
7B: 6GB+ VRAM โ 50-100 tokens/sec
14B: 12GB+ VRAM โ 30-60 tokens/sec
32B: 24GB+ VRAM โ 20-40 tokens/sec
Configuration Options๏
Using Ollama with pr-resolve๏
1. Preset (Easiest)๏
pr-resolve apply 123 --llm-preset ollama-local
Uses default settings:
Model:
qwen2.5-coder:7bBase URL:
http://localhost:11434Auto-download: Disabled
2. Custom Model๏
pr-resolve apply 123 \
--llm-preset ollama-local \
--llm-model codellama:13b
3. Configuration File๏
Create config.yaml:
llm:
enabled: true
provider: ollama
model: qwen2.5-coder:7b
ollama_base_url: http://localhost:11434
max_tokens: 2000
cache_enabled: true
fallback_to_regex: true
Use with:
pr-resolve apply 123 --config config.yaml
4. Environment Variables๏
# Set Ollama configuration
export CR_LLM_PROVIDER=ollama
export CR_LLM_MODEL=qwen2.5-coder:7b
export OLLAMA_BASE_URL=http://localhost:11434
# Run pr-resolve
pr-resolve apply 123 --llm-enabled
Remote Ollama Server๏
If Ollama is running on a different machine:
# Set base URL
export OLLAMA_BASE_URL=http://ollama-server:11434
# Or use config file
pr-resolve apply 123 --config config.yaml
config.yaml:
llm:
enabled: true
provider: ollama
model: qwen2.5-coder:7b
ollama_base_url: http://ollama-server:11434
Auto-Download Feature๏
The auto-download feature automatically downloads models when theyโre not available locally.
Enabling Auto-Download๏
Via Python API:
from review_bot_automator.llm.providers.ollama import OllamaProvider
# Auto-download enabled
provider = OllamaProvider(
model="qwen2.5-coder:7b",
auto_download=True # Downloads model if not available
)
Behavior:
When
auto_download=True: Missing models are downloaded automatically (may take several minutes)When
auto_download=False(default): Raises error with installation instructions
Use Cases:
Automated CI/CD pipelines
First-time setup automation
Switching between models frequently
Note: Auto-download is not currently exposed via CLI flags. Use the interactive script or manual ollama pull for CLI usage.
Model Information๏
Get information about a model:
provider = OllamaProvider(model="qwen2.5-coder:7b")
# Get model info
info = provider._get_model_info("qwen2.5-coder:7b")
print(info) # Dict with size, parameters, etc.
# Get recommended models
models = OllamaProvider.list_recommended_models()
for model in models:
print(f"{model['name']}: {model['description']}")
Troubleshooting๏
Ollama Not Running๏
Error:
LLMAPIError: Ollama is not running or not reachable. Start Ollama with: ollama serve
Solution:
# Start Ollama service
ollama serve
# Or use setup script
./scripts/setup_ollama.sh --skip-install
Model Not Found๏
Error:
LLMConfigurationError: Model 'qwen2.5-coder:7b' not found in Ollama.
Install it with: ollama pull qwen2.5-coder:7b
Solution:
# Download model
./scripts/download_ollama_models.sh qwen2.5-coder:7b
# Or use ollama CLI
ollama pull qwen2.5-coder:7b
# Or enable auto-download (Python API only)
provider = OllamaProvider(model="qwen2.5-coder:7b", auto_download=True)
Slow Performance๏
Symptoms: Generation takes a very long time (>30 seconds per request).
Solutions:
Use GPU acceleration (NVIDIA):
# Check GPU is detected ollama ps # Should show GPU info in output
Use smaller model:
# Switch from 14B to 7B pr-resolve apply 123 \ --llm-preset ollama-local \ --llm-model qwen2.5-coder:7b
Close other applications to free up RAM
Check CPU usage: Ensure Ollama has CPU resources
Out of Memory๏
Error:
Ollama model loading failed: not enough memory
Solutions:
Use smaller model:
ollama pull qwen2.5-coder:7b # Instead of 14b or 32b
Close other applications to free up RAM
Use quantized model (if available):
ollama pull qwen2.5-coder:7b-q4_0 # 4-bit quantization
Connection Pool Exhausted๏
Error:
LLMAPIError: Connection pool exhausted - too many concurrent requests
Cause: More than 10 concurrent requests to Ollama.
Solutions:
Reduce concurrency: Process fewer requests simultaneously
Increase pool size (Python API):
# Not currently configurable - requires code change # Pool size is hardcoded to 10 in HTTPAdapter
Port Already in Use๏
Error:
Error: listen tcp 127.0.0.1:11434: bind: address already in use
Solutions:
Check existing Ollama process:
ps aux | grep ollama killall ollama # Stop existing instance ollama serve # Start new instance
Use different port:
OLLAMA_HOST=0.0.0.0:11435 ollama serve # Update configuration export OLLAMA_BASE_URL=http://localhost:11435
Model Download Failed๏
Error:
Failed to download model: connection timeout
Solutions:
Check internet connection
Retry with manual pull:
ollama pull qwen2.5-coder:7b
Check disk space:
df -h # Ensure at least 10GB free
Advanced Usage๏
Custom Ollama Configuration๏
Change default model directory:
# Set model storage location
export OLLAMA_MODELS=/path/to/models
# Start Ollama
ollama serve
Enable debug logging:
# Enable verbose output
export OLLAMA_DEBUG=1
ollama serve
Multiple Models๏
Use different models for different tasks:
# Download multiple models
ollama pull qwen2.5-coder:7b
ollama pull codellama:13b
ollama pull mistral:7b
# Use specific model
pr-resolve apply 123 --llm-preset ollama-local --llm-model codellama:13b
Model Management๏
# List downloaded models
ollama list
# Show model info
ollama show qwen2.5-coder:7b
# Remove model
ollama rm mistral:7b
# Copy model with custom name
ollama cp qwen2.5-coder:7b my-custom-model
Running as System Service๏
Linux (systemd):
# Create service file
sudo tee /etc/systemd/system/ollama.service > /dev/null <<EOF
[Unit]
Description=Ollama Service
After=network.target
[Service]
Type=simple
User=$USER
ExecStart=/usr/local/bin/ollama serve
Restart=always
[Install]
WantedBy=multi-user.target
EOF
# Enable and start service
sudo systemctl enable ollama
sudo systemctl start ollama
# Check status
sudo systemctl status ollama
macOS (launchd):
# Ollama includes launchd service by default
# Check if running
launchctl list | grep ollama
# Start service
launchctl start com.ollama.ollama
GPU Acceleration๏
GPU acceleration provides 10-60x speedup compared to CPU-only inference. The pr-resolve tool automatically detects and displays GPU information when using Ollama.
Automatic GPU Detection๏
Starting with version 0.3.0, pr-resolve automatically detects GPU availability when initializing Ollama:
# Run conflict resolution
pr-resolve apply 123 --llm-preset ollama-local
# GPU info displayed in metrics (if detected)
# โญโ LLM Metrics โโโโโโโโโโโโโโโโโโโโโโโโโโฎ
# โ Provider: ollama (qwen2.5-coder:7b) โ
# โ Hardware: NVIDIA RTX 4090 (24GB) โ
# โ Changes Parsed: 5 โ
# โ ... โ
# โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Detection supports multiple platforms:
NVIDIA GPUs: CUDA 11.0+ (automatically detected via nvidia-smi)
AMD GPUs: ROCm 5.0+ (automatically detected via rocm-smi)
Apple Silicon: M1/M2/M3/M4 with Metal (automatically detected on macOS)
CPU Fallback: Gracefully falls back if no GPU detected
NVIDIA GPU Setup (CUDA)๏
Prerequisites:
# 1. Verify NVIDIA driver
nvidia-smi
# Should show driver version and GPU info
# Recommended: Driver 525+ (CUDA 12+)
Installation (if nvidia-smi not found):
Ubuntu/Debian:
# Install NVIDIA drivers
sudo ubuntu-drivers autoinstall
# Reboot required
sudo reboot
# Verify
nvidia-smi
Fedora/RHEL:
# Install NVIDIA drivers
sudo dnf install akmod-nvidia
# Reboot required
sudo reboot
# Verify
nvidia-smi
Verification:
# Check Ollama GPU detection
ollama ps
# Should show
# NAME ID SIZE PROCESSOR
# qwen2.5-coder:7b abc123... 4.7 GB 100% GPU
# Test with pr-resolve
pr-resolve apply 123 --llm-preset ollama-local
# Check metrics output for GPU info
Performance Expectations:
RTX 3060 (12GB): ~50-70 tokens/sec with 7B models
RTX 3090 (24GB): ~70-100 tokens/sec with 7B models, ~40-60 tokens/sec with 14B
RTX 4090 (24GB): ~100-150 tokens/sec with 7B models, ~60-90 tokens/sec with 14B
AMD GPU Setup (ROCm)๏
Prerequisites:
AMD GPU with ROCm support (RX 6000/7000 series, MI series)
ROCm 5.0 or newer
Installation:
# Follow AMD ROCm installation guide
# https://github.com/ollama/ollama/blob/main/docs/gpu.md
# Verify
rocm-smi --showproductname
# Should display AMD GPU info
Verification:
# Check Ollama GPU detection
ollama ps
# Test with pr-resolve
pr-resolve apply 123 --llm-preset ollama-local
Apple Silicon Setup (Metal)๏
Automatic Detection: No setup required - Ollama automatically uses Metal acceleration on Apple Silicon Macs.
Supported Chips:
M1, M1 Pro, M1 Max, M1 Ultra
M2, M2 Pro, M2 Max, M2 Ultra
M3, M3 Pro, M3 Max
M4, M4 Pro, M4 Max
Verification:
# Check chip
sysctl -n machdep.cpu.brand_string
# Should show "Apple M1/M2/M3/M4"
# Test with pr-resolve
pr-resolve apply 123 --llm-preset ollama-local
# Metrics will show
# Hardware: Apple M3 Max (Metal)
Performance Notes:
M1/M2 8GB: Good for 7B models
M1/M2 Pro/Max 16GB+: Excellent for 7B-14B models
M1/M2 Ultra 64GB+: Handles 32B models well
Unified memory shared between CPU and GPU
Troubleshooting GPU Detection๏
GPU Not Detected (Shows โHardware: CPUโ):
Verify GPU is available:
# NVIDIA nvidia-smi # AMD rocm-smi --showproductname # Apple Silicon sysctl -n machdep.cpu.brand_string
Check Ollama GPU usage:
ollama ps # PROCESSOR column should show "GPU" not "CPU" # If shows CPU, Ollama isn't using GPU
Restart Ollama to detect GPU:
# Stop Ollama killall ollama # Start Ollama (GPU detection happens on startup) ollama serve # Reload model to use GPU ollama pull qwen2.5-coder:7b --force
Check CUDA/ROCm installation:
# NVIDIA: Check CUDA nvcc --version # AMD: Check ROCm rocminfo
GPU Detected but Slow Performance:
Check GPU memory:
# NVIDIA nvidia-smi # Look for "Memory-Usage" - should have enough free VRAM # 7B models need ~6GB, 14B need ~12GB
Close competing GPU processes:
# NVIDIA: List GPU processes nvidia-smi # AMD: List processes rocm-smi --showpids
Use smaller model if out of VRAM:
# 7B instead of 14B pr-resolve apply 123 \ --llm-preset ollama-local \ --llm-model qwen2.5-coder:7b
Mixed CPU/GPU Usage:
If model is too large for GPU VRAM, Ollama may split between GPU and CPU (slower):
# Check split in ollama ps
ollama ps
# May show: "50% GPU" instead of "100% GPU"
# Solution: Use smaller model
GPU Performance Monitoring๏
During Resolution:
# Terminal 1: Run pr-resolve
pr-resolve apply 123 --llm-preset ollama-local
# Terminal 2: Monitor GPU
watch -n 1 nvidia-smi # NVIDIA
# OR
watch -n 1 rocm-smi # AMD
Check pr-resolve Metrics:
# After resolution completes
# Look for metrics panel in output
โญโ LLM Metrics โโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Provider: ollama (qwen2.5-coder:7b) โ
โ Hardware: NVIDIA RTX 4090 (24GB) โ โ GPU detected
โ Changes Parsed: 5 โ
โ Avg Confidence: 0.92 โ
โ Cache Hit Rate: 0% โ
โ Total Cost: $0.00 โ
โ API Calls: 5 โ
โ Total Tokens: 12,450 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
No GPU Info Displayed:
If GPU info is not shown in metrics, it means:
No GPU detected (CPU-only system)
GPU detection failed (non-fatal, falls back to CPU)
Using cloud LLM provider (GPU info only for Ollama)
GPU Acceleration Benefits๏
Performance Comparison (qwen2.5-coder:7b):
Hardware |
Tokens/sec |
Time for 1000 tokens |
|---|---|---|
CPU (i7-12700K) |
1-3 |
5-15 minutes |
RTX 3060 (12GB) |
50-70 |
15-20 seconds |
RTX 4090 (24GB) |
100-150 |
7-10 seconds |
M2 Max (96GB) |
40-60 |
15-25 seconds |
Cost Savings:
GPU: Free (local hardware)
API (Claude/GPT-4): ~$0.01-0.05 per resolution
Recommendation: For frequent usage, a $300-500 GPU pays for itself in API savings within months.
Performance Tuning๏
Adjust context size:
# config.yaml
llm:
max_tokens: 4000 # Increase for larger conflicts
Adjust timeout:
provider = OllamaProvider(
model="qwen2.5-coder:7b",
timeout=300 # 5 minutes for slow systems
)
See Also๏
LLM Configuration Guide - Advanced configuration options
Configuration Guide - General configuration documentation
Getting Started Guide - Quick start guide
Ollama Documentation - Official Ollama docs