Ollama Setup Guide๏ƒ

This guide provides comprehensive instructions for setting up Ollama for local LLM inference with pr-resolve.

See Also:

Table of Contents๏ƒ

Why Ollama?๏ƒ

Ollama provides several advantages for local LLM inference:

Privacy & Local LLM Processing ๐Ÿ”’๏ƒ

  • ๐Ÿ”’ Reduced Exposure: Eliminates LLM vendor (OpenAI/Anthropic) from access chain

  • ๐ŸŒ GitHub API Required: Internet needed to fetch PR data (not offline/air-gapped)

  • โœ… Simpler Compliance: One fewer data processor for GDPR, HIPAA, SOC2

  • โš ๏ธ Reality Check: Code is on GitHub, CodeRabbit has access (required)

  • ๐Ÿ” Verifiable: Localhost-only LLM operation can be proven with network monitoring

Performance & Cost๏ƒ

  • ๐Ÿ’ฐ Free: No API costs - runs entirely on your hardware (zero ongoing fees)

  • โšก Fast: Local inference with GPU acceleration (NVIDIA, AMD, Apple Silicon)

  • ๐Ÿ“ฆ Simple: Easy installation and model management

Trade-offs๏ƒ

  • Requires local compute resources (8-16GB RAM, 10-20GB disk)

  • Slower than cloud APIs on CPU-only systems (fast with GPU)

  • Model quality varies (improving rapidly, generally lower than GPT-4/Claude)

Learn More About Privacy๏ƒ

For detailed information about Ollamaโ€™s privacy benefits:

Quick Start๏ƒ

The fastest way to get started with Ollama:

# 1. Install and setup Ollama
./scripts/setup_ollama.sh

# 2. Download recommended model
./scripts/download_ollama_models.sh

# 3. Use with pr-resolve
pr-resolve apply 123 --llm-preset ollama-local

Thatโ€™s it! The scripts handle everything automatically.

Installation๏ƒ

Manual Installation๏ƒ

Linux๏ƒ

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Start service
ollama serve

macOS๏ƒ

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Or use Homebrew
brew install ollama

# Start service
ollama serve

Windows (WSL)๏ƒ

# In WSL terminal
curl -fsSL https://ollama.ai/install.sh | sh
ollama serve

Verifying Installation๏ƒ

Check that Ollama is running:

# Check version
ollama --version

# List models (should work even if empty)
ollama list

# Test API health
curl http://localhost:11434/api/tags

Model Selection๏ƒ

Interactive Model Download๏ƒ

Use the interactive script to download models with recommendations:

./scripts/download_ollama_models.sh

Features:

  • Interactive menu with recommendations

  • Model size and quality information

  • Disk space checking

  • Shows already downloaded models

Direct Model Download๏ƒ

Download a specific model directly:

# Using script
./scripts/download_ollama_models.sh qwen2.5-coder:7b

# Using ollama CLI
ollama pull qwen2.5-coder:7b

Model Comparison๏ƒ

qwen2.5-coder:7b vs codellama:7b:

  • Qwen 2.5 Coder: Better at code understanding and multi-language support

  • CodeLlama: Strong at Python and code generation

  • Recommendation: Start with qwen2.5-coder:7b

7B vs 14B vs 32B:

  • 7B: Fast, suitable for most conflicts, 8-16GB RAM

  • 14B: Better quality, complex conflicts, 16-32GB RAM

  • 32B: Best quality, very complex conflicts, 32GB+ RAM

Hardware Requirements๏ƒ

Model Size

RAM

Disk Space

Speed (Inference)

7B

8-16GB

~5GB

~1-3 tokens/sec (CPU)

14B

16-32GB

~10GB

~0.5-1 tokens/sec (CPU)

32B

32GB+

~20GB

~0.2-0.5 tokens/sec (CPU)

With GPU (NVIDIA):

  • 7B: 6GB+ VRAM โ†’ 50-100 tokens/sec

  • 14B: 12GB+ VRAM โ†’ 30-60 tokens/sec

  • 32B: 24GB+ VRAM โ†’ 20-40 tokens/sec

Configuration Options๏ƒ

Using Ollama with pr-resolve๏ƒ

1. Preset (Easiest)๏ƒ

pr-resolve apply 123 --llm-preset ollama-local

Uses default settings:

  • Model: qwen2.5-coder:7b

  • Base URL: http://localhost:11434

  • Auto-download: Disabled

2. Custom Model๏ƒ

pr-resolve apply 123 \
  --llm-preset ollama-local \
  --llm-model codellama:13b

3. Configuration File๏ƒ

Create config.yaml:

llm:
  enabled: true
  provider: ollama
  model: qwen2.5-coder:7b
  ollama_base_url: http://localhost:11434
  max_tokens: 2000
  cache_enabled: true
  fallback_to_regex: true

Use with:

pr-resolve apply 123 --config config.yaml

4. Environment Variables๏ƒ

# Set Ollama configuration
export CR_LLM_PROVIDER=ollama
export CR_LLM_MODEL=qwen2.5-coder:7b
export OLLAMA_BASE_URL=http://localhost:11434

# Run pr-resolve
pr-resolve apply 123 --llm-enabled

Remote Ollama Server๏ƒ

If Ollama is running on a different machine:

# Set base URL
export OLLAMA_BASE_URL=http://ollama-server:11434

# Or use config file
pr-resolve apply 123 --config config.yaml

config.yaml:

llm:
  enabled: true
  provider: ollama
  model: qwen2.5-coder:7b
  ollama_base_url: http://ollama-server:11434

Auto-Download Feature๏ƒ

The auto-download feature automatically downloads models when theyโ€™re not available locally.

Enabling Auto-Download๏ƒ

Via Python API:

from review_bot_automator.llm.providers.ollama import OllamaProvider

# Auto-download enabled
provider = OllamaProvider(
    model="qwen2.5-coder:7b",
    auto_download=True  # Downloads model if not available
)

Behavior:

  • When auto_download=True: Missing models are downloaded automatically (may take several minutes)

  • When auto_download=False (default): Raises error with installation instructions

Use Cases:

  • Automated CI/CD pipelines

  • First-time setup automation

  • Switching between models frequently

Note: Auto-download is not currently exposed via CLI flags. Use the interactive script or manual ollama pull for CLI usage.

Model Information๏ƒ

Get information about a model:

provider = OllamaProvider(model="qwen2.5-coder:7b")

# Get model info
info = provider._get_model_info("qwen2.5-coder:7b")
print(info)  # Dict with size, parameters, etc.

# Get recommended models
models = OllamaProvider.list_recommended_models()
for model in models:
    print(f"{model['name']}: {model['description']}")

Troubleshooting๏ƒ

Ollama Not Running๏ƒ

Error:

LLMAPIError: Ollama is not running or not reachable. Start Ollama with: ollama serve

Solution:

# Start Ollama service
ollama serve

# Or use setup script
./scripts/setup_ollama.sh --skip-install

Model Not Found๏ƒ

Error:

LLMConfigurationError: Model 'qwen2.5-coder:7b' not found in Ollama.
Install it with: ollama pull qwen2.5-coder:7b

Solution:

# Download model
./scripts/download_ollama_models.sh qwen2.5-coder:7b

# Or use ollama CLI
ollama pull qwen2.5-coder:7b

# Or enable auto-download (Python API only)
provider = OllamaProvider(model="qwen2.5-coder:7b", auto_download=True)

Slow Performance๏ƒ

Symptoms: Generation takes a very long time (>30 seconds per request).

Solutions:

  1. Use GPU acceleration (NVIDIA):

    # Check GPU is detected
    ollama ps
    
    # Should show GPU info in output
    
  2. Use smaller model:

    # Switch from 14B to 7B
    pr-resolve apply 123 \
      --llm-preset ollama-local \
      --llm-model qwen2.5-coder:7b
    
  3. Close other applications to free up RAM

  4. Check CPU usage: Ensure Ollama has CPU resources

Out of Memory๏ƒ

Error:

Ollama model loading failed: not enough memory

Solutions:

  1. Use smaller model:

    ollama pull qwen2.5-coder:7b  # Instead of 14b or 32b
    
  2. Close other applications to free up RAM

  3. Use quantized model (if available):

    ollama pull qwen2.5-coder:7b-q4_0  # 4-bit quantization
    

Connection Pool Exhausted๏ƒ

Error:

LLMAPIError: Connection pool exhausted - too many concurrent requests

Cause: More than 10 concurrent requests to Ollama.

Solutions:

  1. Reduce concurrency: Process fewer requests simultaneously

  2. Increase pool size (Python API):

    # Not currently configurable - requires code change
    # Pool size is hardcoded to 10 in HTTPAdapter
    
    

Port Already in Use๏ƒ

Error:

Error: listen tcp 127.0.0.1:11434: bind: address already in use

Solutions:

  1. Check existing Ollama process:

    ps aux | grep ollama
    killall ollama  # Stop existing instance
    ollama serve    # Start new instance
    
  2. Use different port:

    OLLAMA_HOST=0.0.0.0:11435 ollama serve
    
    # Update configuration
    export OLLAMA_BASE_URL=http://localhost:11435
    

Model Download Failed๏ƒ

Error:

Failed to download model: connection timeout

Solutions:

  1. Check internet connection

  2. Retry with manual pull:

    ollama pull qwen2.5-coder:7b
    
  3. Check disk space:

    df -h  # Ensure at least 10GB free
    

Advanced Usage๏ƒ

Custom Ollama Configuration๏ƒ

Change default model directory:

# Set model storage location
export OLLAMA_MODELS=/path/to/models

# Start Ollama
ollama serve

Enable debug logging:

# Enable verbose output
export OLLAMA_DEBUG=1
ollama serve

Multiple Models๏ƒ

Use different models for different tasks:

# Download multiple models
ollama pull qwen2.5-coder:7b
ollama pull codellama:13b
ollama pull mistral:7b

# Use specific model
pr-resolve apply 123 --llm-preset ollama-local --llm-model codellama:13b

Model Management๏ƒ

# List downloaded models
ollama list

# Show model info
ollama show qwen2.5-coder:7b

# Remove model
ollama rm mistral:7b

# Copy model with custom name
ollama cp qwen2.5-coder:7b my-custom-model

Running as System Service๏ƒ

Linux (systemd):

# Create service file
sudo tee /etc/systemd/system/ollama.service > /dev/null <<EOF
[Unit]
Description=Ollama Service
After=network.target

[Service]
Type=simple
User=$USER
ExecStart=/usr/local/bin/ollama serve
Restart=always

[Install]
WantedBy=multi-user.target
EOF

# Enable and start service
sudo systemctl enable ollama
sudo systemctl start ollama

# Check status
sudo systemctl status ollama

macOS (launchd):

# Ollama includes launchd service by default
# Check if running
launchctl list | grep ollama

# Start service
launchctl start com.ollama.ollama

GPU Acceleration๏ƒ

GPU acceleration provides 10-60x speedup compared to CPU-only inference. The pr-resolve tool automatically detects and displays GPU information when using Ollama.

Automatic GPU Detection๏ƒ

Starting with version 0.3.0, pr-resolve automatically detects GPU availability when initializing Ollama:

# Run conflict resolution
pr-resolve apply 123 --llm-preset ollama-local

# GPU info displayed in metrics (if detected)
# โ•ญโ”€ LLM Metrics โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
# โ”‚ Provider: ollama (qwen2.5-coder:7b)   โ”‚
# โ”‚ Hardware: NVIDIA RTX 4090 (24GB)      โ”‚
# โ”‚ Changes Parsed: 5                     โ”‚
# โ”‚ ...                                   โ”‚
# โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Detection supports multiple platforms:

  • NVIDIA GPUs: CUDA 11.0+ (automatically detected via nvidia-smi)

  • AMD GPUs: ROCm 5.0+ (automatically detected via rocm-smi)

  • Apple Silicon: M1/M2/M3/M4 with Metal (automatically detected on macOS)

  • CPU Fallback: Gracefully falls back if no GPU detected

NVIDIA GPU Setup (CUDA)๏ƒ

Prerequisites:

# 1. Verify NVIDIA driver
nvidia-smi

# Should show driver version and GPU info
# Recommended: Driver 525+ (CUDA 12+)

Installation (if nvidia-smi not found):

Ubuntu/Debian:

# Install NVIDIA drivers
sudo ubuntu-drivers autoinstall

# Reboot required
sudo reboot

# Verify
nvidia-smi

Fedora/RHEL:

# Install NVIDIA drivers
sudo dnf install akmod-nvidia

# Reboot required
sudo reboot

# Verify
nvidia-smi

Verification:

# Check Ollama GPU detection
ollama ps

# Should show
# NAME                   ID              SIZE     PROCESSOR
# qwen2.5-coder:7b      abc123...       4.7 GB   100% GPU

# Test with pr-resolve
pr-resolve apply 123 --llm-preset ollama-local

# Check metrics output for GPU info

Performance Expectations:

  • RTX 3060 (12GB): ~50-70 tokens/sec with 7B models

  • RTX 3090 (24GB): ~70-100 tokens/sec with 7B models, ~40-60 tokens/sec with 14B

  • RTX 4090 (24GB): ~100-150 tokens/sec with 7B models, ~60-90 tokens/sec with 14B

AMD GPU Setup (ROCm)๏ƒ

Prerequisites:

  • AMD GPU with ROCm support (RX 6000/7000 series, MI series)

  • ROCm 5.0 or newer

Installation:

# Follow AMD ROCm installation guide
# https://github.com/ollama/ollama/blob/main/docs/gpu.md

# Verify
rocm-smi --showproductname

# Should display AMD GPU info

Verification:

# Check Ollama GPU detection
ollama ps

# Test with pr-resolve
pr-resolve apply 123 --llm-preset ollama-local

Apple Silicon Setup (Metal)๏ƒ

Automatic Detection: No setup required - Ollama automatically uses Metal acceleration on Apple Silicon Macs.

Supported Chips:

  • M1, M1 Pro, M1 Max, M1 Ultra

  • M2, M2 Pro, M2 Max, M2 Ultra

  • M3, M3 Pro, M3 Max

  • M4, M4 Pro, M4 Max

Verification:

# Check chip
sysctl -n machdep.cpu.brand_string

# Should show "Apple M1/M2/M3/M4"

# Test with pr-resolve
pr-resolve apply 123 --llm-preset ollama-local

# Metrics will show
# Hardware: Apple M3 Max (Metal)

Performance Notes:

  • M1/M2 8GB: Good for 7B models

  • M1/M2 Pro/Max 16GB+: Excellent for 7B-14B models

  • M1/M2 Ultra 64GB+: Handles 32B models well

  • Unified memory shared between CPU and GPU

Troubleshooting GPU Detection๏ƒ

GPU Not Detected (Shows โ€œHardware: CPUโ€):

  1. Verify GPU is available:

    # NVIDIA
    nvidia-smi
    
    # AMD
    rocm-smi --showproductname
    
    # Apple Silicon
    sysctl -n machdep.cpu.brand_string
    
  2. Check Ollama GPU usage:

    ollama ps
    
    # PROCESSOR column should show "GPU" not "CPU"
    # If shows CPU, Ollama isn't using GPU
    
  3. Restart Ollama to detect GPU:

    # Stop Ollama
    killall ollama
    
    # Start Ollama (GPU detection happens on startup)
    ollama serve
    
    # Reload model to use GPU
    ollama pull qwen2.5-coder:7b --force
    
  4. Check CUDA/ROCm installation:

    # NVIDIA: Check CUDA
    nvcc --version
    
    # AMD: Check ROCm
    rocminfo
    

GPU Detected but Slow Performance:

  1. Check GPU memory:

    # NVIDIA
    nvidia-smi
    
    # Look for "Memory-Usage" - should have enough free VRAM
    # 7B models need ~6GB, 14B need ~12GB
    
  2. Close competing GPU processes:

    # NVIDIA: List GPU processes
    nvidia-smi
    
    # AMD: List processes
    rocm-smi --showpids
    
  3. Use smaller model if out of VRAM:

    # 7B instead of 14B
    pr-resolve apply 123 \
      --llm-preset ollama-local \
      --llm-model qwen2.5-coder:7b
    

Mixed CPU/GPU Usage:

If model is too large for GPU VRAM, Ollama may split between GPU and CPU (slower):

# Check split in ollama ps
ollama ps

# May show: "50% GPU" instead of "100% GPU"
# Solution: Use smaller model

GPU Performance Monitoring๏ƒ

During Resolution:

# Terminal 1: Run pr-resolve
pr-resolve apply 123 --llm-preset ollama-local

# Terminal 2: Monitor GPU
watch -n 1 nvidia-smi  # NVIDIA
# OR
watch -n 1 rocm-smi    # AMD

Check pr-resolve Metrics:

# After resolution completes
# Look for metrics panel in output
โ•ญโ”€ LLM Metrics โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Provider: ollama (qwen2.5-coder:7b)   โ”‚
โ”‚ Hardware: NVIDIA RTX 4090 (24GB)      โ”‚  โ† GPU detected
โ”‚ Changes Parsed: 5                     โ”‚
โ”‚ Avg Confidence: 0.92                  โ”‚
โ”‚ Cache Hit Rate: 0%                    โ”‚
โ”‚ Total Cost: $0.00                     โ”‚
โ”‚ API Calls: 5                          โ”‚
โ”‚ Total Tokens: 12,450                  โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

No GPU Info Displayed:

  • If GPU info is not shown in metrics, it means:

    • No GPU detected (CPU-only system)

    • GPU detection failed (non-fatal, falls back to CPU)

    • Using cloud LLM provider (GPU info only for Ollama)

GPU Acceleration Benefits๏ƒ

Performance Comparison (qwen2.5-coder:7b):

Hardware

Tokens/sec

Time for 1000 tokens

CPU (i7-12700K)

1-3

5-15 minutes

RTX 3060 (12GB)

50-70

15-20 seconds

RTX 4090 (24GB)

100-150

7-10 seconds

M2 Max (96GB)

40-60

15-25 seconds

Cost Savings:

  • GPU: Free (local hardware)

  • API (Claude/GPT-4): ~$0.01-0.05 per resolution

Recommendation: For frequent usage, a $300-500 GPU pays for itself in API savings within months.

Performance Tuning๏ƒ

Adjust context size:

# config.yaml
llm:
  max_tokens: 4000  # Increase for larger conflicts

Adjust timeout:

provider = OllamaProvider(
    model="qwen2.5-coder:7b",
    timeout=300  # 5 minutes for slow systems
)

See Also๏ƒ