LLM Provider Performance Benchmarks

Comprehensive performance comparison of LLM providers for code review comment resolution.

Overview

This benchmarking infrastructure allows you to systematically compare the performance of all supported LLM providers (Anthropic, OpenAI, Ollama, Claude CLI, Codex CLI) across multiple dimensions:

Latency: Response time metrics (mean, median, P95, P99)
Throughput: Requests per second
Accuracy: Parsing success rate against ground truth
Cost: Per-request and monthly estimates
GPU Performance: Hardware utilization for local models

Quick Start

Prerequisites

Python 3.12+ with virtual environment activated
API keys configured for cloud providers (Anthropic, OpenAI)
Ollama installed for local model testing (optional)

Basic Usage

# Benchmark all providers with default settings (100 iterations)
python scripts/benchmark_llm.py --iterations 100

# Benchmark specific providers
python scripts/benchmark_llm.py --providers anthropic openai --iterations 50

# Use custom test dataset
python scripts/benchmark_llm.py --dataset my_comments.json --iterations 100

# Save report to custom location
python scripts/benchmark_llm.py --output reports/benchmark-2025-11-17.md

Command-Line Options

python scripts/benchmark_llm.py --help

Options:
  --providers PROVIDERS [PROVIDERS ...]
                        LLM providers to benchmark (default: all)
                        Choices: anthropic, openai, ollama, claude-cli, codex-cli

  --iterations N        Number of iterations per provider (default: 100)
                        Recommended: 100+ for statistical significance

  --dataset PATH        Path to test dataset JSON file
                        (default: tests/benchmarks/sample_comments.json)

  --output PATH         Output markdown report path
                        (default: docs/performance-benchmarks.md)

  --warmup N           Number of warmup iterations (default: 5)
                        Warmup runs are not included in metrics

Metrics Explained

Latency Metrics

Mean Latency

Average response time across all requests
Good indicator of typical performance
Affected by outliers

Median Latency (P50)

Middle value when sorted by response time
More robust to outliers than mean
Better represents “typical” user experience

P95 Latency

95% of requests complete faster than this time
Indicates worst-case performance for most users
Acceptance Criteria: < 5 seconds for all providers

P99 Latency

99% of requests complete faster than this time
Captures tail latency and outliers
Important for SLA guarantees

Throughput

Requests Per Second

How many requests the provider can handle
Calculated as: 1 / mean_latency
Higher is better for high-volume deployments

Accuracy Metrics

Success Rate

Percentage of requests that returned valid responses
Calculated as: successful_parses / total_requests
Target: > 95% for production use

Average Confidence

Mean confidence score from parsed responses
Range: 0.0 - 1.0 (higher is better)
Indicates model certainty in suggestions

Cost Analysis

Total Cost

Sum of all API costs for the benchmark run
Free for local models (Ollama, Claude CLI, Codex CLI)

Cost Per Request

Average cost per API call
Important for budgeting at scale

Monthly Estimates

Projected costs at 1K and 10K requests/month
Helps plan production deployment budgets

GPU Information (Local Models Only)

For Ollama and local models, the benchmark captures:

GPU name and model
Total memory available
Driver version
CUDA version (NVIDIA GPUs)

Test Dataset

The default benchmark dataset (tests/benchmarks/sample_comments.json) contains 30 realistic CodeRabbit-style review comments across three complexity levels:

Simple Comments (10)

Basic code suggestions
Single-line fixes
Simple formatting changes
Expected latency: < 2s

Medium Comments (10)

Multi-line code changes
Diff blocks with context
Moderate refactoring suggestions
Expected latency: 2-4s

Complex Comments (10)

Security vulnerability fixes
Architecture refactoring
Multi-file changes
Multi-option recommendations
Expected latency: 4-6s

Ground Truth Annotations

Each comment includes ground truth data for accuracy validation:

{
  "body": "```suggestion\ndef calculate_total(items):\n    return sum(item.price for item in items)\n```",
  "path": "src/cart.py",
  "line": 45,
  "ground_truth": {
    "changes": 1,
    "start_line": 45,
    "end_line": 46,
    "change_type": "modification",
    "confidence_threshold": 0.8
  }
}

Creating Custom Datasets

To create your own benchmark dataset:

Structure: JSON file with three keys: simple, medium, complex
Format: Each category contains a list of comment objects
Required fields: body, path, line, ground_truth

Example custom dataset:

{
  "simple": [
    {
      "body": "Fix typo: 'recieve' → 'receive'",
      "path": "src/utils.py",
      "line": 10,
      "ground_truth": {
        "changes": 1,
        "confidence_threshold": 0.9
      }
    }
  ],
  "medium": [...],
  "complex": [...]
}

Provider Comparison

Anthropic (Claude)

Strengths

Excellent accuracy on complex code understanding
Strong security vulnerability detection
Prompt caching reduces costs by 50-90%

Considerations

Slightly higher per-request cost than OpenAI
API latency depends on region

Best For

Production deployments with repeated prompts
Security-critical code reviews
Complex architectural refactoring

OpenAI (GPT-4o, GPT-4o-mini)

Strengths: (GPT-4o-mini)

Fast response times (1-3s typical)
GPT-4o-mini offers excellent cost/performance ratio
Wide model selection

Considerations: (GPT-4o-mini)

No prompt caching (yet)
Higher costs for GPT-4o at scale

Best For: (GPT-4o-mini)

Speed-critical applications
Cost-sensitive deployments (with mini model)
High-volume production systems

Ollama (Local Models)

Strengths: (Claude Sonnet)

Zero per-request cost
100% privacy (no data leaves your infrastructure)
GPU acceleration support

Considerations: (Claude Sonnet)

Requires local hardware (GPU recommended)
Higher latency than cloud APIs
Model quality varies (qwen2.5-coder:7b recommended)

Best For: (Claude Sonnet)

Privacy-first requirements (HIPAA, GDPR, confidential code)
Cost-sensitive high-volume deployments
Air-gapped environments

Claude CLI

Strengths: (Qwen2.5-Coder)

Free for development/testing
Uses latest Claude models
No API key management

Considerations: (Qwen2.5-Coder)

Not suitable for production automation
Rate limits apply
Requires Claude desktop app

Best For: (Qwen2.5-Coder)

Local development and testing
Prototyping before API integration

Codex CLI

Strengths: (Deepseek-Coder)

Free for development/testing
Direct integration with OpenAI Codex

Considerations: (Deepseek-Coder)

Not suitable for production automation
Limited to Codex model family

Best For: (Deepseek-Coder)

Local development and testing
Codex-specific workflows

Interpreting Results

Sample Benchmark Report

## Latency Comparison

| Provider   | Model          | Mean   | Median | P95   | P99   | Throughput |
|------------|----------------|--------|--------|-------|-------|------------|
| anthropic  | claude-3-5     | 2.1s   | 2.0s   | 3.5s  | 4.2s  | 0.48 req/s |
| openai     | gpt-4o-mini    | 1.8s   | 1.7s   | 2.9s  | 3.5s  | 0.56 req/s |
| ollama     | qwen2.5:7b     | 4.5s   | 4.2s   | 7.1s  | 8.5s  | 0.22 req/s |

What to Look For

Production Readiness

✅ P95 latency < 5s (all providers meet acceptance criteria)
✅ Success rate > 95%
✅ Average confidence > 0.7

Cost Optimization

Compare monthly estimates at expected volume
Consider Anthropic with prompt caching for repeated prompts
Evaluate Ollama for high-volume scenarios

Performance vs Cost Trade-offs

OpenAI gpt-4o-mini: Best cost/performance for cloud
Anthropic: Best accuracy, cost-effective with caching
Ollama: Best for privacy and zero ongoing costs

Advanced Usage

Benchmarking Specific Scenarios

# Benchmark only simple comments
python scripts/benchmark_llm.py --complexity simple --iterations 200

# Benchmark with custom warmup
python scripts/benchmark_llm.py --warmup 10 --iterations 100

# Benchmark with verbose output
python scripts/benchmark_llm.py --verbose

Continuous Benchmarking

Integrate benchmarking into your CI/CD pipeline:

# .github/workflows/benchmark.yml
name: LLM Benchmark
on:
  schedule:
    * cron: '0 0 * * 0'  # Weekly on Sunday
  workflow_dispatch:

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      * uses: actions/checkout@v4
      * name: Run benchmarks
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python scripts/benchmark_llm.py --iterations 100
          git add docs/performance-benchmarks-results.md
          git commit -m "chore: update weekly benchmarks"
          git push

Regression Detection

Monitor key metrics over time to detect performance regressions:

# Save benchmark results with timestamp
python scripts/benchmark_llm.py --output "reports/benchmark-$(date +%Y-%m-%d).md"

# Compare with previous results
diff reports/benchmark-2025-11-10.md reports/benchmark-2025-11-17.md

Troubleshooting

Low Success Rate (< 95%)

Possible Causes

Invalid API keys
Network connectivity issues
Model timeout (increase timeout in config)
Malformed test comments

Solutions

Check API key validity: echo $ANTHROPIC_API_KEY
Test network: curl https://api.anthropic.com
Increase timeout: --timeout 60
Validate test dataset JSON schema

High P99 Latency (> 10s)

Possible Causes: (High Latency)

Network congestion
Provider rate limiting
Cold start delays (first request)
Complex comments exceeding context limits

Solutions: (High Latency)

Increase warmup iterations: --warmup 10
Reduce concurrent requests
Split complex comments into simpler chunks
Check provider status pages

GPU Not Detected (Ollama)

Possible Causes: (Low Confidence)

CUDA/ROCm drivers not installed
GPU not accessible to Docker (if running in container)
Ollama not configured for GPU

Solutions: (Low Confidence)

Verify GPU: nvidia-smi or rocm-smi
Check Ollama config: ollama ps
Reinstall with GPU support: See Ollama Setup Guide

Out of Memory (Local Models)

Possible Causes: (High Cost)

Model too large for available VRAM
Batch size too high
Memory leak in long-running benchmarks

Solutions: (High Cost)

Use smaller model: qwen2.5-coder:3b instead of 7b
Reduce iterations: --iterations 50
Restart Ollama between runs

Contributing

Found a performance issue or want to add benchmark scenarios?

Create a new test comment in tests/benchmarks/sample_comments.json
Add ground truth annotations
Run the benchmark: python scripts/benchmark_llm.py
Submit a PR with your findings

For questions or issues, see the GitHub Issues page.