LLM Provider Performance Benchmarks

Comprehensive performance comparison of LLM providers for code review comment resolution.

Overview

This benchmarking infrastructure allows you to systematically compare the performance of all supported LLM providers (Anthropic, OpenAI, Ollama, Claude CLI, Codex CLI) across multiple dimensions:

  • Latency: Response time metrics (mean, median, P95, P99)

  • Throughput: Requests per second

  • Accuracy: Parsing success rate against ground truth

  • Cost: Per-request and monthly estimates

  • GPU Performance: Hardware utilization for local models

Quick Start

Prerequisites

  • Python 3.12+ with virtual environment activated

  • API keys configured for cloud providers (Anthropic, OpenAI)

  • Ollama installed for local model testing (optional)

Basic Usage

# Benchmark all providers with default settings (100 iterations)
python scripts/benchmark_llm.py --iterations 100

# Benchmark specific providers
python scripts/benchmark_llm.py --providers anthropic openai --iterations 50

# Use custom test dataset
python scripts/benchmark_llm.py --dataset my_comments.json --iterations 100

# Save report to custom location
python scripts/benchmark_llm.py --output reports/benchmark-2025-11-17.md

Command-Line Options

python scripts/benchmark_llm.py --help

Options:
  --providers PROVIDERS [PROVIDERS ...]
                        LLM providers to benchmark (default: all)
                        Choices: anthropic, openai, ollama, claude-cli, codex-cli

  --iterations N        Number of iterations per provider (default: 100)
                        Recommended: 100+ for statistical significance

  --dataset PATH        Path to test dataset JSON file
                        (default: tests/benchmarks/sample_comments.json)

  --output PATH         Output markdown report path
                        (default: docs/performance-benchmarks.md)

  --warmup N           Number of warmup iterations (default: 5)
                        Warmup runs are not included in metrics

Metrics Explained

Latency Metrics

Mean Latency

  • Average response time across all requests

  • Good indicator of typical performance

  • Affected by outliers

Median Latency (P50)

  • Middle value when sorted by response time

  • More robust to outliers than mean

  • Better represents “typical” user experience

P95 Latency

  • 95% of requests complete faster than this time

  • Indicates worst-case performance for most users

  • Acceptance Criteria: < 5 seconds for all providers

P99 Latency

  • 99% of requests complete faster than this time

  • Captures tail latency and outliers

  • Important for SLA guarantees

Throughput

Requests Per Second

  • How many requests the provider can handle

  • Calculated as: 1 / mean_latency

  • Higher is better for high-volume deployments

Accuracy Metrics

Success Rate

  • Percentage of requests that returned valid responses

  • Calculated as: successful_parses / total_requests

  • Target: > 95% for production use

Average Confidence

  • Mean confidence score from parsed responses

  • Range: 0.0 - 1.0 (higher is better)

  • Indicates model certainty in suggestions

Cost Analysis

Total Cost

  • Sum of all API costs for the benchmark run

  • Free for local models (Ollama, Claude CLI, Codex CLI)

Cost Per Request

  • Average cost per API call

  • Important for budgeting at scale

Monthly Estimates

  • Projected costs at 1K and 10K requests/month

  • Helps plan production deployment budgets

GPU Information (Local Models Only)

For Ollama and local models, the benchmark captures:

  • GPU name and model

  • Total memory available

  • Driver version

  • CUDA version (NVIDIA GPUs)

Test Dataset

The default benchmark dataset (tests/benchmarks/sample_comments.json) contains 30 realistic CodeRabbit-style review comments across three complexity levels:

Simple Comments (10)

  • Basic code suggestions

  • Single-line fixes

  • Simple formatting changes

  • Expected latency: < 2s

Medium Comments (10)

  • Multi-line code changes

  • Diff blocks with context

  • Moderate refactoring suggestions

  • Expected latency: 2-4s

Complex Comments (10)

  • Security vulnerability fixes

  • Architecture refactoring

  • Multi-file changes

  • Multi-option recommendations

  • Expected latency: 4-6s

Ground Truth Annotations

Each comment includes ground truth data for accuracy validation:

{
  "body": "```suggestion\ndef calculate_total(items):\n    return sum(item.price for item in items)\n```",
  "path": "src/cart.py",
  "line": 45,
  "ground_truth": {
    "changes": 1,
    "start_line": 45,
    "end_line": 46,
    "change_type": "modification",
    "confidence_threshold": 0.8
  }
}

Creating Custom Datasets

To create your own benchmark dataset:

  1. Structure: JSON file with three keys: simple, medium, complex

  2. Format: Each category contains a list of comment objects

  3. Required fields: body, path, line, ground_truth

Example custom dataset:

{
  "simple": [
    {
      "body": "Fix typo: 'recieve' → 'receive'",
      "path": "src/utils.py",
      "line": 10,
      "ground_truth": {
        "changes": 1,
        "confidence_threshold": 0.9
      }
    }
  ],
  "medium": [...],
  "complex": [...]
}

Provider Comparison

Anthropic (Claude)

Strengths

  • Excellent accuracy on complex code understanding

  • Strong security vulnerability detection

  • Prompt caching reduces costs by 50-90%

Considerations

  • Slightly higher per-request cost than OpenAI

  • API latency depends on region

Best For

  • Production deployments with repeated prompts

  • Security-critical code reviews

  • Complex architectural refactoring

OpenAI (GPT-4o, GPT-4o-mini)

Strengths: (GPT-4o-mini)

  • Fast response times (1-3s typical)

  • GPT-4o-mini offers excellent cost/performance ratio

  • Wide model selection

Considerations: (GPT-4o-mini)

  • No prompt caching (yet)

  • Higher costs for GPT-4o at scale

Best For: (GPT-4o-mini)

  • Speed-critical applications

  • Cost-sensitive deployments (with mini model)

  • High-volume production systems

Ollama (Local Models)

Strengths: (Claude Sonnet)

  • Zero per-request cost

  • 100% privacy (no data leaves your infrastructure)

  • GPU acceleration support

Considerations: (Claude Sonnet)

  • Requires local hardware (GPU recommended)

  • Higher latency than cloud APIs

  • Model quality varies (qwen2.5-coder:7b recommended)

Best For: (Claude Sonnet)

  • Privacy-first requirements (HIPAA, GDPR, confidential code)

  • Cost-sensitive high-volume deployments

  • Air-gapped environments

Claude CLI

Strengths: (Qwen2.5-Coder)

  • Free for development/testing

  • Uses latest Claude models

  • No API key management

Considerations: (Qwen2.5-Coder)

  • Not suitable for production automation

  • Rate limits apply

  • Requires Claude desktop app

Best For: (Qwen2.5-Coder)

  • Local development and testing

  • Prototyping before API integration

Codex CLI

Strengths: (Deepseek-Coder)

  • Free for development/testing

  • Direct integration with OpenAI Codex

Considerations: (Deepseek-Coder)

  • Not suitable for production automation

  • Limited to Codex model family

Best For: (Deepseek-Coder)

  • Local development and testing

  • Codex-specific workflows

Interpreting Results

Sample Benchmark Report

## Latency Comparison

| Provider   | Model          | Mean   | Median | P95   | P99   | Throughput |
|------------|----------------|--------|--------|-------|-------|------------|
| anthropic  | claude-3-5     | 2.1s   | 2.0s   | 3.5s  | 4.2s  | 0.48 req/s |
| openai     | gpt-4o-mini    | 1.8s   | 1.7s   | 2.9s  | 3.5s  | 0.56 req/s |
| ollama     | qwen2.5:7b     | 4.5s   | 4.2s   | 7.1s  | 8.5s  | 0.22 req/s |

What to Look For

Production Readiness

  • ✅ P95 latency < 5s (all providers meet acceptance criteria)

  • ✅ Success rate > 95%

  • ✅ Average confidence > 0.7

Cost Optimization

  • Compare monthly estimates at expected volume

  • Consider Anthropic with prompt caching for repeated prompts

  • Evaluate Ollama for high-volume scenarios

Performance vs Cost Trade-offs

  • OpenAI gpt-4o-mini: Best cost/performance for cloud

  • Anthropic: Best accuracy, cost-effective with caching

  • Ollama: Best for privacy and zero ongoing costs

Advanced Usage

Benchmarking Specific Scenarios

# Benchmark only simple comments
python scripts/benchmark_llm.py --complexity simple --iterations 200

# Benchmark with custom warmup
python scripts/benchmark_llm.py --warmup 10 --iterations 100

# Benchmark with verbose output
python scripts/benchmark_llm.py --verbose

Continuous Benchmarking

Integrate benchmarking into your CI/CD pipeline:

# .github/workflows/benchmark.yml
name: LLM Benchmark
on:
  schedule:
    * cron: '0 0 * * 0'  # Weekly on Sunday
  workflow_dispatch:

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      * uses: actions/checkout@v4
      * name: Run benchmarks
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python scripts/benchmark_llm.py --iterations 100
          git add docs/performance-benchmarks-results.md
          git commit -m "chore: update weekly benchmarks"
          git push

Regression Detection

Monitor key metrics over time to detect performance regressions:

# Save benchmark results with timestamp
python scripts/benchmark_llm.py --output "reports/benchmark-$(date +%Y-%m-%d).md"

# Compare with previous results
diff reports/benchmark-2025-11-10.md reports/benchmark-2025-11-17.md

Troubleshooting

Low Success Rate (< 95%)

Possible Causes

  • Invalid API keys

  • Network connectivity issues

  • Model timeout (increase timeout in config)

  • Malformed test comments

Solutions

  1. Check API key validity: echo $ANTHROPIC_API_KEY

  2. Test network: curl https://api.anthropic.com

  3. Increase timeout: --timeout 60

  4. Validate test dataset JSON schema

High P99 Latency (> 10s)

Possible Causes: (High Latency)

  • Network congestion

  • Provider rate limiting

  • Cold start delays (first request)

  • Complex comments exceeding context limits

Solutions: (High Latency)

  1. Increase warmup iterations: --warmup 10

  2. Reduce concurrent requests

  3. Split complex comments into simpler chunks

  4. Check provider status pages

GPU Not Detected (Ollama)

Possible Causes: (Low Confidence)

  • CUDA/ROCm drivers not installed

  • GPU not accessible to Docker (if running in container)

  • Ollama not configured for GPU

Solutions: (Low Confidence)

  1. Verify GPU: nvidia-smi or rocm-smi

  2. Check Ollama config: ollama ps

  3. Reinstall with GPU support: See Ollama Setup Guide

Out of Memory (Local Models)

Possible Causes: (High Cost)

  • Model too large for available VRAM

  • Batch size too high

  • Memory leak in long-running benchmarks

Solutions: (High Cost)

  1. Use smaller model: qwen2.5-coder:3b instead of 7b

  2. Reduce iterations: --iterations 50

  3. Restart Ollama between runs

See Also

Contributing

Found a performance issue or want to add benchmark scenarios?

  1. Create a new test comment in tests/benchmarks/sample_comments.json

  2. Add ground truth annotations

  3. Run the benchmark: python scripts/benchmark_llm.py

  4. Submit a PR with your findings

For questions or issues, see the GitHub Issues page.