# LLM Provider Performance Benchmarks

Comprehensive performance comparison of LLM providers for code review comment resolution.

## Overview

This benchmarking infrastructure allows you to systematically compare the performance of all supported LLM providers (Anthropic, OpenAI, Ollama, Claude CLI, Codex CLI) across multiple dimensions:

* **Latency**: Response time metrics (mean, median, P95, P99)
* **Throughput**: Requests per second
* **Accuracy**: Parsing success rate against ground truth
* **Cost**: Per-request and monthly estimates
* **GPU Performance**: Hardware utilization for local models

## Quick Start

### Prerequisites

* Python 3.12+ with virtual environment activated
* API keys configured for cloud providers (Anthropic, OpenAI)
* Ollama installed for local model testing (optional)

### Basic Usage

```bash
# Benchmark all providers with default settings (100 iterations)
python scripts/benchmark_llm.py --iterations 100

# Benchmark specific providers
python scripts/benchmark_llm.py --providers anthropic openai --iterations 50

# Use custom test dataset
python scripts/benchmark_llm.py --dataset my_comments.json --iterations 100

# Save report to custom location
python scripts/benchmark_llm.py --output reports/benchmark-2025-11-17.md

```

### Command-Line Options

```bash
python scripts/benchmark_llm.py --help

Options:
  --providers PROVIDERS [PROVIDERS ...]
                        LLM providers to benchmark (default: all)
                        Choices: anthropic, openai, ollama, claude-cli, codex-cli

  --iterations N        Number of iterations per provider (default: 100)
                        Recommended: 100+ for statistical significance

  --dataset PATH        Path to test dataset JSON file
                        (default: tests/benchmarks/sample_comments.json)

  --output PATH         Output markdown report path
                        (default: docs/performance-benchmarks.md)

  --warmup N           Number of warmup iterations (default: 5)
                        Warmup runs are not included in metrics

```

## Metrics Explained

### Latency Metrics

#### Mean Latency

* Average response time across all requests
* Good indicator of typical performance
* Affected by outliers

#### Median Latency (P50)

* Middle value when sorted by response time
* More robust to outliers than mean
* Better represents "typical" user experience

#### P95 Latency

* 95% of requests complete faster than this time
* Indicates worst-case performance for most users
* **Acceptance Criteria**: < 5 seconds for all providers

#### P99 Latency

* 99% of requests complete faster than this time
* Captures tail latency and outliers
* Important for SLA guarantees

### Throughput

#### Requests Per Second

* How many requests the provider can handle
* Calculated as: `1 / mean_latency`
* Higher is better for high-volume deployments

### Accuracy Metrics

#### Success Rate

* Percentage of requests that returned valid responses
* Calculated as: `successful_parses / total_requests`
* Target: > 95% for production use

#### Average Confidence

* Mean confidence score from parsed responses
* Range: 0.0 - 1.0 (higher is better)
* Indicates model certainty in suggestions

### Cost Analysis

#### Total Cost

* Sum of all API costs for the benchmark run
* Free for local models (Ollama, Claude CLI, Codex CLI)

#### Cost Per Request

* Average cost per API call
* Important for budgeting at scale

#### Monthly Estimates

* Projected costs at 1K and 10K requests/month
* Helps plan production deployment budgets

### GPU Information (Local Models Only)

For Ollama and local models, the benchmark captures:

* GPU name and model
* Total memory available
* Driver version
* CUDA version (NVIDIA GPUs)

## Test Dataset

The default benchmark dataset (`tests/benchmarks/sample_comments.json`) contains 30 realistic CodeRabbit-style review comments across three complexity levels:

### Simple Comments (10)

* Basic code suggestions
* Single-line fixes
* Simple formatting changes
* **Expected latency**: < 2s

### Medium Comments (10)

* Multi-line code changes
* Diff blocks with context
* Moderate refactoring suggestions
* **Expected latency**: 2-4s

### Complex Comments (10)

* Security vulnerability fixes
* Architecture refactoring
* Multi-file changes
* Multi-option recommendations
* **Expected latency**: 4-6s

### Ground Truth Annotations

Each comment includes ground truth data for accuracy validation:

```json
{
  "body": "```suggestion\ndef calculate_total(items):\n    return sum(item.price for item in items)\n```",
  "path": "src/cart.py",
  "line": 45,
  "ground_truth": {
    "changes": 1,
    "start_line": 45,
    "end_line": 46,
    "change_type": "modification",
    "confidence_threshold": 0.8
  }
}

```

### Creating Custom Datasets

To create your own benchmark dataset:

1. **Structure**: JSON file with three keys: `simple`, `medium`, `complex`
2. **Format**: Each category contains a list of comment objects
3. **Required fields**: `body`, `path`, `line`, `ground_truth`

Example custom dataset:

```json
{
  "simple": [
    {
      "body": "Fix typo: 'recieve' → 'receive'",
      "path": "src/utils.py",
      "line": 10,
      "ground_truth": {
        "changes": 1,
        "confidence_threshold": 0.9
      }
    }
  ],
  "medium": [...],
  "complex": [...]
}

```

## Provider Comparison

### Anthropic (Claude)

#### Strengths

* Excellent accuracy on complex code understanding
* Strong security vulnerability detection
* Prompt caching reduces costs by 50-90%

#### Considerations

* Slightly higher per-request cost than OpenAI
* API latency depends on region

#### Best For

* Production deployments with repeated prompts
* Security-critical code reviews
* Complex architectural refactoring

### OpenAI (GPT-4o, GPT-4o-mini)

#### Strengths: (GPT-4o-mini)

* Fast response times (1-3s typical)
* GPT-4o-mini offers excellent cost/performance ratio
* Wide model selection

#### Considerations: (GPT-4o-mini)

* No prompt caching (yet)
* Higher costs for GPT-4o at scale

#### Best For: (GPT-4o-mini)

* Speed-critical applications
* Cost-sensitive deployments (with mini model)
* High-volume production systems

### Ollama (Local Models)

#### Strengths: (Claude Sonnet)

* Zero per-request cost
* 100% privacy (no data leaves your infrastructure)
* GPU acceleration support

#### Considerations: (Claude Sonnet)

* Requires local hardware (GPU recommended)
* Higher latency than cloud APIs
* Model quality varies (qwen2.5-coder:7b recommended)

#### Best For: (Claude Sonnet)

* Privacy-first requirements (HIPAA, GDPR, confidential code)
* Cost-sensitive high-volume deployments
* Air-gapped environments

### Claude CLI

#### Strengths: (Qwen2.5-Coder)

* Free for development/testing
* Uses latest Claude models
* No API key management

#### Considerations: (Qwen2.5-Coder)

* Not suitable for production automation
* Rate limits apply
* Requires Claude desktop app

#### Best For: (Qwen2.5-Coder)

* Local development and testing
* Prototyping before API integration

### Codex CLI

#### Strengths: (Deepseek-Coder)

* Free for development/testing
* Direct integration with OpenAI Codex

#### Considerations: (Deepseek-Coder)

* Not suitable for production automation
* Limited to Codex model family

#### Best For: (Deepseek-Coder)

* Local development and testing
* Codex-specific workflows

## Interpreting Results

### Sample Benchmark Report

```markdown
## Latency Comparison

| Provider   | Model          | Mean   | Median | P95   | P99   | Throughput |
|------------|----------------|--------|--------|-------|-------|------------|
| anthropic  | claude-3-5     | 2.1s   | 2.0s   | 3.5s  | 4.2s  | 0.48 req/s |
| openai     | gpt-4o-mini    | 1.8s   | 1.7s   | 2.9s  | 3.5s  | 0.56 req/s |
| ollama     | qwen2.5:7b     | 4.5s   | 4.2s   | 7.1s  | 8.5s  | 0.22 req/s |

```

### What to Look For

#### Production Readiness

* ✅ P95 latency < 5s (all providers meet acceptance criteria)
* ✅ Success rate > 95%
* ✅ Average confidence > 0.7

#### Cost Optimization

* Compare monthly estimates at expected volume
* Consider Anthropic with prompt caching for repeated prompts
* Evaluate Ollama for high-volume scenarios

#### Performance vs Cost Trade-offs

* OpenAI gpt-4o-mini: Best cost/performance for cloud
* Anthropic: Best accuracy, cost-effective with caching
* Ollama: Best for privacy and zero ongoing costs

## Advanced Usage

### Benchmarking Specific Scenarios

```bash
# Benchmark only simple comments
python scripts/benchmark_llm.py --complexity simple --iterations 200

# Benchmark with custom warmup
python scripts/benchmark_llm.py --warmup 10 --iterations 100

# Benchmark with verbose output
python scripts/benchmark_llm.py --verbose

```

### Continuous Benchmarking

Integrate benchmarking into your CI/CD pipeline:

```yaml
# .github/workflows/benchmark.yml
name: LLM Benchmark
on:
  schedule:
    * cron: '0 0 * * 0'  # Weekly on Sunday
  workflow_dispatch:

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      * uses: actions/checkout@v4
      * name: Run benchmarks
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python scripts/benchmark_llm.py --iterations 100
          git add docs/performance-benchmarks-results.md
          git commit -m "chore: update weekly benchmarks"
          git push

```

### Regression Detection

Monitor key metrics over time to detect performance regressions:

```bash
# Save benchmark results with timestamp
python scripts/benchmark_llm.py --output "reports/benchmark-$(date +%Y-%m-%d).md"

# Compare with previous results
diff reports/benchmark-2025-11-10.md reports/benchmark-2025-11-17.md

```

## Troubleshooting

### Low Success Rate (< 95%)

#### Possible Causes

* Invalid API keys
* Network connectivity issues
* Model timeout (increase timeout in config)
* Malformed test comments

#### Solutions

1. Check API key validity: `echo $ANTHROPIC_API_KEY`
2. Test network: `curl https://api.anthropic.com`
3. Increase timeout: `--timeout 60`
4. Validate test dataset JSON schema

### High P99 Latency (> 10s)

#### Possible Causes: (High Latency)

* Network congestion
* Provider rate limiting
* Cold start delays (first request)
* Complex comments exceeding context limits

#### Solutions: (High Latency)

1. Increase warmup iterations: `--warmup 10`
2. Reduce concurrent requests
3. Split complex comments into simpler chunks
4. Check provider status pages

### GPU Not Detected (Ollama)

#### Possible Causes: (Low Confidence)

* CUDA/ROCm drivers not installed
* GPU not accessible to Docker (if running in container)
* Ollama not configured for GPU

#### Solutions: (Low Confidence)

1. Verify GPU: `nvidia-smi` or `rocm-smi`
2. Check Ollama config: `ollama ps`
3. Reinstall with GPU support: See [Ollama Setup Guide](ollama-setup.md)

### Out of Memory (Local Models)

#### Possible Causes: (High Cost)

* Model too large for available VRAM
* Batch size too high
* Memory leak in long-running benchmarks

#### Solutions: (High Cost)

1. Use smaller model: `qwen2.5-coder:3b` instead of `7b`
2. Reduce iterations: `--iterations 50`
3. Restart Ollama between runs

## See Also

* [LLM Configuration Guide](llm-configuration.md) - Provider setup and configuration
* [Ollama Setup Guide](ollama-setup.md) - Local model installation
* [Main Configuration Guide](configuration.md) - General tool configuration
* [API Reference](api-reference.md) - Python API documentation
* [Getting Started Guide](getting-started.md) - Quick start tutorial

## Contributing

Found a performance issue or want to add benchmark scenarios?

1. Create a new test comment in `tests/benchmarks/sample_comments.json`
2. Add ground truth annotations
3. Run the benchmark: `python scripts/benchmark_llm.py`
4. Submit a PR with your findings

For questions or issues, see the [GitHub Issues](https://github.com/VirtualAgentics/review-bot-automator/issues) page.