LLM Provider Performance Benchmarks
Comprehensive performance comparison of LLM providers for code review comment resolution.
Overview
This benchmarking infrastructure allows you to systematically compare the performance of all supported LLM providers (Anthropic, OpenAI, Ollama, Claude CLI, Codex CLI) across multiple dimensions:
Latency: Response time metrics (mean, median, P95, P99)
Throughput: Requests per second
Accuracy: Parsing success rate against ground truth
Cost: Per-request and monthly estimates
GPU Performance: Hardware utilization for local models
Quick Start
Prerequisites
Python 3.12+ with virtual environment activated
API keys configured for cloud providers (Anthropic, OpenAI)
Ollama installed for local model testing (optional)
Basic Usage
# Benchmark all providers with default settings (100 iterations)
python scripts/benchmark_llm.py --iterations 100
# Benchmark specific providers
python scripts/benchmark_llm.py --providers anthropic openai --iterations 50
# Use custom test dataset
python scripts/benchmark_llm.py --dataset my_comments.json --iterations 100
# Save report to custom location
python scripts/benchmark_llm.py --output reports/benchmark-2025-11-17.md
Command-Line Options
python scripts/benchmark_llm.py --help
Options:
--providers PROVIDERS [PROVIDERS ...]
LLM providers to benchmark (default: all)
Choices: anthropic, openai, ollama, claude-cli, codex-cli
--iterations N Number of iterations per provider (default: 100)
Recommended: 100+ for statistical significance
--dataset PATH Path to test dataset JSON file
(default: tests/benchmarks/sample_comments.json)
--output PATH Output markdown report path
(default: docs/performance-benchmarks.md)
--warmup N Number of warmup iterations (default: 5)
Warmup runs are not included in metrics
Metrics Explained
Latency Metrics
Mean Latency
Average response time across all requests
Good indicator of typical performance
Affected by outliers
Median Latency (P50)
Middle value when sorted by response time
More robust to outliers than mean
Better represents “typical” user experience
P95 Latency
95% of requests complete faster than this time
Indicates worst-case performance for most users
Acceptance Criteria: < 5 seconds for all providers
P99 Latency
99% of requests complete faster than this time
Captures tail latency and outliers
Important for SLA guarantees
Throughput
Requests Per Second
How many requests the provider can handle
Calculated as:
1 / mean_latencyHigher is better for high-volume deployments
Accuracy Metrics
Success Rate
Percentage of requests that returned valid responses
Calculated as:
successful_parses / total_requestsTarget: > 95% for production use
Average Confidence
Mean confidence score from parsed responses
Range: 0.0 - 1.0 (higher is better)
Indicates model certainty in suggestions
Cost Analysis
Total Cost
Sum of all API costs for the benchmark run
Free for local models (Ollama, Claude CLI, Codex CLI)
Cost Per Request
Average cost per API call
Important for budgeting at scale
Monthly Estimates
Projected costs at 1K and 10K requests/month
Helps plan production deployment budgets
GPU Information (Local Models Only)
For Ollama and local models, the benchmark captures:
GPU name and model
Total memory available
Driver version
CUDA version (NVIDIA GPUs)
Test Dataset
The default benchmark dataset (tests/benchmarks/sample_comments.json) contains 30 realistic CodeRabbit-style review comments across three complexity levels:
Simple Comments (10)
Basic code suggestions
Single-line fixes
Simple formatting changes
Expected latency: < 2s
Medium Comments (10)
Multi-line code changes
Diff blocks with context
Moderate refactoring suggestions
Expected latency: 2-4s
Complex Comments (10)
Security vulnerability fixes
Architecture refactoring
Multi-file changes
Multi-option recommendations
Expected latency: 4-6s
Ground Truth Annotations
Each comment includes ground truth data for accuracy validation:
{
"body": "```suggestion\ndef calculate_total(items):\n return sum(item.price for item in items)\n```",
"path": "src/cart.py",
"line": 45,
"ground_truth": {
"changes": 1,
"start_line": 45,
"end_line": 46,
"change_type": "modification",
"confidence_threshold": 0.8
}
}
Creating Custom Datasets
To create your own benchmark dataset:
Structure: JSON file with three keys:
simple,medium,complexFormat: Each category contains a list of comment objects
Required fields:
body,path,line,ground_truth
Example custom dataset:
{
"simple": [
{
"body": "Fix typo: 'recieve' → 'receive'",
"path": "src/utils.py",
"line": 10,
"ground_truth": {
"changes": 1,
"confidence_threshold": 0.9
}
}
],
"medium": [...],
"complex": [...]
}
Provider Comparison
Anthropic (Claude)
Strengths
Excellent accuracy on complex code understanding
Strong security vulnerability detection
Prompt caching reduces costs by 50-90%
Considerations
Slightly higher per-request cost than OpenAI
API latency depends on region
Best For
Production deployments with repeated prompts
Security-critical code reviews
Complex architectural refactoring
OpenAI (GPT-4o, GPT-4o-mini)
Strengths: (GPT-4o-mini)
Fast response times (1-3s typical)
GPT-4o-mini offers excellent cost/performance ratio
Wide model selection
Considerations: (GPT-4o-mini)
No prompt caching (yet)
Higher costs for GPT-4o at scale
Best For: (GPT-4o-mini)
Speed-critical applications
Cost-sensitive deployments (with mini model)
High-volume production systems
Ollama (Local Models)
Strengths: (Claude Sonnet)
Zero per-request cost
100% privacy (no data leaves your infrastructure)
GPU acceleration support
Considerations: (Claude Sonnet)
Requires local hardware (GPU recommended)
Higher latency than cloud APIs
Model quality varies (qwen2.5-coder:7b recommended)
Best For: (Claude Sonnet)
Privacy-first requirements (HIPAA, GDPR, confidential code)
Cost-sensitive high-volume deployments
Air-gapped environments
Claude CLI
Strengths: (Qwen2.5-Coder)
Free for development/testing
Uses latest Claude models
No API key management
Considerations: (Qwen2.5-Coder)
Not suitable for production automation
Rate limits apply
Requires Claude desktop app
Best For: (Qwen2.5-Coder)
Local development and testing
Prototyping before API integration
Codex CLI
Strengths: (Deepseek-Coder)
Free for development/testing
Direct integration with OpenAI Codex
Considerations: (Deepseek-Coder)
Not suitable for production automation
Limited to Codex model family
Best For: (Deepseek-Coder)
Local development and testing
Codex-specific workflows
Interpreting Results
Sample Benchmark Report
## Latency Comparison
| Provider | Model | Mean | Median | P95 | P99 | Throughput |
|------------|----------------|--------|--------|-------|-------|------------|
| anthropic | claude-3-5 | 2.1s | 2.0s | 3.5s | 4.2s | 0.48 req/s |
| openai | gpt-4o-mini | 1.8s | 1.7s | 2.9s | 3.5s | 0.56 req/s |
| ollama | qwen2.5:7b | 4.5s | 4.2s | 7.1s | 8.5s | 0.22 req/s |
What to Look For
Production Readiness
✅ P95 latency < 5s (all providers meet acceptance criteria)
✅ Success rate > 95%
✅ Average confidence > 0.7
Cost Optimization
Compare monthly estimates at expected volume
Consider Anthropic with prompt caching for repeated prompts
Evaluate Ollama for high-volume scenarios
Performance vs Cost Trade-offs
OpenAI gpt-4o-mini: Best cost/performance for cloud
Anthropic: Best accuracy, cost-effective with caching
Ollama: Best for privacy and zero ongoing costs
Advanced Usage
Benchmarking Specific Scenarios
# Benchmark only simple comments
python scripts/benchmark_llm.py --complexity simple --iterations 200
# Benchmark with custom warmup
python scripts/benchmark_llm.py --warmup 10 --iterations 100
# Benchmark with verbose output
python scripts/benchmark_llm.py --verbose
Continuous Benchmarking
Integrate benchmarking into your CI/CD pipeline:
# .github/workflows/benchmark.yml
name: LLM Benchmark
on:
schedule:
* cron: '0 0 * * 0' # Weekly on Sunday
workflow_dispatch:
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
* uses: actions/checkout@v4
* name: Run benchmarks
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
python scripts/benchmark_llm.py --iterations 100
git add docs/performance-benchmarks-results.md
git commit -m "chore: update weekly benchmarks"
git push
Regression Detection
Monitor key metrics over time to detect performance regressions:
# Save benchmark results with timestamp
python scripts/benchmark_llm.py --output "reports/benchmark-$(date +%Y-%m-%d).md"
# Compare with previous results
diff reports/benchmark-2025-11-10.md reports/benchmark-2025-11-17.md
Troubleshooting
Low Success Rate (< 95%)
Possible Causes
Invalid API keys
Network connectivity issues
Model timeout (increase timeout in config)
Malformed test comments
Solutions
Check API key validity:
echo $ANTHROPIC_API_KEYTest network:
curl https://api.anthropic.comIncrease timeout:
--timeout 60Validate test dataset JSON schema
High P99 Latency (> 10s)
Possible Causes: (High Latency)
Network congestion
Provider rate limiting
Cold start delays (first request)
Complex comments exceeding context limits
Solutions: (High Latency)
Increase warmup iterations:
--warmup 10Reduce concurrent requests
Split complex comments into simpler chunks
Check provider status pages
GPU Not Detected (Ollama)
Possible Causes: (Low Confidence)
CUDA/ROCm drivers not installed
GPU not accessible to Docker (if running in container)
Ollama not configured for GPU
Solutions: (Low Confidence)
Verify GPU:
nvidia-smiorrocm-smiCheck Ollama config:
ollama psReinstall with GPU support: See Ollama Setup Guide
Out of Memory (Local Models)
Possible Causes: (High Cost)
Model too large for available VRAM
Batch size too high
Memory leak in long-running benchmarks
Solutions: (High Cost)
Use smaller model:
qwen2.5-coder:3binstead of7bReduce iterations:
--iterations 50Restart Ollama between runs
See Also
LLM Configuration Guide - Provider setup and configuration
Ollama Setup Guide - Local model installation
Main Configuration Guide - General tool configuration
API Reference - Python API documentation
Getting Started Guide - Quick start tutorial
Contributing
Found a performance issue or want to add benchmark scenarios?
Create a new test comment in
tests/benchmarks/sample_comments.jsonAdd ground truth annotations
Run the benchmark:
python scripts/benchmark_llm.pySubmit a PR with your findings
For questions or issues, see the GitHub Issues page.