Performance Tuning Guide
This guide covers optimization strategies for maximizing throughput and minimizing latency when using the Review Bot Automator.
Overview
Performance depends on several factors:
Parallel processing: Worker count and rate limiting
LLM provider: Latency and throughput characteristics
Caching: Hit rate and warm-up strategies
Network: Connection pooling and timeouts
Parallel Processing Optimization
Worker Count Recommendations
PR Size (Comments) |
Recommended Workers |
Rationale |
|---|---|---|
1-10 |
2-4 |
Low overhead, minimal benefit from more workers |
10-50 |
4-8 |
Good parallelization benefit |
50-100 |
8-12 |
Higher throughput, watch for rate limits |
100+ |
12-16 |
Maximum throughput, requires careful rate limiting |
Configuration
parallel:
enabled: true
max_workers: 8
Or via CLI:
pr-resolve apply 123 --parallel --max-workers 8
Monitoring Parallel Performance
pr-resolve apply 123 --parallel --max-workers 8 --show-metrics
Check the metrics output:
Latency p95 vs p50: Large gap indicates bottlenecks
Success rate: Should be > 99%
Cache hit rate: Higher is better for parallel workloads
LLM Provider Optimization
Provider Latency Comparison
Provider |
Typical p50 |
Typical p95 |
Best For |
|---|---|---|---|
Ollama (local) |
0.5-2.0s |
2.0-5.0s |
Privacy, no network latency |
Claude CLI |
0.3-1.0s |
1.0-3.0s |
Quality + speed balance |
Anthropic API |
0.2-0.8s |
0.8-2.0s |
Lowest latency |
OpenAI API |
0.3-1.0s |
1.0-3.0s |
Good balance |
Model Selection for Speed
Fastest models by provider:
# Anthropic - Use Haiku for speed
llm:
provider: anthropic
model: claude-haiku-4-20250514
# OpenAI - Use mini for speed
llm:
provider: openai
model: gpt-4o-mini
# Ollama - Use smaller model
llm:
provider: ollama
model: qwen2.5-coder:7b # Faster than llama3.3:70b
Cache Optimization
Cache Hit Rate Targets
< 20%: Consider warming the cache
20-40%: Normal for varied PRs
40-60%: Good cache effectiveness
> 60%: Excellent (common patterns)
Cache Warming
Pre-populate the cache for cold start optimization:
from review_bot_automator.llm.cache.prompt_cache import PromptCache
cache = PromptCache()
# Load from previous export
entries = json.loads(Path("cache_export.json").read_text())
loaded, skipped = cache.warm_cache(entries)
print(f"Loaded {loaded} entries, skipped {skipped} duplicates")
Cache Configuration
llm:
cache_enabled: true
# Cache automatically manages size with LRU eviction
Rate Limit Handling
Retry Configuration
llm:
retry_on_rate_limit: true
retry_max_attempts: 5
retry_base_delay: 2.0 # Exponential backoff
Rate Limit Best Practices
Reduce workers when hitting limits:
export CR_MAX_WORKERS=4 # Down from 8
Use circuit breaker:
llm: circuit_breaker_enabled: true circuit_breaker_threshold: 5
Monitor rate limit errors:
pr-resolve apply 123 --show-metrics --log-level INFO # Watch for "Rate limit exceeded" in logs
Network Optimization
Connection Pooling
Connection pooling is enabled by default for all HTTP-based providers. This reduces connection overhead for multiple requests.
Timeout Configuration
Timeouts are not currently user-configurable. Default timeouts:
Connect timeout: 30 seconds
Read timeout: 120 seconds
If experiencing timeouts:
Check network connectivity
Try a different provider
Reduce parallel workers
GPU Acceleration (Ollama)
Enabling GPU
GPU acceleration is automatic when:
CUDA/ROCm/Metal drivers are installed
Ollama detects the GPU
Verifying GPU Usage
# NVIDIA
nvidia-smi # Watch for ollama process
# AMD
rocm-smi
# Apple Silicon
# GPU usage is automatic
GPU vs CPU Performance
Model |
CPU (tokens/s) |
GPU (tokens/s) |
Speedup |
|---|---|---|---|
llama3.2:3b |
~20 |
~100 |
5x |
llama3.1:8b |
~10 |
~80 |
8x |
llama3.3:70b |
~2 |
~30 |
15x |
Memory Optimization
Reducing Memory Usage
Use smaller models:
export CR_LLM_MODEL=llama3.2:3b # Instead of 70b
Reduce workers:
export CR_MAX_WORKERS=2
Process sequentially:
export CR_PARALLEL=false
Memory Requirements by Model
Model |
Minimum RAM |
Recommended RAM |
|---|---|---|
llama3.2:3b |
4 GB |
8 GB |
llama3.1:8b |
8 GB |
16 GB |
qwen2.5-coder:7b |
8 GB |
16 GB |
codestral:22b |
24 GB |
32 GB |
llama3.3:70b |
48 GB |
64 GB |
Benchmarking
Running Benchmarks
# Quick benchmark
python scripts/benchmark_llm.py --iterations 10
# Comprehensive benchmark
python scripts/benchmark_llm.py --iterations 100 --providers all
Key Metrics to Track
Tokens per second: Model inference speed
Time to first token: Perceived latency
Request latency p95: Worst-case performance
Success rate: Reliability
Performance Profiles
Development (Fast Iteration)
llm:
provider: ollama
model: llama3.2:3b # Fastest local
cache_enabled: true
parallel:
enabled: true
max_workers: 4
CI/CD (Balanced)
llm:
provider: anthropic
model: claude-haiku-4-20250514 # Fast + accurate
cache_enabled: true
cost_budget: 2.0
parallel:
enabled: true
max_workers: 8
Production (Maximum Quality)
llm:
provider: anthropic
model: claude-sonnet-4-5 # Highest quality
confidence_threshold: 0.7
cache_enabled: true
parallel:
enabled: true
max_workers: 4 # Careful rate limiting
Troubleshooting Performance
Slow Processing
Check metrics:
--show-metricsReview p95 latency - if high, check provider
Check cache hit rate - if low, consider warming
Reduce workers if hitting rate limits
High Memory Usage
Use smaller model
Reduce worker count
Process sequentially for very large PRs
Rate Limit Errors
Reduce worker count
Increase retry delay
Consider using a different provider
See Also
Parallel Processing - Detailed parallel configuration
LLM Configuration - Full LLM setup guide
Cost Estimation - Managing API costs
Troubleshooting - Common issues and solutions