Performance Tuning Guide

This guide covers optimization strategies for maximizing throughput and minimizing latency when using the Review Bot Automator.

Overview

Performance depends on several factors:

Parallel processing: Worker count and rate limiting
LLM provider: Latency and throughput characteristics
Caching: Hit rate and warm-up strategies
Network: Connection pooling and timeouts

Parallel Processing Optimization

Worker Count Recommendations

PR Size (Comments)	Recommended Workers	Rationale
1-10	2-4	Low overhead, minimal benefit from more workers
10-50	4-8	Good parallelization benefit
50-100	8-12	Higher throughput, watch for rate limits
100+	12-16	Maximum throughput, requires careful rate limiting

Configuration

parallel:
  enabled: true
  max_workers: 8

Or via CLI:

pr-resolve apply 123 --parallel --max-workers 8

Monitoring Parallel Performance

pr-resolve apply 123 --parallel --max-workers 8 --show-metrics

Check the metrics output:

Latency p95 vs p50: Large gap indicates bottlenecks
Success rate: Should be > 99%
Cache hit rate: Higher is better for parallel workloads

LLM Provider Optimization

Provider Latency Comparison

Provider	Typical p50	Typical p95	Best For
Ollama (local)	0.5-2.0s	2.0-5.0s	Privacy, no network latency
Claude CLI	0.3-1.0s	1.0-3.0s	Quality + speed balance
Anthropic API	0.2-0.8s	0.8-2.0s	Lowest latency
OpenAI API	0.3-1.0s	1.0-3.0s	Good balance

Model Selection for Speed

Fastest models by provider:

# Anthropic - Use Haiku for speed
llm:
  provider: anthropic
  model: claude-haiku-4-20250514

# OpenAI - Use mini for speed
llm:
  provider: openai
  model: gpt-4o-mini

# Ollama - Use smaller model
llm:
  provider: ollama
  model: qwen2.5-coder:7b  # Faster than llama3.3:70b

Cache Optimization

Cache Hit Rate Targets

< 20%: Consider warming the cache
20-40%: Normal for varied PRs
40-60%: Good cache effectiveness
> 60%: Excellent (common patterns)

Cache Warming

Pre-populate the cache for cold start optimization:

from review_bot_automator.llm.cache.prompt_cache import PromptCache

cache = PromptCache()

# Load from previous export
entries = json.loads(Path("cache_export.json").read_text())
loaded, skipped = cache.warm_cache(entries)
print(f"Loaded {loaded} entries, skipped {skipped} duplicates")

Cache Configuration

llm:
  cache_enabled: true
  # Cache automatically manages size with LRU eviction

Rate Limit Handling

Retry Configuration

llm:
  retry_on_rate_limit: true
  retry_max_attempts: 5
  retry_base_delay: 2.0  # Exponential backoff

Rate Limit Best Practices

Reduce workers when hitting limits:
```
export CR_MAX_WORKERS=4  # Down from 8
```

Use circuit breaker:

llm:
  circuit_breaker_enabled: true
  circuit_breaker_threshold: 5

Monitor rate limit errors:

pr-resolve apply 123 --show-metrics --log-level INFO
# Watch for "Rate limit exceeded" in logs

Network Optimization

Connection Pooling

Connection pooling is enabled by default for all HTTP-based providers. This reduces connection overhead for multiple requests.

Timeout Configuration

Timeouts are not currently user-configurable. Default timeouts:

Connect timeout: 30 seconds
Read timeout: 120 seconds

If experiencing timeouts:

Check network connectivity
Try a different provider
Reduce parallel workers

GPU Acceleration (Ollama)

Enabling GPU

GPU acceleration is automatic when:

CUDA/ROCm/Metal drivers are installed
Ollama detects the GPU

Verifying GPU Usage

# NVIDIA
nvidia-smi  # Watch for ollama process

# AMD
rocm-smi

# Apple Silicon
# GPU usage is automatic

GPU vs CPU Performance

Model	CPU (tokens/s)	GPU (tokens/s)	Speedup
llama3.2:3b	~20	~100	5x
llama3.1:8b	~10	~80	8x
llama3.3:70b	~2	~30	15x

Memory Optimization

Reducing Memory Usage

Use smaller models:

export CR_LLM_MODEL=llama3.2:3b  # Instead of 70b

Reduce workers:
```
export CR_MAX_WORKERS=2
```
Process sequentially:
```
export CR_PARALLEL=false
```

Memory Requirements by Model

Model	Minimum RAM	Recommended RAM
llama3.2:3b	4 GB	8 GB
llama3.1:8b	8 GB	16 GB
qwen2.5-coder:7b	8 GB	16 GB
codestral:22b	24 GB	32 GB
llama3.3:70b	48 GB	64 GB

Benchmarking

Running Benchmarks

# Quick benchmark
python scripts/benchmark_llm.py --iterations 10

# Comprehensive benchmark
python scripts/benchmark_llm.py --iterations 100 --providers all

Key Metrics to Track

Tokens per second: Model inference speed
Time to first token: Perceived latency
Request latency p95: Worst-case performance
Success rate: Reliability

Performance Profiles

Development (Fast Iteration)

llm:
  provider: ollama
  model: llama3.2:3b  # Fastest local
  cache_enabled: true

parallel:
  enabled: true
  max_workers: 4

CI/CD (Balanced)

llm:
  provider: anthropic
  model: claude-haiku-4-20250514  # Fast + accurate
  cache_enabled: true
  cost_budget: 2.0

parallel:
  enabled: true
  max_workers: 8

Production (Maximum Quality)

llm:
  provider: anthropic
  model: claude-sonnet-4-5  # Highest quality
  confidence_threshold: 0.7
  cache_enabled: true

parallel:
  enabled: true
  max_workers: 4  # Careful rate limiting

Troubleshooting Performance

Slow Processing

Check metrics: --show-metrics
Review p95 latency - if high, check provider
Check cache hit rate - if low, consider warming
Reduce workers if hitting rate limits

High Memory Usage

Use smaller model
Reduce worker count
Process sequentially for very large PRs

Rate Limit Errors

Reduce worker count
Increase retry delay
Consider using a different provider