# Performance Tuning Guide

This guide covers optimization strategies for maximizing throughput and minimizing latency when using the Review Bot Automator.

## Overview

Performance depends on several factors:

* **Parallel processing**: Worker count and rate limiting
* **LLM provider**: Latency and throughput characteristics
* **Caching**: Hit rate and warm-up strategies
* **Network**: Connection pooling and timeouts

## Parallel Processing Optimization

### Worker Count Recommendations

| PR Size (Comments) | Recommended Workers | Rationale |
|-------------------|---------------------|-----------|
| 1-10 | 2-4 | Low overhead, minimal benefit from more workers |
| 10-50 | 4-8 | Good parallelization benefit |
| 50-100 | 8-12 | Higher throughput, watch for rate limits |
| 100+ | 12-16 | Maximum throughput, requires careful rate limiting |

### Configuration

```yaml
parallel:
  enabled: true
  max_workers: 8
```

Or via CLI:

```bash
pr-resolve apply 123 --parallel --max-workers 8
```

### Monitoring Parallel Performance

```bash
pr-resolve apply 123 --parallel --max-workers 8 --show-metrics
```

Check the metrics output:

* **Latency p95 vs p50**: Large gap indicates bottlenecks
* **Success rate**: Should be > 99%
* **Cache hit rate**: Higher is better for parallel workloads

## LLM Provider Optimization

### Provider Latency Comparison

| Provider | Typical p50 | Typical p95 | Best For |
|----------|-------------|-------------|----------|
| Ollama (local) | 0.5-2.0s | 2.0-5.0s | Privacy, no network latency |
| Claude CLI | 0.3-1.0s | 1.0-3.0s | Quality + speed balance |
| Anthropic API | 0.2-0.8s | 0.8-2.0s | Lowest latency |
| OpenAI API | 0.3-1.0s | 1.0-3.0s | Good balance |

### Model Selection for Speed

**Fastest models by provider:**

```yaml
# Anthropic - Use Haiku for speed
llm:
  provider: anthropic
  model: claude-haiku-4-20250514

# OpenAI - Use mini for speed
llm:
  provider: openai
  model: gpt-4o-mini

# Ollama - Use smaller model
llm:
  provider: ollama
  model: qwen2.5-coder:7b  # Faster than llama3.3:70b
```

## Cache Optimization

### Cache Hit Rate Targets

* **< 20%**: Consider warming the cache
* **20-40%**: Normal for varied PRs
* **40-60%**: Good cache effectiveness
* **\> 60%**: Excellent (common patterns)

### Cache Warming

Pre-populate the cache for cold start optimization:

```python
from review_bot_automator.llm.cache.prompt_cache import PromptCache

cache = PromptCache()

# Load from previous export
entries = json.loads(Path("cache_export.json").read_text())
loaded, skipped = cache.warm_cache(entries)
print(f"Loaded {loaded} entries, skipped {skipped} duplicates")
```

### Cache Configuration

```yaml
llm:
  cache_enabled: true
  # Cache automatically manages size with LRU eviction
```

## Rate Limit Handling

### Retry Configuration

```yaml
llm:
  retry_on_rate_limit: true
  retry_max_attempts: 5
  retry_base_delay: 2.0  # Exponential backoff
```

### Rate Limit Best Practices

1. **Reduce workers when hitting limits:**

   ```bash
   export CR_MAX_WORKERS=4  # Down from 8
   ```

2. **Use circuit breaker:**

   ```yaml
   llm:
     circuit_breaker_enabled: true
     circuit_breaker_threshold: 5
   ```

3. **Monitor rate limit errors:**

   ```bash
   pr-resolve apply 123 --show-metrics --log-level INFO
   # Watch for "Rate limit exceeded" in logs
   ```

## Network Optimization

### Connection Pooling

Connection pooling is enabled by default for all HTTP-based providers. This reduces connection overhead for multiple requests.

### Timeout Configuration

Timeouts are not currently user-configurable. Default timeouts:

* **Connect timeout**: 30 seconds
* **Read timeout**: 120 seconds

If experiencing timeouts:

1. Check network connectivity
2. Try a different provider
3. Reduce parallel workers

## GPU Acceleration (Ollama)

### Enabling GPU

GPU acceleration is automatic when:

1. CUDA/ROCm/Metal drivers are installed
2. Ollama detects the GPU

### Verifying GPU Usage

```bash
# NVIDIA
nvidia-smi  # Watch for ollama process

# AMD
rocm-smi

# Apple Silicon
# GPU usage is automatic
```

### GPU vs CPU Performance

| Model | CPU (tokens/s) | GPU (tokens/s) | Speedup |
|-------|---------------|----------------|---------|
| llama3.2:3b | ~20 | ~100 | 5x |
| llama3.1:8b | ~10 | ~80 | 8x |
| llama3.3:70b | ~2 | ~30 | 15x |

## Memory Optimization

### Reducing Memory Usage

1. **Use smaller models:**

   ```bash
   export CR_LLM_MODEL=llama3.2:3b  # Instead of 70b
   ```

2. **Reduce workers:**

   ```bash
   export CR_MAX_WORKERS=2
   ```

3. **Process sequentially:**

   ```bash
   export CR_PARALLEL=false
   ```

### Memory Requirements by Model

| Model | Minimum RAM | Recommended RAM |
|-------|-------------|-----------------|
| llama3.2:3b | 4 GB | 8 GB |
| llama3.1:8b | 8 GB | 16 GB |
| qwen2.5-coder:7b | 8 GB | 16 GB |
| codestral:22b | 24 GB | 32 GB |
| llama3.3:70b | 48 GB | 64 GB |

## Benchmarking

### Running Benchmarks

```bash
# Quick benchmark
python scripts/benchmark_llm.py --iterations 10

# Comprehensive benchmark
python scripts/benchmark_llm.py --iterations 100 --providers all
```

### Key Metrics to Track

* **Tokens per second**: Model inference speed
* **Time to first token**: Perceived latency
* **Request latency p95**: Worst-case performance
* **Success rate**: Reliability

## Performance Profiles

### Development (Fast Iteration)

```yaml
llm:
  provider: ollama
  model: llama3.2:3b  # Fastest local
  cache_enabled: true

parallel:
  enabled: true
  max_workers: 4
```

### CI/CD (Balanced)

```yaml
llm:
  provider: anthropic
  model: claude-haiku-4-20250514  # Fast + accurate
  cache_enabled: true
  cost_budget: 2.0

parallel:
  enabled: true
  max_workers: 8
```

### Production (Maximum Quality)

```yaml
llm:
  provider: anthropic
  model: claude-sonnet-4-5  # Highest quality
  confidence_threshold: 0.7
  cache_enabled: true

parallel:
  enabled: true
  max_workers: 4  # Careful rate limiting
```

## Troubleshooting Performance

### Slow Processing

1. Check metrics: `--show-metrics`
2. Review p95 latency - if high, check provider
3. Check cache hit rate - if low, consider warming
4. Reduce workers if hitting rate limits

### High Memory Usage

1. Use smaller model
2. Reduce worker count
3. Process sequentially for very large PRs

### Rate Limit Errors

1. Reduce worker count
2. Increase retry delay
3. Consider using a different provider

## See Also

* [Parallel Processing](parallel-processing.md) - Detailed parallel configuration
* [LLM Configuration](llm-configuration.md) - Full LLM setup guide
* [Cost Estimation](cost-estimation.md) - Managing API costs
* [Troubleshooting](troubleshooting.md) - Common issues and solutions