LLM Metrics Guide

This guide explains how to use the metrics system to monitor LLM performance, costs, and reliability.

Overview

The metrics system tracks:

  • Request latency (p50, p95, p99 percentiles)

  • Success/failure rates per provider

  • Cost tracking with budget enforcement

  • Cache hit rates and savings

  • Fallback rate (LLM vs regex parser)

  • Per-provider breakdown

CLI Options

Displaying Metrics

Use --show-metrics to display aggregated metrics after processing:

pr-resolve apply 123 --llm-enabled --show-metrics

Output:

=== LLM Metrics Summary ===
Total requests: 42
Success rate: 97.6%
Latency: p50=0.234s, p95=1.456s, p99=2.103s
Total cost: $0.0523
Cache hit rate: 35.7%
Fallback rate: 4.8% (2 fallbacks)

Per-provider breakdown:
  anthropic: 35 requests, p95=1.234s, $0.0412
  ollama: 7 requests, p95=0.567s, $0.0000

Exporting Metrics

Export metrics to a file for analysis:

# Export to JSON
pr-resolve apply 123 --llm-enabled --show-metrics --metrics-output metrics.json

# Export to CSV (per-request data)
pr-resolve apply 123 --llm-enabled --show-metrics --metrics-output metrics.csv

JSON Export Format

The JSON export includes:

{
  "summary": {
    "total_requests": 42,
    "successful_requests": 41,
    "failed_requests": 1,
    "success_rate": 0.976,
    "latency_p50": 0.234,
    "latency_p95": 1.456,
    "latency_p99": 2.103,
    "latency_avg": 0.567,
    "total_cost": 0.0523,
    "cost_per_comment": 0.00124,
    "cache_hit_rate": 0.357,
    "cache_savings": 0.0187,
    "fallback_count": 2,
    "fallback_rate": 0.048
  },
  "provider_stats": {
    "anthropic": {
      "provider": "anthropic",
      "model": "claude-haiku-4-20250514",
      "total_requests": 35,
      "successful_requests": 34,
      "failed_requests": 1,
      "success_rate": 0.971,
      "total_cost": 0.0412,
      "total_tokens": 15234,
      "avg_latency": 0.678,
      "latency_p50": 0.456,
      "latency_p95": 1.234,
      "latency_p99": 1.567,
      "cache_hit_rate": 0.286,
      "error_counts": {"RateLimitError": 1}
    }
  },
  "pr_info": {
    "owner": "VirtualAgentics",
    "repo": "my-repo",
    "pr_number": 123
  }
}

CSV Export Format

CSV exports per-request data for detailed analysis:

Column

Description

request_id

Unique request identifier

provider

Provider name (anthropic, openai, ollama)

model

Model identifier

latency_seconds

Request duration

success

True/False

tokens_input

Input tokens consumed

tokens_output

Output tokens generated

cost

Request cost in USD

cache_hit

True if served from cache

error_type

Error class name if failed

Key Metrics Explained

Latency Percentiles

  • p50 (median): Typical request latency

  • p95: 95% of requests complete within this time

  • p99: Worst-case latency (excluding outliers)

Interpretation:

  • High p95 vs p50 gap indicates inconsistent performance

  • High p99 may indicate timeout issues or provider instability

Success Rate

Percentage of requests that completed successfully.

Targets:

  • > 99%: Excellent

  • 95-99%: Good

  • < 95%: Investigate failures

Cache Hit Rate

Percentage of requests served from prompt cache.

Impact:

  • Higher hit rate = lower costs and latency

  • Typical range: 20-50% for varied PRs

Fallback Rate

Percentage of comments where LLM parsing failed and regex fallback was used.

Impact:

  • Higher fallback rate = lower parsing accuracy

  • Target: < 10%

Cache Savings

Estimated cost saved by cache hits, calculated as:

savings = cache_hits × avg_non_cache_cost

Programmatic Access

Use the MetricsAggregator class for custom integrations:

from review_bot_automator.llm.metrics_aggregator import MetricsAggregator
from pathlib import Path

# Create aggregator
aggregator = MetricsAggregator()
aggregator.set_pr_info("owner", "repo", 123)

# Track requests
request_id = aggregator.start_request("anthropic", "claude-haiku-4")
# ... make LLM call ...
aggregator.end_request(
    request_id,
    success=True,
    tokens_input=100,
    tokens_output=50,
    cost=0.0012
)

# Get aggregated metrics
metrics = aggregator.get_aggregated_metrics(comments_processed=10)
print(f"Total cost: ${metrics.total_cost:.4f}")
print(f"Cost per comment: ${metrics.cost_per_comment:.4f}")

# Export
aggregator.export_json(Path("metrics.json"))
aggregator.export_csv(Path("metrics.csv"))

# Human-readable summary
print(aggregator.get_summary_report())

Analyzing Metrics

Cost Analysis

import json
from pathlib import Path

data = json.loads(Path("metrics.json").read_text())

# Cost breakdown by provider
for provider, stats in data["provider_stats"].items():
    print(f"{provider}: ${stats['total_cost']:.4f} "
          f"({stats['total_requests']} requests)")

# Cost efficiency
summary = data["summary"]
print(f"Cost per comment: ${summary['cost_per_comment']:.4f}")
print(f"Cache savings: ${summary['cache_savings']:.4f}")

Performance Analysis

# Identify slow providers
for provider, stats in data["provider_stats"].items():
    if stats["latency_p95"] > 2.0:
        print(f"Warning: {provider} p95 latency is {stats['latency_p95']:.2f}s")

# Check error patterns
for provider, stats in data["provider_stats"].items():
    if stats["error_counts"]:
        print(f"{provider} errors: {stats['error_counts']}")

Best Practices

  1. Enable metrics in CI/CD: Track performance trends over time

  2. Set cost budgets: Use CR_LLM_COST_BUDGET to prevent surprises

  3. Monitor fallback rate: High rates indicate parsing issues

  4. Review error counts: Identify provider-specific problems

  5. Export for analysis: Use JSON/CSV for historical tracking

See Also