Observability & Metrics¶

The Madeinoz Knowledge System exports Prometheus metrics for monitoring LLM API usage, token consumption, and costs. This enables integration with existing observability infrastructure.

Overview¶

Metrics are exported via OpenTelemetry with a Prometheus exporter. The system tracks:

Token usage - Input, output, and total tokens per model
API costs - Real-time cost tracking in USD
Cache statistics - Hit rates, tokens saved, cost savings (when caching is enabled)
Memory decay - Lifecycle states, maintenance operations, classification performance (Feature 009)

Quick Start¶

Accessing Metrics¶

The metrics endpoint is exposed at:

Environment	Endpoint
Development	`http://localhost:9091/metrics`
Production	`http://localhost:9090/metrics`

Basic Query¶

# Fetch all metrics
curl http://localhost:9091/metrics

# Filter to graphiti metrics only
curl -s http://localhost:9091/metrics | grep "^graphiti_"

Configuration¶

Environment Variables¶

Add these to your ~/.claude/.env file:

# Enable/disable metrics collection (default: true)
MADEINOZ_KNOWLEDGE_PROMPT_CACHE_METRICS_ENABLED=true

# Enable detailed per-request logging (default: false)
# Set LOG_LEVEL=DEBUG to see metrics in logs
MADEINOZ_KNOWLEDGE_PROMPT_CACHE_LOG_REQUESTS=false

# Enable/disable prompt caching (default: false)
# Note: Currently blocked due to OpenRouter API limitation
MADEINOZ_KNOWLEDGE_PROMPT_CACHE_ENABLED=false

Restart After Configuration¶

bun run server-cli stop
bun run server-cli start

Available Metrics¶

Token Counters¶

Track cumulative token usage across all requests.

Metric	Labels	Description
`graphiti_prompt_tokens_total`	`model`	Total input/prompt tokens
`graphiti_completion_tokens_total`	`model`	Total output/completion tokens
`graphiti_total_tokens_total`	`model`	Total tokens (prompt + completion)
`graphiti_prompt_tokens_all_models_total`	-	Input tokens across all models
`graphiti_completion_tokens_all_models_total`	-	Output tokens across all models
`graphiti_total_tokens_all_models_total`	-	Total tokens across all models

Cost Counters¶

Track cumulative API costs in USD.

Metric	Labels	Description
`graphiti_api_cost_total`	`model`	Total API cost per model
`graphiti_api_input_cost_total`	`model`	Input/prompt cost per model
`graphiti_api_output_cost_total`	`model`	Output/completion cost per model
`graphiti_api_cost_all_models_total`	-	Total cost across all models
`graphiti_api_input_cost_all_models_total`	-	Input cost across all models
`graphiti_api_output_cost_all_models_total`	-	Output cost across all models

Token Histograms¶

Track per-request token distributions for percentile analysis.

Metric	Bucket Range	Description
`graphiti_prompt_tokens_per_request`	10 - 200,000	Input tokens per request
`graphiti_completion_tokens_per_request`	10 - 200,000	Output tokens per request
`graphiti_total_tokens_per_request`	10 - 200,000	Total tokens per request

Token bucket boundaries:

10, 25, 50, 100, 250, 500, 1000, 2000, 3000, 5000, 10000, 25000, 50000, 100000, 200000

Cost Histograms¶

Track per-request cost distributions for percentile analysis.

Metric	Bucket Range	Description
`graphiti_api_cost_per_request`	$0.000005 - $5.00	Total cost per request
`graphiti_api_input_cost_per_request`	$0.000005 - $5.00	Input cost per request
`graphiti_api_output_cost_per_request`	$0.000005 - $5.00	Output cost per request

Cost bucket boundaries:

$0.000005, $0.00001, $0.000025, $0.00005, $0.0001, $0.00025, $0.0005, $0.001,
$0.0025, $0.005, $0.01, $0.025, $0.05, $0.1, $0.25, $0.5, $1.0, $2.5, $5.0

Bucket coverage by model tier:

Range	Model Examples
$0.000005 - $0.01	Gemini Flash, GPT-4o-mini
$0.01 - $0.10	GPT-4o, Claude Sonnet
$0.10 - $1.00	GPT-4, Claude Opus
$1.00 - $5.00	Large context on expensive models

Gauge Metrics¶

Track current state values.

Metric	Values	Description
`graphiti_cache_enabled`	0 or 1	Whether prompt caching is enabled
`graphiti_cache_hit_rate`	0-100	Current session cache hit rate (%)

Cache Metrics (When Enabled)¶

These metrics populate when MADEINOZ_KNOWLEDGE_PROMPT_CACHE_ENABLED=true:

Metric	Labels	Description
`graphiti_cache_hits_total`	`model`	Cache hits per model
`graphiti_cache_misses_total`	`model`	Cache misses per model
`graphiti_cache_tokens_saved_total`	`model`	Tokens saved via caching
`graphiti_cache_cost_saved_total`	`model`	Cost savings from caching (USD)
`graphiti_cache_write_tokens_total`	`model`	Tokens written to cache (cache creation)

Cache Savings Histograms:

Metric	Labels	Description
`graphiti_cache_tokens_saved_per_request`	`model`	Distribution of tokens saved per cache hit
`graphiti_cache_cost_saved_per_request`	`model`	Distribution of cost saved per cache hit (USD)

Prompt Caching via OpenRouter

Prompt caching is available for Gemini models via OpenRouter. The system uses explicit cache_control markers (similar to Anthropic's approach) with a minimum of 1,024 tokens. To enable caching, set MADEINOZ_KNOWLEDGE_PROMPT_CACHE_ENABLED=true. See Prompt Caching for details.

Duration Metrics¶

Track LLM request latency for performance monitoring.

Metric	Labels	Description
`graphiti_llm_request_duration_seconds`	`model`	Distribution of LLM request latency

Duration bucket boundaries (seconds):

0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 15.0, 30.0, 60.0, 120.0, 300.0

Bucket coverage:

Range	Request Type
0.05s - 1s	Cached/simple requests
1s - 10s	Typical LLM calls
10s - 60s	Complex reasoning, large context
60s - 300s	Timeout territory

Error Metrics¶

Track LLM API errors for reliability monitoring.

Metric	Labels	Description
`graphiti_llm_errors_total`	`model`, `error_type`	Error count by model and type
`graphiti_llm_errors_all_models_total`	-	Total errors across all models

Error types:

rate_limit - API rate limit exceeded
timeout - Request timeout
BadRequestError, APIError, etc. - Exception class names

Error Metrics Visibility

Error counters only appear in Prometheus after at least one error has been recorded. If you don't see these metrics, it means no LLM errors have occurred.

Throughput Metrics¶

Track episode processing volume.

Metric	Labels	Description
`graphiti_episodes_processed_total`	`group_id`	Episodes processed per group
`graphiti_episodes_processed_all_groups_total`	-	Total episodes across all groups

Throughput Metrics Integration

Episode metrics require integration into the MCP tool handler and may not be active in all deployments.

Memory Decay Metrics (Feature 009)¶

The memory decay system tracks lifecycle state transitions, maintenance operations, and classification performance. These metrics use the knowledge_ prefix.

Health Endpoint¶

A dedicated health endpoint provides decay system status:

curl http://localhost:9090/health/decay

Returns:

{
  "status": "healthy",
  "decay_enabled": true,
  "last_maintenance": "2026-01-28T12:00:00Z",
  "metrics_endpoint": "/metrics"
}

Maintenance Metrics¶

Track scheduled maintenance operations that recalculate decay scores and transition lifecycle states.

Metric	Labels	Description
`knowledge_decay_maintenance_runs_total`	`status`	Maintenance runs by status (success/failure)
`knowledge_decay_scores_updated_total`	-	Decay scores recalculated
`knowledge_maintenance_duration_seconds`	-	Maintenance run duration (histogram)
`knowledge_memories_purged_total`	-	Soft-deleted memories permanently removed

Duration bucket boundaries (seconds):

1, 5, 30, 60, 120, 300, 600

Performance target: Complete within 10 minutes (600 seconds).

Lifecycle Metrics¶

Track state transitions as memories age or are accessed.

Metric	Labels	Description
`knowledge_lifecycle_transitions_total`	`from_state`, `to_state`	State transitions by type
`knowledge_memories_by_state`	`state`	Current count per lifecycle state
`knowledge_memories_total`	-	Total memory count (excluding soft-deleted)

Lifecycle states:

State	Description
`ACTIVE`	Recently accessed, full relevance
`DORMANT`	Not accessed for 30+ days
`ARCHIVED`	Not accessed for 90+ days
`EXPIRED`	Marked for deletion
`SOFT_DELETED`	Deleted but recoverable for 90 days
`PERMANENT`	High importance + stability, never decays

Classification Metrics¶

Track LLM-based importance/stability classification.

Metric	Labels	Description
`knowledge_classification_requests_total`	`status`	Classification attempts (success/failure/fallback)
`knowledge_classification_latency_seconds`	-	LLM response time (histogram)

Latency bucket boundaries (seconds):

0.1, 0.5, 1, 2, 5

Classification statuses:

Status	Description
`success`	LLM classified successfully
`failure`	LLM call failed, used defaults
`fallback`	LLM unavailable, used defaults

Aggregate Metrics¶

Track average scores across the knowledge graph.

Metric	Description
`knowledge_decay_score_avg`	Average decay score (0.0-1.0)
`knowledge_importance_avg`	Average importance (1-5)
`knowledge_stability_avg`	Average stability (1-5)

Search Metrics¶

Track weighted search operations that boost by relevance.

Metric	Labels	Description
`knowledge_weighted_searches_total`	-	Weighted search operations
`knowledge_search_weighted_latency_seconds`	-	Scoring overhead (histogram)

Example PromQL Queries¶

Maintenance success rate (last 24 hours):

sum(increase(knowledge_decay_maintenance_runs_total{status="success"}[24h]))
/
sum(increase(knowledge_decay_maintenance_runs_total[24h]))

State distribution:

knowledge_memories_by_state

Classification fallback rate:

sum(rate(knowledge_classification_requests_total{status="fallback"}[5m]))
/
sum(rate(knowledge_classification_requests_total[5m]))

Lifecycle transitions per hour:

sum by (from_state, to_state) (increase(knowledge_lifecycle_transitions_total[1h]))

P95 classification latency:

histogram_quantile(0.95, rate(knowledge_classification_latency_seconds_bucket[5m]))

Alert Rules¶

Alert rules are defined in config/monitoring/prometheus/alerts/knowledge.yml:

Alert	Condition	Severity
`MaintenanceTimeout`	Duration > 10 minutes	warning
`MaintenanceFailed`	Any failure in last hour	critical
`ClassificationDegraded`	Fallback rate > 20%	warning
`ExcessiveExpiration`	> 100 expired/hour	warning
`SoftDeleteBacklog`	> 1000 awaiting purge	warning

Prometheus Integration¶

Scrape Configuration¶

Add to your prometheus.yml:

scrape_configs:
  - job_name: 'madeinoz-knowledge'
    static_configs:
      - targets: ['localhost:9091']  # dev port
    scrape_interval: 15s

Example PromQL Queries¶

Token usage in last hour:

increase(graphiti_total_tokens_all_models_total[1h])

Tokens per model:

sum by (model) (increase(graphiti_total_tokens_total[1h]))

Total cost in last 24 hours:

increase(graphiti_api_cost_all_models_total[24h])

Cost per model:

sum by (model) (increase(graphiti_api_cost_total[24h]))

P95 cost per request:

histogram_quantile(0.95, rate(graphiti_api_cost_per_request_bucket[5m]))

P99 tokens per request:

histogram_quantile(0.99, rate(graphiti_total_tokens_per_request_bucket[5m]))

Median (P50) cost per request:

histogram_quantile(0.50, rate(graphiti_api_cost_per_request_bucket[5m]))

P95 request duration:

histogram_quantile(0.95, rate(graphiti_llm_request_duration_seconds_bucket[5m]))

Average request duration:

rate(graphiti_llm_request_duration_seconds_sum[5m]) / rate(graphiti_llm_request_duration_seconds_count[5m])

Error rate by model:

sum by (model) (rate(graphiti_llm_errors_total[5m]))

Understanding Histogram Buckets¶

Prometheus histograms are cumulative. Each bucket shows the count of observations less than or equal to that boundary.

Example output:

graphiti_api_cost_per_request_USD_bucket{le="0.0001"} 2.0
graphiti_api_cost_per_request_USD_bucket{le="0.00025"} 5.0
graphiti_api_cost_per_request_USD_bucket{le="0.0005"} 5.0

Interpretation:

2 requests cost ≤ $0.0001
3 more requests cost between $0.0001 and $0.00025
0 requests cost more than $0.00025 (count stays at 5)

Grafana Dashboard¶

The system includes a pre-configured Grafana dashboard with comprehensive monitoring panels.

Quick Start (Development)¶

The development environment includes Prometheus and Grafana by default:

# Start dev environment with monitoring
docker compose -f src/skills/server/docker-compose-neo4j-dev.yml up -d

# Access points:
# - Grafana: http://localhost:3002 (login: admin/admin)
# - Prometheus UI: http://localhost:9092

Production Setup (Optional)¶

Production monitoring uses Docker Compose profiles and is disabled by default:

# Start with monitoring enabled
docker compose -f src/skills/server/docker-compose-neo4j.yml --profile monitoring up -d

# Start without monitoring (default)
docker compose -f src/skills/server/docker-compose-neo4j.yml up -d

# Access points (when enabled):
# - Grafana: http://localhost:3001 (login: admin/admin or custom password)
# - Prometheus UI: http://localhost:9092

Custom Grafana Password

Set GRAFANA_ADMIN_PASSWORD environment variable for a secure password:

export GRAFANA_ADMIN_PASSWORD=your-secure-password
docker compose -f src/skills/server/docker-compose-neo4j.yml --profile monitoring up -d

Dashboard Panels¶

The pre-configured dashboard includes these sections:

Overview Row:

Total API Cost (USD)
Total Tokens Used
Cache Status (Enabled/Disabled)
Cache Hit Rate (%)
Total Errors

Token Usage Row:

Token Usage Rate (by Model) - Time series
Prompt vs Completion Tokens - Stacked area

Cost Tracking Row:

Cost Rate ($/hour by Model) - Time series
Cost by Model - Pie chart
Input vs Output Cost - Donut chart

Request Duration Row:

Request Duration Percentiles (P50, P95, P99) - Time series
Average Request Duration (by Model) - Bar chart

Cache Performance Row:

Cache Hit Rate Over Time - Time series
Cache Cost Savings Rate - Time series
Cache Hits vs Misses - Stacked area

Errors Row:

Error Rate (by Model & Type) - Stacked bars
Errors by Type - Pie chart

Port Assignments¶

Environment	Service	Port	Notes
Development	Grafana	3002	Neo4j backend
Development	Grafana	3003	FalkorDB backend (avoids UI conflict)
Development	Prometheus UI	9092	Query interface
Production	Grafana	3001	Neo4j backend
Production	Grafana	3002	FalkorDB backend
Production	Prometheus UI	9092	Query interface

Customizing the Dashboard¶

The dashboard configuration is stored at:

config/monitoring/grafana/provisioning/dashboards/madeinoz-knowledge.json

To customize:

Open Grafana and make changes via the UI
Export the dashboard JSON (Share > Export > Save to file)
Replace the provisioned dashboard file
Restart Grafana to apply changes

Manual Panel Examples¶

If building a custom dashboard, use these PromQL queries:

Usage & Cost:

Token Usage Rate - rate(graphiti_total_tokens_all_models_total[5m])
Cost Rate ($/hour) - rate(graphiti_api_cost_all_models_total[1h]) * 3600
Request Cost Distribution - Histogram panel with graphiti_api_cost_per_request_bucket
Token Usage by Model - sum by (model) (rate(graphiti_total_tokens_total[5m]))

Performance:

Request Duration P95 - histogram_quantile(0.95, rate(graphiti_llm_request_duration_seconds_bucket[5m]))
Request Duration Heatmap - Heatmap panel with graphiti_llm_request_duration_seconds_bucket
Error Rate - sum(rate(graphiti_llm_errors_total[5m]))

Caching (when enabled):

Cache Hit Rate - graphiti_cache_hit_rate
Cost Savings Rate - rate(graphiti_cache_cost_saved_all_models_total[1h]) * 3600
Tokens Saved - increase(graphiti_cache_tokens_saved_all_models_total[1h])

Troubleshooting¶

Metrics Not Appearing¶

Check metrics are enabled:

grep MADEINOZ_KNOWLEDGE_PROMPT_CACHE_METRICS_ENABLED ~/.claude/.env

Verify endpoint is accessible:
```
curl http://localhost:9091/metrics
```

Check container logs:

docker logs madeinoz-knowledge-graph-mcp-dev 2>&1 | grep -i metric

Counters Not Incrementing¶

Counter and histogram metrics only appear after LLM API calls are made. Metrics populate when:

add_memory tool is used (triggers entity extraction)
Any operation requiring LLM inference

Search operations (search_memory_facts, search_memory_nodes) use embeddings only and do not increment LLM metrics.

Debug Logging¶

Enable detailed per-request logging:

# In ~/.claude/.env
MADEINOZ_KNOWLEDGE_PROMPT_CACHE_LOG_REQUESTS=true
LOG_LEVEL=DEBUG

This shows per-request metrics in container logs:

📊 Metrics: prompt=1234, completion=567, cost=$0.000089, input_cost=$0.000062, output_cost=$0.000027

Prompt Caching (Gemini via OpenRouter)¶

Prompt caching reduces API costs by up to 15-20% by reusing previously processed prompt content. The system adds explicit cache_control markers to requests when enabled, allowing OpenRouter to serve cached content at reduced cost (0.25x normal price).

Note: Prompt caching is disabled by default and must be explicitly enabled via configuration.

Developer Documentation

For implementation details including architecture diagrams, code flow, and metrics internals, see the Cache Implementation Guide.

How It Works¶

┌─────────────────────────────────────────────────────────────────┐
│                    First Request (Cache Miss)                    │
├─────────────────────────────────────────────────────────────────┤
│  System Prompt (800 tokens) ──► LLM processes ──► Cache stored  │
│  User Message (200 tokens)  ──► LLM processes ──► Response      │
│                                                                  │
│  Cost: Full price for 1000 tokens                               │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                   Second Request (Cache Hit)                     │
├─────────────────────────────────────────────────────────────────┤
│  System Prompt (800 tokens) ──► Retrieved from cache (0.25x)    │
│  User Message (200 tokens)  ──► LLM processes ──► Response      │
│                                                                  │
│  Cost: 0.25x for cached 800 + full for 200 = 75% savings        │
└─────────────────────────────────────────────────────────────────┘

How Caching Works via OpenRouter¶

The Madeinoz Knowledge System implements explicit prompt caching via OpenRouter using cache_control markers (similar to Anthropic's approach):

Aspect	Description
Implementation	Explicit `cache_control` markers added to last message part
Format	Multipart messages with content parts array
Cache lifecycle	Managed by OpenRouter automatically
Minimum tokens	1,024 tokens for caching to be applied
Default state	Disabled - must be explicitly enabled

Recommended Model: google/gemini-2.0-flash-001 via OpenRouter

This implementation uses the CachingLLMClient wrapper which: 1. Checks if caching is enabled (environment variable) 2. Verifies the model is Gemini via OpenRouter 3. Converts messages to multipart format 4. Adds cache_control marker to the last content part 5. Extracts cache metrics from responses (cache_read_tokens, cache_write_tokens)

Configuration¶

# Enable prompt caching (disabled by default)
MADEINOZ_KNOWLEDGE_PROMPT_CACHE_ENABLED=true

# Enable metrics collection for cache statistics (recommended)
MADEINOZ_KNOWLEDGE_PROMPT_CACHE_METRICS_ENABLED=true

# Enable verbose caching logs for debugging (optional)
MADEINOZ_KNOWLEDGE_PROMPT_CACHE_LOG_REQUESTS=true

# Recommended model for caching
MADEINOZ_KNOWLEDGE_MODEL_NAME=google/gemini-2.0-flash-001

Cache Pricing¶

Cached tokens are billed at 0.25x the normal input token price:

Model	Input Price	Cached Price	Savings
Gemini 2.5 Flash	$0.15/1M	$0.0375/1M	75%
Gemini 2.5 Pro	$1.25/1M	$0.3125/1M	75%
Gemini 2.0 Flash	$0.10/1M	$0.025/1M	75%

Cache Metrics to Monitor¶

Metric	Purpose
`graphiti_cache_hit_rate`	Current session hit rate (%)
`graphiti_cache_tokens_saved_total`	Cumulative tokens served from cache
`graphiti_cache_cost_saved_total`	Cumulative USD saved
`graphiti_cache_hits_total` / `graphiti_cache_misses_total`	Hit/miss ratio

Example PromQL Queries¶

Cache hit rate over time:

graphiti_cache_hit_rate

Cost savings rate ($/hour):

rate(graphiti_cache_cost_saved_all_models_total[1h]) * 3600

Tokens saved in last hour:

increase(graphiti_cache_tokens_saved_all_models_total[1h])

Cache effectiveness by model:

sum by (model) (graphiti_cache_hits_total) / sum by (model) (graphiti_cache_requests_total) * 100

Troubleshooting Caching¶

Cache Hits Are Zero¶

Possible causes:

Model doesn't support caching - Only Gemini models support caching
Token count below threshold - Gemini 2.0 requires 4,096+ tokens (use Gemini 2.5 instead)
Caching not enabled - Set MADEINOZ_KNOWLEDGE_PROMPT_CACHE_ENABLED=true
Different prompts - Cache keys are content-based; slight variations = cache miss

Debug steps:

# Check caching is enabled
curl -s http://localhost:9091/metrics | grep graphiti_cache_enabled

# Check for any cache activity
curl -s http://localhost:9091/metrics | grep graphiti_cache

# Enable verbose logging
MADEINOZ_KNOWLEDGE_PROMPT_CACHE_LOG_REQUESTS=true

Low Cache Hit Rate¶

Expected behavior:

First request for any unique prompt = cache miss
Subsequent identical prompts = cache hit
Entity extraction uses similar system prompts = good cache reuse

Typical hit rates:

Scenario	Expected Hit Rate
Single `add_memory` call	0% (first request)
Bulk import (10+ episodes)	30-50%
Steady-state operation	40-60%

Implementation Details¶

The caching system consists of three components:

caching_wrapper.py - Wraps OpenAI client methods
Adds timing for duration metrics
Catches errors for error metrics
Extracts cache statistics from responses
message_formatter.py - Formats messages for caching
Adds cache_control markers for explicit caching
Detects Gemini model families
metrics_exporter.py - Exports to Prometheus
Counters for totals
Histograms for distributions
Gauges for current state

Files modified (in docker/patches/):

docker/patches/
├── caching_wrapper.py      # Client wrapper with timing/error tracking
├── caching_llm_client.py   # LLM client routing
├── message_formatter.py    # Cache marker formatting
├── cache_metrics.py        # Metrics calculation
├── session_metrics.py      # Session-level aggregation
└── metrics_exporter.py     # Prometheus export

Architecture¶

┌─────────────────────────────────────────────────────────────────┐
│                     OpenRouter API                               │
│  (returns: usage, cost, cost_details, prompt_tokens_details)    │
│  (Gemini: cached_tokens in prompt_tokens_details)               │
└─────────────────────────────────────────────────────────────────┘
                              ▲
                              │
┌─────────────────────────────────────────────────────────────────┐
│                   caching_wrapper.py                             │
│  - Wraps responses.parse() and chat.completions.create()        │
│  - Adds timing (record_request_duration)                         │
│  - Catches errors (record_error)                                 │
│  - Extracts cache metrics from response                          │
│  - Records cache hits/misses and savings                         │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                   metrics_exporter.py                            │
│  - OpenTelemetry MeterProvider with custom Views                │
│  - Prometheus exporter on port 9090/9091                        │
│  - Counters: tokens, cost, cache hits/misses, errors            │
│  - Histograms: tokens/request, cost/request, duration           │
│  - Gauges: cache_enabled, cache_hit_rate                        │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│              Prometheus / Grafana                                │
│  - Scrape /metrics endpoint                                      │
│  - Visualize with dashboards                                     │
│  - Alert on thresholds (cost, errors, latency)                  │
└─────────────────────────────────────────────────────────────────┘

Configuration Reference - All environment variables
Developer Notes - Internal architecture details
Troubleshooting - Common issues

Observability & Metrics¶

Overview¶

Quick Start¶

Accessing Metrics¶

Basic Query¶

Configuration¶

Environment Variables¶

Restart After Configuration¶

Available Metrics¶

Token Counters¶

Cost Counters¶

Token Histograms¶

Cost Histograms¶

Gauge Metrics¶

Cache Metrics (When Enabled)¶

Duration Metrics¶

Error Metrics¶

Throughput Metrics¶

Memory Decay Metrics (Feature 009)¶

Health Endpoint¶

Maintenance Metrics¶

Lifecycle Metrics¶

Classification Metrics¶

Aggregate Metrics¶

Search Metrics¶

Example PromQL Queries¶

Alert Rules¶

Prometheus Integration¶

Scrape Configuration¶

Example PromQL Queries¶

Understanding Histogram Buckets¶

Grafana Dashboard¶

Quick Start (Development)¶

Production Setup (Optional)¶

Dashboard Panels¶

Port Assignments¶

Customizing the Dashboard¶

Manual Panel Examples¶

Troubleshooting¶

Metrics Not Appearing¶

Counters Not Incrementing¶

Debug Logging¶

Prompt Caching (Gemini via OpenRouter)¶

How It Works¶

How Caching Works via OpenRouter¶

Configuration¶

Cache Pricing¶

Cache Metrics to Monitor¶

Example PromQL Queries¶

Troubleshooting Caching¶

Cache Hits Are Zero¶

Low Cache Hit Rate¶

Implementation Details¶

Architecture¶

Related Documentation¶