Cache Implementation Guide¶

This document describes how the Gemini prompt caching system works internally. For configuration and usage, see Observability & Metrics.

Architecture Overview¶

Visual Architecture Diagrams¶

Request Flow Diagram¶

LLM requests flow through the CachingLLMClient wrapper, which checks three conditions before applying cache_control markers and forwarding to OpenRouter.

Component Architecture¶

Core modules work together: environment configuration controls behavior, core components power the caching wrapper, which wraps the LLM client and exports metrics to Prometheus.

Decision Tree: When is Caching Applied?¶

Three checks must pass: caching enabled, Gemini model, and sufficient token count. If any check fails, the request bypasses caching.

Metrics Flow¶

OpenRouter responses are parsed for cache metrics, calculated for savings and hit rates, recorded to SessionMetrics, and exported to Prometheus.

Complete Request Flow¶

The complete flow from LLM request through client routing, caching wrapper, preprocessing, API call, and postprocessing with metrics extraction.

File Architecture¶

All cache implementation files live in docker/patches/ following Constitution Principle VII (Language Separation): Python code in docker/, TypeScript code in src/.

File	Purpose	Key Exports
`factories.py`	Client routing logic	`LLMClientFactory.create()`
`caching_wrapper.py`	Patches OpenAI client	`wrap_openai_client_for_caching()`
`message_formatter.py`	Adds cache_control markers	`format_messages_for_caching()`
`cache_metrics.py`	Per-request metrics	`CacheMetrics`, `PricingTier`
`session_metrics.py`	Session aggregation	`SessionMetrics`
`metrics_exporter.py`	Prometheus export	`CacheMetricsExporter`
`caching_llm_client.py`	Wrapper class	`CachingLLMClient`

Client Routing¶

The critical design decision in this implementation: Gemini models on OpenRouter must use OpenAIGenericClient to route requests through /chat/completions instead of /responses.

Location: docker/patches/factories.py (lines 305-343)

# Simplified routing logic
if _is_gemini_model(config.model) and 'openrouter' in provider_name.lower():
    # Use OpenAIGenericClient (routes to /chat/completions)
    client = OpenAIGenericClient(config=llm_config)
    if _caching_wrapper_available:
        client = wrap_openai_client_for_caching(client, config.model)
    return client

Why This Matters¶

Endpoint	Multipart Format	Caching Works	Structured Output
`/responses`	Rejected	No	Yes
`/chat/completions`	Accepted	Yes	Yes (json_schema)

The /responses endpoint returns "expected string, received array" when it receives multipart messages with cache_control markers. The /chat/completions endpoint accepts both multipart format AND json_schema for structured outputs.

Message Formatting¶

Location: docker/patches/message_formatter.py

The formatter converts messages to multipart format and adds cache_control markers to the last message only.

Transformation Example¶

Input (standard format):

{"role": "system", "content": "You are an AI assistant..."}

Output (multipart with cache_control):

{
    "role": "system",
    "content": [
        {
            "type": "text",
            "text": "You are an AI assistant...",
            "cache_control": {"type": "ephemeral", "ttl": "1h"}
        }
    ]
}

Why Mark Only the Last Message?¶

The cache_control marker tells Gemini "everything up to and including this point should be cached." By marking the last message:

System prompts and context (earlier messages) become part of the cached prefix
First request creates the cache (pays full price)
Subsequent requests with the same prefix get 90% discount on cached tokens

Processing Functions¶

Function	Purpose
`is_caching_enabled()`	Checks `MADEINOZ_KNOWLEDGE_PROMPT_CACHE_ENABLED`
`is_gemini_model(model)`	Pattern matches "google/gemini"
`is_cacheable_request(messages)`	Heuristic: minimum ~1024 tokens
`convert_to_multipart(message)`	Converts string content to array format
`add_cache_control_marker(message)`	Adds `{"type": "ephemeral", "ttl": "..."}`
`format_messages_for_caching(messages, model)`	Main entry point

Metrics Data Models¶

CacheMetrics (Per-Request)¶

Location: docker/patches/cache_metrics.py

@dataclass
class CacheMetrics:
    cache_hit: bool           # Was cache used?
    cached_tokens: int        # Tokens served from cache
    prompt_tokens: int        # Total input tokens
    completion_tokens: int    # Output tokens
    tokens_saved: int         # Alias for cached_tokens
    cost_without_cache: float # Hypothetical cost (USD)
    actual_cost: float        # Real cost after discount
    cost_saved: float         # Dollar savings
    savings_percent: float    # Reduction percentage (0-100)
    model: str                # Model identifier

Cost Calculation¶

# Cost without cache (all tokens at full rate)
cost_without_cache = (prompt_tokens * input_rate) + (completion_tokens * output_rate)

# Actual cost (cached tokens at 90% discount)
uncached_tokens = prompt_tokens - cached_tokens
actual_cost = (uncached_tokens * input_rate) + \
              (cached_tokens * cached_input_rate) + \
              (completion_tokens * output_rate)

# Savings
cost_saved = cost_without_cache - actual_cost

Pricing Tiers¶

Model	Input (per 1M)	Cached (90% off)	Output (per 1M)
gemini-2.0-flash-001	$0.10	$0.01	$0.40
gemini-2.5-flash	$0.30	$0.03	$2.50
gemini-2.5-pro	$1.25	$0.125	$10.00

SessionMetrics (Aggregated)¶

Location: docker/patches/session_metrics.py

Aggregates metrics across all requests in a session:

Property	Type	Description
`total_requests`	int	Total API calls
`cache_hits`	int	Requests with cache_hit=true
`cache_misses`	int	Requests with cache_hit=false
`total_cached_tokens`	int	Sum of all cached_tokens
`total_cost_without_cache`	float	Cumulative hypothetical cost
`total_actual_cost`	float	Cumulative real cost
`total_cost_saved`	float	Total savings
`cache_hit_rate`	float	Calculated percentage (0-100)
`overall_savings_percent`	float	Overall cost reduction %

The Caching Wrapper¶

Location: docker/patches/caching_wrapper.py

The wrapper monkey-patches chat.completions.create() to intercept requests and responses.

Request Processing¶

async def create_with_caching(*args, **kwargs):
    # 1. PREPROCESSING: Format messages with cache_control
    if 'messages' in kwargs:
        kwargs['messages'] = format_messages_for_caching(
            kwargs['messages'], model
        )

    # 2. TIMING: Measure request duration
    start_time = time.monotonic()
    response = await original_create(*args, **kwargs)
    duration = time.monotonic() - start_time

    # 3. POSTPROCESSING: Extract metrics
    extract_and_record_metrics(response)

    return response

Response Metric Extraction¶

The wrapper handles multiple response formats:

Format	Source	Fields
OpenRouter (root-level)	`response.native_tokens_cached`	`native_tokens_cached`, `tokens_prompt`, `tokens_completion`
OpenAI (usage object)	`response.usage.cached_tokens`	`cached_tokens`, `prompt_tokens`, `completion_tokens`

Known Limitation: responses.parse()¶

The responses.parse() endpoint does not support multipart format:

Error: expected string, received array

Solution: Caching is automatically skipped for responses.parse() calls. Only chat.completions.create() supports caching.

Prometheus Integration¶

Location: docker/patches/metrics_exporter.py

Initialization¶

The metrics exporter is initialized in graphiti_mcp_server.py:

if _metrics_exporter_available:
    metrics_port = int(os.getenv("MADEINOZ_KNOWLEDGE_METRICS_PORT", "9090"))
    metrics_enabled = os.getenv("MADEINOZ_KNOWLEDGE_PROMPT_CACHE_METRICS_ENABLED", "true")
    initialize_metrics_exporter(enabled=metrics_enabled, port=metrics_port)

Exposed Metrics¶

Counters (Cumulative)¶

Metric	Labels	Description
`graphiti_cache_hits_total`	model	Cache hit count
`graphiti_cache_misses_total`	model	Cache miss count
`graphiti_cache_tokens_saved_total`	model	Total cached tokens
`graphiti_cache_cost_saved_total`	model	Total $ saved
`graphiti_prompt_tokens_total`	model	Input tokens used
`graphiti_completion_tokens_total`	model	Output tokens used
`graphiti_api_cost_total`	model	Total API cost

Gauges (Current Value)¶

Metric	Labels	Description
`graphiti_cache_hit_rate`	model	Current hit rate %
`graphiti_cache_enabled`	none	1=enabled, 0=disabled

Histograms (Distributions)¶

Metric	Description
`graphiti_prompt_tokens_per_request`	Input token distribution
`graphiti_api_cost_per_request`	Cost per request
`graphiti_llm_request_duration_seconds`	Response latency

Example Output¶

# Cache hit rate gauge
graphiti_cache_hit_rate{model="google/gemini-2.0-flash-001"} 75.77

# Cache hits counter
graphiti_cache_hits_total{model="google/gemini-2.0-flash-001"} 1523

# Cost savings counter
graphiti_cache_cost_saved_total{model="google/gemini-2.0-flash-001"} 0.7584

Configuration Reference¶

Variable	Default	Description
`MADEINOZ_KNOWLEDGE_PROMPT_CACHE_ENABLED`	`false`	Enable/disable caching
`MADEINOZ_KNOWLEDGE_PROMPT_CACHE_TTL`	`1h`	Cache TTL (`5m` or `1h`)
`MADEINOZ_KNOWLEDGE_METRICS_PORT`	`9090`	Prometheus metrics port
`MADEINOZ_KNOWLEDGE_PROMPT_CACHE_METRICS_ENABLED`	`true`	Enable metrics export
`MADEINOZ_KNOWLEDGE_PROMPT_CACHE_LOG_REQUESTS`	`false`	Debug logging

Testing¶

Test files follow Constitution Principle VII (Language Separation):

Test File	Location	Purpose
`test_cache_metrics.py`	`docker/tests/unit/`	Cache metrics extraction
`test_session_metrics.py`	`docker/tests/unit/`	Session aggregation
`test_caching_e2e.py`	`docker/tests/integration/`	End-to-end workflow
`test_metrics_endpoint.py`	`docker/tests/integration/`	Prometheus endpoint

Success Criteria¶

Metric	Target
Token reduction from cache	40%+
Cache hit rate (after warmup)	60%+
Response time improvement	25%+ faster

Quick Reference¶

Dependency Graph¶

factories.py
    → wrap_openai_client_for_caching() [caching_wrapper.py]
        → format_messages_for_caching() [message_formatter.py]
        → extract_and_record_metrics()
            → CacheMetrics.from_openrouter_response() [cache_metrics.py]
            → SessionMetrics.record_request() [session_metrics.py]
            → metrics_exporter.record_*() [metrics_exporter.py]

Key Design Decisions¶

Decision	Rationale
OpenAIGenericClient routing	`/responses` rejects multipart; `/chat/completions` supports both
Mark only last message	System prompts become cached prefix
Session-level metrics	Track cumulative effectiveness
Prometheus + OpenTelemetry	Standard observability format
90% cached token discount	Reflects Gemini's actual pricing
1-hour default TTL	Balance hit rate vs freshness