Phase C Implementation Complete

Date: 2026-02-17 Branch: feature/sonnet-impl-20260217-155229 Status: ✅ Implemented (Testing Deferred)

Summary

Phase C implements caching and tier-based cost control:

Rerank Result Caching: 15-minute TTL, prevents duplicate GPT-mini calls
Chunk Text Caching: 1-hour TTL LRU cache, reduces Postgres queries
Tier-Based Limits: Free/Standard/Premium tiers with different top-K and rerank-N
Cost Estimation: Prevents over-spending per tier

What Was Implemented

S10: Rerank Result Caching

New: app/services/cache.py - LRUCache: In-memory LRU cache with TTL support - CacheService: Dual-cache service (rerank + chunk text) - Rerank cache key: sha256(query + namespace + tier) - TTL: 15 minutes (configurable) - Max size: 500 entries - Invalidation: On re-ingestion of namespace

Cache Flow:

# Check cache
cached = cache_service.get_rerank_result(query, namespace, tier)
if cached:
    return cached  # Cache hit

# Call GPT-mini for reranking
reranked = gpt_mini_service.rerank(query, candidates)

# Store in cache
cache_service.set_rerank_result(query, namespace, tier, reranked)

S11: Chunk Text Caching

In: app/services/cache.py - Separate LRU cache for chunk text - TTL: 1 hour (configurable) - Max size: 5,000 entries - Invalidation: On re-ingestion of file

Retrieval Flow:

# Get chunk IDs from Pinecone
chunk_ids = [match["id"] for match in pinecone_results]

# Fetch chunk text (with cache)
chunk_texts = []
for chunk_id in chunk_ids:
    text = cache_service.get_chunk_text(chunk_id)
    if not text:
        # Cache miss - fetch from Postgres
        text = supabase.table("chunks").select("content").eq("chunk_id", chunk_id).execute()
        cache_service.set_chunk_text(chunk_id, text)
    chunk_texts.append(text)

S12: Tier-Based Retrieval Limits

New: app/services/tier_config.py - TierLimits: Pydantic model for tier configuration - TierConfig: Tier limits manager

Tier Limits:

Tier	Top-K	Rerank-N	Max Tokens	Daily Limit	Rate Limit
Free	10	3	500	50/day	10/min
Standard	20	5	2,000	500/day	30/min
Premium	30	8	4,000	Unlimited	100/min

Usage in Retrieval:

from app.services.tier_config import TierConfig

# Get user tier from wallet
user_tier = wallet["subscription_tier"]  # 'free', 'standard', 'premium'

# Enforce tier limits
top_k = TierConfig.get_top_k(user_tier)
rerank_n = TierConfig.get_rerank_n(user_tier)

# Query Pinecone with tier-specific top-K
results = pinecone_adapter.query(query_vector, top_k=top_k, ...)

# Rerank with tier-specific rerank-N
reranked = gpt_mini_service.rerank(query, results, top_n=rerank_n)

Cost Estimation:

# Estimate cost before reservation
estimated = TierConfig.calculate_estimated_cost(
    tier=user_tier,
    query_length=len(query),
    context_chunks=None  # Uses tier default
)

# Check if tier can afford request
can_afford = TierConfig.can_afford_request(user_tier, estimated)
if not can_afford:
    return {"error": "request_exceeds_tier_limit"}

Cache Statistics

Cache provides stats endpoint:

stats = cache_service.get_stats()
# {
#   "rerank_cache": {"size": 123, "max_size": 500, "ttl": 900},
#   "chunk_cache": {"size": 2456, "max_size": 5000, "ttl": 3600}
# }

Cache Invalidation Strategy

On Re-Ingestion

When a file is re-ingested:

# Invalidate rerank cache for the namespace
cache_service.invalidate_rerank_namespace(namespace)

# Invalidate chunk text cache for the file
cache_service.invalidate_chunks_by_file(file_id)

Manual Cache Clear

Admin endpoint to clear caches:

# Clear all caches
cache_service.rerank_cache.clear()
cache_service.chunk_cache.clear()

Production Considerations

In-Memory Caching Limitations

The current implementation uses in-memory LRU caches, which: - ✅ Works for single-instance deployments - ❌ Does NOT work for multi-instance deployments (no shared cache)

Migrating to Redis (Production)

For multi-instance deployments, migrate to Redis:

import redis

class RedisCache:
    def __init__(self, redis_url: str):
        self.client = redis.from_url(redis_url)

    def get(self, key: str) -> Optional[str]:
        value = self.client.get(key)
        return value.decode('utf-8') if value else None

    def set(self, key: str, value: str, ttl: int):
        self.client.setex(key, ttl, value)

Configuration:

# Add to .env
REDIS_URL=redis://localhost:6379/0
CACHE_BACKEND=redis  # or 'memory'

Testing Checklist

Rerank Caching (S10)

[ ] First query executes GPT-mini rerank
[ ] Second identical query within 15 min returns cached result (no GPT-mini call)
[ ] Query after 15 min TTL executes GPT-mini again
[ ] Re-ingestion of namespace invalidates rerank cache

Chunk Caching (S11)

[ ] First retrieval fetches chunk text from Postgres
[ ] Second retrieval for same chunk returns cached text (no Postgres query)
[ ] Cache evicts least-recently-used entries when max size reached
[ ] Re-ingestion of file invalidates chunk cache

Tier Limits (S12)

[ ] Free tier retrieves top-10 chunks
[ ] Standard tier retrieves top-20 chunks
[ ] Premium tier retrieves top-30 chunks
[ ] Request exceeding tier token limit returns error
[ ] Cost estimation calculates correctly for each tier

Performance Metrics

Expected improvements:

Metric	Before (No Cache)	After (With Cache)	Improvement
Rerank latency (cache hit)	2-5 seconds	< 10 ms	99% faster
Chunk fetch latency (cache hit)	50-200 ms	< 1 ms	99% faster
Postgres queries per request	10-30	0-30 (depends on cache hits)	Up to 100% reduction
GPT-mini API calls (duplicate queries)	100%	10-20% (depends on cache hit rate)	80-90% reduction

Configuration

Environment Variables

Add to .env:

# Cache TTLs
CACHE_RERANK_TTL_SECONDS=900  # 15 minutes
CACHE_CHUNK_TTL_SECONDS=3600  # 1 hour

# Tier limits (already configured in tier_config.py)
TIER_FREE_TOP_K=10
TIER_FREE_RERANK_N=3
TIER_STANDARD_TOP_K=20
TIER_STANDARD_RERANK_N=5
TIER_PREMIUM_TOP_K=30
TIER_PREMIUM_RERANK_N=8

Next Phase: Phase D - Retrieval Pipeline

Once Phase C tests pass:

S20: GPT-mini service (reranker, language detection, validation)
Retrieval → Rerank → Reasoning pipeline integration

Files Changed

New Files

app/services/cache.py - LRU cache with TTL support (S10-S11)
app/services/tier_config.py - Tier limits and cost estimation (S12)

Status: ✅ Phase C Complete - Ready for Testing

See SONNET_RUN.md for full implementation log