Skip to content

Phase C Implementation Complete

Date: 2026-02-17 Branch: feature/sonnet-impl-20260217-155229 Status: ✅ Implemented (Testing Deferred)


Summary

Phase C implements caching and tier-based cost control:

  • Rerank Result Caching: 15-minute TTL, prevents duplicate GPT-mini calls
  • Chunk Text Caching: 1-hour TTL LRU cache, reduces Postgres queries
  • Tier-Based Limits: Free/Standard/Premium tiers with different top-K and rerank-N
  • Cost Estimation: Prevents over-spending per tier

What Was Implemented

S10: Rerank Result Caching

New: app/services/cache.py - LRUCache: In-memory LRU cache with TTL support - CacheService: Dual-cache service (rerank + chunk text) - Rerank cache key: sha256(query + namespace + tier) - TTL: 15 minutes (configurable) - Max size: 500 entries - Invalidation: On re-ingestion of namespace

Cache Flow:

# Check cache
cached = cache_service.get_rerank_result(query, namespace, tier)
if cached:
    return cached  # Cache hit

# Call GPT-mini for reranking
reranked = gpt_mini_service.rerank(query, candidates)

# Store in cache
cache_service.set_rerank_result(query, namespace, tier, reranked)

S11: Chunk Text Caching

In: app/services/cache.py - Separate LRU cache for chunk text - TTL: 1 hour (configurable) - Max size: 5,000 entries - Invalidation: On re-ingestion of file

Retrieval Flow:

# Get chunk IDs from Pinecone
chunk_ids = [match["id"] for match in pinecone_results]

# Fetch chunk text (with cache)
chunk_texts = []
for chunk_id in chunk_ids:
    text = cache_service.get_chunk_text(chunk_id)
    if not text:
        # Cache miss - fetch from Postgres
        text = supabase.table("chunks").select("content").eq("chunk_id", chunk_id).execute()
        cache_service.set_chunk_text(chunk_id, text)
    chunk_texts.append(text)

S12: Tier-Based Retrieval Limits

New: app/services/tier_config.py - TierLimits: Pydantic model for tier configuration - TierConfig: Tier limits manager

Tier Limits:

Tier Top-K Rerank-N Max Tokens Daily Limit Rate Limit
Free 10 3 500 50/day 10/min
Standard 20 5 2,000 500/day 30/min
Premium 30 8 4,000 Unlimited 100/min

Usage in Retrieval:

from app.services.tier_config import TierConfig

# Get user tier from wallet
user_tier = wallet["subscription_tier"]  # 'free', 'standard', 'premium'

# Enforce tier limits
top_k = TierConfig.get_top_k(user_tier)
rerank_n = TierConfig.get_rerank_n(user_tier)

# Query Pinecone with tier-specific top-K
results = pinecone_adapter.query(query_vector, top_k=top_k, ...)

# Rerank with tier-specific rerank-N
reranked = gpt_mini_service.rerank(query, results, top_n=rerank_n)

Cost Estimation:

# Estimate cost before reservation
estimated = TierConfig.calculate_estimated_cost(
    tier=user_tier,
    query_length=len(query),
    context_chunks=None  # Uses tier default
)

# Check if tier can afford request
can_afford = TierConfig.can_afford_request(user_tier, estimated)
if not can_afford:
    return {"error": "request_exceeds_tier_limit"}


Cache Statistics

Cache provides stats endpoint:

stats = cache_service.get_stats()
# {
#   "rerank_cache": {"size": 123, "max_size": 500, "ttl": 900},
#   "chunk_cache": {"size": 2456, "max_size": 5000, "ttl": 3600}
# }

Cache Invalidation Strategy

On Re-Ingestion

When a file is re-ingested:

# Invalidate rerank cache for the namespace
cache_service.invalidate_rerank_namespace(namespace)

# Invalidate chunk text cache for the file
cache_service.invalidate_chunks_by_file(file_id)

Manual Cache Clear

Admin endpoint to clear caches:

# Clear all caches
cache_service.rerank_cache.clear()
cache_service.chunk_cache.clear()

Production Considerations

In-Memory Caching Limitations

The current implementation uses in-memory LRU caches, which: - ✅ Works for single-instance deployments - ❌ Does NOT work for multi-instance deployments (no shared cache)

Migrating to Redis (Production)

For multi-instance deployments, migrate to Redis:

import redis

class RedisCache:
    def __init__(self, redis_url: str):
        self.client = redis.from_url(redis_url)

    def get(self, key: str) -> Optional[str]:
        value = self.client.get(key)
        return value.decode('utf-8') if value else None

    def set(self, key: str, value: str, ttl: int):
        self.client.setex(key, ttl, value)

Configuration:

# Add to .env
REDIS_URL=redis://localhost:6379/0
CACHE_BACKEND=redis  # or 'memory'


Testing Checklist

Rerank Caching (S10)

  • [ ] First query executes GPT-mini rerank
  • [ ] Second identical query within 15 min returns cached result (no GPT-mini call)
  • [ ] Query after 15 min TTL executes GPT-mini again
  • [ ] Re-ingestion of namespace invalidates rerank cache

Chunk Caching (S11)

  • [ ] First retrieval fetches chunk text from Postgres
  • [ ] Second retrieval for same chunk returns cached text (no Postgres query)
  • [ ] Cache evicts least-recently-used entries when max size reached
  • [ ] Re-ingestion of file invalidates chunk cache

Tier Limits (S12)

  • [ ] Free tier retrieves top-10 chunks
  • [ ] Standard tier retrieves top-20 chunks
  • [ ] Premium tier retrieves top-30 chunks
  • [ ] Request exceeding tier token limit returns error
  • [ ] Cost estimation calculates correctly for each tier

Performance Metrics

Expected improvements:

Metric Before (No Cache) After (With Cache) Improvement
Rerank latency (cache hit) 2-5 seconds < 10 ms 99% faster
Chunk fetch latency (cache hit) 50-200 ms < 1 ms 99% faster
Postgres queries per request 10-30 0-30 (depends on cache hits) Up to 100% reduction
GPT-mini API calls (duplicate queries) 100% 10-20% (depends on cache hit rate) 80-90% reduction

Configuration

Environment Variables

Add to .env:

# Cache TTLs
CACHE_RERANK_TTL_SECONDS=900  # 15 minutes
CACHE_CHUNK_TTL_SECONDS=3600  # 1 hour

# Tier limits (already configured in tier_config.py)
TIER_FREE_TOP_K=10
TIER_FREE_RERANK_N=3
TIER_STANDARD_TOP_K=20
TIER_STANDARD_RERANK_N=5
TIER_PREMIUM_TOP_K=30
TIER_PREMIUM_RERANK_N=8

Next Phase: Phase D - Retrieval Pipeline

Once Phase C tests pass:

  • S20: GPT-mini service (reranker, language detection, validation)
  • Retrieval → Rerank → Reasoning pipeline integration

Files Changed

New Files

  • app/services/cache.py - LRU cache with TTL support (S10-S11)
  • app/services/tier_config.py - Tier limits and cost estimation (S12)

Status: ✅ Phase C Complete - Ready for Testing


See SONNET_RUN.md for full implementation log