Phase C Implementation Complete
Date: 2026-02-17 Branch: feature/sonnet-impl-20260217-155229 Status: ✅ Implemented (Testing Deferred)
Summary
Phase C implements caching and tier-based cost control:
- Rerank Result Caching: 15-minute TTL, prevents duplicate GPT-mini calls
- Chunk Text Caching: 1-hour TTL LRU cache, reduces Postgres queries
- Tier-Based Limits: Free/Standard/Premium tiers with different top-K and rerank-N
- Cost Estimation: Prevents over-spending per tier
What Was Implemented
S10: Rerank Result Caching
New: app/services/cache.py
- LRUCache: In-memory LRU cache with TTL support
- CacheService: Dual-cache service (rerank + chunk text)
- Rerank cache key: sha256(query + namespace + tier)
- TTL: 15 minutes (configurable)
- Max size: 500 entries
- Invalidation: On re-ingestion of namespace
Cache Flow:
# Check cache
cached = cache_service.get_rerank_result(query, namespace, tier)
if cached:
return cached # Cache hit
# Call GPT-mini for reranking
reranked = gpt_mini_service.rerank(query, candidates)
# Store in cache
cache_service.set_rerank_result(query, namespace, tier, reranked)
S11: Chunk Text Caching
In: app/services/cache.py
- Separate LRU cache for chunk text
- TTL: 1 hour (configurable)
- Max size: 5,000 entries
- Invalidation: On re-ingestion of file
Retrieval Flow:
# Get chunk IDs from Pinecone
chunk_ids = [match["id"] for match in pinecone_results]
# Fetch chunk text (with cache)
chunk_texts = []
for chunk_id in chunk_ids:
text = cache_service.get_chunk_text(chunk_id)
if not text:
# Cache miss - fetch from Postgres
text = supabase.table("chunks").select("content").eq("chunk_id", chunk_id).execute()
cache_service.set_chunk_text(chunk_id, text)
chunk_texts.append(text)
S12: Tier-Based Retrieval Limits
New: app/services/tier_config.py
- TierLimits: Pydantic model for tier configuration
- TierConfig: Tier limits manager
Tier Limits:
| Tier | Top-K | Rerank-N | Max Tokens | Daily Limit | Rate Limit |
|---|---|---|---|---|---|
| Free | 10 | 3 | 500 | 50/day | 10/min |
| Standard | 20 | 5 | 2,000 | 500/day | 30/min |
| Premium | 30 | 8 | 4,000 | Unlimited | 100/min |
Usage in Retrieval:
from app.services.tier_config import TierConfig
# Get user tier from wallet
user_tier = wallet["subscription_tier"] # 'free', 'standard', 'premium'
# Enforce tier limits
top_k = TierConfig.get_top_k(user_tier)
rerank_n = TierConfig.get_rerank_n(user_tier)
# Query Pinecone with tier-specific top-K
results = pinecone_adapter.query(query_vector, top_k=top_k, ...)
# Rerank with tier-specific rerank-N
reranked = gpt_mini_service.rerank(query, results, top_n=rerank_n)
Cost Estimation:
# Estimate cost before reservation
estimated = TierConfig.calculate_estimated_cost(
tier=user_tier,
query_length=len(query),
context_chunks=None # Uses tier default
)
# Check if tier can afford request
can_afford = TierConfig.can_afford_request(user_tier, estimated)
if not can_afford:
return {"error": "request_exceeds_tier_limit"}
Cache Statistics
Cache provides stats endpoint:
stats = cache_service.get_stats()
# {
# "rerank_cache": {"size": 123, "max_size": 500, "ttl": 900},
# "chunk_cache": {"size": 2456, "max_size": 5000, "ttl": 3600}
# }
Cache Invalidation Strategy
On Re-Ingestion
When a file is re-ingested:
# Invalidate rerank cache for the namespace
cache_service.invalidate_rerank_namespace(namespace)
# Invalidate chunk text cache for the file
cache_service.invalidate_chunks_by_file(file_id)
Manual Cache Clear
Admin endpoint to clear caches:
Production Considerations
In-Memory Caching Limitations
The current implementation uses in-memory LRU caches, which: - ✅ Works for single-instance deployments - ❌ Does NOT work for multi-instance deployments (no shared cache)
Migrating to Redis (Production)
For multi-instance deployments, migrate to Redis:
import redis
class RedisCache:
def __init__(self, redis_url: str):
self.client = redis.from_url(redis_url)
def get(self, key: str) -> Optional[str]:
value = self.client.get(key)
return value.decode('utf-8') if value else None
def set(self, key: str, value: str, ttl: int):
self.client.setex(key, ttl, value)
Configuration:
Testing Checklist
Rerank Caching (S10)
- [ ] First query executes GPT-mini rerank
- [ ] Second identical query within 15 min returns cached result (no GPT-mini call)
- [ ] Query after 15 min TTL executes GPT-mini again
- [ ] Re-ingestion of namespace invalidates rerank cache
Chunk Caching (S11)
- [ ] First retrieval fetches chunk text from Postgres
- [ ] Second retrieval for same chunk returns cached text (no Postgres query)
- [ ] Cache evicts least-recently-used entries when max size reached
- [ ] Re-ingestion of file invalidates chunk cache
Tier Limits (S12)
- [ ] Free tier retrieves top-10 chunks
- [ ] Standard tier retrieves top-20 chunks
- [ ] Premium tier retrieves top-30 chunks
- [ ] Request exceeding tier token limit returns error
- [ ] Cost estimation calculates correctly for each tier
Performance Metrics
Expected improvements:
| Metric | Before (No Cache) | After (With Cache) | Improvement |
|---|---|---|---|
| Rerank latency (cache hit) | 2-5 seconds | < 10 ms | 99% faster |
| Chunk fetch latency (cache hit) | 50-200 ms | < 1 ms | 99% faster |
| Postgres queries per request | 10-30 | 0-30 (depends on cache hits) | Up to 100% reduction |
| GPT-mini API calls (duplicate queries) | 100% | 10-20% (depends on cache hit rate) | 80-90% reduction |
Configuration
Environment Variables
Add to .env:
# Cache TTLs
CACHE_RERANK_TTL_SECONDS=900 # 15 minutes
CACHE_CHUNK_TTL_SECONDS=3600 # 1 hour
# Tier limits (already configured in tier_config.py)
TIER_FREE_TOP_K=10
TIER_FREE_RERANK_N=3
TIER_STANDARD_TOP_K=20
TIER_STANDARD_RERANK_N=5
TIER_PREMIUM_TOP_K=30
TIER_PREMIUM_RERANK_N=8
Next Phase: Phase D - Retrieval Pipeline
Once Phase C tests pass:
- S20: GPT-mini service (reranker, language detection, validation)
- Retrieval → Rerank → Reasoning pipeline integration
Files Changed
New Files
app/services/cache.py- LRU cache with TTL support (S10-S11)app/services/tier_config.py- Tier limits and cost estimation (S12)
Status: ✅ Phase C Complete - Ready for Testing
See SONNET_RUN.md for full implementation log