BacMR Backend Architecture Plan
Critical Architectural Issues Found (Summary)
- No idempotent ingestion —
ingestion_vector_embedding.mduses character-based chunking (1000 chars) instead of token-based, has no deterministic chunk IDs, and stores full raw text in Pinecone metadata (violates the canonical chunk-store pattern and risks hitting Pinecone's 40 KB metadata limit). - No reservation billing —
wallet_billing.mddescribes a simple deduct-after-response model with no reserve-then-finalize pattern; a crash between LLM invocation and deduction loses revenue; there is no atomicity guarantee. - RLS plan is incomplete —
rls_plan.mdcovers only Phase A/B but omits new tables (ingestion_jobs,reservations,embedding_refs,ingestion_audit), does not specify admin-role extraction from JWT custom claims, and has no migration rollback procedure. - Auth still relies on deprecated
x-admin-key—auth_roles_admin.mdsays "Transition Phase" but provides no concrete migration checklist, no service-token strategy, and no plan for JWT custom claims via Postgres hooks. - Scraping has no dedupe, canonicalization, or content-quality gates —
scraping_platform.mdtrusts whatever the scraper finds; no SimHash/fingerprinting, no Arabic-script normalization, no OCR confidence thresholds, no provenance chain. - Chat agent doc has no cost control, no caching, no circuit breaker —
chat_agent.mddescribes the happy path only; no cached-rerank TTL, no fallback when OpenAI is down, no observability hooks, no Hassaniya support. - README.md still shows deprecated
x-admin-keycurl examples and has no mention of the GPT-mini validator, reservation billing, or observability — needs update to reflect new architecture. - No request-ID propagation — No correlation key across logs, wallet, LLM calls, Pinecone queries, and audit tables. Debugging an LLM failure that spans multiple subsystems is effectively impossible without this.
- No API-level rate limiting — Students can spam refresh, open multiple tabs, and trigger parallel queries. Without per-user rate limits, a single user can exhaust the platform's OpenAI quota and degrade service for everyone.
Change Log (Diff-Style)
File 1: docs/30_design/ingestion_vector_embedding.md
Change 1.1 — Switch from character-based to token-based chunking
--- original (lines 13-15)
2. **Chunking:**
- **Size:** 1000 characters.
- **Overlap:** 150 characters (ensures context isn't lost at boundaries).
+++ proposed
2. **Chunking (Token-Based):**
- **Strategy:** Token-based chunking using `tiktoken` (model `cl100k_base`).
- **Size:** 512 tokens (≈2,000 characters for Latin scripts; fewer for Arabic due to tokenizer expansion).
- **Overlap:** 64 tokens.
- **Language-specific rules:**
- **French:** Standard 512-token chunks.
- **Arabic / Hassaniya:** 384-token chunks (Arabic tokenizes at ~1.5× expansion); overlap 48 tokens.
- **Deterministic Chunk ID:** `sha256(file_id + ":" + page_number + ":" + chunk_index)` — ensures idempotent re-ingestion.
Change 1.2 — Do not store raw text in Pinecone metadata
--- original (line 24)
- Stores the raw text in the vector metadata for fast RAG retrieval (avoids a second DB hop).
+++ proposed
- **Canonical chunk store:** Full chunk text is stored in `chunks` table in Postgres (keyed by `chunk_id`). Pinecone metadata stores only lightweight filter fields (`chunk_id`, `file_id`, `language`, `grade`, `subject`, `source_url`, `page_number`, `ingestion_ts`). At retrieval time the backend fetches chunk text from Postgres (with Redis/in-memory cache) using the `chunk_id` returned by Pinecone.
- **Rationale:** Avoids Pinecone's 40 KB metadata limit; keeps Postgres as source of truth; enables full-text search fallback.
Change 1.3 — Add ingestion job lifecycle
--- original (lines 27-35, "Reference-Driven Ingestion" section)
[entire section]
+++ proposed
## Ingestion Job Lifecycle
Ingestion is managed via the `ingestion_jobs` table with deterministic state transitions:
| State | Meaning |
|:---|:---|
| `queued` | Job created, awaiting parser worker |
| `parsing` | PDF downloaded, text extraction in progress |
| `tokenizing` | Chunks being created with token-based splitter |
| `embedding_request_sent` | OpenAI embedding API call dispatched |
| `embedding_upserted` | Vectors upserted to Pinecone |
| `ready` | Fully searchable; `chunks` rows committed |
| `failed` | Terminal failure after max retries (3) |
### Retry & Idempotency
- Each ingestion job uses deterministic vector IDs (`sha256(file_id:page:chunk_index)`), so re-running an ingestion for the same file is a safe upsert (no duplicates).
- On transient failure the job returns to the previous state and retries (max 3 attempts with exponential backoff).
- `ingestion_audit` table records every state transition with timestamp and error message.
### Reference-Driven Trigger
1. **Discovery**: Scrapers populate the `references` table (status: `discovered`).
2. **Trigger**: Admin calls `POST /ingestion/jobs` with `reference_id`.
3. **Execution**: Job moves through states above.
4. **Sync-back**: On `ready`, the `references.status` is set to `ready` and vector count synced.
Change 1.4 — Update Key Constants table
--- original (lines 37-43)
| Parameter | Value |
| :--- | :--- |
| Chunk Size | 1000 chars |
| Chunk Overlap | 150 chars |
| Model | `text-embedding-3-small` |
| Index | `curriculum-1536` |
+++ proposed
| Parameter | Value |
| :--- | :--- |
| Chunk Size | 512 tokens (French), 384 tokens (Arabic/Hassaniya) |
| Chunk Overlap | 64 tokens (French), 48 tokens (Arabic/Hassaniya) |
| Tokenizer | `tiktoken` / `cl100k_base` |
| Embedding Model | `text-embedding-3-small` (1536-dim) |
| Pinecone Index | `curriculum-1536` |
| Chunk ID Format | `sha256(file_id + ":" + page + ":" + chunk_index)` |
| Max retries | 3 (exponential backoff) |
File 2: docs/30_design/wallet_billing.md
Change 2.1 — Add reservation-based billing
--- original (lines 24-26)
## Implementation Details (`wallet.py`)
- **Balance Pre-check:** Done in the `check_wallet` node of the LangGraph.
- **Welcome Bonus:** New users receive 50 tokens automatically upon their first wallet fetch.
+++ proposed
## Reservation-Based Billing
### Why Reservations?
A simple "check-then-deduct" approach is not atomic — if the LLM call succeeds but the deduction fails (crash, timeout), the platform loses revenue. Conversely, if we deduct first and the LLM fails, the user loses tokens unfairly.
### Flow
1. **Reserve**: `POST /wallet/reserve` — atomically creates a `reservations` row and decrements `wallet.token_balance` by the *estimated* cost. Returns `reservation_id`.
2. **LLM Call**: The agent performs retrieval + reasoning.
3. **Finalize**: `POST /wallet/finalize` — compares estimated vs actual usage. If actual < estimated, refunds the delta; if actual > estimated (rare, capped at 2× estimate), deducts the overage. Marks reservation `finalized`.
4. **Expiry / Rollback**: A background job releases un-finalized reservations older than 5 minutes (refunds the reserved amount).
### DB Transaction (Pseudocode)
```sql
BEGIN;
-- Step 1: Reserve
UPDATE wallet SET token_balance = token_balance - :estimated
WHERE user_id = :uid AND token_balance >= :estimated;
-- raises error if balance insufficient (no partial deductions)
INSERT INTO reservations (user_id, estimated, status, created_at)
VALUES (:uid, :estimated, 'reserved', now())
RETURNING id;
COMMIT;
-- After LLM call completes:
BEGIN;
UPDATE reservations SET actual = :actual, status = 'finalized', finalized_at = now()
WHERE id = :reservation_id AND status = 'reserved';
-- refund delta if actual < estimated
UPDATE wallet SET token_balance = token_balance + (:estimated - :actual)
WHERE user_id = :uid AND :actual < :estimated;
INSERT INTO wallet_ledger (user_id, delta, reason, request_id)
VALUES (:uid, -:actual, 'agent_chat', :request_id);
COMMIT;
Implementation Details (wallet.py)
- Balance Pre-check & Reserve: Combined in the
check_walletnode of the LangGraph. - Welcome Bonus: New users receive 50 tokens automatically upon their first wallet fetch.
- Reconciliation: A nightly job compares
SUM(wallet_ledger.delta)withwallet.token_balanceper user and flags discrepancies. --- original (lines 21-32, entire Phase B + Timeline)
Phase B: Admin Tables (Priority 2)
These tables should be restricted to users with the admin role.
references, scrape_runs
- Policy:
service_rolecan do everything.admincan read/write. - Check:
(auth.jwt() ->> 'role') = 'admin'
Implementation Timeline
- Drafting: Feb 20, 2026.
- Testing: Feb 22, 2026.
- Enforcement: Feb 25, 2026.
+++ proposed
Phase B: Admin & System Tables (Priority 2)
references, scrape_runs
- Policy:
service_rolefull access. Authenticated users withadminrole can SELECT/INSERT/UPDATE. - Check:
(auth.jwt() -> 'app_metadata' ->> 'role') = 'admin'
ingestion_jobs, ingestion_audit, embedding_refs
- Policy: Service-role only. No direct user access.
- SQL Template:
reservations
- Policy: Users can SELECT their own reservations. INSERT/UPDATE via service_role only.
- SQL:
Phase C: JWT Custom Claims Migration
Goal
Move role from user_metadata to JWT custom claims via a Supabase Postgres hook, eliminating DB lookups for role checks.
Migration Checklist
- Create Postgres function
custom_access_token_hook(event jsonb)that injectsapp_metadata.roleinto the JWT claims. - Register the hook in Supabase Dashboard → Authentication → Hooks → Customize Access Token.
- Update all RLS policies to read from
auth.jwt() -> 'app_metadata' ->> 'role'instead ofauth.jwt() ->> 'role'. - Update FastAPI
get_current_adminto readapp_metadata.rolefrom the decoded JWT. - Deploy and test with a canary user.
- Remove
x-admin-keyheader support from all endpoints. - Rollback: If claims hook fails, revert the hook registration in Dashboard; existing
user_metadata-based checks remain functional.
Implementation Timeline
- Phase B enforcement: Feb 20, 2026.
- Phase C (JWT claims migration): Feb 25, 2026.
x-admin-keyremoval: Mar 1, 2026. --- original (lines 8-19)
Admin Authorization (Updated 2026-02-16)
We are moving toward JWT + role as the primary source of truth for admin authorization. The static X-Admin-Key is now a secondary bootstrap mechanism and will be phased out.
Primary: Supabase JWT Role
- Implementation:
get_current_admindependency. - Check: Validates Supabase JWT and ensures
user_metadata.role == 'admin'. - Status: Production default.
Secondary: X-Admin-Key
- Purpose: Internal scripts or bootstrap if Supabase Auth is unavailable.
- Header:
x-admin-key. - Status: Deprecated (Transition Phase).
+++ proposed
Admin Authorization (Updated 2026-02-17)
Primary: Supabase JWT with Custom Claims
- Implementation:
get_current_adminFastAPI dependency. - Check: Validates Supabase JWT → reads
app_metadata.rolefrom JWT claims (injected by Postgres hook). Requiresrole == 'admin'. - Status: Production default.
Service Tokens (for backend-to-backend)
- Purpose: Backend workers (ingestion, scraper cron, reconciliation) that run without a user session.
- Mechanism: Use
SUPABASE_SERVICE_KEYstored in a secret manager (Vault / GCP Secret Manager / AWS Secrets Manager). Never stored in plain.envin production. - Status: Active.
Deprecated: X-Admin-Key
- Removal date: Mar 1, 2026.
- Migration: All scripts using
x-admin-keymust switch to either admin JWT or service-role key. --- original (lines 21-25)
Roles Model
Roles are stored in the user's app_metadata or user_metadata in Supabase.
- student: Default role. Can chat and view their own wallet.
- teacher: (Future) Can view analytics for their classes.
- admin: Can trigger scrapers, upload curriculum, manage the vector index, and manage users.
+++ proposed
Roles Model
Roles are stored in profiles.role (Postgres, source of truth) and injected into JWT app_metadata.role via a Postgres access-token hook.
| Role | Permissions |
|---|---|
student |
Chat, view own wallet/usage, view own profile |
teacher |
(Future) View class analytics, create quizzes |
admin |
Trigger scrapers, upload curriculum, manage vector index, manage users, view all wallets |
| --- original (lines 22-27) | |
| ## Multilingual Support (Arabic/French) | |
| To support students asking in Arabic when the curriculum is in French: | |
| 1. Translation Step: If the input language is detected as Arabic, the query is translated to French using the LLM before vector search. | |
| 2. Cross-Lingual Retrieval: Vector search is performed using the French query against the French corpus. | |
| 3. Cross-Lingual Reranking: The reranker is instructed to match the original question (and its translation) against snippets, accounting for language differences. | |
| 4. Arabic Response: The final answer is generated in Arabic using the French context. |
+++ proposed
Multilingual Support (Arabic / French / Hassaniya)
Language Matrix
| Input Language | Corpus Language | Translation? | Response Language |
|---|---|---|---|
| French | French | No | French |
| Arabic (MSA) | French | Yes (Arabic→French before retrieval) | Arabic |
| Hassaniya | French | Yes (Hassaniya→French; treat as Arabic-script with cultural localization rules) | Hassaniya/Arabic |
| Arabic (MSA) | Arabic | No | Arabic |
Steps
- Language Detection: Detect input language (ISO 639-3). Hassaniya is classified as Arabic-script; a secondary classifier (GPT-mini validator) distinguishes MSA from Hassaniya dialect.
- Translation Step: If input language differs from corpus language, translate via LLM before vector search.
- Cross-Lingual Retrieval: Vector search uses the translated query.
- Reranking: Reranker sees both original and translated query.
- Response Generation: Answer is generated in the detected input language (Hassaniya users receive Hassaniya-flavored Arabic).
OCR Considerations
- Arabic/Hassaniya PDFs may require OCR (Tesseract with
aramodel or Google Vision API). - OCR confidence threshold: ≥ 0.70 for Arabic, ≥ 0.80 for French. Below threshold → flag for manual review.
- Arabic text normalization: unify alef variants (أ إ آ → ا), remove tatweel (ـ), normalize taa marbuta. --- original (lines 17-20)
Retrieval Logic
- Stage 1 (Dense Search): Get top 20 matches from Pinecone.
- Stage 2 (Reranking): Use
gpt-4o-minito select the top 5 most relevant snippets from the 20 candidates. This significantly improves accuracy for complex pedagogical questions.
+++ proposed
Retrieval & Rerank Pipeline
Stage 1: Dense Retrieval
- Query Pinecone with top-K =
TIER_CAPS[tier].top_k(Free: 10, Standard: 20, Premium: 30). - Apply metadata prefilters:
language,grade,subject.
Stage 2: Optional Lexical Prefilter
- For Arabic queries, apply BM25-style keyword prefilter on the retrieved chunks to remove OCR noise.
Stage 3: Reranking
- Use
gpt-4o-mini(GPT-mini validator/reranker) to select topTIER_CAPS[tier].rerank_n(Free: 3, Standard: 5, Premium: 8). - GPT-mini service: Hosted via OpenAI API (same key as main models). SLA: same as OpenAI API. Fallback: skip reranking and use dense-retrieval order if GPT-mini returns error or latency > 5 s.
Caching
- Rerank cache: Cache reranked results keyed by
sha256(query + namespace + tier)with TTL = 15 min. - Chunk text cache: Redis/in-memory LRU cache for
chunk_id → textlookups (TTL = 1 hour). - Cache is invalidated on re-ingestion of the same file.
Circuit Breaker
- If OpenAI API returns 5xx or times out 3 times in 60 seconds:
- Embedding: Queue the job and retry later (ingestion).
- Reranking: Skip rerank, return dense-retrieval order.
- Reasoning (chat model): Return user-facing error "Service temporarily unavailable, please retry."
- Circuit breaker resets after 120 seconds of no failures. --- original (lines 30-35)
Admin Workflow
- Call
POST /scraping/koutoubi/sync. - The system fetches the sitemap.
- It creates/updates entries in the
referencestable. - Admin reviews
GET /scraping/koutoubi/referencesto see what is ready for vectorization.
+++ proposed
Content Processing Pipeline (Automated)
Deduplication
- Method: SimHash (64-bit) computed on normalized text of each discovered PDF.
- Storage:
references.content_fingerprint(BIGINT). - Rule: If a new PDF has a SimHash Hamming distance ≤ 3 from an existing reference, mark it as
duplicateand link to the canonical reference viareferences.canonical_id.
Canonicalization
- Normalize whitespace (collapse multiple spaces/newlines).
- Arabic script normalization: unify alef forms, remove tatweel, normalize hamza.
- Remove boilerplate headers/footers (regex patterns per source, configurable in
scraper_config). - Strip non-content pages (TOC, blank pages) based on character-count threshold (< 50 chars after normalization → skip).
Provenance Metadata
Each references row stores:
- source_url (canonical URL after redirect resolution)
- discovered_at (timestamp of first discovery)
- last_checked_at (last time scraper verified the URL is still live)
- content_fingerprint (SimHash)
- scrape_run_id (FK to scrape_runs)
- canonical_id (self-FK for deduplication)
Content Quality Heuristics
| Check | Threshold | Action |
|---|---|---|
| Minimum text length | 200 chars (after normalization) | Skip page/chunk |
| OCR confidence (Arabic) | ≥ 0.70 | Flag for review if below |
| OCR confidence (French) | ≥ 0.80 | Flag for review if below |
| Encoding detection | Must be valid UTF-8 | Reject and log |
Admin Workflow
- Call
POST /scraping/{source}/sync. - System fetches sitemap, applies dedupe and canonicalization automatically.
- Quality-failed items are logged in
ingestion_auditwith reason. - Admin reviews
GET /scraping/{source}/references— no manual approval needed for source permission (all sources pre-approved). --- original (lines 56-58) curl -X POST "http://localhost:8000/scraping/koutoubi/sync" \ -H "x-admin-key: YOUR_ADMIN_API_KEY"
+++ proposed curl -X POST "http://localhost:8000/scraping/koutoubi/sync" \ -H "Authorization: Bearer YOUR_ADMIN_JWT"
--- original (lines 17-18, after Tech Stack section)+++ proposed
Architecture Highlights
- Reservation Billing: Token costs are reserved before LLM calls and finalized after, ensuring atomicity.
- Idempotent Ingestion: Deterministic chunk IDs (
sha256(file_id:page:chunk_index)) make re-ingestion safe. - GPT-mini Validator: A lightweight
gpt-4o-minimicroservice handles reranking, language detection, and input validation. Hosted on OpenAI API with fallback to skip-rerank on timeout. - Canonical Chunk Store: Full text lives in Postgres
chunkstable; Pinecone stores only vectors + lightweight metadata. - Observability: Structured logging, circuit breakers on all LLM/Pinecone calls, and wallet reconciliation jobs.
```
Now I will produce the full docs/backend_architecture.md document.