BacMR Backend Architecture Plan

Critical Architectural Issues Found (Summary)

No idempotent ingestion — ingestion_vector_embedding.md uses character-based chunking (1000 chars) instead of token-based, has no deterministic chunk IDs, and stores full raw text in Pinecone metadata (violates the canonical chunk-store pattern and risks hitting Pinecone's 40 KB metadata limit).
No reservation billing — wallet_billing.md describes a simple deduct-after-response model with no reserve-then-finalize pattern; a crash between LLM invocation and deduction loses revenue; there is no atomicity guarantee.
RLS plan is incomplete — rls_plan.md covers only Phase A/B but omits new tables (ingestion_jobs, reservations, embedding_refs, ingestion_audit), does not specify admin-role extraction from JWT custom claims, and has no migration rollback procedure.
Auth still relies on deprecated x-admin-key — auth_roles_admin.md says "Transition Phase" but provides no concrete migration checklist, no service-token strategy, and no plan for JWT custom claims via Postgres hooks.
Scraping has no dedupe, canonicalization, or content-quality gates — scraping_platform.md trusts whatever the scraper finds; no SimHash/fingerprinting, no Arabic-script normalization, no OCR confidence thresholds, no provenance chain.
Chat agent doc has no cost control, no caching, no circuit breaker — chat_agent.md describes the happy path only; no cached-rerank TTL, no fallback when OpenAI is down, no observability hooks, no Hassaniya support.
README.md still shows deprecated x-admin-key curl examples and has no mention of the GPT-mini validator, reservation billing, or observability — needs update to reflect new architecture.
No request-ID propagation — No correlation key across logs, wallet, LLM calls, Pinecone queries, and audit tables. Debugging an LLM failure that spans multiple subsystems is effectively impossible without this.
No API-level rate limiting — Students can spam refresh, open multiple tabs, and trigger parallel queries. Without per-user rate limits, a single user can exhaust the platform's OpenAI quota and degrade service for everyone.

Change Log (Diff-Style)

File 1: `docs/30_design/ingestion_vector_embedding.md`

Change 1.1 — Switch from character-based to token-based chunking

--- original (lines 13-15)
2. **Chunking:**
   - **Size:** 1000 characters.
   - **Overlap:** 150 characters (ensures context isn't lost at boundaries).

+++ proposed
2. **Chunking (Token-Based):**
   - **Strategy:** Token-based chunking using `tiktoken` (model `cl100k_base`).
   - **Size:** 512 tokens (≈2,000 characters for Latin scripts; fewer for Arabic due to tokenizer expansion).
   - **Overlap:** 64 tokens.
   - **Language-specific rules:**
     - **French:** Standard 512-token chunks.
     - **Arabic / Hassaniya:** 384-token chunks (Arabic tokenizes at ~1.5× expansion); overlap 48 tokens.
   - **Deterministic Chunk ID:** `sha256(file_id + ":" + page_number + ":" + chunk_index)` — ensures idempotent re-ingestion.

Change 1.2 — Do not store raw text in Pinecone metadata

--- original (line 24)
   - Stores the raw text in the vector metadata for fast RAG retrieval (avoids a second DB hop).

+++ proposed
   - **Canonical chunk store:** Full chunk text is stored in `chunks` table in Postgres (keyed by `chunk_id`). Pinecone metadata stores only lightweight filter fields (`chunk_id`, `file_id`, `language`, `grade`, `subject`, `source_url`, `page_number`, `ingestion_ts`). At retrieval time the backend fetches chunk text from Postgres (with Redis/in-memory cache) using the `chunk_id` returned by Pinecone.
   - **Rationale:** Avoids Pinecone's 40 KB metadata limit; keeps Postgres as source of truth; enables full-text search fallback.

Change 1.3 — Add ingestion job lifecycle

--- original (lines 27-35, "Reference-Driven Ingestion" section)
[entire section]

+++ proposed
## Ingestion Job Lifecycle
Ingestion is managed via the `ingestion_jobs` table with deterministic state transitions:

| State | Meaning |
|:---|:---|
| `queued` | Job created, awaiting parser worker |
| `parsing` | PDF downloaded, text extraction in progress |
| `tokenizing` | Chunks being created with token-based splitter |
| `embedding_request_sent` | OpenAI embedding API call dispatched |
| `embedding_upserted` | Vectors upserted to Pinecone |
| `ready` | Fully searchable; `chunks` rows committed |
| `failed` | Terminal failure after max retries (3) |

### Retry & Idempotency
- Each ingestion job uses deterministic vector IDs (`sha256(file_id:page:chunk_index)`), so re-running an ingestion for the same file is a safe upsert (no duplicates).
- On transient failure the job returns to the previous state and retries (max 3 attempts with exponential backoff).
- `ingestion_audit` table records every state transition with timestamp and error message.

### Reference-Driven Trigger
1. **Discovery**: Scrapers populate the `references` table (status: `discovered`).
2. **Trigger**: Admin calls `POST /ingestion/jobs` with `reference_id`.
3. **Execution**: Job moves through states above.
4. **Sync-back**: On `ready`, the `references.status` is set to `ready` and vector count synced.

Change 1.4 — Update Key Constants table

--- original (lines 37-43)
| Parameter | Value |
| :--- | :--- |
| Chunk Size | 1000 chars |
| Chunk Overlap | 150 chars |
| Model | `text-embedding-3-small` |
| Index | `curriculum-1536` |

+++ proposed
| Parameter | Value |
| :--- | :--- |
| Chunk Size | 512 tokens (French), 384 tokens (Arabic/Hassaniya) |
| Chunk Overlap | 64 tokens (French), 48 tokens (Arabic/Hassaniya) |
| Tokenizer | `tiktoken` / `cl100k_base` |
| Embedding Model | `text-embedding-3-small` (1536-dim) |
| Pinecone Index | `curriculum-1536` |
| Chunk ID Format | `sha256(file_id + ":" + page + ":" + chunk_index)` |
| Max retries | 3 (exponential backoff) |

File 2: `docs/30_design/wallet_billing.md`

Change 2.1 — Add reservation-based billing

--- original (lines 24-26)
## Implementation Details (`wallet.py`)
- **Balance Pre-check:** Done in the `check_wallet` node of the LangGraph.
- **Welcome Bonus:** New users receive 50 tokens automatically upon their first wallet fetch.

+++ proposed
## Reservation-Based Billing

### Why Reservations?
A simple "check-then-deduct" approach is not atomic — if the LLM call succeeds but the deduction fails (crash, timeout), the platform loses revenue. Conversely, if we deduct first and the LLM fails, the user loses tokens unfairly.

### Flow
1. **Reserve**: `POST /wallet/reserve` — atomically creates a `reservations` row and decrements `wallet.token_balance` by the *estimated* cost. Returns `reservation_id`.
2. **LLM Call**: The agent performs retrieval + reasoning.
3. **Finalize**: `POST /wallet/finalize` — compares estimated vs actual usage. If actual < estimated, refunds the delta; if actual > estimated (rare, capped at 2× estimate), deducts the overage. Marks reservation `finalized`.
4. **Expiry / Rollback**: A background job releases un-finalized reservations older than 5 minutes (refunds the reserved amount).

### DB Transaction (Pseudocode)
```sql
BEGIN;
  -- Step 1: Reserve
  UPDATE wallet SET token_balance = token_balance - :estimated
    WHERE user_id = :uid AND token_balance >= :estimated;
  -- raises error if balance insufficient (no partial deductions)
  INSERT INTO reservations (user_id, estimated, status, created_at)
    VALUES (:uid, :estimated, 'reserved', now())
    RETURNING id;
COMMIT;

-- After LLM call completes:
BEGIN;
  UPDATE reservations SET actual = :actual, status = 'finalized', finalized_at = now()
    WHERE id = :reservation_id AND status = 'reserved';
  -- refund delta if actual < estimated
  UPDATE wallet SET token_balance = token_balance + (:estimated - :actual)
    WHERE user_id = :uid AND :actual < :estimated;
  INSERT INTO wallet_ledger (user_id, delta, reason, request_id)
    VALUES (:uid, -:actual, 'agent_chat', :request_id);
COMMIT;

Implementation Details (`wallet.py`)

Balance Pre-check & Reserve: Combined in the check_wallet node of the LangGraph.
Welcome Bonus: New users receive 50 tokens automatically upon their first wallet fetch.
Reconciliation: A nightly job compares SUM(wallet_ledger.delta) with wallet.token_balance per user and flags discrepancies.
```
---

### File 3: `docs/30_design/rls_plan.md`

#### Change 3.1 — Add new tables and migration checklist
```
--- original (lines 21-32, entire Phase B + Timeline)

Phase B: Admin Tables (Priority 2)

These tables should be restricted to users with the admin role.

`references`, `scrape_runs`

Policy: service_role can do everything. admin can read/write.
Check: (auth.jwt() ->> 'role') = 'admin'

Implementation Timeline

Drafting: Feb 20, 2026.
Testing: Feb 22, 2026.
Enforcement: Feb 25, 2026.

+++ proposed

Phase B: Admin & System Tables (Priority 2)

`references`, `scrape_runs`

Policy: service_role full access. Authenticated users with admin role can SELECT/INSERT/UPDATE.
Check: (auth.jwt() -> 'app_metadata' ->> 'role') = 'admin'

`ingestion_jobs`, `ingestion_audit`, `embedding_refs`

Policy: Service-role only. No direct user access.

SQL Template:

ALTER TABLE ingestion_jobs ENABLE ROW LEVEL SECURITY;
CREATE POLICY "service_role_all" ON ingestion_jobs
  FOR ALL USING (auth.role() = 'service_role');

`reservations`

Policy: Users can SELECT their own reservations. INSERT/UPDATE via service_role only.

SQL:

ALTER TABLE reservations ENABLE ROW LEVEL SECURITY;
CREATE POLICY "user_select_own" ON reservations
  FOR SELECT USING (auth.uid() = user_id);
CREATE POLICY "service_role_all" ON reservations
  FOR ALL USING (auth.role() = 'service_role');

Phase C: JWT Custom Claims Migration

Goal

Move role from user_metadata to JWT custom claims via a Supabase Postgres hook, eliminating DB lookups for role checks.

Migration Checklist

Create Postgres function custom_access_token_hook(event jsonb) that injects app_metadata.role into the JWT claims.
Register the hook in Supabase Dashboard → Authentication → Hooks → Customize Access Token.
Update all RLS policies to read from auth.jwt() -> 'app_metadata' ->> 'role' instead of auth.jwt() ->> 'role'.
Update FastAPI get_current_admin to read app_metadata.role from the decoded JWT.
Deploy and test with a canary user.
Remove x-admin-key header support from all endpoints.
Rollback: If claims hook fails, revert the hook registration in Dashboard; existing user_metadata-based checks remain functional.

Implementation Timeline

Phase B enforcement: Feb 20, 2026.
Phase C (JWT claims migration): Feb 25, 2026.

x-admin-key removal: Mar 1, 2026.

---

### File 4: `docs/30_design/auth_roles_admin.md`

#### Change 4.1 — Add service-token strategy and remove ambiguity

--- original (lines 8-19)

Admin Authorization (Updated 2026-02-16)

We are moving toward JWT + role as the primary source of truth for admin authorization. The static X-Admin-Key is now a secondary bootstrap mechanism and will be phased out.

Primary: Supabase JWT Role

Implementation: get_current_admin dependency.
Check: Validates Supabase JWT and ensures user_metadata.role == 'admin'.
Status: Production default.

Secondary: `X-Admin-Key`

Purpose: Internal scripts or bootstrap if Supabase Auth is unavailable.
Header: x-admin-key.
Status: Deprecated (Transition Phase).

+++ proposed

Admin Authorization (Updated 2026-02-17)

Primary: Supabase JWT with Custom Claims

Implementation: get_current_admin FastAPI dependency.
Check: Validates Supabase JWT → reads app_metadata.role from JWT claims (injected by Postgres hook). Requires role == 'admin'.
Status: Production default.

Service Tokens (for backend-to-backend)

Purpose: Backend workers (ingestion, scraper cron, reconciliation) that run without a user session.
Mechanism: Use SUPABASE_SERVICE_KEY stored in a secret manager (Vault / GCP Secret Manager / AWS Secrets Manager). Never stored in plain .env in production.
Status: Active.

Deprecated: `X-Admin-Key`

Removal date: Mar 1, 2026.
Migration: All scripts using x-admin-key must switch to either admin JWT or service-role key.
```
#### Change 4.2 — Add roles to custom claims
```
--- original (lines 21-25)

Roles Model

Roles are stored in the user's app_metadata or user_metadata in Supabase. - student: Default role. Can chat and view their own wallet. - teacher: (Future) Can view analytics for their classes. - admin: Can trigger scrapers, upload curriculum, manage the vector index, and manage users.

+++ proposed

Roles Model

Roles are stored in profiles.role (Postgres, source of truth) and injected into JWT app_metadata.role via a Postgres access-token hook.

Role	Permissions
`student`	Chat, view own wallet/usage, view own profile
`teacher`	(Future) View class analytics, create quizzes
`admin`	Trigger scrapers, upload curriculum, manage vector index, manage users, view all wallets
--- ### File 5: `docs/30_design/chat_agent.md` #### Change 5.1 — Add Hassaniya support
--- original (lines 22-27)
## Multilingual Support (Arabic/French)
To support students asking in Arabic when the curriculum is in French:
1. Translation Step: If the input language is detected as Arabic, the query is translated to French using the LLM before vector search.
2. Cross-Lingual Retrieval: Vector search is performed using the French query against the French corpus.
3. Cross-Lingual Reranking: The reranker is instructed to match the original question (and its translation) against snippets, accounting for language differences.
4. Arabic Response: The final answer is generated in Arabic using the French context.

+++ proposed

Multilingual Support (Arabic / French / Hassaniya)

Language Matrix

Input Language	Corpus Language	Translation?	Response Language
French	French	No	French
Arabic (MSA)	French	Yes (Arabic→French before retrieval)	Arabic
Hassaniya	French	Yes (Hassaniya→French; treat as Arabic-script with cultural localization rules)	Hassaniya/Arabic
Arabic (MSA)	Arabic	No	Arabic

Steps

Language Detection: Detect input language (ISO 639-3). Hassaniya is classified as Arabic-script; a secondary classifier (GPT-mini validator) distinguishes MSA from Hassaniya dialect.
Translation Step: If input language differs from corpus language, translate via LLM before vector search.
Cross-Lingual Retrieval: Vector search uses the translated query.
Reranking: Reranker sees both original and translated query.
Response Generation: Answer is generated in the detected input language (Hassaniya users receive Hassaniya-flavored Arabic).

OCR Considerations

Arabic/Hassaniya PDFs may require OCR (Tesseract with ara model or Google Vision API).
OCR confidence threshold: ≥ 0.70 for Arabic, ≥ 0.80 for French. Below threshold → flag for manual review.
Arabic text normalization: unify alef variants (أ إ آ → ا), remove tatweel (ـ), normalize taa marbuta.
```
#### Change 5.2 — Add caching, circuit breaker, cost control
```
--- original (lines 17-20)

Retrieval Logic

Stage 1 (Dense Search): Get top 20 matches from Pinecone.
Stage 2 (Reranking): Use gpt-4o-mini to select the top 5 most relevant snippets from the 20 candidates. This significantly improves accuracy for complex pedagogical questions.

+++ proposed

Retrieval & Rerank Pipeline

Stage 1: Dense Retrieval

Query Pinecone with top-K = TIER_CAPS[tier].top_k (Free: 10, Standard: 20, Premium: 30).
Apply metadata prefilters: language, grade, subject.

Stage 2: Optional Lexical Prefilter

For Arabic queries, apply BM25-style keyword prefilter on the retrieved chunks to remove OCR noise.

Stage 3: Reranking

Use gpt-4o-mini (GPT-mini validator/reranker) to select top TIER_CAPS[tier].rerank_n (Free: 3, Standard: 5, Premium: 8).
GPT-mini service: Hosted via OpenAI API (same key as main models). SLA: same as OpenAI API. Fallback: skip reranking and use dense-retrieval order if GPT-mini returns error or latency > 5 s.

Caching

Rerank cache: Cache reranked results keyed by sha256(query + namespace + tier) with TTL = 15 min.
Chunk text cache: Redis/in-memory LRU cache for chunk_id → text lookups (TTL = 1 hour).
Cache is invalidated on re-ingestion of the same file.

Circuit Breaker

If OpenAI API returns 5xx or times out 3 times in 60 seconds:
Embedding: Queue the job and retry later (ingestion).
Reranking: Skip rerank, return dense-retrieval order.
Reasoning (chat model): Return user-facing error "Service temporarily unavailable, please retry."

Circuit breaker resets after 120 seconds of no failures.

---

### File 6: `docs/30_design/scraping_platform.md`

#### Change 6.1 — Add dedupe, canonicalization, provenance, quality gates

--- original (lines 30-35)

Admin Workflow

Call POST /scraping/koutoubi/sync.
The system fetches the sitemap.
It creates/updates entries in the references table.
Admin reviews GET /scraping/koutoubi/references to see what is ready for vectorization.

+++ proposed

Content Processing Pipeline (Automated)

Deduplication

Method: SimHash (64-bit) computed on normalized text of each discovered PDF.
Storage: references.content_fingerprint (BIGINT).
Rule: If a new PDF has a SimHash Hamming distance ≤ 3 from an existing reference, mark it as duplicate and link to the canonical reference via references.canonical_id.

Canonicalization

Normalize whitespace (collapse multiple spaces/newlines).
Arabic script normalization: unify alef forms, remove tatweel, normalize hamza.
Remove boilerplate headers/footers (regex patterns per source, configurable in scraper_config).
Strip non-content pages (TOC, blank pages) based on character-count threshold (< 50 chars after normalization → skip).

Provenance Metadata

Each references row stores: - source_url (canonical URL after redirect resolution) - discovered_at (timestamp of first discovery) - last_checked_at (last time scraper verified the URL is still live) - content_fingerprint (SimHash) - scrape_run_id (FK to scrape_runs) - canonical_id (self-FK for deduplication)

Content Quality Heuristics

Check	Threshold	Action
Minimum text length	200 chars (after normalization)	Skip page/chunk
OCR confidence (Arabic)	≥ 0.70	Flag for review if below
OCR confidence (French)	≥ 0.80	Flag for review if below
Encoding detection	Must be valid UTF-8	Reject and log

Admin Workflow

Call POST /scraping/{source}/sync.
System fetches sitemap, applies dedupe and canonicalization automatically.
Quality-failed items are logged in ingestion_audit with reason.
Admin reviews GET /scraping/{source}/references — no manual approval needed for source permission (all sources pre-approved).
```
---

### File 7: `README.md` (project root)

#### Change 7.1 — Update deprecated `x-admin-key` examples
```
--- original (lines 56-58) curl -X POST "http://localhost:8000/scraping/koutoubi/sync" \ -H "x-admin-key: YOUR_ADMIN_API_KEY"

+++ proposed curl -X POST "http://localhost:8000/scraping/koutoubi/sync" \ -H "Authorization: Bearer YOUR_ADMIN_JWT"

#### Change 7.2 — Add architecture summary section

--- original (lines 17-18, after Tech Stack section)

+++ proposed

Architecture Highlights

Reservation Billing: Token costs are reserved before LLM calls and finalized after, ensuring atomicity.
Idempotent Ingestion: Deterministic chunk IDs (sha256(file_id:page:chunk_index)) make re-ingestion safe.
GPT-mini Validator: A lightweight gpt-4o-mini microservice handles reranking, language detection, and input validation. Hosted on OpenAI API with fallback to skip-rerank on timeout.
Canonical Chunk Store: Full text lives in Postgres chunks table; Pinecone stores only vectors + lightweight metadata.
Observability: Structured logging, circuit breakers on all LLM/Pinecone calls, and wallet reconciliation jobs.

```

Now I will produce the full docs/backend_architecture.md document.

BacMR Backend Architecture Plan

Critical Architectural Issues Found (Summary)

Change Log (Diff-Style)

File 1: docs/30_design/ingestion_vector_embedding.md

Change 1.1 — Switch from character-based to token-based chunking

Change 1.2 — Do not store raw text in Pinecone metadata

Change 1.3 — Add ingestion job lifecycle

Change 1.4 — Update Key Constants table

File 2: docs/30_design/wallet_billing.md

Change 2.1 — Add reservation-based billing

Implementation Details (wallet.py)

Phase B: Admin Tables (Priority 2)

references, scrape_runs

Implementation Timeline

Phase B: Admin & System Tables (Priority 2)

references, scrape_runs

ingestion_jobs, ingestion_audit, embedding_refs

reservations

Phase C: JWT Custom Claims Migration

Goal

Migration Checklist

Implementation Timeline

Admin Authorization (Updated 2026-02-16)

Primary: Supabase JWT Role

Secondary: X-Admin-Key

Admin Authorization (Updated 2026-02-17)

Primary: Supabase JWT with Custom Claims

Service Tokens (for backend-to-backend)

Deprecated: X-Admin-Key

Roles Model

Roles Model

Multilingual Support (Arabic / French / Hassaniya)

Language Matrix

Steps

OCR Considerations

Retrieval Logic

Retrieval & Rerank Pipeline

Stage 1: Dense Retrieval

Stage 2: Optional Lexical Prefilter

Stage 3: Reranking

Caching

Circuit Breaker

Admin Workflow

Content Processing Pipeline (Automated)

Deduplication

Canonicalization

Provenance Metadata

Content Quality Heuristics

Admin Workflow

+++ proposed

Architecture Highlights

File 1: `docs/30_design/ingestion_vector_embedding.md`

File 2: `docs/30_design/wallet_billing.md`

Implementation Details (`wallet.py`)

`references`, `scrape_runs`

`references`, `scrape_runs`

`ingestion_jobs`, `ingestion_audit`, `embedding_refs`

`reservations`

Secondary: `X-Admin-Key`

Deprecated: `X-Admin-Key`