Skip to content

BacMR Backend Architecture Plan

Critical Architectural Issues Found (Summary)

  1. No idempotent ingestioningestion_vector_embedding.md uses character-based chunking (1000 chars) instead of token-based, has no deterministic chunk IDs, and stores full raw text in Pinecone metadata (violates the canonical chunk-store pattern and risks hitting Pinecone's 40 KB metadata limit).
  2. No reservation billingwallet_billing.md describes a simple deduct-after-response model with no reserve-then-finalize pattern; a crash between LLM invocation and deduction loses revenue; there is no atomicity guarantee.
  3. RLS plan is incompleterls_plan.md covers only Phase A/B but omits new tables (ingestion_jobs, reservations, embedding_refs, ingestion_audit), does not specify admin-role extraction from JWT custom claims, and has no migration rollback procedure.
  4. Auth still relies on deprecated x-admin-keyauth_roles_admin.md says "Transition Phase" but provides no concrete migration checklist, no service-token strategy, and no plan for JWT custom claims via Postgres hooks.
  5. Scraping has no dedupe, canonicalization, or content-quality gatesscraping_platform.md trusts whatever the scraper finds; no SimHash/fingerprinting, no Arabic-script normalization, no OCR confidence thresholds, no provenance chain.
  6. Chat agent doc has no cost control, no caching, no circuit breakerchat_agent.md describes the happy path only; no cached-rerank TTL, no fallback when OpenAI is down, no observability hooks, no Hassaniya support.
  7. README.md still shows deprecated x-admin-key curl examples and has no mention of the GPT-mini validator, reservation billing, or observability — needs update to reflect new architecture.
  8. No request-ID propagation — No correlation key across logs, wallet, LLM calls, Pinecone queries, and audit tables. Debugging an LLM failure that spans multiple subsystems is effectively impossible without this.
  9. No API-level rate limiting — Students can spam refresh, open multiple tabs, and trigger parallel queries. Without per-user rate limits, a single user can exhaust the platform's OpenAI quota and degrade service for everyone.

Change Log (Diff-Style)

File 1: docs/30_design/ingestion_vector_embedding.md

Change 1.1 — Switch from character-based to token-based chunking

--- original (lines 13-15)
2. **Chunking:**
   - **Size:** 1000 characters.
   - **Overlap:** 150 characters (ensures context isn't lost at boundaries).

+++ proposed
2. **Chunking (Token-Based):**
   - **Strategy:** Token-based chunking using `tiktoken` (model `cl100k_base`).
   - **Size:** 512 tokens (≈2,000 characters for Latin scripts; fewer for Arabic due to tokenizer expansion).
   - **Overlap:** 64 tokens.
   - **Language-specific rules:**
     - **French:** Standard 512-token chunks.
     - **Arabic / Hassaniya:** 384-token chunks (Arabic tokenizes at ~1.5× expansion); overlap 48 tokens.
   - **Deterministic Chunk ID:** `sha256(file_id + ":" + page_number + ":" + chunk_index)` — ensures idempotent re-ingestion.

Change 1.2 — Do not store raw text in Pinecone metadata

--- original (line 24)
   - Stores the raw text in the vector metadata for fast RAG retrieval (avoids a second DB hop).

+++ proposed
   - **Canonical chunk store:** Full chunk text is stored in `chunks` table in Postgres (keyed by `chunk_id`). Pinecone metadata stores only lightweight filter fields (`chunk_id`, `file_id`, `language`, `grade`, `subject`, `source_url`, `page_number`, `ingestion_ts`). At retrieval time the backend fetches chunk text from Postgres (with Redis/in-memory cache) using the `chunk_id` returned by Pinecone.
   - **Rationale:** Avoids Pinecone's 40 KB metadata limit; keeps Postgres as source of truth; enables full-text search fallback.

Change 1.3 — Add ingestion job lifecycle

--- original (lines 27-35, "Reference-Driven Ingestion" section)
[entire section]

+++ proposed
## Ingestion Job Lifecycle
Ingestion is managed via the `ingestion_jobs` table with deterministic state transitions:

| State | Meaning |
|:---|:---|
| `queued` | Job created, awaiting parser worker |
| `parsing` | PDF downloaded, text extraction in progress |
| `tokenizing` | Chunks being created with token-based splitter |
| `embedding_request_sent` | OpenAI embedding API call dispatched |
| `embedding_upserted` | Vectors upserted to Pinecone |
| `ready` | Fully searchable; `chunks` rows committed |
| `failed` | Terminal failure after max retries (3) |

### Retry & Idempotency
- Each ingestion job uses deterministic vector IDs (`sha256(file_id:page:chunk_index)`), so re-running an ingestion for the same file is a safe upsert (no duplicates).
- On transient failure the job returns to the previous state and retries (max 3 attempts with exponential backoff).
- `ingestion_audit` table records every state transition with timestamp and error message.

### Reference-Driven Trigger
1. **Discovery**: Scrapers populate the `references` table (status: `discovered`).
2. **Trigger**: Admin calls `POST /ingestion/jobs` with `reference_id`.
3. **Execution**: Job moves through states above.
4. **Sync-back**: On `ready`, the `references.status` is set to `ready` and vector count synced.

Change 1.4 — Update Key Constants table

--- original (lines 37-43)
| Parameter | Value |
| :--- | :--- |
| Chunk Size | 1000 chars |
| Chunk Overlap | 150 chars |
| Model | `text-embedding-3-small` |
| Index | `curriculum-1536` |

+++ proposed
| Parameter | Value |
| :--- | :--- |
| Chunk Size | 512 tokens (French), 384 tokens (Arabic/Hassaniya) |
| Chunk Overlap | 64 tokens (French), 48 tokens (Arabic/Hassaniya) |
| Tokenizer | `tiktoken` / `cl100k_base` |
| Embedding Model | `text-embedding-3-small` (1536-dim) |
| Pinecone Index | `curriculum-1536` |
| Chunk ID Format | `sha256(file_id + ":" + page + ":" + chunk_index)` |
| Max retries | 3 (exponential backoff) |

File 2: docs/30_design/wallet_billing.md

Change 2.1 — Add reservation-based billing

--- original (lines 24-26)
## Implementation Details (`wallet.py`)
- **Balance Pre-check:** Done in the `check_wallet` node of the LangGraph.
- **Welcome Bonus:** New users receive 50 tokens automatically upon their first wallet fetch.

+++ proposed
## Reservation-Based Billing

### Why Reservations?
A simple "check-then-deduct" approach is not atomic — if the LLM call succeeds but the deduction fails (crash, timeout), the platform loses revenue. Conversely, if we deduct first and the LLM fails, the user loses tokens unfairly.

### Flow
1. **Reserve**: `POST /wallet/reserve` — atomically creates a `reservations` row and decrements `wallet.token_balance` by the *estimated* cost. Returns `reservation_id`.
2. **LLM Call**: The agent performs retrieval + reasoning.
3. **Finalize**: `POST /wallet/finalize` — compares estimated vs actual usage. If actual < estimated, refunds the delta; if actual > estimated (rare, capped at 2× estimate), deducts the overage. Marks reservation `finalized`.
4. **Expiry / Rollback**: A background job releases un-finalized reservations older than 5 minutes (refunds the reserved amount).

### DB Transaction (Pseudocode)
```sql
BEGIN;
  -- Step 1: Reserve
  UPDATE wallet SET token_balance = token_balance - :estimated
    WHERE user_id = :uid AND token_balance >= :estimated;
  -- raises error if balance insufficient (no partial deductions)
  INSERT INTO reservations (user_id, estimated, status, created_at)
    VALUES (:uid, :estimated, 'reserved', now())
    RETURNING id;
COMMIT;

-- After LLM call completes:
BEGIN;
  UPDATE reservations SET actual = :actual, status = 'finalized', finalized_at = now()
    WHERE id = :reservation_id AND status = 'reserved';
  -- refund delta if actual < estimated
  UPDATE wallet SET token_balance = token_balance + (:estimated - :actual)
    WHERE user_id = :uid AND :actual < :estimated;
  INSERT INTO wallet_ledger (user_id, delta, reason, request_id)
    VALUES (:uid, -:actual, 'agent_chat', :request_id);
COMMIT;

Implementation Details (wallet.py)

  • Balance Pre-check & Reserve: Combined in the check_wallet node of the LangGraph.
  • Welcome Bonus: New users receive 50 tokens automatically upon their first wallet fetch.
  • Reconciliation: A nightly job compares SUM(wallet_ledger.delta) with wallet.token_balance per user and flags discrepancies.
    ---
    
    ### File 3: `docs/30_design/rls_plan.md`
    
    #### Change 3.1 — Add new tables and migration checklist
    
    --- original (lines 21-32, entire Phase B + Timeline)

Phase B: Admin Tables (Priority 2)

These tables should be restricted to users with the admin role.

references, scrape_runs

  • Policy: service_role can do everything. admin can read/write.
  • Check: (auth.jwt() ->> 'role') = 'admin'

Implementation Timeline

  • Drafting: Feb 20, 2026.
  • Testing: Feb 22, 2026.
  • Enforcement: Feb 25, 2026.

+++ proposed

Phase B: Admin & System Tables (Priority 2)

references, scrape_runs

  • Policy: service_role full access. Authenticated users with admin role can SELECT/INSERT/UPDATE.
  • Check: (auth.jwt() -> 'app_metadata' ->> 'role') = 'admin'

ingestion_jobs, ingestion_audit, embedding_refs

  • Policy: Service-role only. No direct user access.
  • SQL Template:
    ALTER TABLE ingestion_jobs ENABLE ROW LEVEL SECURITY;
    CREATE POLICY "service_role_all" ON ingestion_jobs
      FOR ALL USING (auth.role() = 'service_role');
    

reservations

  • Policy: Users can SELECT their own reservations. INSERT/UPDATE via service_role only.
  • SQL:
    ALTER TABLE reservations ENABLE ROW LEVEL SECURITY;
    CREATE POLICY "user_select_own" ON reservations
      FOR SELECT USING (auth.uid() = user_id);
    CREATE POLICY "service_role_all" ON reservations
      FOR ALL USING (auth.role() = 'service_role');
    

Phase C: JWT Custom Claims Migration

Goal

Move role from user_metadata to JWT custom claims via a Supabase Postgres hook, eliminating DB lookups for role checks.

Migration Checklist

  1. Create Postgres function custom_access_token_hook(event jsonb) that injects app_metadata.role into the JWT claims.
  2. Register the hook in Supabase Dashboard → Authentication → Hooks → Customize Access Token.
  3. Update all RLS policies to read from auth.jwt() -> 'app_metadata' ->> 'role' instead of auth.jwt() ->> 'role'.
  4. Update FastAPI get_current_admin to read app_metadata.role from the decoded JWT.
  5. Deploy and test with a canary user.
  6. Remove x-admin-key header support from all endpoints.
  7. Rollback: If claims hook fails, revert the hook registration in Dashboard; existing user_metadata-based checks remain functional.

Implementation Timeline

  • Phase B enforcement: Feb 20, 2026.
  • Phase C (JWT claims migration): Feb 25, 2026.
  • x-admin-key removal: Mar 1, 2026.
    ---
    
    ### File 4: `docs/30_design/auth_roles_admin.md`
    
    #### Change 4.1 — Add service-token strategy and remove ambiguity
    
    --- original (lines 8-19)

Admin Authorization (Updated 2026-02-16)

We are moving toward JWT + role as the primary source of truth for admin authorization. The static X-Admin-Key is now a secondary bootstrap mechanism and will be phased out.

Primary: Supabase JWT Role

  • Implementation: get_current_admin dependency.
  • Check: Validates Supabase JWT and ensures user_metadata.role == 'admin'.
  • Status: Production default.

Secondary: X-Admin-Key

  • Purpose: Internal scripts or bootstrap if Supabase Auth is unavailable.
  • Header: x-admin-key.
  • Status: Deprecated (Transition Phase).

+++ proposed

Admin Authorization (Updated 2026-02-17)

Primary: Supabase JWT with Custom Claims

  • Implementation: get_current_admin FastAPI dependency.
  • Check: Validates Supabase JWT → reads app_metadata.role from JWT claims (injected by Postgres hook). Requires role == 'admin'.
  • Status: Production default.

Service Tokens (for backend-to-backend)

  • Purpose: Backend workers (ingestion, scraper cron, reconciliation) that run without a user session.
  • Mechanism: Use SUPABASE_SERVICE_KEY stored in a secret manager (Vault / GCP Secret Manager / AWS Secrets Manager). Never stored in plain .env in production.
  • Status: Active.

Deprecated: X-Admin-Key

  • Removal date: Mar 1, 2026.
  • Migration: All scripts using x-admin-key must switch to either admin JWT or service-role key.
    #### Change 4.2 — Add roles to custom claims
    
    --- original (lines 21-25)

Roles Model

Roles are stored in the user's app_metadata or user_metadata in Supabase. - student: Default role. Can chat and view their own wallet. - teacher: (Future) Can view analytics for their classes. - admin: Can trigger scrapers, upload curriculum, manage the vector index, and manage users.

+++ proposed

Roles Model

Roles are stored in profiles.role (Postgres, source of truth) and injected into JWT app_metadata.role via a Postgres access-token hook.

Role Permissions
student Chat, view own wallet/usage, view own profile
teacher (Future) View class analytics, create quizzes
admin Trigger scrapers, upload curriculum, manage vector index, manage users, view all wallets
---

### File 5: `docs/30_design/chat_agent.md`

#### Change 5.1 — Add Hassaniya support
--- original (lines 22-27)
## Multilingual Support (Arabic/French)
To support students asking in Arabic when the curriculum is in French:
1. Translation Step: If the input language is detected as Arabic, the query is translated to French using the LLM before vector search.
2. Cross-Lingual Retrieval: Vector search is performed using the French query against the French corpus.
3. Cross-Lingual Reranking: The reranker is instructed to match the original question (and its translation) against snippets, accounting for language differences.
4. Arabic Response: The final answer is generated in Arabic using the French context.

+++ proposed

Multilingual Support (Arabic / French / Hassaniya)

Language Matrix

Input Language Corpus Language Translation? Response Language
French French No French
Arabic (MSA) French Yes (Arabic→French before retrieval) Arabic
Hassaniya French Yes (Hassaniya→French; treat as Arabic-script with cultural localization rules) Hassaniya/Arabic
Arabic (MSA) Arabic No Arabic

Steps

  1. Language Detection: Detect input language (ISO 639-3). Hassaniya is classified as Arabic-script; a secondary classifier (GPT-mini validator) distinguishes MSA from Hassaniya dialect.
  2. Translation Step: If input language differs from corpus language, translate via LLM before vector search.
  3. Cross-Lingual Retrieval: Vector search uses the translated query.
  4. Reranking: Reranker sees both original and translated query.
  5. Response Generation: Answer is generated in the detected input language (Hassaniya users receive Hassaniya-flavored Arabic).

OCR Considerations

  • Arabic/Hassaniya PDFs may require OCR (Tesseract with ara model or Google Vision API).
  • OCR confidence threshold: ≥ 0.70 for Arabic, ≥ 0.80 for French. Below threshold → flag for manual review.
  • Arabic text normalization: unify alef variants (أ إ آ → ا), remove tatweel (ـ), normalize taa marbuta.
    #### Change 5.2 — Add caching, circuit breaker, cost control
    
    --- original (lines 17-20)

Retrieval Logic

  1. Stage 1 (Dense Search): Get top 20 matches from Pinecone.
  2. Stage 2 (Reranking): Use gpt-4o-mini to select the top 5 most relevant snippets from the 20 candidates. This significantly improves accuracy for complex pedagogical questions.

+++ proposed

Retrieval & Rerank Pipeline

Stage 1: Dense Retrieval

  • Query Pinecone with top-K = TIER_CAPS[tier].top_k (Free: 10, Standard: 20, Premium: 30).
  • Apply metadata prefilters: language, grade, subject.

Stage 2: Optional Lexical Prefilter

  • For Arabic queries, apply BM25-style keyword prefilter on the retrieved chunks to remove OCR noise.

Stage 3: Reranking

  • Use gpt-4o-mini (GPT-mini validator/reranker) to select top TIER_CAPS[tier].rerank_n (Free: 3, Standard: 5, Premium: 8).
  • GPT-mini service: Hosted via OpenAI API (same key as main models). SLA: same as OpenAI API. Fallback: skip reranking and use dense-retrieval order if GPT-mini returns error or latency > 5 s.

Caching

  • Rerank cache: Cache reranked results keyed by sha256(query + namespace + tier) with TTL = 15 min.
  • Chunk text cache: Redis/in-memory LRU cache for chunk_id → text lookups (TTL = 1 hour).
  • Cache is invalidated on re-ingestion of the same file.

Circuit Breaker

  • If OpenAI API returns 5xx or times out 3 times in 60 seconds:
  • Embedding: Queue the job and retry later (ingestion).
  • Reranking: Skip rerank, return dense-retrieval order.
  • Reasoning (chat model): Return user-facing error "Service temporarily unavailable, please retry."
  • Circuit breaker resets after 120 seconds of no failures.
    ---
    
    ### File 6: `docs/30_design/scraping_platform.md`
    
    #### Change 6.1 — Add dedupe, canonicalization, provenance, quality gates
    
    --- original (lines 30-35)

Admin Workflow

  1. Call POST /scraping/koutoubi/sync.
  2. The system fetches the sitemap.
  3. It creates/updates entries in the references table.
  4. Admin reviews GET /scraping/koutoubi/references to see what is ready for vectorization.

+++ proposed

Content Processing Pipeline (Automated)

Deduplication

  • Method: SimHash (64-bit) computed on normalized text of each discovered PDF.
  • Storage: references.content_fingerprint (BIGINT).
  • Rule: If a new PDF has a SimHash Hamming distance ≤ 3 from an existing reference, mark it as duplicate and link to the canonical reference via references.canonical_id.

Canonicalization

  1. Normalize whitespace (collapse multiple spaces/newlines).
  2. Arabic script normalization: unify alef forms, remove tatweel, normalize hamza.
  3. Remove boilerplate headers/footers (regex patterns per source, configurable in scraper_config).
  4. Strip non-content pages (TOC, blank pages) based on character-count threshold (< 50 chars after normalization → skip).

Provenance Metadata

Each references row stores: - source_url (canonical URL after redirect resolution) - discovered_at (timestamp of first discovery) - last_checked_at (last time scraper verified the URL is still live) - content_fingerprint (SimHash) - scrape_run_id (FK to scrape_runs) - canonical_id (self-FK for deduplication)

Content Quality Heuristics

Check Threshold Action
Minimum text length 200 chars (after normalization) Skip page/chunk
OCR confidence (Arabic) ≥ 0.70 Flag for review if below
OCR confidence (French) ≥ 0.80 Flag for review if below
Encoding detection Must be valid UTF-8 Reject and log

Admin Workflow

  1. Call POST /scraping/{source}/sync.
  2. System fetches sitemap, applies dedupe and canonicalization automatically.
  3. Quality-failed items are logged in ingestion_audit with reason.
  4. Admin reviews GET /scraping/{source}/references — no manual approval needed for source permission (all sources pre-approved).
    ---
    
    ### File 7: `README.md` (project root)
    
    #### Change 7.1 — Update deprecated `x-admin-key` examples
    
    --- original (lines 56-58) curl -X POST "http://localhost:8000/scraping/koutoubi/sync" \ -H "x-admin-key: YOUR_ADMIN_API_KEY"

+++ proposed curl -X POST "http://localhost:8000/scraping/koutoubi/sync" \ -H "Authorization: Bearer YOUR_ADMIN_JWT"

#### Change 7.2 — Add architecture summary section
--- original (lines 17-18, after Tech Stack section)


+++ proposed

Architecture Highlights

  • Reservation Billing: Token costs are reserved before LLM calls and finalized after, ensuring atomicity.
  • Idempotent Ingestion: Deterministic chunk IDs (sha256(file_id:page:chunk_index)) make re-ingestion safe.
  • GPT-mini Validator: A lightweight gpt-4o-mini microservice handles reranking, language detection, and input validation. Hosted on OpenAI API with fallback to skip-rerank on timeout.
  • Canonical Chunk Store: Full text lives in Postgres chunks table; Pinecone stores only vectors + lightweight metadata.
  • Observability: Structured logging, circuit breakers on all LLM/Pinecone calls, and wallet reconciliation jobs.

```


Now I will produce the full docs/backend_architecture.md document.