Design: Scraping Platform
Overview
The Scraping Platform is designed to discover and catalog Mauritanian curriculum resources from external portals. It acts as the "Top of the Funnel" for the ingestion pipeline.
Data Model
scrape_runs
Tracks the history and status of every sync attempt.
- status: queued, running, success, failed.
- counts: Tracks found, new, updated, and error.
references
A catalog of discovered PDF resources.
- source: The origin (e.g., koutoubi).
- pdf_source: The unique URL of the file.
- status:
- discovered: Newly found by scraper.
- ready: Successfully ingested into Pinecone.
- failed: Ingestion attempt failed.
- metadata: grade, subject, language, weight (coefficient).
Supported Sources
Koutoubi Scraper (KoutoubiScraper)
- Logic: Parses the
sitemap.xmland specific HTML tables on koutoubi.mr. - Inference: Uses regex and "Year Decoders" to guess the grade and subject from the URL and title.
- Weighting: Maps subjects to official Mauritanian coefficients (e.g., Math in 7C has a weight of 9).
Admin Workflow
- Call
POST /scraping/koutoubi/sync. - The system fetches the sitemap.
- It creates/updates entries in the
referencestable. - Admin reviews
GET /scraping/koutoubi/referencesto see what is ready for vectorization.