Skip to content

Design: Scraping Platform

Overview

The Scraping Platform is designed to discover and catalog Mauritanian curriculum resources from external portals. It acts as the "Top of the Funnel" for the ingestion pipeline.

Data Model

scrape_runs

Tracks the history and status of every sync attempt. - status: queued, running, success, failed. - counts: Tracks found, new, updated, and error.

references

A catalog of discovered PDF resources. - source: The origin (e.g., koutoubi). - pdf_source: The unique URL of the file. - status: - discovered: Newly found by scraper. - ready: Successfully ingested into Pinecone. - failed: Ingestion attempt failed. - metadata: grade, subject, language, weight (coefficient).

Supported Sources

Koutoubi Scraper (KoutoubiScraper)

  • Logic: Parses the sitemap.xml and specific HTML tables on koutoubi.mr.
  • Inference: Uses regex and "Year Decoders" to guess the grade and subject from the URL and title.
  • Weighting: Maps subjects to official Mauritanian coefficients (e.g., Math in 7C has a weight of 9).

Admin Workflow

  1. Call POST /scraping/koutoubi/sync.
  2. The system fetches the sitemap.
  3. It creates/updates entries in the references table.
  4. Admin reviews GET /scraping/koutoubi/references to see what is ready for vectorization.

Back to Index