Skip to content

Design: Ingestion & Vector Embedding

Overview

The ingestion pipeline converts raw PDF curriculum documents into searchable vector embeddings.

Pipeline Steps

graph LR
    A[PDF File] --> B[Extract Pages<br>PyPDF]
    B --> C[Chunk Text<br>1000 chars / 150 overlap]
    C --> D[Generate Embeddings<br>text-embedding-3-small]
    D --> E[Upsert to Pinecone<br>grade-X-subject namespace]
    D --> F[Store chunks<br>in Postgres]

    style A fill:#ffebee
    style E fill:#e8f5e9
    style F fill:#e3f2fd
  1. Extraction (pdf_processor.py):
  2. Uses PyPDF to extract text and page numbers.
  3. Cleans whitespace and normalization.

  4. Chunking:

  5. Size: 1000 characters.
  6. Overlap: 150 characters (ensures context isn't lost at boundaries).
  7. Each chunk keeps track of its page_number.

  8. Embedding (embedding_service.py):

  9. Uses OpenAI text-embedding-3-small.
  10. Generates 1536-dimensional vectors.
  11. Singleton: embedding_service in app/core/dependencies.py.

  12. Storage:

  13. Upserts vectors into Pinecone via pinecone_adapter (singleton from dependencies).
  14. Uses the grade-{grade}-{subject} namespace.
  15. For manual uploads: text stored in vector metadata for backward compatibility.
  16. For new architecture: text stays in Postgres chunks table (lightweight Pinecone metadata).

Ingestion Methods

1. Reference-Driven Ingestion (Preferred)

stateDiagram-v2
    [*] --> discovered: Scraper finds PDF
    discovered --> queued: POST /admin/ingest
    queued --> parsing: dispatch picks job
    parsing --> tokenizing: Pages extracted
    tokenizing --> embedding_request_sent: Chunks created
    embedding_request_sent --> embedding_upserted: Vectors generated
    embedding_upserted --> ready: Upserted to Pinecone
    queued --> failed: Error
    parsing --> failed: Error
    tokenizing --> failed: Error
    embedding_request_sent --> failed: Error
    failed --> queued: POST /admin/jobs/requeue

Triggered by discovered references from the scraping platform: 1. Discovery: Scrapers find PDFs and store them in the references table (status: discovered). 2. Trigger: Admin calls POST /admin/ingest/{reference_id}. 3. Execution: - Status changes to processing. - PDF downloaded from pdf_source URL (no local upload needed). - Text extracted, chunked, and embedded. - Vectors upserted to Pinecone. - Status changes to ready. 4. Metadata: Pinecone index, namespace, and vector counts are synced back to the references table.

2. Manual Upload (Legacy)

Admin uploads a PDF directly via POST /admin/upload-curriculum: - File uploaded as multipart form data. - Grade, subject, and language specified in form fields. - Useful for custom content not discovered by scrapers.

3. Legacy Endpoint

POST /admin/vector-embedding accepts {"reference_id": "...", "force": false} and redirects internally to the reference-driven ingestion flow.

Service Architecture

All services are injected from app/core/dependencies.py: - embedding_service: Generates embeddings via OpenAI. - pinecone_adapter: Upserts/queries vectors in Pinecone. - supabase_service: Reads/writes reference status and metadata.

Key Constants

Parameter Value
Chunk Size 1000 chars
Chunk Overlap 150 chars
Model text-embedding-3-small
Dimensions 1536
Index curriculum-1536

Back to Index