Design: Ingestion & Vector Embedding
Overview
The ingestion pipeline converts raw PDF curriculum documents into searchable vector embeddings.
Pipeline Steps
graph LR
A[PDF File] --> B[Extract Pages<br>PyPDF]
B --> C[Chunk Text<br>1000 chars / 150 overlap]
C --> D[Generate Embeddings<br>text-embedding-3-small]
D --> E[Upsert to Pinecone<br>grade-X-subject namespace]
D --> F[Store chunks<br>in Postgres]
style A fill:#ffebee
style E fill:#e8f5e9
style F fill:#e3f2fd
- Extraction (
pdf_processor.py): - Uses
PyPDFto extract text and page numbers. -
Cleans whitespace and normalization.
-
Chunking:
- Size: 1000 characters.
- Overlap: 150 characters (ensures context isn't lost at boundaries).
-
Each chunk keeps track of its
page_number. -
Embedding (
embedding_service.py): - Uses OpenAI
text-embedding-3-small. - Generates 1536-dimensional vectors.
-
Singleton:
embedding_serviceinapp/core/dependencies.py. -
Storage:
- Upserts vectors into Pinecone via
pinecone_adapter(singleton from dependencies). - Uses the
grade-{grade}-{subject}namespace. - For manual uploads: text stored in vector metadata for backward compatibility.
- For new architecture: text stays in Postgres
chunkstable (lightweight Pinecone metadata).
Ingestion Methods
1. Reference-Driven Ingestion (Preferred)
stateDiagram-v2
[*] --> discovered: Scraper finds PDF
discovered --> queued: POST /admin/ingest
queued --> parsing: dispatch picks job
parsing --> tokenizing: Pages extracted
tokenizing --> embedding_request_sent: Chunks created
embedding_request_sent --> embedding_upserted: Vectors generated
embedding_upserted --> ready: Upserted to Pinecone
queued --> failed: Error
parsing --> failed: Error
tokenizing --> failed: Error
embedding_request_sent --> failed: Error
failed --> queued: POST /admin/jobs/requeue
Triggered by discovered references from the scraping platform:
1. Discovery: Scrapers find PDFs and store them in the references table (status: discovered).
2. Trigger: Admin calls POST /admin/ingest/{reference_id}.
3. Execution:
- Status changes to processing.
- PDF downloaded from pdf_source URL (no local upload needed).
- Text extracted, chunked, and embedded.
- Vectors upserted to Pinecone.
- Status changes to ready.
4. Metadata: Pinecone index, namespace, and vector counts are synced back to the references table.
2. Manual Upload (Legacy)
Admin uploads a PDF directly via POST /admin/upload-curriculum:
- File uploaded as multipart form data.
- Grade, subject, and language specified in form fields.
- Useful for custom content not discovered by scrapers.
3. Legacy Endpoint
POST /admin/vector-embedding accepts {"reference_id": "...", "force": false} and redirects internally to the reference-driven ingestion flow.
Service Architecture
All services are injected from app/core/dependencies.py:
- embedding_service: Generates embeddings via OpenAI.
- pinecone_adapter: Upserts/queries vectors in Pinecone.
- supabase_service: Reads/writes reference status and metadata.
Key Constants
| Parameter | Value |
|---|---|
| Chunk Size | 1000 chars |
| Chunk Overlap | 150 chars |
| Model | text-embedding-3-small |
| Dimensions | 1536 |
| Index | curriculum-1536 |