Phase E Implementation Complete

Date: 2026-02-17 Branch: feature/sonnet-impl-20260217-155229 Status: ✅ Implemented (Testing Deferred)

Summary

Phase E implements scraper hardening with automated quality control:

SimHash Deduplication: 64-bit fingerprints with Hamming distance ≤ 3
Arabic Text Canonicalization: Alef unification, tatweel removal, whitespace normalization
Content Quality Heuristics: Min length, OCR confidence, encoding validation
Automated Pipeline: Scraper → canonicalize → dedupe → quality check → insert

What Was Implemented

S13: SimHash Deduplication

New: app/services/deduplication.py

Features: - SimHasher: Computes 64-bit SimHash for documents - hamming_distance(): Calculates bit differences between hashes - is_duplicate(): Checks if Hamming distance ≤ 3 (default threshold) - DeduplicationService: Finds canonical references for duplicates

Algorithm: 1. Tokenize document text into words 2. For each word: compute MD5 hash 3. Build 64-bit accumulator vector based on hash bits 4. Final SimHash = bitwise majority of vector 5. Compare with existing fingerprints using Hamming distance

Duplicate Detection: - Hamming distance ≤ 3 → considered duplicate - Links to canonical reference via references.canonical_id - First-discovered document becomes canonical

Storage: - references.content_fingerprint: BIGINT (64-bit SimHash) - Indexed for fast duplicate lookup

S14: Arabic Text Canonicalization

New: app/services/text_normalizer.py

Features:

Alef Unification:
أ, إ, آ, ٱ → ا (plain alef)
Ensures consistent matching across documents
Tatweel Removal:
ـ (kashida/elongation) → removed
Reduces noise in Arabic text
Taa Marbuta Normalization (optional):
ة → ه (context-dependent, disabled by default)
Hamza Normalization (optional):
ؤ → و, ئ → ي
Whitespace Normalization:
Collapse multiple spaces/tabs/newlines → single space
Trim leading/trailing whitespace
Boilerplate Removal:
Regex patterns for page numbers, copyright, URLs
Configurable per source

Usage:

normalizer = TextNormalizer()

# Full canonicalization
canonical = normalizer.canonicalize(
    text=raw_text,
    language="ar",
    boilerplate_patterns=custom_patterns
)

S15: Content Quality Heuristics

New: app/services/quality_checker.py

Quality Checks:

Check	Threshold	Action
Minimum text length	≥ 200 chars (after normalization)	Reject page
OCR confidence (Arabic/Hassaniya)	≥ 0.70	Reject if below
OCR confidence (French)	≥ 0.80	Reject if below
Encoding	Valid UTF-8	Reject if invalid
File size	≤ 100 MB	Reject if larger

QualityChecker Methods: - check_text_length(): Ensures minimum content - check_ocr_confidence(): Language-specific thresholds - check_encoding(): UTF-8 validation - check_file_size(): Size limits - check_all(): Run all checks and return combined result

Result Format:

class QualityCheckResult:
    passed: bool
    reason: Optional[str]  # Failure reason
    warnings: List[str]  # Non-fatal warnings

Scraper Pipeline Integration

New: app/services/scraper_service.py

Full Pipeline:

Scraper Fetches Document
    ↓
[Canonicalize Text] → Arabic normalization, whitespace, boilerplate removal
    ↓
[Quality Check] → Min length, OCR confidence, encoding
    ↓ (Pass)
[Compute SimHash] → 64-bit fingerprint
    ↓
[Check Duplicates] → Query existing fingerprints, Hamming ≤ 3?
    ↓
[Insert Reference] → Link to canonical if duplicate, else mark as canonical
    ↓
[Auto-Trigger Ingestion] → Create ingestion job for new canonical documents

Endpoints:

New: app/api/routers/scraper_admin.py - POST /admin/scraping/{source}/sync: Trigger scraper sync - GET /admin/scraping/{source}/references: List references

Response:

{
  "run_id": "uuid",
  "source": "koutoubi",
  "status": "success",
  "found": 15,
  "new": 3,
  "duplicates": 2,
  "quality_failed": 0,
  "errors": 0
}

Quality Check Examples

Example 1: Page Too Short

checker = QualityChecker()

text = "Page 1"  # Only 6 chars
result = checker.check_text_length(text)

# result.passed = False
# result.reason = "Text too short: 6 chars < 200 minimum"

Example 2: Low OCR Confidence

result = checker.check_ocr_confidence(
    confidence=0.65,
    language="ar"
)

# result.passed = False
# result.reason = "Low OCR confidence: 0.65 < 0.70 threshold"

Example 3: All Checks Pass

result = checker.check_all(
    text="Long document text here..." * 50,  # 1000+ chars
    language="fr",
    ocr_confidence=0.95,
    file_size_bytes=5_000_000  # 5 MB
)

# result.passed = True

SimHash Deduplication Examples

Example 1: Identical Documents

hasher = SimHasher()

text1 = "This is a math textbook for grade 12 students."
text2 = "This is a math textbook for grade 12 students."

hash1 = hasher.simhash(text1)
hash2 = hasher.simhash(text2)

distance = hasher.hamming_distance(hash1, hash2)
# distance = 0 (identical)

is_dup = hasher.is_duplicate(hash1, hash2, threshold=3)
# is_dup = True

Example 2: Similar Documents (Minor Differences)

text1 = "This is a math textbook for grade 12 students."
text2 = "This is a math textbook for grade 12 students studying algebra."

hash1 = hasher.simhash(text1)
hash2 = hasher.simhash(text2)

distance = hasher.hamming_distance(hash1, hash2)
# distance ≈ 2-3 (very similar)

is_dup = hasher.is_duplicate(hash1, hash2, threshold=3)
# is_dup = True or False (depends on exact distance)

Example 3: Different Documents

text1 = "Math textbook for grade 12."
text2 = "History textbook for grade 10."

hash1 = hasher.simhash(text1)
hash2 = hasher.simhash(text2)

distance = hasher.hamming_distance(hash1, hash2)
# distance ≈ 20-30 (very different)

is_dup = hasher.is_duplicate(hash1, hash2, threshold=3)
# is_dup = False

Testing Checklist

SimHash Deduplication (S13)

[ ] Identical documents produce identical SimHash
[ ] Very similar documents have Hamming ≤ 3
[ ] Different documents have Hamming > 3
[ ] Duplicate detection finds canonical reference
[ ] First-discovered document marked as canonical

Arabic Canonicalization (S14)

[ ] Alef variants unified correctly
[ ] Tatweel removed from text
[ ] Whitespace normalized (multiple spaces → single space)
[ ] Boilerplate patterns removed
[ ] Canonicalized text is consistent across equivalent inputs

Quality Checks (S15)

[ ] Short pages (< 200 chars) rejected
[ ] Low OCR confidence (< 0.70 for Arabic) rejected
[ ] Invalid UTF-8 encoding rejected
[ ] Large files (> 100 MB) rejected
[ ] Quality failures logged to ingestion_audit

Scraper Integration

[ ] Scraper sync processes all documents
[ ] Duplicates linked to canonical references
[ ] Quality-failed documents logged (not inserted)
[ ] New canonical documents auto-trigger ingestion jobs
[ ] Sync statistics accurate (new, duplicates, quality_failed, errors)

Data Flow

New Document (No Duplicate)

Scraper fetches PDF
  ↓
Extract text (PyPDF/OCR)
  ↓
Canonicalize (Arabic normalization)
  ↓
Quality check → PASS
  ↓
Compute SimHash = 0x1a2b3c4d...
  ↓
Check duplicates → NONE FOUND
  ↓
Insert into references:
  - content_fingerprint = 0x1a2b3c4d...
  - canonical_id = NULL (self is canonical)
  - status = 'discovered'
  ↓
Auto-trigger ingestion job

Duplicate Document

Scraper fetches PDF
  ↓
Extract text (PyPDF/OCR)
  ↓
Canonicalize
  ↓
Quality check → PASS
  ↓
Compute SimHash = 0x1a2b3c4e...
  ↓
Check duplicates → Hamming(0x1a2b3c4e, 0x1a2b3c4d) = 2 ≤ 3
  ↓
Insert into references:
  - content_fingerprint = 0x1a2b3c4e...
  - canonical_id = <existing-ref-id>  ← Points to canonical
  - status = 'duplicate'
  ↓
NO ingestion job (duplicate not processed)

Quality-Failed Document

Scraper fetches PDF
  ↓
Extract text (PyPDF/OCR)
  ↓
Canonicalize
  ↓
Quality check → FAIL (only 50 chars)
  ↓
Log to ingestion_audit:
  - reason: "Text too short: 50 chars < 200 minimum"
  ↓
NOT inserted into references

Next Phase: Phase F - Observability & DR

Once Phase E tests pass:

S17: Structured logging & metrics (already done!)
S18: Wallet reconciliation job
S19: Reindex & disaster recovery scripts

Files Changed

New Files

app/services/text_normalizer.py - Arabic canonicalization (S14)
app/services/deduplication.py - SimHash deduplication (S13)
app/services/quality_checker.py - Quality heuristics (S15)
app/services/scraper_service.py - Integrated scraper pipeline
app/api/routers/scraper_admin.py - Admin scraper endpoints

Status: ✅ Phase E Complete - Ready for Testing

See SONNET_RUN.md for full implementation log