Skip to content

Phase E Implementation Complete

Date: 2026-02-17 Branch: feature/sonnet-impl-20260217-155229 Status: ✅ Implemented (Testing Deferred)


Summary

Phase E implements scraper hardening with automated quality control:

  • SimHash Deduplication: 64-bit fingerprints with Hamming distance ≤ 3
  • Arabic Text Canonicalization: Alef unification, tatweel removal, whitespace normalization
  • Content Quality Heuristics: Min length, OCR confidence, encoding validation
  • Automated Pipeline: Scraper → canonicalize → dedupe → quality check → insert

What Was Implemented

S13: SimHash Deduplication

New: app/services/deduplication.py

Features: - SimHasher: Computes 64-bit SimHash for documents - hamming_distance(): Calculates bit differences between hashes - is_duplicate(): Checks if Hamming distance ≤ 3 (default threshold) - DeduplicationService: Finds canonical references for duplicates

Algorithm: 1. Tokenize document text into words 2. For each word: compute MD5 hash 3. Build 64-bit accumulator vector based on hash bits 4. Final SimHash = bitwise majority of vector 5. Compare with existing fingerprints using Hamming distance

Duplicate Detection: - Hamming distance ≤ 3 → considered duplicate - Links to canonical reference via references.canonical_id - First-discovered document becomes canonical

Storage: - references.content_fingerprint: BIGINT (64-bit SimHash) - Indexed for fast duplicate lookup

S14: Arabic Text Canonicalization

New: app/services/text_normalizer.py

Features:

  1. Alef Unification:
  2. أ, إ, آ, ٱا (plain alef)
  3. Ensures consistent matching across documents

  4. Tatweel Removal:

  5. ـ (kashida/elongation) → removed
  6. Reduces noise in Arabic text

  7. Taa Marbuta Normalization (optional):

  8. ةه (context-dependent, disabled by default)

  9. Hamza Normalization (optional):

  10. ؤو, ئي

  11. Whitespace Normalization:

  12. Collapse multiple spaces/tabs/newlines → single space
  13. Trim leading/trailing whitespace

  14. Boilerplate Removal:

  15. Regex patterns for page numbers, copyright, URLs
  16. Configurable per source

Usage:

normalizer = TextNormalizer()

# Full canonicalization
canonical = normalizer.canonicalize(
    text=raw_text,
    language="ar",
    boilerplate_patterns=custom_patterns
)

S15: Content Quality Heuristics

New: app/services/quality_checker.py

Quality Checks:

Check Threshold Action
Minimum text length ≥ 200 chars (after normalization) Reject page
OCR confidence (Arabic/Hassaniya) ≥ 0.70 Reject if below
OCR confidence (French) ≥ 0.80 Reject if below
Encoding Valid UTF-8 Reject if invalid
File size ≤ 100 MB Reject if larger

QualityChecker Methods: - check_text_length(): Ensures minimum content - check_ocr_confidence(): Language-specific thresholds - check_encoding(): UTF-8 validation - check_file_size(): Size limits - check_all(): Run all checks and return combined result

Result Format:

class QualityCheckResult:
    passed: bool
    reason: Optional[str]  # Failure reason
    warnings: List[str]  # Non-fatal warnings

Scraper Pipeline Integration

New: app/services/scraper_service.py

Full Pipeline:

Scraper Fetches Document
[Canonicalize Text] → Arabic normalization, whitespace, boilerplate removal
[Quality Check] → Min length, OCR confidence, encoding
    ↓ (Pass)
[Compute SimHash] → 64-bit fingerprint
[Check Duplicates] → Query existing fingerprints, Hamming ≤ 3?
[Insert Reference] → Link to canonical if duplicate, else mark as canonical
[Auto-Trigger Ingestion] → Create ingestion job for new canonical documents

Endpoints:

New: app/api/routers/scraper_admin.py - POST /admin/scraping/{source}/sync: Trigger scraper sync - GET /admin/scraping/{source}/references: List references

Response:

{
  "run_id": "uuid",
  "source": "koutoubi",
  "status": "success",
  "found": 15,
  "new": 3,
  "duplicates": 2,
  "quality_failed": 0,
  "errors": 0
}


Quality Check Examples

Example 1: Page Too Short

checker = QualityChecker()

text = "Page 1"  # Only 6 chars
result = checker.check_text_length(text)

# result.passed = False
# result.reason = "Text too short: 6 chars < 200 minimum"

Example 2: Low OCR Confidence

result = checker.check_ocr_confidence(
    confidence=0.65,
    language="ar"
)

# result.passed = False
# result.reason = "Low OCR confidence: 0.65 < 0.70 threshold"

Example 3: All Checks Pass

result = checker.check_all(
    text="Long document text here..." * 50,  # 1000+ chars
    language="fr",
    ocr_confidence=0.95,
    file_size_bytes=5_000_000  # 5 MB
)

# result.passed = True

SimHash Deduplication Examples

Example 1: Identical Documents

hasher = SimHasher()

text1 = "This is a math textbook for grade 12 students."
text2 = "This is a math textbook for grade 12 students."

hash1 = hasher.simhash(text1)
hash2 = hasher.simhash(text2)

distance = hasher.hamming_distance(hash1, hash2)
# distance = 0 (identical)

is_dup = hasher.is_duplicate(hash1, hash2, threshold=3)
# is_dup = True

Example 2: Similar Documents (Minor Differences)

text1 = "This is a math textbook for grade 12 students."
text2 = "This is a math textbook for grade 12 students studying algebra."

hash1 = hasher.simhash(text1)
hash2 = hasher.simhash(text2)

distance = hasher.hamming_distance(hash1, hash2)
# distance ≈ 2-3 (very similar)

is_dup = hasher.is_duplicate(hash1, hash2, threshold=3)
# is_dup = True or False (depends on exact distance)

Example 3: Different Documents

text1 = "Math textbook for grade 12."
text2 = "History textbook for grade 10."

hash1 = hasher.simhash(text1)
hash2 = hasher.simhash(text2)

distance = hasher.hamming_distance(hash1, hash2)
# distance ≈ 20-30 (very different)

is_dup = hasher.is_duplicate(hash1, hash2, threshold=3)
# is_dup = False

Testing Checklist

SimHash Deduplication (S13)

  • [ ] Identical documents produce identical SimHash
  • [ ] Very similar documents have Hamming ≤ 3
  • [ ] Different documents have Hamming > 3
  • [ ] Duplicate detection finds canonical reference
  • [ ] First-discovered document marked as canonical

Arabic Canonicalization (S14)

  • [ ] Alef variants unified correctly
  • [ ] Tatweel removed from text
  • [ ] Whitespace normalized (multiple spaces → single space)
  • [ ] Boilerplate patterns removed
  • [ ] Canonicalized text is consistent across equivalent inputs

Quality Checks (S15)

  • [ ] Short pages (< 200 chars) rejected
  • [ ] Low OCR confidence (< 0.70 for Arabic) rejected
  • [ ] Invalid UTF-8 encoding rejected
  • [ ] Large files (> 100 MB) rejected
  • [ ] Quality failures logged to ingestion_audit

Scraper Integration

  • [ ] Scraper sync processes all documents
  • [ ] Duplicates linked to canonical references
  • [ ] Quality-failed documents logged (not inserted)
  • [ ] New canonical documents auto-trigger ingestion jobs
  • [ ] Sync statistics accurate (new, duplicates, quality_failed, errors)

Data Flow

New Document (No Duplicate)

Scraper fetches PDF
Extract text (PyPDF/OCR)
Canonicalize (Arabic normalization)
Quality check → PASS
Compute SimHash = 0x1a2b3c4d...
Check duplicates → NONE FOUND
Insert into references:
  - content_fingerprint = 0x1a2b3c4d...
  - canonical_id = NULL (self is canonical)
  - status = 'discovered'
Auto-trigger ingestion job

Duplicate Document

Scraper fetches PDF
Extract text (PyPDF/OCR)
Canonicalize
Quality check → PASS
Compute SimHash = 0x1a2b3c4e...
Check duplicates → Hamming(0x1a2b3c4e, 0x1a2b3c4d) = 2 ≤ 3
Insert into references:
  - content_fingerprint = 0x1a2b3c4e...
  - canonical_id = <existing-ref-id>  ← Points to canonical
  - status = 'duplicate'
NO ingestion job (duplicate not processed)

Quality-Failed Document

Scraper fetches PDF
Extract text (PyPDF/OCR)
Canonicalize
Quality check → FAIL (only 50 chars)
Log to ingestion_audit:
  - reason: "Text too short: 50 chars < 200 minimum"
NOT inserted into references

Next Phase: Phase F - Observability & DR

Once Phase E tests pass:

  • S17: Structured logging & metrics (already done!)
  • S18: Wallet reconciliation job
  • S19: Reindex & disaster recovery scripts

Files Changed

New Files

  • app/services/text_normalizer.py - Arabic canonicalization (S14)
  • app/services/deduplication.py - SimHash deduplication (S13)
  • app/services/quality_checker.py - Quality heuristics (S15)
  • app/services/scraper_service.py - Integrated scraper pipeline
  • app/api/routers/scraper_admin.py - Admin scraper endpoints

Status: ✅ Phase E Complete - Ready for Testing


See SONNET_RUN.md for full implementation log