Phase E Implementation Complete
Date: 2026-02-17 Branch: feature/sonnet-impl-20260217-155229 Status: ✅ Implemented (Testing Deferred)
Summary
Phase E implements scraper hardening with automated quality control:
- SimHash Deduplication: 64-bit fingerprints with Hamming distance ≤ 3
- Arabic Text Canonicalization: Alef unification, tatweel removal, whitespace normalization
- Content Quality Heuristics: Min length, OCR confidence, encoding validation
- Automated Pipeline: Scraper → canonicalize → dedupe → quality check → insert
What Was Implemented
S13: SimHash Deduplication
New: app/services/deduplication.py
Features:
- SimHasher: Computes 64-bit SimHash for documents
- hamming_distance(): Calculates bit differences between hashes
- is_duplicate(): Checks if Hamming distance ≤ 3 (default threshold)
- DeduplicationService: Finds canonical references for duplicates
Algorithm: 1. Tokenize document text into words 2. For each word: compute MD5 hash 3. Build 64-bit accumulator vector based on hash bits 4. Final SimHash = bitwise majority of vector 5. Compare with existing fingerprints using Hamming distance
Duplicate Detection:
- Hamming distance ≤ 3 → considered duplicate
- Links to canonical reference via references.canonical_id
- First-discovered document becomes canonical
Storage:
- references.content_fingerprint: BIGINT (64-bit SimHash)
- Indexed for fast duplicate lookup
S14: Arabic Text Canonicalization
New: app/services/text_normalizer.py
Features:
- Alef Unification:
أ,إ,آ,ٱ→ا(plain alef)-
Ensures consistent matching across documents
-
Tatweel Removal:
ـ(kashida/elongation) → removed-
Reduces noise in Arabic text
-
Taa Marbuta Normalization (optional):
-
ة→ه(context-dependent, disabled by default) -
Hamza Normalization (optional):
-
ؤ→و,ئ→ي -
Whitespace Normalization:
- Collapse multiple spaces/tabs/newlines → single space
-
Trim leading/trailing whitespace
-
Boilerplate Removal:
- Regex patterns for page numbers, copyright, URLs
- Configurable per source
Usage:
normalizer = TextNormalizer()
# Full canonicalization
canonical = normalizer.canonicalize(
text=raw_text,
language="ar",
boilerplate_patterns=custom_patterns
)
S15: Content Quality Heuristics
New: app/services/quality_checker.py
Quality Checks:
| Check | Threshold | Action |
|---|---|---|
| Minimum text length | ≥ 200 chars (after normalization) | Reject page |
| OCR confidence (Arabic/Hassaniya) | ≥ 0.70 | Reject if below |
| OCR confidence (French) | ≥ 0.80 | Reject if below |
| Encoding | Valid UTF-8 | Reject if invalid |
| File size | ≤ 100 MB | Reject if larger |
QualityChecker Methods:
- check_text_length(): Ensures minimum content
- check_ocr_confidence(): Language-specific thresholds
- check_encoding(): UTF-8 validation
- check_file_size(): Size limits
- check_all(): Run all checks and return combined result
Result Format:
class QualityCheckResult:
passed: bool
reason: Optional[str] # Failure reason
warnings: List[str] # Non-fatal warnings
Scraper Pipeline Integration
New: app/services/scraper_service.py
Full Pipeline:
Scraper Fetches Document
↓
[Canonicalize Text] → Arabic normalization, whitespace, boilerplate removal
↓
[Quality Check] → Min length, OCR confidence, encoding
↓ (Pass)
[Compute SimHash] → 64-bit fingerprint
↓
[Check Duplicates] → Query existing fingerprints, Hamming ≤ 3?
↓
[Insert Reference] → Link to canonical if duplicate, else mark as canonical
↓
[Auto-Trigger Ingestion] → Create ingestion job for new canonical documents
Endpoints:
New: app/api/routers/scraper_admin.py
- POST /admin/scraping/{source}/sync: Trigger scraper sync
- GET /admin/scraping/{source}/references: List references
Response:
{
"run_id": "uuid",
"source": "koutoubi",
"status": "success",
"found": 15,
"new": 3,
"duplicates": 2,
"quality_failed": 0,
"errors": 0
}
Quality Check Examples
Example 1: Page Too Short
checker = QualityChecker()
text = "Page 1" # Only 6 chars
result = checker.check_text_length(text)
# result.passed = False
# result.reason = "Text too short: 6 chars < 200 minimum"
Example 2: Low OCR Confidence
result = checker.check_ocr_confidence(
confidence=0.65,
language="ar"
)
# result.passed = False
# result.reason = "Low OCR confidence: 0.65 < 0.70 threshold"
Example 3: All Checks Pass
result = checker.check_all(
text="Long document text here..." * 50, # 1000+ chars
language="fr",
ocr_confidence=0.95,
file_size_bytes=5_000_000 # 5 MB
)
# result.passed = True
SimHash Deduplication Examples
Example 1: Identical Documents
hasher = SimHasher()
text1 = "This is a math textbook for grade 12 students."
text2 = "This is a math textbook for grade 12 students."
hash1 = hasher.simhash(text1)
hash2 = hasher.simhash(text2)
distance = hasher.hamming_distance(hash1, hash2)
# distance = 0 (identical)
is_dup = hasher.is_duplicate(hash1, hash2, threshold=3)
# is_dup = True
Example 2: Similar Documents (Minor Differences)
text1 = "This is a math textbook for grade 12 students."
text2 = "This is a math textbook for grade 12 students studying algebra."
hash1 = hasher.simhash(text1)
hash2 = hasher.simhash(text2)
distance = hasher.hamming_distance(hash1, hash2)
# distance ≈ 2-3 (very similar)
is_dup = hasher.is_duplicate(hash1, hash2, threshold=3)
# is_dup = True or False (depends on exact distance)
Example 3: Different Documents
text1 = "Math textbook for grade 12."
text2 = "History textbook for grade 10."
hash1 = hasher.simhash(text1)
hash2 = hasher.simhash(text2)
distance = hasher.hamming_distance(hash1, hash2)
# distance ≈ 20-30 (very different)
is_dup = hasher.is_duplicate(hash1, hash2, threshold=3)
# is_dup = False
Testing Checklist
SimHash Deduplication (S13)
- [ ] Identical documents produce identical SimHash
- [ ] Very similar documents have Hamming ≤ 3
- [ ] Different documents have Hamming > 3
- [ ] Duplicate detection finds canonical reference
- [ ] First-discovered document marked as canonical
Arabic Canonicalization (S14)
- [ ] Alef variants unified correctly
- [ ] Tatweel removed from text
- [ ] Whitespace normalized (multiple spaces → single space)
- [ ] Boilerplate patterns removed
- [ ] Canonicalized text is consistent across equivalent inputs
Quality Checks (S15)
- [ ] Short pages (< 200 chars) rejected
- [ ] Low OCR confidence (< 0.70 for Arabic) rejected
- [ ] Invalid UTF-8 encoding rejected
- [ ] Large files (> 100 MB) rejected
- [ ] Quality failures logged to ingestion_audit
Scraper Integration
- [ ] Scraper sync processes all documents
- [ ] Duplicates linked to canonical references
- [ ] Quality-failed documents logged (not inserted)
- [ ] New canonical documents auto-trigger ingestion jobs
- [ ] Sync statistics accurate (new, duplicates, quality_failed, errors)
Data Flow
New Document (No Duplicate)
Scraper fetches PDF
↓
Extract text (PyPDF/OCR)
↓
Canonicalize (Arabic normalization)
↓
Quality check → PASS
↓
Compute SimHash = 0x1a2b3c4d...
↓
Check duplicates → NONE FOUND
↓
Insert into references:
- content_fingerprint = 0x1a2b3c4d...
- canonical_id = NULL (self is canonical)
- status = 'discovered'
↓
Auto-trigger ingestion job
Duplicate Document
Scraper fetches PDF
↓
Extract text (PyPDF/OCR)
↓
Canonicalize
↓
Quality check → PASS
↓
Compute SimHash = 0x1a2b3c4e...
↓
Check duplicates → Hamming(0x1a2b3c4e, 0x1a2b3c4d) = 2 ≤ 3
↓
Insert into references:
- content_fingerprint = 0x1a2b3c4e...
- canonical_id = <existing-ref-id> ← Points to canonical
- status = 'duplicate'
↓
NO ingestion job (duplicate not processed)
Quality-Failed Document
Scraper fetches PDF
↓
Extract text (PyPDF/OCR)
↓
Canonicalize
↓
Quality check → FAIL (only 50 chars)
↓
Log to ingestion_audit:
- reason: "Text too short: 50 chars < 200 minimum"
↓
NOT inserted into references
Next Phase: Phase F - Observability & DR
Once Phase E tests pass:
- S17: Structured logging & metrics (already done!)
- S18: Wallet reconciliation job
- S19: Reindex & disaster recovery scripts
Files Changed
New Files
app/services/text_normalizer.py- Arabic canonicalization (S14)app/services/deduplication.py- SimHash deduplication (S13)app/services/quality_checker.py- Quality heuristics (S15)app/services/scraper_service.py- Integrated scraper pipelineapp/api/routers/scraper_admin.py- Admin scraper endpoints
Status: ✅ Phase E Complete - Ready for Testing
See SONNET_RUN.md for full implementation log