Pipeline Overview
Stage Details
1. Pre-normalization
Standardizes input text for consistent detection:- Normalizes unicode characters (NFKC normalization)
- Standardizes whitespace (multiple spaces → single space)
- Preserves original positions for accurate tagging
2. Regex Detection
Fast pattern matching for structured PII:- Runs all enabled recognizers in parallel
- Validates matches (e.g., Luhn check for credit cards, IBAN checksums)
- Returns spans with 100% confidence
3. NER Detection
Neural network inference for soft PII:- Tokenizes text using WordPiece
- Runs ONNX inference
- Decodes BIO tags to spans
- Returns spans with confidence scores
4. Span Resolution
Merges and prioritizes overlapping detections:- Higher priority type wins (see PII Types)
- For same type: higher confidence wins
- For exact ties: regex detection wins over NER
5. Title Extraction
When semantic enrichment is enabled, honorific titles are extracted:6. Semantic Enrichment
Adds attributes for better translation context:| Attribute | Type | Values | Purpose |
|---|---|---|---|
gender | PERSON | male, female, neutral | Grammatical agreement |
scope | LOCATION | city, country, region | Preposition selection |
7. Entity Tagging
Replaces detected spans with placeholder tags:- Assigns unique IDs per type
- Builds PII map (ID → original value)
- Supports ID reuse for repeated values
8. Validation
Optional output validation:- Leak Scan: Checks if any original PII appears in output
- Format Check: Validates tag structure
- Warnings: Logs issues without blocking
9. Encryption
Secures the PII map using AES-256-GCM:Performance Characteristics
| Stage | Time (2K chars) | Notes |
|---|---|---|
| Regex pass | ~5 ms | All recognizers |
| NER inference | ~100-150 ms | Quantized model |
| Semantic enrichment | ~1-2 ms | After data loaded |
| Total pipeline | ~150-200 ms | Full anonymization |