Skip to main content
Rehydra detects two categories of PII: structured PII (detected via regex) and soft PII (detected via NER model).

Structured PII (Regex Detection)

These types have well-defined patterns and are detected using optimized regular expressions with validation.
TypeDescriptionExampleValidation
EMAILEmail addresses[email protected]RFC 5322 format
PHONEPhone numbers (international)+49 30 123456E.164 patterns
IBANInternational Bank Account NumbersDE89370400440532013000Checksum validation
BIC_SWIFTBank Identifier CodesCOBADEFFXXXFormat validation
CREDIT_CARDCredit card numbers4111111111111111Luhn algorithm
IP_ADDRESSIPv4 and IPv6 addresses192.168.1.1Format validation
URLWeb URLshttps://example.comURI format
CASE_IDCase/ticket numbersCASE-12345Configurable pattern
CUSTOMER_IDCustomer identifiersCUST-ABC123Configurable pattern

Soft PII (NER Detection)

These types require contextual understanding and are detected using a trained NER (Named Entity Recognition) model.
TypeDescriptionExampleSemantic Attributes
PERSONPerson namesJohn Smith, Mariagender (male/female/neutral)
ORGOrganization namesAcme Corp, Google
LOCATIONPlaces and locationsBerlin, Germanyscope (city/country/region)
ADDRESSPhysical addresses123 Main St
DATE_OF_BIRTHDates of birthborn on March 15, 1990
NER detection requires initializing with a model mode other than 'disabled'. See the NER Detection Guide for setup.

Priority Resolution

When multiple detections overlap (e.g., an email that’s also a URL), Rehydra uses priority ordering:
Lower Priority                              Higher Priority
──────────────────────────────────────────────────────────→
URL → IP → LOCATION → ORG → PERSON → PHONE → EMAIL → IBAN → CREDIT_CARD
Higher priority types take precedence when spans overlap.

Confidence Thresholds

NER-detected entities have confidence scores. You can configure minimum thresholds:
const anonymizer = createAnonymizer({
  ner: { 
    mode: 'quantized',
    thresholds: {
      PERSON: 0.8,    // Require 80% confidence for names
      ORG: 0.7,       // 70% for organizations
      LOCATION: 0.6,  // 60% for locations
    }
  }
});

Type-Specific Detection Control

Enable or disable specific PII types:
import { createAnonymizer, PIIType } from 'rehydra';

const anonymizer = createAnonymizer({
  defaultPolicy: {
    // Only detect these regex types
    regexEnabledTypes: new Set([
      PIIType.EMAIL, 
      PIIType.PHONE,
    ]),
    // Only detect these NER types
    nerEnabledTypes: new Set([
      PIIType.PERSON,
    ]),
  }
});

Custom ID Patterns

Add domain-specific patterns for case IDs and customer IDs:
import { createCustomIdRecognizer, PIIType } from 'rehydra';

const customRecognizer = createCustomIdRecognizer([
  {
    name: 'Order Number',
    pattern: /\bORD-[A-Z0-9]{8}\b/g,
    type: PIIType.CASE_ID,
  },
  {
    name: 'Customer ID',
    pattern: /\bCUST-[0-9]{6}\b/g,
    type: PIIType.CUSTOMER_ID,
  },
]);

const anonymizer = createAnonymizer();
anonymizer.getRegistry().register(customRecognizer);

Placeholder Format

Detected PII is replaced with XML-like placeholder tags:
<!-- Basic format -->
<PII type="EMAIL" id="1"/>

<!-- With semantic attributes -->
<PII type="PERSON" gender="female" id="1"/>
<PII type="LOCATION" scope="city" id="2"/>
The placeholder format is designed to be:
  • Preserved by translation APIs — Most services treat XML-like tags as non-translatable
  • Parseable — Easy to extract for rehydration
  • Informative — Type and attributes help with contextual translation

Next Steps