Skip to main content
The NER (Named Entity Recognition) model enables detection of soft PII like person names, organizations, and locations that can’t be captured by regex patterns.

Model Modes

ModeDescriptionSizeUse Case
'disabled'No NER, regex only0Fast processing, structured PII only
'quantized'Smaller quantized model~280 MBRecommended for most use cases
'standard'Full-size model~1.1 GBMaximum accuracy
'custom'Your own ONNX modelVariesDomain-specific models

Basic Setup

import { createAnonymizer } from 'rehydra';

const anonymizer = createAnonymizer({
  ner: { 
    mode: 'quantized',
    onStatus: (status) => console.log(status),
  }
});

await anonymizer.initialize();  // Downloads model on first use

const result = await anonymizer.anonymize('Hello John Smith from Acme Corp!');
// "Hello <PII type="PERSON" id="1"/> from <PII type="ORG" id="1"/>!"

Download Progress

Track model download progress:
const anonymizer = createAnonymizer({
  ner: {
    mode: 'quantized',
    onStatus: (status) => console.log('Status:', status),
    onDownloadProgress: (progress) => {
      console.log(`${progress.file}: ${progress.percent}%`);
    }
  }
});
Output during first initialization:
Status: Downloading model files...
model.onnx: 15%
model.onnx: 30%
model.onnx: 100%
vocab.txt: 100%
Status: Loading NER model...
Status: NER model loaded!

Confidence Thresholds

NER entities have confidence scores (0.0-1.0). Configure minimum thresholds:
const anonymizer = createAnonymizer({
  ner: { 
    mode: 'quantized',
    thresholds: {
      PERSON: 0.8,     // 80% confidence required
      ORG: 0.7,        // 70% for organizations
      LOCATION: 0.6,   // 60% for locations
    }
  }
});
Lower thresholds → more detections (potentially more false positives) Higher thresholds → fewer detections (may miss some entities)

Case Fallback

The NER model is case-sensitive — it works best on properly capitalized text. This means lowercase names like "tom" or "sarah" can be missed. Enable caseFallback to run a second NER pass on title-cased text and merge any new detections:
const anonymizer = createAnonymizer({
  ner: {
    mode: 'quantized',
    caseFallback: true,
  }
});

await anonymizer.initialize();

await anonymizer.anonymize('hey tom, can you ask sarah to call me?');
// "hey <PII type="PERSON" id="1"/>, can you ask <PII type="PERSON" id="2"/> to call me?"
Without caseFallback, neither "tom" nor "sarah" would be detected.

How it works

  1. The primary NER pass runs on the original text
  2. A second pass runs on title-cased text (e.g. "tom""Tom")
  3. New detections from the fallback pass that don’t overlap with primary detections are merged in
  4. Fallback detections keep the original lowercase text and character offsets
  5. A confidence penalty is applied to fallback detections to reduce false positives

Confidence penalty

Fallback detections receive a confidence penalty (multiplied by caseFallbackPenalty, default 0.85) since title-casing can introduce false positives. You can tune this:
const anonymizer = createAnonymizer({
  ner: {
    mode: 'quantized',
    caseFallback: true,
    caseFallbackPenalty: 0.7,  // Stricter penalty
  }
});
Enabling caseFallback doubles NER inference time since it runs two passes. Use it when your input text contains informal or uncapitalized names (chat messages, transcripts, etc.).

Auto-Download Control

By default, models are downloaded automatically. To disable:
const anonymizer = createAnonymizer({
  ner: {
    mode: 'quantized',
    autoDownload: false,  // Will throw if model not present
  }
});

Manual Model Management

Pre-download models or manage cache:
import { 
  isModelDownloaded,
  downloadModel,
  clearModelCache,
  listDownloadedModels
} from 'rehydra';

// Check if model exists
const hasModel = await isModelDownloaded('quantized');

// Pre-download with progress
await downloadModel('quantized', (progress) => {
  console.log(`${progress.file}: ${progress.percent}%`);
});

// List downloaded models
const models = await listDownloadedModels();
// ['quantized']

// Clear specific model
await clearModelCache('quantized');

// Clear all models
await clearModelCache();

Inference Server Backend

For batch processing or GPU acceleration, offload NER inference to a remote server:
const anonymizer = createAnonymizer({
  ner: {
    mode: 'quantized',
    backend: 'inference-server',
    inferenceServerUrl: 'http://localhost:8000/predict',
    inferenceServerTimeout: 30000,  // 30 seconds (default)
  }
});
This sends tokenized text to the server for inference instead of running ONNX locally. The server must accept the same input format and return logits in the expected shape.

Custom Models

Use your own ONNX model:
const anonymizer = createAnonymizer({
  ner: {
    mode: 'custom',
    modelPath: './my-model.onnx',
    vocabPath: './vocab.txt',
  }
});
Custom models must follow the same input/output format as the default models. See the model training guide for details.

Cache Locations

Models are cached locally for offline use:

Node.js

PlatformLocation
macOS~/Library/Caches/rehydra/models/
Linux~/.cache/rehydra/models/
Windows%LOCALAPPDATA%/rehydra/models/

Browser

In browsers, models are stored using:
  • Origin Private File System (OPFS) for large model files
  • IndexedDB for metadata
Data persists across page reloads and browser sessions.

NER-Detected Types

TypeExamples
PERSONJohn Smith, Maria, Dr. Johnson
ORGAcme Corp, Google, United Nations
LOCATIONBerlin, Germany, Central Park
ADDRESS123 Main Street
DATE_OF_BIRTHborn on March 15, 1990

Disabling Specific NER Types

Detect only certain entity types:
import { createAnonymizer, PIIType } from 'rehydra';

const anonymizer = createAnonymizer({
  ner: { mode: 'quantized' },
  defaultPolicy: {
    nerEnabledTypes: new Set([
      PIIType.PERSON,  // Only detect names
    ])
  }
});

Performance Tips

Model loading is expensive. Create once and reuse:
// ✅ Good: create once
const anonymizer = createAnonymizer({ ner: { mode: 'quantized' } });
await anonymizer.initialize();

// Reuse for multiple texts
await anonymizer.anonymize(text1);
await anonymizer.anonymize(text2);

// Dispose when done
await anonymizer.dispose();
The quantized model is ~95% as accurate but 4x smaller:
ModelSizeInference Time
Standard~1.1 GB~120ms
Quantized~280 MB~100ms
If you only need emails, phones, IBANs, etc.:
import { anonymizeRegexOnly } from 'rehydra';

const result = await anonymizeRegexOnly(text);
// ~5ms instead of ~150ms

Next Steps

Semantic Enrichment

Add gender and location attributes

Custom Recognizers

Add domain-specific detection patterns