Skip to content
Documentation

DataMask

Mask production data into dev/staging without losing referential integrity. KMS-derived deterministic seeds (HKDF fallback), ML-based PII detection, dry-run preview, k-anonymity verification, and Ed25519-signed audit handshake.

What it is

A workspace + engine that runs an LGPD-grade masking job from source to target. Sensitive fields are detected via a 3-signal classifier (regex + heuristics + ML), then transformed with deterministic functions (HKDF-derived per field) so the same source value always maps to the same masked value — referential integrity is preserved across collections.

When to use

Run before refreshing dev / staging from production. Anyone with access to the masked target gets the same schema and statistical distribution as production, but every PII column is replaced — DPO / ANPD / external dev teams can use the data without an LGPD breach.

How to open

  1. In the NoSqlStudio desktop app, open the Tools menu → Migrate and Diff → Data Mask, or press Ctrl+Alt+K.
  2. A workspace tab opens with the source / target picker on top, the field policy editor in the center, and the dry-run preview on the right.
  3. Pick source and target connections, then pick the database and collections to mask.

Building a mask plan

  1. Scanner — click "Scan source" to run the PII detector on a sample. The output is a per-field classification: not_sensitive / personal_low / personal_medium / personal_high / checksum_match.
  2. Policy — for every flagged field, the engine proposes a transform (hash / shuffle / format-preserving / nullify / leave). You can override per field.
  3. Audit handshake — for tiers that require it, the engine triggers a license-server-signed handshake (Ed25519). The signed token is embedded in every audit line so an external auditor can verify the run cryptographically.
  4. Dry-run — click "Preview" to mask 100 sample documents in-memory and render before/after side by side. Nothing has been written to the target yet.
  5. k-anonymity — set a minimum k. The engine refuses to apply the mask if any combination of quasi-identifiers (zipcode + age + gender) produces a group smaller than k.
  6. Justification — for sensitive fields, the engine requires a free-text justification stored in the audit log.
  7. Apply — click "Start mask". The engine writes the masked documents to the target with progress bar, throughput, and resume-on-failure built in. The audit log records the policy fingerprint, target namespaces, and outcome.

KMS-derived seeds

For deterministic masking, every field uses a per-field seed derived from a root secret via HKDF. The root secret can live in HashiCorp Vault, AWS KMS, Azure Key Vault, or GCP KMS — configured once at the workstation. Without a KMS, the engine falls back to a local HKDF seed (still deterministic, but rotation requires manual key replacement).

Right-to-be-Forgotten executor

A separate sub-workflow for LGPD Art. 18 erasure requests. See the RTBF Wizard doc for the full flow — the DataMask engine is the substrate that actually erases.

Limitations (v1)

  • KMS providers are optional. Without one, the local HKDF seed lives in userData and rotation is manual. Tier-gated.
  • Schema-drift detection only catches changes since the last snapshot. If the source schema changed between scan and apply, the engine warns; it does not auto-evolve the policy.
  • k-anonymity check is sample-based (default 10k docs). For ironclad guarantees set k high and review the per-bucket report.