Skip to content

Data Anonymisation for Testing

Standard ID: STD-DATA-062 Version: 1.0 Last Updated: 2026-04-03 Owner: Data Team / Security Team Status: Active

Purpose

Define mandatory standards for data anonymisation, synthetic data generation, and test dataset creation at Simpaisa. No real PII or PAN data may exist outside the production environment. This standard ensures that sandbox, staging, development, QA, and local environments operate exclusively on anonymised or synthetic data.

Current State

  • Production data in staging: The staging environment contains a snapshot of production data from 6 months ago, including real MSISDNs, CNICs/NIDs, names, and bank account numbers. This is a PCI DSS violation and a regulatory risk across all six markets.
  • No synthetic data generation: The sandbox environment (used by merchants for integration testing) contains manually fabricated test records. Coverage is poor — only a handful of test merchants and scenarios exist.
  • Developer local environments: Some developers have partial production data extracts on their laptops for debugging. This is uncontrolled and undocumented.
  • No anonymisation pipeline: There is no automated mechanism to create anonymised datasets from production data. Any data transfer between environments is manual and ad hoc.

Target State

  • Zero real PII/PAN outside production. All non-production environments use exclusively anonymised or synthetic data.
  • Automated anonymisation pipeline produces test datasets from production snapshots on a weekly schedule.
  • Synthetic data generator creates realistic test data for sandbox and development environments without any production data dependency.
  • Referential integrity preserved — anonymised data maintains foreign key relationships and cross-table consistency.

Anonymisation Rules

Field-Level Rules

Data Category Fields Anonymisation Method Details
Mobile number (MSISDN) msisdn, phone, mobile Consistent hash SHA-256 of original + salt, mapped to valid format (e.g., 03xx-xxxxxxx). Same input always produces same output for referential integrity
National ID (CNIC/NID) cnic, nid, nationalId Consistent hash Hashed to valid format per country (PK: 13 digits, BD: 10/17 digits)
Name firstName, lastName, fullName Faker replacement Replaced with locale-appropriate fake names (Urdu names for PK, Bengali for BD, etc.)
Email email Consistent hash + domain sha256(email)@test.simpaisa.com. Same email always maps to same anonymised email
Bank account accountNumber, iban Consistent hash Hashed to valid format per bank. Check digits recalculated
Card number (PAN) pan, cardNumber BIN-preserving First 6 digits (BIN) preserved, remaining digits randomised, Luhn check digit recalculated
Address address, city, postalCode Faker replacement Replaced with fake addresses in the correct market/city
Date of birth dob, dateOfBirth Date shift Shifted by a random offset (±180 days), consistent per individual
Transaction amount amount Preserved Amounts are NOT anonymised — they are non-identifying and needed for realistic testing
Transaction reference transactionId, reference Preserved References are system-generated and non-identifying
Merchant ID merchantId Preserved Merchant IDs are system-generated. Merchant names and contact details ARE anonymised

Consistent Hashing

Consistent hashing is critical for maintaining referential integrity across tables:

  • A per-environment salt is used for all hashing. The salt is stored in the secret manager and rotated quarterly.
  • The same real MSISDN always produces the same anonymised MSISDN within an environment. This means a consumer's transactions, KYC records, and wallet balance all link correctly after anonymisation.
  • Different environments use different salts, so anonymised data cannot be correlated across environments.
  • The salt is never stored alongside the anonymised data. Production salt recovery is a two-person process requiring security team approval.

Synthetic Data Generation

For sandbox and development environments where no production data dependency is desired:

Synthetic Data Generator

A Go service generates realistic test data:

Data Set Volume Characteristics
Merchants 500 per market Mix of sizes (micro, small, medium, enterprise), all four products, realistic configuration
Consumers 50,000 per market Locale-appropriate names, valid-format MSISDNs and IDs, realistic demographics
Transactions 1M per market Realistic distribution (80% Pay-In, 10% Pay-Out, 8% Remittance, 2% Cards), peak hours, seasonal patterns
Channels All active channels per market Realistic success/failure rates, response times, settlement files
Settlements Daily for 90 days Matching transaction volumes, realistic fees and netting

Sandbox-Specific Rules

  • Sandbox uses only synthetic data. No anonymised production data.
  • Test card numbers follow industry-standard test ranges (e.g., 4242424242424242).
  • Test MSISDNs use reserved ranges per market (coordinated with mobile operators where required).
  • Synthetic channel responses simulate realistic behaviour: success rates, timeouts, specific error codes.

Anonymisation Pipeline

Architecture

Production MySQL/SurrealDB
  → Temporal Workflow (weekly, Saturday 02:00 UTC)
    → Export (read-only replica)
      → Anonymise (field-level rules applied)
        → Quality Check (no real PII in output)
          → Load to target environment
            → Verification (PII scan of target)

Pipeline Rules

Rule Requirement
Schedule Weekly (Saturday 02:00 UTC). Manual trigger available for ad hoc needs
Source Read-only replica of production. Never reads from primary
Scope Configurable per target environment (staging gets full dataset; QA gets subset)
Idempotency Each run produces a complete replacement. No incremental anonymisation (avoids data correlation risks)
Verification Post-load PII scan checks for any unanonymised data. Pipeline fails if real PII is detected
Retention Previous anonymised dataset is deleted before new dataset is loaded. No accumulation

PII Detection Scan

After every anonymisation run, an automated scan checks the target environment for residual real PII:

Check Method
MSISDN format Regex match against known real operator prefixes (vs test prefixes)
CNIC/NID format Checksum validation against real ID algorithms
PAN Luhn validation + BIN lookup against real card ranges (vs test ranges)
Email Check for real domain names (gmail.com, yahoo.com vs test.simpaisa.com)
Name Statistical comparison against common real name databases

If any check fails, the pipeline is rolled back and an alert is sent to the security team.

Environment Classification

Environment Data Source Refresh Frequency PII Allowed
Production Real data N/A Yes (with PCI DSS controls)
Staging Anonymised production snapshot Weekly No
QA Anonymised production subset Weekly No
Development Synthetic data On-demand No
Sandbox (merchant-facing) Synthetic data On-demand No
Local (developer laptops) Synthetic data On-demand No

Compliance Impact

  • PCI DSS Requirement 6.4.3: Test data must not contain real PANs. This standard directly satisfies this requirement.
  • Data protection laws: Pakistan (PECA 2016), Bangladesh (Digital Security Act), and other market regulations restrict the use of personal data outside its original purpose. Anonymisation ensures non-production use complies.
  • Central bank regulations: Financial regulators in all six markets require controls on data used in testing environments. This standard provides auditable evidence of those controls.

Actions

  1. Immediate: Purge all real PII from the staging environment. Replace with anonymised data from the first pipeline run. Target: 2 weeks.
  2. Immediate: Audit developer laptops for production data extracts. Revoke and delete any found. Target: 1 week.
  3. Month 1: Build and deploy the anonymisation pipeline (Temporal workflow). Run first production snapshot anonymisation for staging.
  4. Month 1–2: Build the synthetic data generator for sandbox and development environments. Deploy to sandbox.
  5. Month 2: Implement the post-anonymisation PII detection scan. Integrate into the pipeline as a mandatory verification step.
  6. Month 2–3: Configure per-environment salt rotation (quarterly). Document the two-person salt recovery process.
  7. Ongoing: Weekly anonymisation pipeline runs. Quarterly salt rotation. Annual audit of all non-production environments for PII compliance.