Data Anonymisation for Testing¶

Standard ID: STD-DATA-062 Version: 1.0 Last Updated: 2026-04-03 Owner: Data Team / Security Team Status: Active

Purpose¶

Define mandatory standards for data anonymisation, synthetic data generation, and test dataset creation at Simpaisa. No real PII or PAN data may exist outside the production environment. This standard ensures that sandbox, staging, development, QA, and local environments operate exclusively on anonymised or synthetic data.

Current State¶

Production data in staging: The staging environment contains a snapshot of production data from 6 months ago, including real MSISDNs, CNICs/NIDs, names, and bank account numbers. This is a PCI DSS violation and a regulatory risk across all six markets.
No synthetic data generation: The sandbox environment (used by merchants for integration testing) contains manually fabricated test records. Coverage is poor — only a handful of test merchants and scenarios exist.
Developer local environments: Some developers have partial production data extracts on their laptops for debugging. This is uncontrolled and undocumented.
No anonymisation pipeline: There is no automated mechanism to create anonymised datasets from production data. Any data transfer between environments is manual and ad hoc.

Target State¶

Zero real PII/PAN outside production. All non-production environments use exclusively anonymised or synthetic data.
Automated anonymisation pipeline produces test datasets from production snapshots on a weekly schedule.
Synthetic data generator creates realistic test data for sandbox and development environments without any production data dependency.
Referential integrity preserved — anonymised data maintains foreign key relationships and cross-table consistency.

Anonymisation Rules¶

Field-Level Rules¶

Data Category	Fields	Anonymisation Method	Details
Mobile number (MSISDN)	`msisdn`, `phone`, `mobile`	Consistent hash	SHA-256 of original + salt, mapped to valid format (e.g., `03xx-xxxxxxx`). Same input always produces same output for referential integrity
National ID (CNIC/NID)	`cnic`, `nid`, `nationalId`	Consistent hash	Hashed to valid format per country (PK: 13 digits, BD: 10/17 digits)
Name	`firstName`, `lastName`, `fullName`	Faker replacement	Replaced with locale-appropriate fake names (Urdu names for PK, Bengali for BD, etc.)
Email	`email`	Consistent hash + domain	`sha256(email)@test.simpaisa.com`. Same email always maps to same anonymised email
Bank account	`accountNumber`, `iban`	Consistent hash	Hashed to valid format per bank. Check digits recalculated
Card number (PAN)	`pan`, `cardNumber`	BIN-preserving	First 6 digits (BIN) preserved, remaining digits randomised, Luhn check digit recalculated
Address	`address`, `city`, `postalCode`	Faker replacement	Replaced with fake addresses in the correct market/city
Date of birth	`dob`, `dateOfBirth`	Date shift	Shifted by a random offset (±180 days), consistent per individual
Transaction amount	`amount`	Preserved	Amounts are NOT anonymised — they are non-identifying and needed for realistic testing
Transaction reference	`transactionId`, `reference`	Preserved	References are system-generated and non-identifying
Merchant ID	`merchantId`	Preserved	Merchant IDs are system-generated. Merchant names and contact details ARE anonymised

Consistent Hashing¶

Consistent hashing is critical for maintaining referential integrity across tables:

A per-environment salt is used for all hashing. The salt is stored in the secret manager and rotated quarterly.
The same real MSISDN always produces the same anonymised MSISDN within an environment. This means a consumer's transactions, KYC records, and wallet balance all link correctly after anonymisation.
Different environments use different salts, so anonymised data cannot be correlated across environments.
The salt is never stored alongside the anonymised data. Production salt recovery is a two-person process requiring security team approval.

Synthetic Data Generation¶

For sandbox and development environments where no production data dependency is desired:

Synthetic Data Generator¶

A Go service generates realistic test data:

Data Set	Volume	Characteristics
Merchants	500 per market	Mix of sizes (micro, small, medium, enterprise), all four products, realistic configuration
Consumers	50,000 per market	Locale-appropriate names, valid-format MSISDNs and IDs, realistic demographics
Transactions	1M per market	Realistic distribution (80% Pay-In, 10% Pay-Out, 8% Remittance, 2% Cards), peak hours, seasonal patterns
Channels	All active channels per market	Realistic success/failure rates, response times, settlement files
Settlements	Daily for 90 days	Matching transaction volumes, realistic fees and netting

Sandbox-Specific Rules¶

Sandbox uses only synthetic data. No anonymised production data.
Test card numbers follow industry-standard test ranges (e.g., 4242424242424242).
Test MSISDNs use reserved ranges per market (coordinated with mobile operators where required).
Synthetic channel responses simulate realistic behaviour: success rates, timeouts, specific error codes.

Anonymisation Pipeline¶

Architecture¶

Production MySQL/SurrealDB
  → Temporal Workflow (weekly, Saturday 02:00 UTC)
    → Export (read-only replica)
      → Anonymise (field-level rules applied)
        → Quality Check (no real PII in output)
          → Load to target environment
            → Verification (PII scan of target)

Pipeline Rules¶

Rule	Requirement
Schedule	Weekly (Saturday 02:00 UTC). Manual trigger available for ad hoc needs
Source	Read-only replica of production. Never reads from primary
Scope	Configurable per target environment (staging gets full dataset; QA gets subset)
Idempotency	Each run produces a complete replacement. No incremental anonymisation (avoids data correlation risks)
Verification	Post-load PII scan checks for any unanonymised data. Pipeline fails if real PII is detected
Retention	Previous anonymised dataset is deleted before new dataset is loaded. No accumulation

PII Detection Scan¶

After every anonymisation run, an automated scan checks the target environment for residual real PII:

Check	Method
MSISDN format	Regex match against known real operator prefixes (vs test prefixes)
CNIC/NID format	Checksum validation against real ID algorithms
PAN	Luhn validation + BIN lookup against real card ranges (vs test ranges)
Email	Check for real domain names (gmail.com, yahoo.com vs test.simpaisa.com)
Name	Statistical comparison against common real name databases

If any check fails, the pipeline is rolled back and an alert is sent to the security team.

Environment Classification¶

Environment	Data Source	Refresh Frequency	PII Allowed
Production	Real data	N/A	Yes (with PCI DSS controls)
Staging	Anonymised production snapshot	Weekly	No
QA	Anonymised production subset	Weekly	No
Development	Synthetic data	On-demand	No
Sandbox (merchant-facing)	Synthetic data	On-demand	No
Local (developer laptops)	Synthetic data	On-demand	No

Compliance Impact¶

PCI DSS Requirement 6.4.3: Test data must not contain real PANs. This standard directly satisfies this requirement.
Data protection laws: Pakistan (PECA 2016), Bangladesh (Digital Security Act), and other market regulations restrict the use of personal data outside its original purpose. Anonymisation ensures non-production use complies.
Central bank regulations: Financial regulators in all six markets require controls on data used in testing environments. This standard provides auditable evidence of those controls.

Actions¶

Immediate: Purge all real PII from the staging environment. Replace with anonymised data from the first pipeline run. Target: 2 weeks.
Immediate: Audit developer laptops for production data extracts. Revoke and delete any found. Target: 1 week.
Month 1: Build and deploy the anonymisation pipeline (Temporal workflow). Run first production snapshot anonymisation for staging.
Month 1–2: Build the synthetic data generator for sandbox and development environments. Deploy to sandbox.
Month 2: Implement the post-anonymisation PII detection scan. Integrate into the pipeline as a mandatory verification step.
Month 2–3: Configure per-environment salt rotation (quarterly). Document the two-person salt recovery process.
Ongoing: Weekly anonymisation pipeline runs. Quarterly salt rotation. Annual audit of all non-production environments for PII compliance.