Data Anonymisation for Testing¶
Standard ID: STD-DATA-062 Version: 1.0 Last Updated: 2026-04-03 Owner: Data Team / Security Team Status: Active
Purpose¶
Define mandatory standards for data anonymisation, synthetic data generation, and test dataset creation at Simpaisa. No real PII or PAN data may exist outside the production environment. This standard ensures that sandbox, staging, development, QA, and local environments operate exclusively on anonymised or synthetic data.
Current State¶
- Production data in staging: The staging environment contains a snapshot of production data from 6 months ago, including real MSISDNs, CNICs/NIDs, names, and bank account numbers. This is a PCI DSS violation and a regulatory risk across all six markets.
- No synthetic data generation: The sandbox environment (used by merchants for integration testing) contains manually fabricated test records. Coverage is poor — only a handful of test merchants and scenarios exist.
- Developer local environments: Some developers have partial production data extracts on their laptops for debugging. This is uncontrolled and undocumented.
- No anonymisation pipeline: There is no automated mechanism to create anonymised datasets from production data. Any data transfer between environments is manual and ad hoc.
Target State¶
- Zero real PII/PAN outside production. All non-production environments use exclusively anonymised or synthetic data.
- Automated anonymisation pipeline produces test datasets from production snapshots on a weekly schedule.
- Synthetic data generator creates realistic test data for sandbox and development environments without any production data dependency.
- Referential integrity preserved — anonymised data maintains foreign key relationships and cross-table consistency.
Anonymisation Rules¶
Field-Level Rules¶
| Data Category | Fields | Anonymisation Method | Details |
|---|---|---|---|
| Mobile number (MSISDN) | msisdn, phone, mobile |
Consistent hash | SHA-256 of original + salt, mapped to valid format (e.g., 03xx-xxxxxxx). Same input always produces same output for referential integrity |
| National ID (CNIC/NID) | cnic, nid, nationalId |
Consistent hash | Hashed to valid format per country (PK: 13 digits, BD: 10/17 digits) |
| Name | firstName, lastName, fullName |
Faker replacement | Replaced with locale-appropriate fake names (Urdu names for PK, Bengali for BD, etc.) |
email |
Consistent hash + domain | sha256(email)@test.simpaisa.com. Same email always maps to same anonymised email |
|
| Bank account | accountNumber, iban |
Consistent hash | Hashed to valid format per bank. Check digits recalculated |
| Card number (PAN) | pan, cardNumber |
BIN-preserving | First 6 digits (BIN) preserved, remaining digits randomised, Luhn check digit recalculated |
| Address | address, city, postalCode |
Faker replacement | Replaced with fake addresses in the correct market/city |
| Date of birth | dob, dateOfBirth |
Date shift | Shifted by a random offset (±180 days), consistent per individual |
| Transaction amount | amount |
Preserved | Amounts are NOT anonymised — they are non-identifying and needed for realistic testing |
| Transaction reference | transactionId, reference |
Preserved | References are system-generated and non-identifying |
| Merchant ID | merchantId |
Preserved | Merchant IDs are system-generated. Merchant names and contact details ARE anonymised |
Consistent Hashing¶
Consistent hashing is critical for maintaining referential integrity across tables:
- A per-environment salt is used for all hashing. The salt is stored in the secret manager and rotated quarterly.
- The same real MSISDN always produces the same anonymised MSISDN within an environment. This means a consumer's transactions, KYC records, and wallet balance all link correctly after anonymisation.
- Different environments use different salts, so anonymised data cannot be correlated across environments.
- The salt is never stored alongside the anonymised data. Production salt recovery is a two-person process requiring security team approval.
Synthetic Data Generation¶
For sandbox and development environments where no production data dependency is desired:
Synthetic Data Generator¶
A Go service generates realistic test data:
| Data Set | Volume | Characteristics |
|---|---|---|
| Merchants | 500 per market | Mix of sizes (micro, small, medium, enterprise), all four products, realistic configuration |
| Consumers | 50,000 per market | Locale-appropriate names, valid-format MSISDNs and IDs, realistic demographics |
| Transactions | 1M per market | Realistic distribution (80% Pay-In, 10% Pay-Out, 8% Remittance, 2% Cards), peak hours, seasonal patterns |
| Channels | All active channels per market | Realistic success/failure rates, response times, settlement files |
| Settlements | Daily for 90 days | Matching transaction volumes, realistic fees and netting |
Sandbox-Specific Rules¶
- Sandbox uses only synthetic data. No anonymised production data.
- Test card numbers follow industry-standard test ranges (e.g.,
4242424242424242). - Test MSISDNs use reserved ranges per market (coordinated with mobile operators where required).
- Synthetic channel responses simulate realistic behaviour: success rates, timeouts, specific error codes.
Anonymisation Pipeline¶
Architecture¶
Production MySQL/SurrealDB
→ Temporal Workflow (weekly, Saturday 02:00 UTC)
→ Export (read-only replica)
→ Anonymise (field-level rules applied)
→ Quality Check (no real PII in output)
→ Load to target environment
→ Verification (PII scan of target)
Pipeline Rules¶
| Rule | Requirement |
|---|---|
| Schedule | Weekly (Saturday 02:00 UTC). Manual trigger available for ad hoc needs |
| Source | Read-only replica of production. Never reads from primary |
| Scope | Configurable per target environment (staging gets full dataset; QA gets subset) |
| Idempotency | Each run produces a complete replacement. No incremental anonymisation (avoids data correlation risks) |
| Verification | Post-load PII scan checks for any unanonymised data. Pipeline fails if real PII is detected |
| Retention | Previous anonymised dataset is deleted before new dataset is loaded. No accumulation |
PII Detection Scan¶
After every anonymisation run, an automated scan checks the target environment for residual real PII:
| Check | Method |
|---|---|
| MSISDN format | Regex match against known real operator prefixes (vs test prefixes) |
| CNIC/NID format | Checksum validation against real ID algorithms |
| PAN | Luhn validation + BIN lookup against real card ranges (vs test ranges) |
| Check for real domain names (gmail.com, yahoo.com vs test.simpaisa.com) | |
| Name | Statistical comparison against common real name databases |
If any check fails, the pipeline is rolled back and an alert is sent to the security team.
Environment Classification¶
| Environment | Data Source | Refresh Frequency | PII Allowed |
|---|---|---|---|
| Production | Real data | N/A | Yes (with PCI DSS controls) |
| Staging | Anonymised production snapshot | Weekly | No |
| QA | Anonymised production subset | Weekly | No |
| Development | Synthetic data | On-demand | No |
| Sandbox (merchant-facing) | Synthetic data | On-demand | No |
| Local (developer laptops) | Synthetic data | On-demand | No |
Compliance Impact¶
- PCI DSS Requirement 6.4.3: Test data must not contain real PANs. This standard directly satisfies this requirement.
- Data protection laws: Pakistan (PECA 2016), Bangladesh (Digital Security Act), and other market regulations restrict the use of personal data outside its original purpose. Anonymisation ensures non-production use complies.
- Central bank regulations: Financial regulators in all six markets require controls on data used in testing environments. This standard provides auditable evidence of those controls.
Actions¶
- Immediate: Purge all real PII from the staging environment. Replace with anonymised data from the first pipeline run. Target: 2 weeks.
- Immediate: Audit developer laptops for production data extracts. Revoke and delete any found. Target: 1 week.
- Month 1: Build and deploy the anonymisation pipeline (Temporal workflow). Run first production snapshot anonymisation for staging.
- Month 1–2: Build the synthetic data generator for sandbox and development environments. Deploy to sandbox.
- Month 2: Implement the post-anonymisation PII detection scan. Integrate into the pipeline as a mandatory verification step.
- Month 2–3: Configure per-environment salt rotation (quarterly). Document the two-person salt recovery process.
- Ongoing: Weekly anonymisation pipeline runs. Quarterly salt rotation. Annual audit of all non-production environments for PII compliance.