Simpaisa Infrastructure Standards¶

Version: 1.0.0 Date: 2026-04-03 Owner: CDO (Daniel O'Reilly) Classification: Internal — Architecture & Engineering Leadership Status: Living Document — Prototype / AI SDLC Showcase

Table of Contents¶

Executive Summary
Infrastructure Principles
Environment Strategy
Compute Standards
Networking Standards
Edge & CDN (Cloudflare)
API Gateway (KrakenD)
Observability Stack
Identity & Access (ControlPlane.com)
Data Infrastructure
Secret Management
Disaster Recovery & Business Continuity
Compliance Infrastructure Requirements
Infrastructure as Code
CI/CD Pipeline Standards
Cost Management
Migration Roadmap
Appendix: Infrastructure Controls Checklist

1. Executive Summary¶

This document defines the infrastructure standards for Simpaisa's payment gateway platform, which processes 270M+ transactions worth $1B+ across Pakistan, Bangladesh, Nepal, Iraq, and Egypt. It covers four product lines: Pay-Ins, Pay-Outs, Remittances, and Cards.

Context¶

This is a prototype and showcase of AI SDLC capabilities. The organisation is adopting an agentic AI SDLC-first approach — the team structure will be reorganised as required to support this model.

Current State Summary¶

Simpaisa runs on AWS with sound foundational infrastructure (Multi-AZ, WAF, ALB, ASG, RDS, ElastiCache) but has significant gaps in observability, API gateway, disaster recovery documentation, and distributed tracing. The platform uses Spring Boot / Java services on EC2.

Target State Summary¶

The target architecture moves towards a cloud-native, multi-provider model:

Layer	Current	Target
Edge/CDN	AWS WAF only	Cloudflare (CDN, WAF, DDoS, Workers, Pages, R2, DNS)
API Gateway	None	KrakenD
Compute	EC2 + ASG (Spring Boot/Java)	Containers (Go services) + Unikraft unikernels (assess)
Reverse Proxy	ALB direct	Caddy (per-service, mTLS) behind ALB
Identity	Custom auth	ControlPlane.com
Observability	CloudWatch	OpenTelemetry → Grafana / Jaeger / OpenSearch
Analytics	None	PostHog
Database	RDS MySQL (shared)	SurrealDB (new services) + MySQL (existing)
Messaging	Kafka	NSQ
Search	None	Meilisearch (merchant-facing) + OpenSearch (logs)
Workflow	None	Temporal
Hosting	AWS only	Cloudflare preferred + AWS for existing

Critical Gaps¶

Gap	Priority	Impact
No API Gateway	CRITICAL	No centralised rate limiting, auth verification, or request validation
Single shared RDS	CRITICAL	Single point of failure, no service isolation
No DR documentation	CRITICAL	Unknown recovery posture
No distributed tracing	HIGH	Cannot trace transactions end-to-end across services
No CDN	HIGH	Latency for merchant-facing assets, no edge caching
Single ElastiCache cluster	HIGH	Cache failure impacts all services
No blue/green or canary	MEDIUM	Risky deployments with potential downtime
No IaC documented	MEDIUM	Infrastructure drift, no reproducibility

2. Infrastructure Principles¶

2.1 Cloud-Native¶

All new services MUST be designed as cloud-native, containerised workloads. Infrastructure MUST be provisioned through APIs, not manual console operations.

2.2 Infrastructure as Code¶

All infrastructure MUST be defined in version-controlled code. No manual provisioning or configuration changes in any environment. Drift detection MUST run on every deployment.

2.3 Immutable Deployments¶

Infrastructure and application artefacts MUST be immutable. No in-place updates to running instances. Every deployment creates new artefacts; rollback means deploying the previous artefact.

2.4 Observability-First¶

Every service MUST emit structured logs, metrics, and traces from day one. Observability is not optional — it is a deployment prerequisite. OpenTelemetry is the mandatory instrumentation standard.

2.5 Security by Default¶

All network traffic MUST be encrypted in transit (TLS 1.2 minimum, TLS 1.3 preferred). All data at rest MUST be encrypted. Zero-trust networking: no implicit trust between services. mTLS for all service-to-service communication.

2.6 Multi-Jurisdiction Compliance¶

Infrastructure MUST satisfy regulatory requirements across all operating jurisdictions (Pakistan, Bangladesh, Nepal, Iraq, Egypt). Data residency requirements MUST be met per jurisdiction. Compliance controls MUST be auditable and evidenced.

2.7 Least Privilege¶

All access — human and machine — MUST follow the principle of least privilege. Service accounts MUST have only the permissions required for their function. Permissions MUST be reviewed quarterly.

2.8 Automation Over Process¶

Automate everything that can be automated. Manual processes are a source of error and a barrier to scale. If a runbook step is repeated more than twice, it MUST be automated.

3. Environment Strategy¶

3.1 Environment Definitions¶

Environment	Purpose	URL Pattern	Access	Data
Sandbox	Merchant-facing testing and integration	`sandbox.simpaisa.com`	Merchants + internal	Synthetic test data only
Dev	Internal development and experimentation	`dev.internal.simpaisa.com`	Engineering only	Synthetic / anonymised
Test	Automated testing, QA, UAT	`test.internal.simpaisa.com`	Engineering + QA	Synthetic / anonymised
Prod	Live production traffic	`api.simpaisa.com`	Controlled access	Real customer/merchant data

3.2 Environment Parity Requirements¶

Aspect	Requirement
Architecture	All environments MUST use the same architectural patterns (ALB, ASG, VPC layout)
Configuration	Same configuration structure, different values per environment
Infrastructure	Dev/Test may use smaller instance sizes; architecture MUST match Prod
Networking	Same VPC/subnet design; security groups MUST be equivalent
Secrets	Each environment has its own secrets; NEVER share across environments
Databases	Same engine and version across all environments
Monitoring	All environments MUST have observability; alerting thresholds differ

3.3 Data Segregation¶

Production data MUST NEVER be copied to lower environments without anonymisation
Each environment MUST have its own database instances, cache clusters, and message queues
Payment channel credentials MUST be environment-specific (sandbox credentials for Sandbox, live for Prod)
PII MUST NOT exist in Dev or Test environments
Sandbox MUST simulate realistic payment channel responses (success, failure, timeout, partial)

3.4 Promotion Workflow¶

Dev → Test → Prod
 │      │      │
 │      │      └── Requires: all quality gates passed, change approval, deployment window
 │      └────────── Requires: all automated tests pass, security scan clean
 └───────────────── Requires: code review, unit tests pass, lint clean

Gate	Dev → Test	Test → Prod
Code review	Required	N/A (already done)
Unit tests	Pass	Pass
Integration tests	Run	Pass (mandatory)
Security scan	Run	Pass (mandatory, zero critical/high)
Performance test	Optional	Required for payment-path changes
Change approval	Not required	Required (CDO or delegate)
Deployment window	Any time	Scheduled (avoid peak transaction hours)
Rollback plan	Documented	Documented and tested

3.5 Sandbox-Specific Requirements¶

The Sandbox environment is merchant-facing and MUST:

Be available 99.5% of the time (separate SLA from Prod)
Provide realistic response times (within 2x of Prod P95)
Support all payment channels with simulated responses
Provide test credentials and documentation
Allow merchants to trigger specific scenarios (success, decline, timeout, insufficient funds)
Have its own KrakenD instance with the same rate limiting configuration as Prod
Log all requests for merchant support and debugging

4. Compute Standards¶

4.1 Current State¶

Aspect	Detail
Platform	AWS EC2 instances
Scaling	Auto Scaling Groups (ASG)
Runtime	Spring Boot / Java
Load Balancing	Application Load Balancer (ALB) in public subnets
Availability	Multi-AZ deployment
Deployment	Rolling updates via ASG

4.2 Target State¶

Aspect	Detail
New services	Go services in containers
Security-critical	Unikraft unikernels (assess phase — evaluate for payment processing core)
Reverse proxy	Caddy per-service (behind ALB, providing mTLS termination)
Orchestration	TBC — evaluate ECS Fargate, Kubernetes (EKS), or ControlPlane.com
Existing services	Spring Boot / Java on EC2 (maintained until rewritten)

4.3 Sizing Guidelines¶

Service Tier	Description	Min Instances	Instance Type (Current)	Auto-Scale Trigger
Tier 1 — Payment Critical	Pay-In initiation, Pay-Out execution, Remittance processing, Card auth	3 (Multi-AZ)	m5.xlarge or equivalent	CPU > 60%, request latency P99 > 500ms
Tier 2 — Merchant Facing	API Gateway, Sandbox, Developer Portal, Merchant Dashboard	2 (Multi-AZ)	m5.large or equivalent	CPU > 70%, request latency P99 > 1s
Tier 3 — Internal	Reporting, reconciliation, back-office	2	m5.large or equivalent	CPU > 75%
Tier 4 — Infrastructure	Observability, logging, search indexing	2	r5.large or equivalent	Disk > 80%, memory > 85%

4.4 Auto-Scaling Policies for Payment Workloads¶

Payment services MUST scale based on:
  - Request rate (transactions per second)
  - Response latency (P95 and P99)
  - CPU utilisation
  - Queue depth (for async processing)

Scale-out: aggressive (1-minute evaluation, 2-minute cooldown)
Scale-in: conservative (5-minute evaluation, 10-minute cooldown)

Payment services MUST NOT scale to zero.
Minimum capacity MUST handle 2x average traffic without scaling.

4.5 Deployment Strategies¶

Strategy	Current State	Target State
Rolling update	Yes (ASG)	Maintained for non-critical services
Blue/green	No	Required for Tier 1 (payment-critical) services
Canary	No	Required for API Gateway and payment initiation

Blue/Green Requirements: - Two identical environments (blue and green) - Traffic switch at ALB level (weighted target groups) - Automated health checks before full cutover - Instant rollback capability (switch back to previous colour) - Both environments kept warm for minimum 30 minutes post-deployment

Canary Requirements: - Initial canary: 5% of traffic - Automated metric comparison (error rate, latency, success rate) - Automatic rollback if error rate increases by > 0.1% - Progressive rollout: 5% → 25% → 50% → 100% - Minimum 10 minutes at each stage

5. Networking Standards¶

5.1 VPC Design¶

Each environment MUST have its own VPC. VPCs MUST NOT be shared across environments.

Environment	VPC CIDR	Region
Prod	`10.0.0.0/16`	TBC (primary)
Test	`10.1.0.0/16`	TBC (same region as Prod)
Dev	`10.2.0.0/16`	TBC (same region as Prod)
Sandbox	`10.3.0.0/16`	TBC (same region as Prod)

Note: CIDR ranges are illustrative. Final allocation requires network planning exercise including payment channel VPN requirements.

5.2 Subnet Strategy¶

Each VPC MUST have three subnet tiers across a minimum of two Availability Zones:

Subnet Tier	Purpose	Internet Access	Examples
Public	Edge / ingress	Direct (IGW)	ALB, NAT Gateway, Bastion (if required)
Private	Application workloads	Outbound only (NAT)	EC2 instances, containers, KrakenD, Caddy
Isolated	Data stores	None	RDS, ElastiCache, SurrealDB, OpenSearch

5.3 Security Groups and NACLs¶

Security Group Rules:

Component	Inbound	Outbound
ALB	443 (HTTPS) from Cloudflare IPs only	Application ports to private subnets
Application instances	Application port from ALB SG only	443 to NAT GW (external APIs), DB ports to isolated subnet
KrakenD	8080 from ALB SG	Application ports to private subnets
RDS MySQL	3306 from application SG only	None (stateful return traffic)
ElastiCache Redis	6379 from application SG only	None
OpenSearch	9200 from observability SG only	None

NACL Rules: - NACLs provide defence in depth at the subnet level - Deny all by default, explicitly allow required traffic - NACLs MUST mirror security group intent but at the subnet level

5.4 NAT Gateway Configuration¶

One NAT Gateway per Availability Zone for high availability
All private subnet outbound traffic routes through NAT Gateway
NAT Gateway MUST be in the public subnet
Elastic IP allocated per NAT Gateway

5.5 DNS: Cloudflare DNS¶

Aspect	Standard
Primary DNS	Cloudflare (authoritative)
Internal DNS	Route 53 Private Hosted Zones (for VPC-internal resolution)
TTL	300s for API endpoints, 3600s for static assets
DNSSEC	Enabled on all public zones
Records	A/AAAA records proxied through Cloudflare (orange cloud)

5.6 DDoS Protection¶

Layer	Current	Target
Layer 7	AWS WAF	Cloudflare WAF (primary) + AWS WAF (transitional)
Layer 3/4	AWS Shield Standard	Cloudflare DDoS protection
Rate limiting	None centralised	Cloudflare rate limiting + KrakenD per-merchant limits
Bot management	None	Cloudflare Bot Management

6. Edge & CDN (Cloudflare)¶

6.1 Cloudflare as Primary Edge¶

Cloudflare MUST be the primary edge for all Simpaisa public-facing services. All traffic MUST pass through Cloudflare before reaching AWS infrastructure.

Service	Cloudflare Product	Purpose
CDN	Cloudflare CDN	Cache static assets, reduce origin load
WAF	Cloudflare WAF	Application-layer attack protection
DDoS	Cloudflare DDoS Protection	Volumetric and protocol attack mitigation
DNS	Cloudflare DNS	Authoritative DNS with global anycast
Workers	Cloudflare Workers	Edge logic (rate limiting, validation, geo-routing)
Pages	Cloudflare Pages	Static site hosting (corporate site, developer portal)
R2	Cloudflare R2	Object storage (reports, receipts, merchant documents)
Bot Management	Cloudflare Bot Management	Distinguish legitimate traffic from bots

6.2 Cloudflare Workers Use Cases¶

Use Case	Description	Priority
Geo-routing	Route requests to appropriate regional backend based on merchant jurisdiction	HIGH
Request validation	Validate request structure before forwarding to origin	HIGH
Rate limiting	First-pass rate limiting at the edge (before KrakenD)	HIGH
A/B testing	Route percentage of traffic to canary deployments	MEDIUM
IP allowlisting	Enforce merchant IP allowlists at the edge	MEDIUM
Response caching	Cache merchant configuration, channel status responses	MEDIUM
Header injection	Add tracing headers (X-Request-ID, X-Trace-ID) at the edge	HIGH

6.3 Cloudflare Pages¶

Site	Repository	Domain
Corporate website	`simpaisa.com` repo	`www.simpaisa.com`
Developer portal	`developer-portal` repo	`developer.simpaisa.com`
Status page	`status` repo	`status.simpaisa.com`

6.4 Cloudflare R2¶

Bucket	Purpose	Retention	Access
`merchant-reports`	Generated merchant reports (CSV, PDF)	90 days	Merchant portal (signed URLs)
`transaction-receipts`	Payment receipts	7 years (compliance)	Internal + merchant API
`merchant-documents`	KYC/KYB documents	10 years (compliance)	Internal only
`static-assets`	Images, fonts, scripts	Indefinite	Public (CDN)

6.5 WAF Rules for Payment API Protection¶

Rule	Action	Description
Block non-HTTPS	Block	All payment API traffic MUST be HTTPS
Block non-JSON	Block	Payment APIs accept JSON only; block other content types
Block oversized requests	Block	Maximum 1MB request body for payment APIs
Rate limit by merchant	Challenge/Block	Per-merchant TPS limits enforced at edge
Block known bad IPs	Block	Threat intelligence feed integration
SQL injection detection	Block	OWASP CRS rules for SQLi patterns
Geographic restrictions	Block	Block traffic from sanctioned jurisdictions
Bot score filtering	Challenge	Challenge requests with bot score < 30

6.6 Cloudflare-to-Origin Security¶

Authenticated Origin Pulls: Cloudflare presents a client certificate to the ALB; the ALB validates it
Origin CA: Use Cloudflare Origin CA certificates on ALB
Strict SSL mode: Full (Strict) — Cloudflare validates origin certificate
IP allowlisting: ALB security group MUST only allow Cloudflare IP ranges (published at cloudflare.com/ips)

7. API Gateway (KrakenD)¶

7.1 Deployment Architecture¶

Cloudflare Edge → ALB → KrakenD Cluster → Caddy (mTLS) → Backend Services

Aspect	Standard
Deployment	Containerised, minimum 3 instances across AZs
Configuration	Declarative JSON, version-controlled
Health check	`/health` endpoint, 10-second interval
Scaling	Horizontal, based on request rate and latency
State	Stateless — no persistent storage required

7.2 Configuration Management¶

KrakenD configuration MUST be declarative JSON stored in Git
Configuration changes MUST go through the standard promotion workflow (Dev → Test → Prod)
Configuration MUST be validated (krakend check) before deployment
Flexible Configuration (FC) MUST be used to template environment-specific values
Configuration MUST be generated from OpenAPI specifications where possible

7.3 Auth Verification at Gateway¶

Auth Method	Product	KrakenD Handling
JWT validation	All (target)	Validate JWT signature, expiry, issuer, audience
API key	Sandbox	Validate against key store, inject merchant context
RSA signature	Pay-Outs, Remittances (current)	Pass through to backend (gateway validates timestamp freshness)
mTLS	Cards	Terminate at Caddy, KrakenD receives forwarded client cert info

7.4 Rate Limiting Tiers¶

Tier	Scope	Default Limit	Burst	Notes
Global	All merchants	10,000 req/s	15,000	Platform-wide safety limit
Per-merchant	Individual merchant	100 req/s	200	Configurable per merchant agreement
Per-product	Product line	5,000 req/s	7,500	Pay-Ins, Pay-Outs, Remittances, Cards
Per-endpoint	Specific endpoint	Varies	Varies	e.g., payment initiation: 50 req/s per merchant
Sandbox	Sandbox environment	20 req/s per merchant	30	Lower limits for testing

Rate limit responses MUST include: - X-RateLimit-Limit — maximum requests allowed - X-RateLimit-Remaining — requests remaining in window - X-RateLimit-Reset — seconds until window resets - HTTP 429 status code with a JSON error body

7.5 OpenAPI Validation¶

All API endpoints MUST have an OpenAPI 3.1 specification
KrakenD MUST validate incoming requests against the OpenAPI schema
Invalid requests MUST be rejected at the gateway (400 Bad Request)
Request body validation: required fields, types, format constraints
Query parameter validation: allowed values, types

7.6 Error Response Standardisation¶

All error responses from KrakenD MUST follow RFC 9457 (Problem Details for HTTP APIs):

{
  "type": "https://api.simpaisa.com/errors/rate-limited",
  "title": "Rate limit exceeded",
  "status": 429,
  "detail": "Merchant has exceeded 100 requests per second",
  "instance": "/v1/pay-ins/transactions",
  "traceId": "abc123-def456-ghi789"
}

7.7 High Availability¶

Requirement	Standard
Minimum instances	3 (one per AZ)
Health check	HTTP 200 on `/health` within 5 seconds
Graceful shutdown	Drain connections for 30 seconds before termination
Configuration reload	Zero-downtime reload on configuration change
Failover	ALB removes unhealthy instances within 30 seconds
Availability target	99.99% (gateway MUST NOT be the bottleneck)

8. Observability Stack¶

CloudWatch will NOT be used. The observability stack is built on open standards (OpenTelemetry) with open-source tooling.

8.1 Architecture Overview¶

Services (OTel SDK) → OTel Collector → ┬→ Prometheus (metrics) → Grafana
                                        ├→ Jaeger / Tempo (traces) → Grafana
                                        └→ OpenSearch (logs) → Grafana / OpenSearch Dashboards

PostHog ← (product events from frontend + backend)

8.2 OpenTelemetry Collector¶

The OpenTelemetry Collector is the unified telemetry pipeline. All services MUST send telemetry to the OTel Collector — never directly to backends.

Aspect	Standard
Deployment	Agent mode (sidecar or daemonset) + Gateway mode (central)
Receivers	OTLP (gRPC and HTTP), Prometheus scrape, Fluent Forward
Processors	Batch, memory limiter, attribute enrichment, tail sampling
Exporters	Prometheus Remote Write, Jaeger/Tempo OTLP, OpenSearch
Configuration	Version-controlled YAML, per-environment

8.3 Traces: Jaeger (or Grafana Tempo)¶

Aspect	Standard
Tool	Jaeger (evaluate Grafana Tempo as alternative)
Storage	OpenSearch (Jaeger backend) or S3 (Tempo)
Retention	30 days hot, 90 days cold
Sampling	Head-based: 100% for errors, 10% for success (adjust per traffic)
Context propagation	W3C Trace Context (mandatory), B3 (for legacy compatibility)

Mandatory Trace Spans:

Every payment transaction MUST include the following spans:

Span	Service	Description
`gateway.receive`	KrakenD	Request received at gateway
`auth.verify`	KrakenD / Auth service	Authentication/authorisation check
`payment.initiate`	Payment service	Payment initiation logic
`channel.request`	Channel adapter	Request sent to payment channel (Easypaisa, JazzCash, etc.)
`channel.response`	Channel adapter	Response received from channel
`payment.complete`	Payment service	Transaction finalisation
`callback.dispatch`	Callback service	Webhook sent to merchant

8.4 Metrics: Prometheus + Grafana¶

Aspect	Standard
Collection	Prometheus (via OTel Collector remote write)
Visualisation	Grafana
Retention	15 days high-resolution, 1 year downsampled
Naming convention	`simpaisa_<product>_<metric>_<unit>`

Mandatory Metrics:

Metric	Type	Labels	Description
`simpaisa_transaction_total`	Counter	`product`, `channel`, `status`, `merchant`	Total transactions
`simpaisa_transaction_duration_seconds`	Histogram	`product`, `channel`, `merchant`	Transaction processing time
`simpaisa_transaction_amount_total`	Counter	`product`, `channel`, `currency`	Total transaction value
`simpaisa_channel_request_duration_seconds`	Histogram	`channel`, `operation`	Time to get response from payment channel
`simpaisa_channel_availability`	Gauge	`channel`	Channel health (1 = up, 0 = down)
`simpaisa_gateway_request_total`	Counter	`method`, `path`, `status`	API gateway requests
`simpaisa_gateway_latency_seconds`	Histogram	`method`, `path`	API gateway response time
`simpaisa_error_total`	Counter	`product`, `error_type`, `severity`	Errors by type

8.5 Logs: OpenSearch with Structured Logging¶

Aspect	Standard
Format	JSON structured logging (mandatory)
Transport	OTel Collector → OpenSearch
Retention	90 days hot, 1 year warm, 7 years cold (compliance)
Index pattern	`simpaisa-<service>-<environment>-YYYY.MM.DD`
ISM Policy	Hot → Warm at 7 days, Warm → Cold at 90 days, Delete at 7 years

Mandatory Log Fields:

{
  "timestamp": "2026-04-03T10:30:00.000Z",
  "level": "INFO",
  "service": "pay-in-service",
  "traceId": "abc123",
  "spanId": "def456",
  "merchantId": "M12345",
  "transactionId": "TXN-789",
  "channel": "easypaisa",
  "message": "Transaction initiated",
  "environment": "prod"
}

Sensitive Data Rules: - NEVER log card numbers, CVV, PINs, or full account numbers - Mask mobile numbers: 03XX-XXXX-1234 (show last 4 digits only) - Mask CNICs: XXXXX-XXXXXXX-3 (show last digit only) - Log transaction IDs, merchant IDs, channel references — these are required for tracing

8.6 Alerting¶

Aspect	Standard
Tool	Grafana Alerting (evaluate PagerDuty/OpsGenie for escalation)
Channels	Slack (info/warning), SMS/call (critical), email (summary)
Escalation	P1: immediate call → CDO + on-call engineer; P2: Slack + 15min response; P3: next business day

Alert Definitions:

Alert	Severity	Condition	Action
Transaction success rate drop	P1	Success rate < 95% for any channel over 5 minutes	Immediate investigation
Payment channel down	P1	Channel health check fails for 3 consecutive checks	Failover / merchant notification
API latency spike	P2	P99 latency > 2s for 5 minutes	Scale out / investigate
Error rate increase	P2	Error rate > 5% over 5 minutes	Investigate
Disk space critical	P2	Any data store > 85% disk usage	Expand / clean up
Certificate expiry	P3	Any certificate expiring within 14 days	Renew
Deployment failed	P2	Deployment health check fails	Automatic rollback

8.7 Dashboards¶

Dashboard	Audience	Key Metrics
Executive Overview	CDO, leadership	Total transactions, value, success rate, revenue by product
Per-Product	Product owners	Transaction volume, success/failure rates, channel mix, latency
Per-Channel	Operations	Channel availability, response times, error rates, queue depth
Per-Merchant	Support, account managers	Merchant transaction volume, errors, rate limit hits
Infrastructure	Engineering	CPU, memory, disk, network, scaling events
Security	Security team	WAF blocks, auth failures, suspicious patterns, rate limit events
SLA Monitoring	Operations, leadership	P95/P99 latency per endpoint, uptime percentages

8.8 Transaction Tracing¶

End-to-end transaction tracing is the highest priority observability feature. Every merchant request MUST be traceable from Cloudflare edge → KrakenD → service → payment channel → callback.

Requirement	Standard
Trace ID	Generated at Cloudflare edge (Worker), propagated through all services
Correlation	Trace ID MUST appear in logs, metrics labels, and traces
Merchant visibility	Trace ID returned in API response headers (`X-Trace-Id`)
Support lookup	Support team can search by trace ID, transaction ID, or merchant reference
Channel correlation	Map Simpaisa trace ID to channel reference number

8.9 PostHog for Product Analytics¶

Aspect	Standard
Deployment	Self-hosted (data residency compliance) or cloud (evaluate)
Events	Merchant portal interactions, developer portal usage, API adoption
Feature flags	PostHog feature flags for gradual rollout
Session replay	Enabled for merchant portal (with PII redaction)
Funnels	Merchant onboarding, first transaction, product adoption

9. Identity & Access (ControlPlane.com)¶

9.1 Overview¶

ControlPlane.com provides Universal Cloud Identity, enabling workloads to consume cloud resources from multiple providers without storing credentials. It employs a zero-trust architecture where every access request is fully authenticated and authorised.

9.2 Centralised Identity Management¶

Aspect	Current State	Target State
Human access	AWS IAM users + console	ControlPlane.com SSO → cloud provider roles
Service identity	AWS IAM roles (per-service)	ControlPlane.com workload identity
Merchant identity	Custom auth (JSESSIONID / RSA)	ControlPlane.com + KrakenD JWT validation
Audit trail	CloudTrail (AWS only)	ControlPlane.com tamper-proof audit trail + CloudTrail

9.3 Service-to-Service Authentication¶

Requirement	Standard
Protocol	mTLS (mutual TLS) via Caddy
Certificate management	ControlPlane.com or automated CA (evaluate)
Rotation	Automatic, maximum 24-hour certificate lifetime
Verification	Both client and server certificates validated
No shared secrets	Services MUST NOT use shared API keys for inter-service communication

9.4 Merchant Identity and RBAC¶

Role	Permissions	Description
Merchant Admin	Full access to merchant's resources	Account owner, manages users and settings
Merchant Operator	Initiate transactions, view reports	Day-to-day operational access
Merchant Viewer	Read-only access	Reporting and audit access
Merchant Developer	Sandbox access, API key management	Integration and testing

9.5 Integration with KrakenD¶

Merchant Request → Cloudflare → KrakenD → ControlPlane.com (token validation)
                                    ↓
                              Valid JWT with claims:
                              - merchant_id
                              - roles[]
                              - products[]
                              - rate_limit_tier
                                    ↓
                              Backend Service (receives validated claims as headers)

9.6 Policy-as-Code¶

Access policies MUST be defined as code and version-controlled
Policy changes MUST go through the same review process as code changes
Policies MUST be testable (unit tests for policy logic)
ControlPlane.com policies define: who can access what resources, from which networks, at which times

10. Data Infrastructure¶

10.1 Overview¶

Technology	Role	Current State	Target State
RDS MySQL	Primary transactional database	Shared single instance, Multi-AZ	Per-service instances, read replicas, automated backups
SurrealDB	New service database	Not deployed	Clustered deployment for new Go services
ElastiCache Redis	Caching and session store	Single shared cluster	Cluster mode enabled, per-service namespacing
NSQ	Message queue	Not deployed (Kafka currently)	Replace Kafka for inter-service messaging
Meilisearch	Merchant-facing search	Not deployed	Merchant/transaction search in portal
OpenSearch	Log storage and search	Not deployed	Log aggregation, Jaeger trace storage

10.2 RDS MySQL (Existing)¶

Aspect	Current	Target	Priority
Instances	1 shared instance	Per-service instances (minimum: separate Pay-Ins, Pay-Outs, Remittances, Cards)	CRITICAL
Multi-AZ	Yes	Yes (maintained)	—
Read replicas	None	1 per service instance (reporting queries)	HIGH
Backups	TBC	Automated daily, 35-day retention, point-in-time recovery	CRITICAL
Encryption at rest	TBC	AES-256 (AWS KMS managed key)	CRITICAL
Encryption in transit	TBC	TLS mandatory for all connections	CRITICAL
Version	TBC	MySQL 8.0+ (latest stable)	MEDIUM
Monitoring	CloudWatch	Prometheus exporter → Grafana	HIGH
Slow query log	TBC	Enabled, threshold 1s, exported to OpenSearch	HIGH

10.3 SurrealDB (New Services)¶

Aspect	Standard
Deployment	Clustered (minimum 3 nodes for Prod)
Storage backend	TiKV (distributed) or RocksDB (single-node for Dev/Test)
Backup	Automated daily export, stored in R2
Access	Namespace and database per service, scoped authentication
Schema	Schemaful tables for payment data, schemafree for flexible data
Monitoring	Prometheus metrics endpoint → Grafana

10.4 Redis (ElastiCache)¶

Aspect	Current	Target	Priority
Mode	Single cluster, no cluster mode	Cluster mode enabled	HIGH
Failover	Multi-AZ with automatic failover	Maintained	—
Namespacing	None (shared keyspace)	Prefix per service: `payin:`, `payout:`, `remit:`, `cards:`	HIGH
Encryption	TBC	In-transit (TLS) and at-rest encryption	HIGH
Eviction	TBC	`allkeys-lru` for caches, `noeviction` for session stores	MEDIUM
Monitoring	CloudWatch	Prometheus exporter → Grafana	HIGH
Backup	TBC	Daily snapshots, 7-day retention	MEDIUM

10.5 NSQ (Messaging)¶

Aspect	Standard
Deployment	`nsqlookupd` (3 instances) + `nsqd` (per application host)
Topics	One topic per event type: `payment.initiated`, `payment.completed`, `payment.failed`, `callback.pending`, etc.
Channels	One channel per consumer group (e.g., `payment.completed#notification`, `payment.completed#reconciliation`)
Message retention	In-memory with disk overflow; messages purged after successful consumption
Dead letter	Failed messages after 5 retries → dead letter topic for manual investigation
Monitoring	`nsqadmin` + Prometheus exporter → Grafana
Ordering	Per-partition ordering not guaranteed; use idempotency keys for exactly-once semantics

10.6 Meilisearch (Merchant-Facing Search)¶

Aspect	Standard
Purpose	Fast search in merchant portal (transactions, customers, reports)
Deployment	Single instance per environment (evaluate clustering for Prod)
Indices	`transactions`, `merchants`, `customers`, `reports`
Refresh strategy	Near-real-time: primary write to MySQL/SurrealDB, async index update via NSQ
Security	API key per merchant, tenant isolation via filterable attributes
Monitoring	Health check endpoint + Prometheus metrics

10.7 OpenSearch (Logs and Traces)¶

Aspect	Standard
Deployment	3 master nodes + 3 data nodes (Prod minimum)
Indices	`simpaisa-logs-`, `simpaisa-jaeger-`, `simpaisa-audit-*`
ISM Policies	Hot (7 days, SSD) → Warm (90 days, HDD) → Cold (7 years, S3/R2) → Delete
Retention	Logs: 7 years (compliance), Traces: 90 days, Audit: 10 years
Security	OpenSearch Security plugin, RBAC per index, TLS
Backup	Snapshot to S3/R2, daily
Monitoring	Built-in performance analyser + Prometheus exporter

11. Secret Management¶

11.1 Current State¶

Aspect	Detail
Tool	AWS Systems Manager Parameter Store (SecureString)
Encryption	AWS KMS managed keys
Access	IAM role-based
Rotation	Manual
Audit	CloudTrail

11.2 Target State¶

Aspect	Detail	Priority
Tool	Evaluate: ControlPlane.com secrets, HashiCorp Vault, AWS Secrets Manager	HIGH
Rotation	Automated rotation for all secrets; maximum 90-day lifetime	HIGH
Access	Workload identity (no static credentials); secrets injected at runtime	HIGH
Audit	All secret access logged and alerted on anomalous patterns	HIGH

11.3 Secret Policies¶

Policy	Requirement
No secrets in code	NEVER commit secrets, tokens, keys, or passwords to source control
No secrets in config files	Configuration files MUST reference secret paths, not values
No secrets in environment variables	Prefer mounted secrets or secret injection; env vars are visible in process listings
No secrets in container images	Build-time secrets MUST use multi-stage builds with secret mounts
Secret scanning in CI	Every commit MUST be scanned for secret patterns (pre-commit hook + CI step)
Rotation on compromise	If a secret is suspected compromised, rotate immediately (< 1 hour)
Shared secrets	NEVER share secrets between environments; each environment has its own

11.4 Secret Categories and Rotation¶

Category	Examples	Max Lifetime	Rotation Method
Database credentials	MySQL, SurrealDB, Redis passwords	90 days	Automated (dual-user pattern)
API keys	Payment channel API keys, merchant API keys	365 days	Merchant-initiated or scheduled
TLS certificates	Service certificates, mTLS certs	90 days (target: 24 hours via ControlPlane)	Automated
Signing keys	RSA keys for Pay-Outs/Remittances	365 days	Coordinated rotation with merchants
OAuth tokens	Service-to-service tokens	1 hour	Automatic refresh
Encryption keys	KMS keys, data encryption keys	Annual rotation	AWS KMS automatic rotation

12. Disaster Recovery & Business Continuity¶

12.1 Service Tier Classification¶

Tier	Services	RPO	RTO	Description
Tier 1 — Payment Critical	Pay-In processing, Pay-Out execution, Card auth, Remittance processing	0 (zero data loss)	< 5 minutes	Direct revenue impact; customer-facing payment flows
Tier 2 — Merchant Facing	API Gateway, Merchant Portal, Sandbox	< 5 minutes	< 15 minutes	Merchant experience; no direct payment loss
Tier 3 — Operational	Reporting, reconciliation, settlement, back-office	< 1 hour	< 4 hours	Internal operations; deferred processing acceptable
Tier 4 — Supporting	Developer portal, corporate website, analytics	< 24 hours	< 24 hours	No operational impact

12.2 Backup Strategy¶

Resource	Backup Method	Frequency	Retention	Testing
RDS MySQL	Automated snapshots + binlog replication	Continuous (point-in-time)	35 days	Monthly restore test
SurrealDB	Export + snapshot	Daily	35 days	Monthly restore test
Redis	AOF + RDB snapshots	Hourly (RDB), continuous (AOF)	7 days	Weekly restore test
OpenSearch	Snapshot to S3/R2	Daily	90 days (snapshots)	Quarterly restore test
KrakenD config	Git repository	Every change	Indefinite (Git history)	On every deployment
IaC state	Remote state backend + versioning	Every change	Indefinite	On every deployment
Secrets	AWS backup + encrypted export	Daily	35 days	Quarterly
R2/S3 objects	Cross-region replication	Continuous	Per retention policy	Quarterly

12.3 Current: Multi-AZ¶

Component	Multi-AZ Status	Failover
EC2/ASG	Yes (instances spread across AZs)	Automatic (ASG replaces failed instances)
ALB	Yes (cross-AZ load balancing)	Automatic
RDS	Yes (standby in different AZ)	Automatic failover (< 2 minutes)
ElastiCache	Yes (replica in different AZ)	Automatic failover
NAT Gateway	One per AZ	Route table failover needed

12.4 Target: Multi-Region¶

Phase	Scope	Timeline
Phase 1	Document current DR posture, define RPO/RTO, create runbooks	Q2 2026
Phase 2	Cross-region backup replication (S3/R2), read replicas in secondary region	Q3 2026
Phase 3	Active-passive multi-region for Tier 1 services	Q4 2026
Phase 4	Active-active multi-region (evaluate need based on jurisdiction requirements)	2027

12.5 Failover Procedures¶

Scenario	Detection	Response	Recovery
Single instance failure	ASG health check (30s)	ASG launches replacement	Automatic (< 5 min)
AZ failure	ALB health checks + CloudWatch	Traffic shifts to healthy AZs	Automatic (< 5 min)
RDS primary failure	RDS event + monitoring alert	Automatic failover to standby	Automatic (< 2 min)
Redis primary failure	ElastiCache failover	Automatic promotion of replica	Automatic (< 1 min)
Payment channel outage	Health check failure (3 consecutive)	Disable channel, notify merchants	Manual channel re-enable after verification
Region failure	Multi-region health check	DNS failover to secondary region	Manual (Phase 1) → Automatic (Phase 3)
Cloudflare incident	External monitoring	Evaluate: bypass to ALB direct (emergency only)	Manual

12.6 DR Testing Cadence¶

Test Type	Frequency	Scope	Owner
Backup restore	Monthly	Restore latest backup to Test environment	Engineering
AZ failover	Quarterly	Simulate AZ failure, verify continued operation	Engineering + Operations
Full DR exercise	Bi-annually	Full failover simulation, measure actual RTO/RPO	CDO + Engineering
Tabletop exercise	Quarterly	Walk through failure scenarios with all stakeholders	CDO
Chaos engineering	Monthly (target)	Controlled failure injection in Test/Prod	Engineering

12.7 Runbooks¶

The following runbooks MUST be created, tested, and maintained:

Runbook	Status
RDS failover procedure	TO CREATE
Redis cluster failover	TO CREATE
Payment channel outage response	TO CREATE
Full region failover	TO CREATE
KrakenD configuration rollback	TO CREATE
Cloudflare bypass (emergency)	TO CREATE
Data corruption recovery	TO CREATE
DDoS attack response	TO CREATE
Certificate emergency rotation	TO CREATE
Merchant communication during outage	TO CREATE

13. Compliance Infrastructure Requirements¶

This section documents the infrastructure controls required by regulators in each jurisdiction where Simpaisa operates. Compliance is not optional — failure to meet these requirements risks licence revocation.

Note: Regulatory requirements are subject to change. This section MUST be reviewed quarterly and updated when new circulars or regulations are issued.

13.1 Pakistan — State Bank of Pakistan (SBP)¶

Governing Legislation: - Payment Systems and Electronic Fund Transfers Act, 2007 (PSEFT Act) - Rules for Payment System Operators and Payment Service Providers, 2014 (PSO/PSP Rules) - Electronic Fund Transfer Regulations - Personal Data Protection Bill, 2023 (pending enactment — draft approved by Federal Cabinet)

Infrastructure Requirements:

Requirement	Regulation Source	Infrastructure Control	Current Status	Gap	Priority
Data localisation	PSO/PSP Rules 2014, PDPB 2023 (draft)	Processing systems MUST be located within Pakistan; critical personal data stored on servers in Pakistan	ASSESS — Verify all processing on Pakistan-based AWS region or local DC	TBC	CRITICAL
Technology platform approval	PSO/PSP Rules 2014	Prior SBP approval required for changes to technology platforms	ASSESS — Determine if current changes require approval	TBC	CRITICAL
Transaction record retention	PSO/PSP Rules 2014	All transaction records retained for minimum 5 years (10 years recommended)	ASSESS	Log retention policy needed	HIGH
Information security	PSO/PSP Rules 2014	Appropriate measures for security, integrity, and confidentiality of financial transactions	PARTIAL — AWS infrastructure sound, but gaps in observability and access control	Strengthen controls	HIGH
Risk management	PSO/PSP Rules 2014	Documented risk management framework for payment operations	ASSESS	Documentation needed	HIGH
Audit trail	PSO/PSP Rules 2014	Complete audit trail of all transactions and system changes	PARTIAL — Transaction logs exist but no centralised audit system	Implement centralised audit logging	HIGH
Business continuity	PSO/PSP Rules 2014	Documented BCP/DR plan, tested regularly	GAP — No DR documentation	Create and test DR plan	CRITICAL
Incident reporting	SBP circulars	Timely reporting of security incidents and system outages to SBP	ASSESS	Formalise incident reporting procedure	HIGH
AML/CFT systems	PSEFT Act, FATF requirements	Transaction monitoring, sanctions screening, STR filing	ASSESS	Verify integration with FMU reporting	HIGH

Pakistan-Specific Notes: - AWS does not have a region in Pakistan. Simpaisa MUST verify with SBP whether AWS ap-south-1 (Mumbai) is acceptable, or whether co-location in a Pakistan-based data centre is required for certain data categories - The Personal Data Protection Bill 2023 introduces strict data localisation once enacted — "critical personal data shall only be processed in servers within Pakistan" - SBP requires prior approval for changes to technology platforms — the migration to ControlPlane.com, KrakenD, and other new technologies may require SBP notification/approval

13.2 Bangladesh — Bangladesh Bank¶

Governing Legislation: - Payment and Settlement Systems Act, 2024 - Mobile Financial Services Regulations, 2022 - Bangladesh Bank Payment Systems Department circulars - Bangladesh Financial Intelligence Unit (BFIU) guidelines

Infrastructure Requirements:

Requirement	Regulation Source	Infrastructure Control	Current Status	Gap	Priority
Data localisation (mandatory)	MFS Regulations 2022, PSS Act 2024	IT infrastructure and data centres MUST be located within Bangladesh; data localisation is mandatory	ASSESS — Verify hosting for Bangladesh operations	If not locally hosted, establish local DC or partner	CRITICAL
On-site inspection readiness	MFS Regulations 2022	Bangladesh Bank conducts on-site inspections of IT infrastructure after setup	ASSESS	Ensure infrastructure meets inspection standards	CRITICAL
Biometric e-KYC	BFIU guidelines	Electronic KYC with biometric verification required	ASSESS	Integration with national ID system needed	HIGH
AML/CFT compliance	BFIU guidelines	Suspicious Transaction Report filing, transaction monitoring	ASSESS	Verify STR filing integration	HIGH
Two-phase licensing	MFS Regulations 2022	Phase 1: NOC to set up infrastructure; Phase 2: licence to operate	ASSESS — Verify current licence status	Follow licensing process	HIGH
Transaction reporting	Bangladesh Bank circulars	Regular transaction reports to Bangladesh Bank PSD	ASSESS	Automated reporting needed	HIGH
Capital adequacy	MFS Regulations 2022	Minimum paid-up capital BDT 450 million for MFS (bank-led model)	ASSESS	Verify capital structure	MEDIUM

Bangladesh-Specific Notes: - Data localisation is non-negotiable in Bangladesh — on-site infrastructure inspection is conducted by Bangladesh Bank - Two-phase licensing means infrastructure MUST be built before operational licence is granted - BFIU compliance is separate from Bangladesh Bank payment licensing and adds additional infrastructure requirements for transaction monitoring

13.3 Nepal — Nepal Rastra Bank (NRB)¶

Governing Legislation: - Payment and Settlement Act, 2019 (2075 BS) - NRB PSO/PSP licensing directives - Data Center and Cloud Services (Operation and Management) Directive, 2081 (2024) - NRB Cyber Resilience Guidelines - NRB IT Guidelines

Infrastructure Requirements:

Requirement	Regulation Source	Infrastructure Control	Current Status	Gap	Priority
Data centre approval	Data Center Directive 2081	Data MUST be stored in centres approved by Nepal's IT Department; centres MUST comply with the Directive	ASSESS	Identify approved data centres in Nepal	CRITICAL
PCI DSS compliance	NRB IT Guidelines	Licensed institutions MUST adhere to PCI DSS standards	ASSESS	PCI DSS certification required	CRITICAL
ISO 27000 certification	NRB IT Guidelines	Financial institutions involved in payment processing require ISO 27001 certification	ASSESS	ISO 27001 audit and certification needed	HIGH
Cyber resilience	NRB Cyber Resilience Guidelines	Governance, cyber risk culture, training, resilience testing, recovery planning	ASSESS	Formalise cyber resilience programme	HIGH
EMV compliance	NRB IT Guidelines	EMV and EMV Contactless standards compliance for card processing	ASSESS	Verify EMV compliance for Cards product	HIGH
Licensing requirements	Payment and Settlement Act 2019	Prior NRB approval/licence for PSO/PSP operations; 12-18 month process	ASSESS	Verify licence status	CRITICAL
Capital requirements	NRB directives	NPR 150M (domestic PSP) / NPR 250M (foreign investment PSP)	ASSESS	Verify capital compliance	MEDIUM
Technical assessment	NRB licensing	NRB assesses system security, reliability, and technical standards compliance	ASSESS	Prepare for technical assessment	HIGH

Nepal-Specific Notes: - Nepal has explicit data centre approval requirements — data MUST reside in government-approved centres within Nepal - PCI DSS and ISO 27001 are explicitly mandated (not merely recommended) for payment processors - The 12-18 month licensing timeline means infrastructure investment precedes revenue

13.4 Iraq — Central Bank of Iraq (CBI)¶

Governing Legislation: - Electronic Payment Services Regulation, 2024 (replaced 2014 framework) - Central Bank of Iraq circulars on digital banking and payment systems - AML/CFT regulations (aligned with FATF recommendations)

Infrastructure Requirements:

Requirement	Regulation Source	Infrastructure Control	Current Status	Gap	Priority
CBI licensing	Electronic Payment Services Regulation 2024	Licence required from CBI for electronic payment services; 10-year licence validity	ASSESS	Verify licence status	CRITICAL
Minimum capital	Electronic Payment Services Regulation 2024	Minimum IQD 10 billion company capital	ASSESS	Verify capital compliance	HIGH
Feasibility study	Electronic Payment Services Regulation 2024	3-year feasibility study required covering: economic projections, technical infrastructure, information security, AML systems, dispute resolution	ASSESS	Prepare or update feasibility study	HIGH
5-year record retention	Electronic Payment Services Regulation 2024	All electronic payment transactions and related data retained for minimum 5 years	ASSESS	Implement 5-year retention policy	HIGH
Cybersecurity infrastructure	Electronic Payment Services Regulation 2024, CBI circulars	Advanced cybersecurity measures to safeguard banking systems; compliance with international standards	ASSESS	Cybersecurity posture assessment needed	HIGH
AML/CFT systems	Electronic Payment Services Regulation 2024	Sanctions list screening, transaction monitoring, daily transaction reporting	ASSESS	Verify AML system integration	CRITICAL
Business continuity	CBI circulars	Business continuity during crises; DR planning	ASSESS	DR plan required	HIGH
ISO 20022 alignment	CBI modernisation programme	Payment messaging aligned with ISO 20022 standard	ASSESS	Evaluate ISO 20022 readiness	MEDIUM

Iraq-Specific Notes: - The 2024 regulation is a significant upgrade from the 2014 framework — verify full compliance with the new requirements - IQD 10 billion minimum capital (~USD 7.6M) is a substantial requirement - The 3-year feasibility study requirement includes detailed technical infrastructure and security documentation - Iraq's financial system is heavily influenced by US sanctions compliance (OFAC) — additional sanctions screening infrastructure may be required

13.5 PCI DSS v4.0.1 (Cards Product)¶

Standard: PCI DSS v4.0.1 (mandatory as of 31 March 2025)

PCI DSS applies specifically to the Cards product (Visa/Mastercard acquiring). All systems that store, process, or transmit cardholder data are in scope.

Requirement Area	PCI DSS Requirement	Infrastructure Control	Current Status	Gap	Priority
Network segmentation	Req 1: Install and maintain network security controls	CDE (Cardholder Data Environment) MUST be isolated in a dedicated subnet with strict firewall rules; micro-segmentation recommended	ASSESS	Verify CDE isolation	CRITICAL
Secure configuration	Req 2: Apply secure configurations to all system components	Hardened OS images, no default credentials, unnecessary services disabled	ASSESS	Configuration baseline needed	HIGH
Data protection (stored)	Req 3: Protect stored account data	PAN encrypted with AES-256; hash or truncate where possible; encryption keys managed separately from data	ASSESS	Verify encryption implementation	CRITICAL
Data protection (transit)	Req 4: Protect cardholder data with strong cryptography during transmission	TLS 1.2+ for all cardholder data transmission; no SSL or early TLS	PARTIAL — mTLS for Cards product	Verify all transmission paths	CRITICAL
Malware protection	Req 5: Protect all systems and networks from malicious software	Anti-malware on all CDE systems; regular scanning	ASSESS	Deploy and monitor	HIGH
Secure development	Req 6: Develop and maintain secure systems and software	Secure coding practices, vulnerability patching within 30 days (critical)	ASSESS	SDLC security review needed	HIGH
Access control	Req 7 & 8: Restrict access; identify users and authenticate	MFA mandatory for ALL CDE access (PCI DSS 4.0 requirement); role-based access; unique IDs	ASSESS	Implement MFA for all CDE access	CRITICAL
Physical security	Req 9: Restrict physical access to cardholder data	Physical access controls for CDE infrastructure (if on-premise)	N/A (cloud)	Cloud provider responsibility; verify AWS compliance	MEDIUM
Logging and monitoring	Req 10: Log and monitor all access to system components and cardholder data	All CDE access logged; logs tamper-evident; reviewed daily; retained 12 months (3 months immediately accessible)	ASSESS	Implement comprehensive CDE logging	CRITICAL
Vulnerability management	Req 11: Test security of systems and networks regularly	Internal vulnerability scan quarterly; external ASV scan quarterly; penetration test annually; segmentation test bi-annually	ASSESS	Establish scanning programme	CRITICAL
Organisational policies	Req 12: Support information security with organisational policies and programmes	Security policy, risk assessment, incident response plan, security awareness training	ASSESS	Formalise security programme	HIGH

PCI DSS 4.0 New Requirements (Mandatory from March 2025):

New Requirement	Description	Infrastructure Impact
Targeted risk analysis	Customised approach for each requirement based on risk	Risk analysis documentation for each CDE control
MFA everywhere	MFA for ALL access to CDE (not just remote)	Deploy MFA for console, SSH, application access to CDE
Authenticated vulnerability scanning	Internal scans must use authenticated scanning	Scanning tools need credentials for CDE systems
Automated log review	Automated mechanisms to detect security events	SIEM/OpenSearch with automated alerting rules for CDE
Web application firewall	WAF or equivalent for public-facing web applications	Cloudflare WAF / KrakenD for card payment endpoints
Script management	Inventory and integrity of payment page scripts	CSP headers, SRI, script inventory for card entry pages
Enhanced encryption	Disc-level encryption alone is insufficient	Application-level encryption for stored PAN

PCI DSS Scoping Notes: - CDE MUST be clearly defined and documented - All systems connected to or that could impact the CDE are in scope - Network segmentation reduces scope — strongly recommended - Cloudflare and KrakenD processing card data brings them into scope - Annual PCI DSS assessment (SAQ or ROC depending on transaction volume)

13.6 Compliance Summary Matrix¶

Jurisdiction	Data Localisation	Incident Reporting SLA	Record Retention	Licensing Status	PCI DSS Required
Pakistan	Required (processing in-country; PDPB 2023 pending)	TBC (SBP circulars)	5+ years	VERIFY	Yes (Cards)
Bangladesh	Mandatory (DC inspection by Bangladesh Bank)	TBC	TBC	VERIFY	TBC
Nepal	Mandatory (govt-approved DC only)	TBC	TBC	VERIFY	Mandatory (NRB directive)
Iraq	TBC (new 2024 regulation)	TBC	5 years (minimum)	VERIFY	TBC
PCI DSS	N/A	72 hours (breach notification)	12 months (3 months immediately accessible)	N/A	Yes (Cards)

13.7 Compliance Remediation Priorities¶

Priority	Action	Jurisdictions	Timeline
1	Verify all current licence and authorisation statuses	All	Immediate
2	Data localisation assessment — where is data stored/processed for each jurisdiction?	PK, BD, NP	Q2 2026
3	PCI DSS v4.0.1 gap assessment for Cards product	Global	Q2 2026
4	Implement 2-hour incident reporting capability (best practice across all markets)	All	Q2 2026
5	Formalise record retention policies meeting all jurisdictional minimums	All	Q2 2026
6	DR/BCP documentation and testing	All (regulatory requirement in most jurisdictions)	Q2-Q3 2026
7	AML/CFT system verification across all jurisdictions	All	Q3 2026
8	ISO 27001 certification (required for Nepal, beneficial for all)	NP (mandatory), all	Q3-Q4 2026
9	Prepare for Pakistan PDPB enactment	PK	Q3 2026
10	Iraq 2024 regulation full compliance assessment	IQ	Q3 2026

14. Infrastructure as Code¶

14.1 Tool Selection (TBC)¶

Tool	Pros	Cons	Recommendation
Terraform	Industry standard, large ecosystem, HCL is declarative, multi-cloud	State management complexity, HCL learning curve, BSL licence (OpenTofu as alternative)	Evaluate
Pulumi	Real programming languages (Go, TypeScript), strong typing, testing	Smaller ecosystem, less community content, state management similar to Terraform	Evaluate (strong fit with Go stack)
AWS CDK	Native AWS integration, TypeScript/Go support	AWS-only (not multi-cloud), CloudFormation under the hood	Lower priority (multi-cloud needed for Cloudflare)
OpenTofu	Terraform-compatible, open source (MPL 2.0)	Younger project, smaller team	Evaluate (if Terraform BSL is a concern)

Decision required: IaC tool selection is TBC. Recommendation: evaluate Pulumi (Go alignment) and Terraform/OpenTofu (ecosystem breadth) in a spike. Whichever tool is chosen, the standards below apply.

14.2 Repository Structure¶

infrastructure/
├── modules/                    # Reusable modules
│   ├── vpc/                    # VPC, subnets, NAT, security groups
│   ├── compute/                # EC2/containers, ASG, ALB
│   ├── database/               # RDS, SurrealDB, ElastiCache
│   ├── observability/          # OpenSearch, Grafana, Jaeger, OTel Collector
│   ├── gateway/                # KrakenD deployment
│   ├── cloudflare/             # DNS, WAF, Workers, Pages, R2
│   └── security/               # WAF rules, security groups, KMS
├── environments/
│   ├── sandbox/                # Sandbox environment configuration
│   ├── dev/                    # Dev environment configuration
│   ├── test/                   # Test environment configuration
│   └── prod/                   # Prod environment configuration
├── policies/                   # OPA/Sentinel policies for compliance
└── README.md

14.3 Module Design Principles¶

One module per concern: VPC, compute, database, observability are separate modules
Inputs validated: All module inputs MUST have type constraints and validation rules
Outputs explicit: Modules MUST export IDs, ARNs, endpoints needed by dependent modules
No hardcoded values: All environment-specific values passed as variables
Tagging enforced: Every resource MUST be tagged (see Cost Management section)
Documentation: Every module MUST have a README with inputs, outputs, and examples

14.4 State Management¶

Requirement	Standard
Remote state	S3 bucket (encrypted, versioned) + DynamoDB table (locking)
State per environment	Separate state file per environment (never shared)
State locking	Mandatory — prevent concurrent modifications
State encryption	AES-256 encryption at rest
State access	Restricted to CI/CD pipeline service account and designated operators
State backup	S3 versioning provides history; cross-region replication for DR

14.5 Drift Detection¶

Drift detection MUST run daily on all environments
Drift detection MUST run before every deployment
Any detected drift MUST be reported as a P2 alert
Drift MUST be resolved before the next planned deployment
Unplanned manual changes to infrastructure are prohibited

15. CI/CD Pipeline Standards¶

Jenkins will NOT be used. CI/CD tool is TBC. The standards below are tool-agnostic.

15.1 Tool Evaluation¶

Tool	Pros	Cons	Status
Bitbucket Pipelines	Native Bitbucket integration, simple YAML config	Limited compute, caching limitations	Evaluate (Simpaisa uses Bitbucket)
Dagger	Containerised pipelines, language-native (Go SDK), portable	Newer, smaller community	Evaluate (strong fit with Go + AI SDLC)
Buildkite	Fast, self-hosted agents, YAML config, scalable	Requires agent infrastructure	Evaluate
Woodpecker CI	Open source, Drone-compatible, container-native	Smaller community	Evaluate

15.2 Pipeline Stages¶

┌─────┐   ┌──────┐   ┌───────┐   ┌──────────────┐   ┌────────┐   ┌────────┐
│ Lint │ → │ Test │ → │ Build │ → │ Security Scan │ → │ Deploy │ → │ Verify │
└─────┘   └──────┘   └───────┘   └──────────────┘   └────────┘   └────────┘

Stage	Activities	Failure Action
Lint	Code formatting, linting, static analysis	Block — fix before proceeding
Test	Unit tests, integration tests (with coverage)	Block — tests must pass
Build	Compile, build container image, generate artefacts	Block — build must succeed
Security Scan	Dependency vulnerability scan, SAST, secret scanning, container scan	Block if critical/high findings
Deploy	Deploy to target environment (blue/green or canary)	Automatic rollback on failure
Verify	Smoke tests, health checks, synthetic transactions	Automatic rollback if verification fails

15.3 Quality Gates¶

Gate	Requirement	Blocks Deployment?
Code coverage	Minimum 80% for new code, 60% overall	Yes
Security vulnerabilities	Zero critical, zero high (for Prod)	Yes (Prod), Warning (Dev/Test)
Secret scanning	No secrets detected in code or config	Yes (all environments)
Dependency vulnerabilities	No known critical CVEs in dependencies	Yes (Prod)
Container scan	No critical vulnerabilities in container image	Yes (Prod)
Performance regression	No P95 latency regression > 10% (payment paths)	Yes (Prod)
API contract	OpenAPI spec validation passes	Yes (all environments)

15.4 Artefact Management¶

Artefact	Storage	Retention	Naming
Container images	Container registry (evaluate: ECR, Cloudflare Container Registry, or self-hosted)	90 days for non-production tags, indefinite for production tags	`<service>:<git-sha>-<build-number>`
Go binaries	R2/S3 artefact bucket	90 days	`<service>-<version>-<os>-<arch>`
IaC plans	R2/S3 artefact bucket	365 days	`<environment>-<timestamp>-<git-sha>.plan`
Test reports	R2/S3 artefact bucket	365 days	`<service>-<timestamp>-test-report.xml`

15.5 Deployment Automation¶

Requirement	Standard
No manual deployments	All deployments MUST go through the CI/CD pipeline
Reproducible	Same artefact deployed to all environments (configuration differs, not code)
Auditable	Every deployment logged: who triggered, what version, when, which environment
Rollback	One-click rollback to previous version (< 5 minutes)
Deployment windows	Prod deployments during business hours (UTC+5) unless emergency
Feature flags	Use PostHog feature flags for gradual rollout, not deployment gating

16. Cost Management¶

16.1 Tagging Strategy¶

All AWS and Cloudflare resources MUST have the following tags:

Tag Key	Description	Example Values	Required
`Environment`	Deployment environment	`sandbox`, `dev`, `test`, `prod`	Yes
`Service`	Service name	`pay-in-service`, `krakend`, `grafana`	Yes
`Product`	Product line	`pay-ins`, `pay-outs`, `remittances`, `cards`, `platform`	Yes
`Owner`	Team or individual responsible	`engineering`, `platform`, `security`	Yes
`CostCentre`	Financial cost centre	`TECH-001`, `SEC-001`	Yes
`ManagedBy`	IaC tool or manual	`terraform`, `pulumi`, `manual`	Yes
`Criticality`	Service tier	`tier-1`, `tier-2`, `tier-3`, `tier-4`	Yes

16.2 Budget Alerts¶

Alert Level	Threshold	Notification	Action
Info	50% of monthly budget	Email to engineering lead	Review spending trend
Warning	75% of monthly budget	Slack notification to engineering	Investigate and optimise
Critical	90% of monthly budget	SMS to CDO + engineering lead	Immediate cost review
Breach	100% of monthly budget	Call to CDO	Emergency cost reduction

16.3 Reserved Capacity Planning¶

Resource	Strategy	Review Cadence
EC2 instances	Reserved Instances (1-year) for baseline, On-Demand for burst	Quarterly
RDS	Reserved Instances (1-year) for all Prod databases	Annually
ElastiCache	Reserved Nodes for Prod	Annually
Cloudflare	Enterprise plan (annual commitment)	Annually
Data transfer	Cloudflare reduces AWS egress; monitor and optimise	Monthly

16.4 Cost Optimisation Reviews¶

Review	Frequency	Owner	Focus
Resource utilisation	Monthly	Engineering	Right-sizing instances, identifying idle resources
Data transfer costs	Monthly	Engineering	Optimise cross-AZ and internet egress traffic
Reserved Instance coverage	Quarterly	CDO + Engineering	Ensure RI coverage matches usage
Architecture cost review	Quarterly	CDO	Evaluate architectural changes for cost impact
Vendor negotiation	Annually	CDO	AWS, Cloudflare, ControlPlane.com contract review

17. Migration Roadmap¶

Phase Overview¶

Phase 1          Phase 2        Phase 3         Phase 4          Phase 5           Phase 6
Observability    Edge           Gateway         Identity         Compute           Data
Q2 2026          Q2-Q3 2026     Q3 2026         Q3-Q4 2026       Q4 2026-Q1 2027   2027
─────────────────────────────────────────────────────────────────────────────────────────►

Phase 1: Observability (Replace CloudWatch)¶

Timeline: Q2 2026 Priority: CRITICAL — prerequisite for all other phases

Task	Description	Dependencies	Effort
Deploy OpenTelemetry Collector	Central telemetry pipeline (gateway mode)	None	1 week
Deploy Grafana	Dashboards and alerting	None	1 week
Deploy Prometheus	Metrics storage	Grafana	1 week
Deploy Jaeger (or Tempo)	Distributed tracing	OpenSearch (for storage)	1 week
Deploy OpenSearch	Log aggregation and trace storage	None	2 weeks
Instrument existing services	Add OTel SDK to Spring Boot services	OTel Collector	2-3 weeks
Build dashboards	Per-product, per-channel, infrastructure, SLA	Grafana + data flowing	2 weeks
Configure alerting	Alert rules for all P1/P2 scenarios	Grafana	1 week
Decommission CloudWatch dependency	Remove CloudWatch alarms, switch to Grafana	All above complete	1 week
Deploy PostHog	Product analytics	None	1 week

Success Criteria: - All services emit traces, metrics, and structured logs via OTel - End-to-end transaction tracing works for all products - Grafana dashboards operational for all products - Alerting functional with correct escalation paths - CloudWatch no longer primary monitoring tool

Phase 2: Edge (Cloudflare)¶

Timeline: Q2-Q3 2026 Priority: HIGH

Task	Description	Dependencies	Effort
Migrate DNS to Cloudflare	Authoritative DNS for all domains	None	1 week
Enable Cloudflare CDN	Cache static assets, configure cache rules	DNS migration	1 week
Configure Cloudflare WAF	Payment API protection rules	DNS migration	1 week
Deploy Cloudflare Workers	Geo-routing, rate limiting, header injection	DNS migration	2 weeks
Migrate static sites to Pages	Corporate site, developer portal	DNS migration	2 weeks
Configure R2 buckets	Merchant reports, transaction receipts	None	1 week
Implement Authenticated Origin Pulls	Secure Cloudflare-to-ALB connection	CDN enabled	1 week
Configure bot management	Bot detection and challenge rules	WAF configured	1 week

Success Criteria: - All traffic routes through Cloudflare - WAF blocking malicious traffic - Static sites served from Cloudflare Pages - Origin servers only accessible from Cloudflare IPs - DDoS protection active

Phase 3: API Gateway (KrakenD)¶

Timeline: Q3 2026 Priority: CRITICAL

Task	Description	Dependencies	Effort
Deploy KrakenD to Test	Initial deployment with basic configuration	Phase 1 (observability)	1 week
Define API specifications	OpenAPI 3.1 specs for all endpoints	None	2 weeks
Configure auth verification	JWT validation, API key verification	None	1 week
Configure rate limiting	Per-merchant, per-product, per-endpoint limits	None	1 week
Configure error standardisation	RFC 9457 error responses	None	1 week
Deploy to Sandbox	Merchant-facing test environment	Test deployment stable	1 week
Merchant migration (phased)	Migrate merchants to gateway-fronted endpoints	Sandbox proven	4-6 weeks
Deploy to Prod	Production deployment with blue/green	Merchant migration tested	1 week

Success Criteria: - All API traffic routes through KrakenD - Rate limiting enforced per merchant - Auth verification at gateway level - Standardised error responses - OpenAPI validation rejecting malformed requests

Phase 4: Identity (ControlPlane.com)¶

Timeline: Q3-Q4 2026 Priority: HIGH

Task	Description	Dependencies	Effort
ControlPlane.com setup	Account, organisation, initial configuration	None	1 week
Workload identity	Migrate service-to-service auth from IAM to ControlPlane	Phase 3 (KrakenD)	2-3 weeks
Merchant identity	Design merchant RBAC model	None	1 week
KrakenD integration	JWT issuance and validation via ControlPlane	Phase 3 + workload identity	2 weeks
SSO for internal tools	Grafana, OpenSearch, merchant portal via SSO	ControlPlane setup	2 weeks
Policy-as-code	Define and test access policies	All above	2 weeks

Phase 5: Compute Modernisation¶

Timeline: Q4 2026 - Q1 2027 Priority: MEDIUM

Task	Description	Dependencies	Effort
Container platform selection	Evaluate ECS Fargate vs EKS vs ControlPlane.com	Phase 4	1-2 weeks
Deploy Caddy	Per-service reverse proxy with mTLS	Container platform	2 weeks
First Go service	New service built in Go, deployed as container	Container platform	4-6 weeks
Blue/green deployment	Implement for Tier 1 services	Container platform	2 weeks
Canary deployment	Implement for API Gateway and payment initiation	Blue/green working	2 weeks
Unikraft evaluation	Assess Unikraft for security-critical payment processing	Go service proven	4 weeks

Phase 6: Data Infrastructure¶

Timeline: 2027 Priority: MEDIUM

Task	Description	Dependencies	Effort
RDS split	Separate shared RDS into per-service instances	None (can start earlier)	4-6 weeks
SurrealDB pilot	Deploy SurrealDB for first new Go service	Phase 5 (Go service)	2-3 weeks
NSQ deployment	Replace Kafka with NSQ for inter-service messaging	None	3-4 weeks
Meilisearch deployment	Merchant-facing search in portal	None	2 weeks
Redis cluster mode	Enable cluster mode, per-service namespacing	None (can start earlier)	1-2 weeks

Migration Risk Register¶

Risk	Impact	Likelihood	Mitigation
Service disruption during KrakenD rollout	HIGH	Medium	Blue/green deployment, gradual merchant migration, instant rollback
Cloudflare outage impacts all services	HIGH	Low	Document emergency bypass procedure; monitor Cloudflare status
Data loss during RDS split	CRITICAL	Low	Extensive testing in Test environment; point-in-time recovery enabled; rollback plan
ControlPlane.com integration delays	MEDIUM	Medium	Keep existing auth as fallback; phased migration
Compliance issues with new infrastructure	HIGH	Medium	Engage regulators early; legal review of each technology change
Team skill gap (Go, new tooling)	MEDIUM	High	Training programme; gradual adoption; AI SDLC augmentation

18. Appendix: Infrastructure Controls Checklist¶

Use this checklist for infrastructure reviews and compliance audits.

A. Network Security¶

#	Control	Required By	Status
N-01	All public endpoints behind Cloudflare (no direct origin access)	Security standard	☐
N-02	ALB accepts traffic only from Cloudflare IP ranges	Security standard	☐
N-03	Security groups follow least-privilege (no 0.0.0.0/0 inbound)	PCI DSS, all regulators	☐
N-04	NACLs configured as defence in depth	Security standard	☐
N-05	VPC flow logs enabled and exported to OpenSearch	PCI DSS, audit requirement	☐
N-06	No public IP addresses on application or database instances	Security standard	☐
N-07	CDE network segment isolated (Cards product)	PCI DSS 4.0	☐
N-08	DDoS protection active (Cloudflare)	All regulators	☐
N-09	WAF rules configured for payment API protection	PCI DSS, security standard	☐
N-10	DNS DNSSEC enabled	Security standard	☐

B. Encryption¶

#	Control	Required By	Status
E-01	TLS 1.2+ on all external connections	PCI DSS 4.0, all regulators	☐
E-02	TLS 1.3 preferred where supported	Security standard	☐
E-03	mTLS for all service-to-service communication	Security standard	☐
E-04	Database encryption at rest (AES-256)	PCI DSS, all regulators	☐
E-05	S3/R2 bucket encryption enabled	Security standard	☐
E-06	PAN encrypted at application level (not just disc)	PCI DSS 4.0	☐
E-07	Encryption keys managed in KMS (separate from data)	PCI DSS 4.0	☐
E-08	Certificate auto-renewal configured	Operational	☐
E-09	No SSL or early TLS anywhere	PCI DSS 4.0	☐

C. Access Control¶

#	Control	Required By	Status
A-01	MFA enabled for all CDE access	PCI DSS 4.0	☐
A-02	MFA enabled for all infrastructure access	Security standard, all regulators	☐
A-03	No shared accounts or credentials	PCI DSS, security standard	☐
A-04	Service accounts use workload identity (no static credentials)	Security standard	☐
A-05	Quarterly access review completed	PCI DSS, all regulators	☐
A-06	Privileged access logged and alerted	PCI DSS, all regulators	☐
A-07	Break-glass procedure documented and tested	Operational	☐
A-08	Terminated employee access revoked within 24 hours	PCI DSS, all regulators	☐

D. Logging and Monitoring¶

#	Control	Required By	Status
L-01	Structured JSON logging on all services	Observability standard	☐
L-02	Trace ID propagated end-to-end	Observability standard	☐
L-03	CDE access logs tamper-evident	PCI DSS 4.0	☐
L-04	Log retention meets jurisdictional requirements (7 years max)	PK, BD, NP, IQ, EG regulators	☐
L-05	Automated log review for security events	PCI DSS 4.0	☐
L-06	Alerting configured for all P1/P2 scenarios	Operational	☐
L-07	Dashboards operational for all products	Operational	☐
L-08	No sensitive data in logs (PAN, CVV, PIN, full CNIC)	PCI DSS, PDPA	☐
L-09	Audit trail for all infrastructure changes	All regulators	☐

E. Backup and Recovery¶

#	Control	Required By	Status
R-01	Automated daily backups for all databases	All regulators	☐
R-02	Backup restore tested monthly	DR standard	☐
R-03	RPO/RTO defined per service tier	DR standard	☐
R-04	DR runbooks documented	All regulators, DR standard	☐
R-05	DR exercise conducted bi-annually	DR standard	☐
R-06	Backups encrypted	PCI DSS, security standard	☐
R-07	Backups stored in different location from primary	DR standard	☐

F. Compliance¶

#	Control	Required By	Status
C-01	Data localisation requirements met per jurisdiction	PK, BD, NP	☐
C-02	Incident reporting capability (2-hour internal SLA)	All	☐
C-03	Transaction record retention (minimum 5 years)	PK, IQ, PCI DSS	☐
C-04	PCI DSS v4.0.1 assessment current	PCI DSS	☐
C-05	AML/CFT transaction monitoring operational	All jurisdictions	☐
C-06	Sanctions screening integrated	All jurisdictions (especially IQ)	☐
C-07	Regulatory technology change approvals obtained	PK (SBP), BD, NP	☐
C-08	ISO 27001 certification (required for Nepal)	NP (NRB)	☐
C-09	Annual PCI DSS assessment scheduled	PCI DSS	☐
C-10	Quarterly vulnerability scanning programme	PCI DSS 4.0	☐

Document Control¶

Version	Date	Author	Changes
1.0.0	2026-04-03	CDO (AI SDLC)	Initial version — AI SDLC prototype and showcase

Review Schedule: Quarterly (next review: Q3 2026)

Distribution: Architecture & Engineering Leadership

This document was generated as part of the Simpaisa AI SDLC prototype. All compliance information should be verified with legal counsel and regulatory advisors in each jurisdiction.