Skip to content

Simpaisa Infrastructure Standards

Version: 1.0.0 Date: 2026-04-03 Owner: CDO (Daniel O'Reilly) Classification: Internal — Architecture & Engineering Leadership Status: Living Document — Prototype / AI SDLC Showcase


Table of Contents

  1. Executive Summary
  2. Infrastructure Principles
  3. Environment Strategy
  4. Compute Standards
  5. Networking Standards
  6. Edge & CDN (Cloudflare)
  7. API Gateway (KrakenD)
  8. Observability Stack
  9. Identity & Access (ControlPlane.com)
  10. Data Infrastructure
  11. Secret Management
  12. Disaster Recovery & Business Continuity
  13. Compliance Infrastructure Requirements
  14. Infrastructure as Code
  15. CI/CD Pipeline Standards
  16. Cost Management
  17. Migration Roadmap
  18. Appendix: Infrastructure Controls Checklist

1. Executive Summary

This document defines the infrastructure standards for Simpaisa's payment gateway platform, which processes 270M+ transactions worth $1B+ across Pakistan, Bangladesh, Nepal, Iraq, and Egypt. It covers four product lines: Pay-Ins, Pay-Outs, Remittances, and Cards.

Context

This is a prototype and showcase of AI SDLC capabilities. The organisation is adopting an agentic AI SDLC-first approach — the team structure will be reorganised as required to support this model.

Current State Summary

Simpaisa runs on AWS with sound foundational infrastructure (Multi-AZ, WAF, ALB, ASG, RDS, ElastiCache) but has significant gaps in observability, API gateway, disaster recovery documentation, and distributed tracing. The platform uses Spring Boot / Java services on EC2.

Target State Summary

The target architecture moves towards a cloud-native, multi-provider model:

Layer Current Target
Edge/CDN AWS WAF only Cloudflare (CDN, WAF, DDoS, Workers, Pages, R2, DNS)
API Gateway None KrakenD
Compute EC2 + ASG (Spring Boot/Java) Containers (Go services) + Unikraft unikernels (assess)
Reverse Proxy ALB direct Caddy (per-service, mTLS) behind ALB
Identity Custom auth ControlPlane.com
Observability CloudWatch OpenTelemetry → Grafana / Jaeger / OpenSearch
Analytics None PostHog
Database RDS MySQL (shared) SurrealDB (new services) + MySQL (existing)
Messaging Kafka NSQ
Search None Meilisearch (merchant-facing) + OpenSearch (logs)
Workflow None Temporal
Hosting AWS only Cloudflare preferred + AWS for existing

Critical Gaps

Gap Priority Impact
No API Gateway CRITICAL No centralised rate limiting, auth verification, or request validation
Single shared RDS CRITICAL Single point of failure, no service isolation
No DR documentation CRITICAL Unknown recovery posture
No distributed tracing HIGH Cannot trace transactions end-to-end across services
No CDN HIGH Latency for merchant-facing assets, no edge caching
Single ElastiCache cluster HIGH Cache failure impacts all services
No blue/green or canary MEDIUM Risky deployments with potential downtime
No IaC documented MEDIUM Infrastructure drift, no reproducibility

2. Infrastructure Principles

2.1 Cloud-Native

All new services MUST be designed as cloud-native, containerised workloads. Infrastructure MUST be provisioned through APIs, not manual console operations.

2.2 Infrastructure as Code

All infrastructure MUST be defined in version-controlled code. No manual provisioning or configuration changes in any environment. Drift detection MUST run on every deployment.

2.3 Immutable Deployments

Infrastructure and application artefacts MUST be immutable. No in-place updates to running instances. Every deployment creates new artefacts; rollback means deploying the previous artefact.

2.4 Observability-First

Every service MUST emit structured logs, metrics, and traces from day one. Observability is not optional — it is a deployment prerequisite. OpenTelemetry is the mandatory instrumentation standard.

2.5 Security by Default

All network traffic MUST be encrypted in transit (TLS 1.2 minimum, TLS 1.3 preferred). All data at rest MUST be encrypted. Zero-trust networking: no implicit trust between services. mTLS for all service-to-service communication.

2.6 Multi-Jurisdiction Compliance

Infrastructure MUST satisfy regulatory requirements across all operating jurisdictions (Pakistan, Bangladesh, Nepal, Iraq, Egypt). Data residency requirements MUST be met per jurisdiction. Compliance controls MUST be auditable and evidenced.

2.7 Least Privilege

All access — human and machine — MUST follow the principle of least privilege. Service accounts MUST have only the permissions required for their function. Permissions MUST be reviewed quarterly.

2.8 Automation Over Process

Automate everything that can be automated. Manual processes are a source of error and a barrier to scale. If a runbook step is repeated more than twice, it MUST be automated.


3. Environment Strategy

3.1 Environment Definitions

Environment Purpose URL Pattern Access Data
Sandbox Merchant-facing testing and integration sandbox.simpaisa.com Merchants + internal Synthetic test data only
Dev Internal development and experimentation dev.internal.simpaisa.com Engineering only Synthetic / anonymised
Test Automated testing, QA, UAT test.internal.simpaisa.com Engineering + QA Synthetic / anonymised
Prod Live production traffic api.simpaisa.com Controlled access Real customer/merchant data

3.2 Environment Parity Requirements

Aspect Requirement
Architecture All environments MUST use the same architectural patterns (ALB, ASG, VPC layout)
Configuration Same configuration structure, different values per environment
Infrastructure Dev/Test may use smaller instance sizes; architecture MUST match Prod
Networking Same VPC/subnet design; security groups MUST be equivalent
Secrets Each environment has its own secrets; NEVER share across environments
Databases Same engine and version across all environments
Monitoring All environments MUST have observability; alerting thresholds differ

3.3 Data Segregation

  • Production data MUST NEVER be copied to lower environments without anonymisation
  • Each environment MUST have its own database instances, cache clusters, and message queues
  • Payment channel credentials MUST be environment-specific (sandbox credentials for Sandbox, live for Prod)
  • PII MUST NOT exist in Dev or Test environments
  • Sandbox MUST simulate realistic payment channel responses (success, failure, timeout, partial)

3.4 Promotion Workflow

Dev → Test → Prod
 │      │      │
 │      │      └── Requires: all quality gates passed, change approval, deployment window
 │      └────────── Requires: all automated tests pass, security scan clean
 └───────────────── Requires: code review, unit tests pass, lint clean
Gate Dev → Test Test → Prod
Code review Required N/A (already done)
Unit tests Pass Pass
Integration tests Run Pass (mandatory)
Security scan Run Pass (mandatory, zero critical/high)
Performance test Optional Required for payment-path changes
Change approval Not required Required (CDO or delegate)
Deployment window Any time Scheduled (avoid peak transaction hours)
Rollback plan Documented Documented and tested

3.5 Sandbox-Specific Requirements

The Sandbox environment is merchant-facing and MUST:

  • Be available 99.5% of the time (separate SLA from Prod)
  • Provide realistic response times (within 2x of Prod P95)
  • Support all payment channels with simulated responses
  • Provide test credentials and documentation
  • Allow merchants to trigger specific scenarios (success, decline, timeout, insufficient funds)
  • Have its own KrakenD instance with the same rate limiting configuration as Prod
  • Log all requests for merchant support and debugging

4. Compute Standards

4.1 Current State

Aspect Detail
Platform AWS EC2 instances
Scaling Auto Scaling Groups (ASG)
Runtime Spring Boot / Java
Load Balancing Application Load Balancer (ALB) in public subnets
Availability Multi-AZ deployment
Deployment Rolling updates via ASG

4.2 Target State

Aspect Detail
New services Go services in containers
Security-critical Unikraft unikernels (assess phase — evaluate for payment processing core)
Reverse proxy Caddy per-service (behind ALB, providing mTLS termination)
Orchestration TBC — evaluate ECS Fargate, Kubernetes (EKS), or ControlPlane.com
Existing services Spring Boot / Java on EC2 (maintained until rewritten)

4.3 Sizing Guidelines

Service Tier Description Min Instances Instance Type (Current) Auto-Scale Trigger
Tier 1 — Payment Critical Pay-In initiation, Pay-Out execution, Remittance processing, Card auth 3 (Multi-AZ) m5.xlarge or equivalent CPU > 60%, request latency P99 > 500ms
Tier 2 — Merchant Facing API Gateway, Sandbox, Developer Portal, Merchant Dashboard 2 (Multi-AZ) m5.large or equivalent CPU > 70%, request latency P99 > 1s
Tier 3 — Internal Reporting, reconciliation, back-office 2 m5.large or equivalent CPU > 75%
Tier 4 — Infrastructure Observability, logging, search indexing 2 r5.large or equivalent Disk > 80%, memory > 85%

4.4 Auto-Scaling Policies for Payment Workloads

Payment services MUST scale based on:
  - Request rate (transactions per second)
  - Response latency (P95 and P99)
  - CPU utilisation
  - Queue depth (for async processing)

Scale-out: aggressive (1-minute evaluation, 2-minute cooldown)
Scale-in: conservative (5-minute evaluation, 10-minute cooldown)

Payment services MUST NOT scale to zero.
Minimum capacity MUST handle 2x average traffic without scaling.

4.5 Deployment Strategies

Strategy Current State Target State
Rolling update Yes (ASG) Maintained for non-critical services
Blue/green No Required for Tier 1 (payment-critical) services
Canary No Required for API Gateway and payment initiation

Blue/Green Requirements: - Two identical environments (blue and green) - Traffic switch at ALB level (weighted target groups) - Automated health checks before full cutover - Instant rollback capability (switch back to previous colour) - Both environments kept warm for minimum 30 minutes post-deployment

Canary Requirements: - Initial canary: 5% of traffic - Automated metric comparison (error rate, latency, success rate) - Automatic rollback if error rate increases by > 0.1% - Progressive rollout: 5% → 25% → 50% → 100% - Minimum 10 minutes at each stage


5. Networking Standards

5.1 VPC Design

Each environment MUST have its own VPC. VPCs MUST NOT be shared across environments.

Environment VPC CIDR Region
Prod 10.0.0.0/16 TBC (primary)
Test 10.1.0.0/16 TBC (same region as Prod)
Dev 10.2.0.0/16 TBC (same region as Prod)
Sandbox 10.3.0.0/16 TBC (same region as Prod)

Note: CIDR ranges are illustrative. Final allocation requires network planning exercise including payment channel VPN requirements.

5.2 Subnet Strategy

Each VPC MUST have three subnet tiers across a minimum of two Availability Zones:

Subnet Tier Purpose Internet Access Examples
Public Edge / ingress Direct (IGW) ALB, NAT Gateway, Bastion (if required)
Private Application workloads Outbound only (NAT) EC2 instances, containers, KrakenD, Caddy
Isolated Data stores None RDS, ElastiCache, SurrealDB, OpenSearch

5.3 Security Groups and NACLs

Security Group Rules:

Component Inbound Outbound
ALB 443 (HTTPS) from Cloudflare IPs only Application ports to private subnets
Application instances Application port from ALB SG only 443 to NAT GW (external APIs), DB ports to isolated subnet
KrakenD 8080 from ALB SG Application ports to private subnets
RDS MySQL 3306 from application SG only None (stateful return traffic)
ElastiCache Redis 6379 from application SG only None
OpenSearch 9200 from observability SG only None

NACL Rules: - NACLs provide defence in depth at the subnet level - Deny all by default, explicitly allow required traffic - NACLs MUST mirror security group intent but at the subnet level

5.4 NAT Gateway Configuration

  • One NAT Gateway per Availability Zone for high availability
  • All private subnet outbound traffic routes through NAT Gateway
  • NAT Gateway MUST be in the public subnet
  • Elastic IP allocated per NAT Gateway

5.5 DNS: Cloudflare DNS

Aspect Standard
Primary DNS Cloudflare (authoritative)
Internal DNS Route 53 Private Hosted Zones (for VPC-internal resolution)
TTL 300s for API endpoints, 3600s for static assets
DNSSEC Enabled on all public zones
Records A/AAAA records proxied through Cloudflare (orange cloud)

5.6 DDoS Protection

Layer Current Target
Layer 7 AWS WAF Cloudflare WAF (primary) + AWS WAF (transitional)
Layer 3/4 AWS Shield Standard Cloudflare DDoS protection
Rate limiting None centralised Cloudflare rate limiting + KrakenD per-merchant limits
Bot management None Cloudflare Bot Management

6. Edge & CDN (Cloudflare)

6.1 Cloudflare as Primary Edge

Cloudflare MUST be the primary edge for all Simpaisa public-facing services. All traffic MUST pass through Cloudflare before reaching AWS infrastructure.

Service Cloudflare Product Purpose
CDN Cloudflare CDN Cache static assets, reduce origin load
WAF Cloudflare WAF Application-layer attack protection
DDoS Cloudflare DDoS Protection Volumetric and protocol attack mitigation
DNS Cloudflare DNS Authoritative DNS with global anycast
Workers Cloudflare Workers Edge logic (rate limiting, validation, geo-routing)
Pages Cloudflare Pages Static site hosting (corporate site, developer portal)
R2 Cloudflare R2 Object storage (reports, receipts, merchant documents)
Bot Management Cloudflare Bot Management Distinguish legitimate traffic from bots

6.2 Cloudflare Workers Use Cases

Use Case Description Priority
Geo-routing Route requests to appropriate regional backend based on merchant jurisdiction HIGH
Request validation Validate request structure before forwarding to origin HIGH
Rate limiting First-pass rate limiting at the edge (before KrakenD) HIGH
A/B testing Route percentage of traffic to canary deployments MEDIUM
IP allowlisting Enforce merchant IP allowlists at the edge MEDIUM
Response caching Cache merchant configuration, channel status responses MEDIUM
Header injection Add tracing headers (X-Request-ID, X-Trace-ID) at the edge HIGH

6.3 Cloudflare Pages

Site Repository Domain
Corporate website simpaisa.com repo www.simpaisa.com
Developer portal developer-portal repo developer.simpaisa.com
Status page status repo status.simpaisa.com

6.4 Cloudflare R2

Bucket Purpose Retention Access
merchant-reports Generated merchant reports (CSV, PDF) 90 days Merchant portal (signed URLs)
transaction-receipts Payment receipts 7 years (compliance) Internal + merchant API
merchant-documents KYC/KYB documents 10 years (compliance) Internal only
static-assets Images, fonts, scripts Indefinite Public (CDN)

6.5 WAF Rules for Payment API Protection

Rule Action Description
Block non-HTTPS Block All payment API traffic MUST be HTTPS
Block non-JSON Block Payment APIs accept JSON only; block other content types
Block oversized requests Block Maximum 1MB request body for payment APIs
Rate limit by merchant Challenge/Block Per-merchant TPS limits enforced at edge
Block known bad IPs Block Threat intelligence feed integration
SQL injection detection Block OWASP CRS rules for SQLi patterns
Geographic restrictions Block Block traffic from sanctioned jurisdictions
Bot score filtering Challenge Challenge requests with bot score < 30

6.6 Cloudflare-to-Origin Security

  • Authenticated Origin Pulls: Cloudflare presents a client certificate to the ALB; the ALB validates it
  • Origin CA: Use Cloudflare Origin CA certificates on ALB
  • Strict SSL mode: Full (Strict) — Cloudflare validates origin certificate
  • IP allowlisting: ALB security group MUST only allow Cloudflare IP ranges (published at cloudflare.com/ips)

7. API Gateway (KrakenD)

7.1 Deployment Architecture

Cloudflare Edge → ALB → KrakenD Cluster → Caddy (mTLS) → Backend Services
Aspect Standard
Deployment Containerised, minimum 3 instances across AZs
Configuration Declarative JSON, version-controlled
Health check /health endpoint, 10-second interval
Scaling Horizontal, based on request rate and latency
State Stateless — no persistent storage required

7.2 Configuration Management

  • KrakenD configuration MUST be declarative JSON stored in Git
  • Configuration changes MUST go through the standard promotion workflow (Dev → Test → Prod)
  • Configuration MUST be validated (krakend check) before deployment
  • Flexible Configuration (FC) MUST be used to template environment-specific values
  • Configuration MUST be generated from OpenAPI specifications where possible

7.3 Auth Verification at Gateway

Auth Method Product KrakenD Handling
JWT validation All (target) Validate JWT signature, expiry, issuer, audience
API key Sandbox Validate against key store, inject merchant context
RSA signature Pay-Outs, Remittances (current) Pass through to backend (gateway validates timestamp freshness)
mTLS Cards Terminate at Caddy, KrakenD receives forwarded client cert info

7.4 Rate Limiting Tiers

Tier Scope Default Limit Burst Notes
Global All merchants 10,000 req/s 15,000 Platform-wide safety limit
Per-merchant Individual merchant 100 req/s 200 Configurable per merchant agreement
Per-product Product line 5,000 req/s 7,500 Pay-Ins, Pay-Outs, Remittances, Cards
Per-endpoint Specific endpoint Varies Varies e.g., payment initiation: 50 req/s per merchant
Sandbox Sandbox environment 20 req/s per merchant 30 Lower limits for testing

Rate limit responses MUST include: - X-RateLimit-Limit — maximum requests allowed - X-RateLimit-Remaining — requests remaining in window - X-RateLimit-Reset — seconds until window resets - HTTP 429 status code with a JSON error body

7.5 OpenAPI Validation

  • All API endpoints MUST have an OpenAPI 3.1 specification
  • KrakenD MUST validate incoming requests against the OpenAPI schema
  • Invalid requests MUST be rejected at the gateway (400 Bad Request)
  • Request body validation: required fields, types, format constraints
  • Query parameter validation: allowed values, types

7.6 Error Response Standardisation

All error responses from KrakenD MUST follow RFC 9457 (Problem Details for HTTP APIs):

{
  "type": "https://api.simpaisa.com/errors/rate-limited",
  "title": "Rate limit exceeded",
  "status": 429,
  "detail": "Merchant has exceeded 100 requests per second",
  "instance": "/v1/pay-ins/transactions",
  "traceId": "abc123-def456-ghi789"
}

7.7 High Availability

Requirement Standard
Minimum instances 3 (one per AZ)
Health check HTTP 200 on /health within 5 seconds
Graceful shutdown Drain connections for 30 seconds before termination
Configuration reload Zero-downtime reload on configuration change
Failover ALB removes unhealthy instances within 30 seconds
Availability target 99.99% (gateway MUST NOT be the bottleneck)

8. Observability Stack

CloudWatch will NOT be used. The observability stack is built on open standards (OpenTelemetry) with open-source tooling.

8.1 Architecture Overview

Services (OTel SDK) → OTel Collector → ┬→ Prometheus (metrics) → Grafana
                                        ├→ Jaeger / Tempo (traces) → Grafana
                                        └→ OpenSearch (logs) → Grafana / OpenSearch Dashboards

PostHog ← (product events from frontend + backend)

8.2 OpenTelemetry Collector

The OpenTelemetry Collector is the unified telemetry pipeline. All services MUST send telemetry to the OTel Collector — never directly to backends.

Aspect Standard
Deployment Agent mode (sidecar or daemonset) + Gateway mode (central)
Receivers OTLP (gRPC and HTTP), Prometheus scrape, Fluent Forward
Processors Batch, memory limiter, attribute enrichment, tail sampling
Exporters Prometheus Remote Write, Jaeger/Tempo OTLP, OpenSearch
Configuration Version-controlled YAML, per-environment

8.3 Traces: Jaeger (or Grafana Tempo)

Aspect Standard
Tool Jaeger (evaluate Grafana Tempo as alternative)
Storage OpenSearch (Jaeger backend) or S3 (Tempo)
Retention 30 days hot, 90 days cold
Sampling Head-based: 100% for errors, 10% for success (adjust per traffic)
Context propagation W3C Trace Context (mandatory), B3 (for legacy compatibility)

Mandatory Trace Spans:

Every payment transaction MUST include the following spans:

Span Service Description
gateway.receive KrakenD Request received at gateway
auth.verify KrakenD / Auth service Authentication/authorisation check
payment.initiate Payment service Payment initiation logic
channel.request Channel adapter Request sent to payment channel (Easypaisa, JazzCash, etc.)
channel.response Channel adapter Response received from channel
payment.complete Payment service Transaction finalisation
callback.dispatch Callback service Webhook sent to merchant

8.4 Metrics: Prometheus + Grafana

Aspect Standard
Collection Prometheus (via OTel Collector remote write)
Visualisation Grafana
Retention 15 days high-resolution, 1 year downsampled
Naming convention simpaisa_<product>_<metric>_<unit>

Mandatory Metrics:

Metric Type Labels Description
simpaisa_transaction_total Counter product, channel, status, merchant Total transactions
simpaisa_transaction_duration_seconds Histogram product, channel, merchant Transaction processing time
simpaisa_transaction_amount_total Counter product, channel, currency Total transaction value
simpaisa_channel_request_duration_seconds Histogram channel, operation Time to get response from payment channel
simpaisa_channel_availability Gauge channel Channel health (1 = up, 0 = down)
simpaisa_gateway_request_total Counter method, path, status API gateway requests
simpaisa_gateway_latency_seconds Histogram method, path API gateway response time
simpaisa_error_total Counter product, error_type, severity Errors by type

8.5 Logs: OpenSearch with Structured Logging

Aspect Standard
Format JSON structured logging (mandatory)
Transport OTel Collector → OpenSearch
Retention 90 days hot, 1 year warm, 7 years cold (compliance)
Index pattern simpaisa-<service>-<environment>-YYYY.MM.DD
ISM Policy Hot → Warm at 7 days, Warm → Cold at 90 days, Delete at 7 years

Mandatory Log Fields:

{
  "timestamp": "2026-04-03T10:30:00.000Z",
  "level": "INFO",
  "service": "pay-in-service",
  "traceId": "abc123",
  "spanId": "def456",
  "merchantId": "M12345",
  "transactionId": "TXN-789",
  "channel": "easypaisa",
  "message": "Transaction initiated",
  "environment": "prod"
}

Sensitive Data Rules: - NEVER log card numbers, CVV, PINs, or full account numbers - Mask mobile numbers: 03XX-XXXX-1234 (show last 4 digits only) - Mask CNICs: XXXXX-XXXXXXX-3 (show last digit only) - Log transaction IDs, merchant IDs, channel references — these are required for tracing

8.6 Alerting

Aspect Standard
Tool Grafana Alerting (evaluate PagerDuty/OpsGenie for escalation)
Channels Slack (info/warning), SMS/call (critical), email (summary)
Escalation P1: immediate call → CDO + on-call engineer; P2: Slack + 15min response; P3: next business day

Alert Definitions:

Alert Severity Condition Action
Transaction success rate drop P1 Success rate < 95% for any channel over 5 minutes Immediate investigation
Payment channel down P1 Channel health check fails for 3 consecutive checks Failover / merchant notification
API latency spike P2 P99 latency > 2s for 5 minutes Scale out / investigate
Error rate increase P2 Error rate > 5% over 5 minutes Investigate
Disk space critical P2 Any data store > 85% disk usage Expand / clean up
Certificate expiry P3 Any certificate expiring within 14 days Renew
Deployment failed P2 Deployment health check fails Automatic rollback

8.7 Dashboards

Dashboard Audience Key Metrics
Executive Overview CDO, leadership Total transactions, value, success rate, revenue by product
Per-Product Product owners Transaction volume, success/failure rates, channel mix, latency
Per-Channel Operations Channel availability, response times, error rates, queue depth
Per-Merchant Support, account managers Merchant transaction volume, errors, rate limit hits
Infrastructure Engineering CPU, memory, disk, network, scaling events
Security Security team WAF blocks, auth failures, suspicious patterns, rate limit events
SLA Monitoring Operations, leadership P95/P99 latency per endpoint, uptime percentages

8.8 Transaction Tracing

End-to-end transaction tracing is the highest priority observability feature. Every merchant request MUST be traceable from Cloudflare edge → KrakenD → service → payment channel → callback.

Requirement Standard
Trace ID Generated at Cloudflare edge (Worker), propagated through all services
Correlation Trace ID MUST appear in logs, metrics labels, and traces
Merchant visibility Trace ID returned in API response headers (X-Trace-Id)
Support lookup Support team can search by trace ID, transaction ID, or merchant reference
Channel correlation Map Simpaisa trace ID to channel reference number

8.9 PostHog for Product Analytics

Aspect Standard
Deployment Self-hosted (data residency compliance) or cloud (evaluate)
Events Merchant portal interactions, developer portal usage, API adoption
Feature flags PostHog feature flags for gradual rollout
Session replay Enabled for merchant portal (with PII redaction)
Funnels Merchant onboarding, first transaction, product adoption

9. Identity & Access (ControlPlane.com)

9.1 Overview

ControlPlane.com provides Universal Cloud Identity, enabling workloads to consume cloud resources from multiple providers without storing credentials. It employs a zero-trust architecture where every access request is fully authenticated and authorised.

9.2 Centralised Identity Management

Aspect Current State Target State
Human access AWS IAM users + console ControlPlane.com SSO → cloud provider roles
Service identity AWS IAM roles (per-service) ControlPlane.com workload identity
Merchant identity Custom auth (JSESSIONID / RSA) ControlPlane.com + KrakenD JWT validation
Audit trail CloudTrail (AWS only) ControlPlane.com tamper-proof audit trail + CloudTrail

9.3 Service-to-Service Authentication

Requirement Standard
Protocol mTLS (mutual TLS) via Caddy
Certificate management ControlPlane.com or automated CA (evaluate)
Rotation Automatic, maximum 24-hour certificate lifetime
Verification Both client and server certificates validated
No shared secrets Services MUST NOT use shared API keys for inter-service communication

9.4 Merchant Identity and RBAC

Role Permissions Description
Merchant Admin Full access to merchant's resources Account owner, manages users and settings
Merchant Operator Initiate transactions, view reports Day-to-day operational access
Merchant Viewer Read-only access Reporting and audit access
Merchant Developer Sandbox access, API key management Integration and testing

9.5 Integration with KrakenD

Merchant Request → Cloudflare → KrakenD → ControlPlane.com (token validation)
                                    ↓
                              Valid JWT with claims:
                              - merchant_id
                              - roles[]
                              - products[]
                              - rate_limit_tier
                                    ↓
                              Backend Service (receives validated claims as headers)

9.6 Policy-as-Code

  • Access policies MUST be defined as code and version-controlled
  • Policy changes MUST go through the same review process as code changes
  • Policies MUST be testable (unit tests for policy logic)
  • ControlPlane.com policies define: who can access what resources, from which networks, at which times

10. Data Infrastructure

10.1 Overview

Technology Role Current State Target State
RDS MySQL Primary transactional database Shared single instance, Multi-AZ Per-service instances, read replicas, automated backups
SurrealDB New service database Not deployed Clustered deployment for new Go services
ElastiCache Redis Caching and session store Single shared cluster Cluster mode enabled, per-service namespacing
NSQ Message queue Not deployed (Kafka currently) Replace Kafka for inter-service messaging
Meilisearch Merchant-facing search Not deployed Merchant/transaction search in portal
OpenSearch Log storage and search Not deployed Log aggregation, Jaeger trace storage

10.2 RDS MySQL (Existing)

Aspect Current Target Priority
Instances 1 shared instance Per-service instances (minimum: separate Pay-Ins, Pay-Outs, Remittances, Cards) CRITICAL
Multi-AZ Yes Yes (maintained)
Read replicas None 1 per service instance (reporting queries) HIGH
Backups TBC Automated daily, 35-day retention, point-in-time recovery CRITICAL
Encryption at rest TBC AES-256 (AWS KMS managed key) CRITICAL
Encryption in transit TBC TLS mandatory for all connections CRITICAL
Version TBC MySQL 8.0+ (latest stable) MEDIUM
Monitoring CloudWatch Prometheus exporter → Grafana HIGH
Slow query log TBC Enabled, threshold 1s, exported to OpenSearch HIGH

10.3 SurrealDB (New Services)

Aspect Standard
Deployment Clustered (minimum 3 nodes for Prod)
Storage backend TiKV (distributed) or RocksDB (single-node for Dev/Test)
Backup Automated daily export, stored in R2
Access Namespace and database per service, scoped authentication
Schema Schemaful tables for payment data, schemafree for flexible data
Monitoring Prometheus metrics endpoint → Grafana

10.4 Redis (ElastiCache)

Aspect Current Target Priority
Mode Single cluster, no cluster mode Cluster mode enabled HIGH
Failover Multi-AZ with automatic failover Maintained
Namespacing None (shared keyspace) Prefix per service: payin:, payout:, remit:, cards: HIGH
Encryption TBC In-transit (TLS) and at-rest encryption HIGH
Eviction TBC allkeys-lru for caches, noeviction for session stores MEDIUM
Monitoring CloudWatch Prometheus exporter → Grafana HIGH
Backup TBC Daily snapshots, 7-day retention MEDIUM

10.5 NSQ (Messaging)

Aspect Standard
Deployment nsqlookupd (3 instances) + nsqd (per application host)
Topics One topic per event type: payment.initiated, payment.completed, payment.failed, callback.pending, etc.
Channels One channel per consumer group (e.g., payment.completed#notification, payment.completed#reconciliation)
Message retention In-memory with disk overflow; messages purged after successful consumption
Dead letter Failed messages after 5 retries → dead letter topic for manual investigation
Monitoring nsqadmin + Prometheus exporter → Grafana
Ordering Per-partition ordering not guaranteed; use idempotency keys for exactly-once semantics
Aspect Standard
Purpose Fast search in merchant portal (transactions, customers, reports)
Deployment Single instance per environment (evaluate clustering for Prod)
Indices transactions, merchants, customers, reports
Refresh strategy Near-real-time: primary write to MySQL/SurrealDB, async index update via NSQ
Security API key per merchant, tenant isolation via filterable attributes
Monitoring Health check endpoint + Prometheus metrics

10.7 OpenSearch (Logs and Traces)

Aspect Standard
Deployment 3 master nodes + 3 data nodes (Prod minimum)
Indices simpaisa-logs-*, simpaisa-jaeger-*, simpaisa-audit-*
ISM Policies Hot (7 days, SSD) → Warm (90 days, HDD) → Cold (7 years, S3/R2) → Delete
Retention Logs: 7 years (compliance), Traces: 90 days, Audit: 10 years
Security OpenSearch Security plugin, RBAC per index, TLS
Backup Snapshot to S3/R2, daily
Monitoring Built-in performance analyser + Prometheus exporter

11. Secret Management

11.1 Current State

Aspect Detail
Tool AWS Systems Manager Parameter Store (SecureString)
Encryption AWS KMS managed keys
Access IAM role-based
Rotation Manual
Audit CloudTrail

11.2 Target State

Aspect Detail Priority
Tool Evaluate: ControlPlane.com secrets, HashiCorp Vault, AWS Secrets Manager HIGH
Rotation Automated rotation for all secrets; maximum 90-day lifetime HIGH
Access Workload identity (no static credentials); secrets injected at runtime HIGH
Audit All secret access logged and alerted on anomalous patterns HIGH

11.3 Secret Policies

Policy Requirement
No secrets in code NEVER commit secrets, tokens, keys, or passwords to source control
No secrets in config files Configuration files MUST reference secret paths, not values
No secrets in environment variables Prefer mounted secrets or secret injection; env vars are visible in process listings
No secrets in container images Build-time secrets MUST use multi-stage builds with secret mounts
Secret scanning in CI Every commit MUST be scanned for secret patterns (pre-commit hook + CI step)
Rotation on compromise If a secret is suspected compromised, rotate immediately (< 1 hour)
Shared secrets NEVER share secrets between environments; each environment has its own

11.4 Secret Categories and Rotation

Category Examples Max Lifetime Rotation Method
Database credentials MySQL, SurrealDB, Redis passwords 90 days Automated (dual-user pattern)
API keys Payment channel API keys, merchant API keys 365 days Merchant-initiated or scheduled
TLS certificates Service certificates, mTLS certs 90 days (target: 24 hours via ControlPlane) Automated
Signing keys RSA keys for Pay-Outs/Remittances 365 days Coordinated rotation with merchants
OAuth tokens Service-to-service tokens 1 hour Automatic refresh
Encryption keys KMS keys, data encryption keys Annual rotation AWS KMS automatic rotation

12. Disaster Recovery & Business Continuity

12.1 Service Tier Classification

Tier Services RPO RTO Description
Tier 1 — Payment Critical Pay-In processing, Pay-Out execution, Card auth, Remittance processing 0 (zero data loss) < 5 minutes Direct revenue impact; customer-facing payment flows
Tier 2 — Merchant Facing API Gateway, Merchant Portal, Sandbox < 5 minutes < 15 minutes Merchant experience; no direct payment loss
Tier 3 — Operational Reporting, reconciliation, settlement, back-office < 1 hour < 4 hours Internal operations; deferred processing acceptable
Tier 4 — Supporting Developer portal, corporate website, analytics < 24 hours < 24 hours No operational impact

12.2 Backup Strategy

Resource Backup Method Frequency Retention Testing
RDS MySQL Automated snapshots + binlog replication Continuous (point-in-time) 35 days Monthly restore test
SurrealDB Export + snapshot Daily 35 days Monthly restore test
Redis AOF + RDB snapshots Hourly (RDB), continuous (AOF) 7 days Weekly restore test
OpenSearch Snapshot to S3/R2 Daily 90 days (snapshots) Quarterly restore test
KrakenD config Git repository Every change Indefinite (Git history) On every deployment
IaC state Remote state backend + versioning Every change Indefinite On every deployment
Secrets AWS backup + encrypted export Daily 35 days Quarterly
R2/S3 objects Cross-region replication Continuous Per retention policy Quarterly

12.3 Current: Multi-AZ

Component Multi-AZ Status Failover
EC2/ASG Yes (instances spread across AZs) Automatic (ASG replaces failed instances)
ALB Yes (cross-AZ load balancing) Automatic
RDS Yes (standby in different AZ) Automatic failover (< 2 minutes)
ElastiCache Yes (replica in different AZ) Automatic failover
NAT Gateway One per AZ Route table failover needed

12.4 Target: Multi-Region

Phase Scope Timeline
Phase 1 Document current DR posture, define RPO/RTO, create runbooks Q2 2026
Phase 2 Cross-region backup replication (S3/R2), read replicas in secondary region Q3 2026
Phase 3 Active-passive multi-region for Tier 1 services Q4 2026
Phase 4 Active-active multi-region (evaluate need based on jurisdiction requirements) 2027

12.5 Failover Procedures

Scenario Detection Response Recovery
Single instance failure ASG health check (30s) ASG launches replacement Automatic (< 5 min)
AZ failure ALB health checks + CloudWatch Traffic shifts to healthy AZs Automatic (< 5 min)
RDS primary failure RDS event + monitoring alert Automatic failover to standby Automatic (< 2 min)
Redis primary failure ElastiCache failover Automatic promotion of replica Automatic (< 1 min)
Payment channel outage Health check failure (3 consecutive) Disable channel, notify merchants Manual channel re-enable after verification
Region failure Multi-region health check DNS failover to secondary region Manual (Phase 1) → Automatic (Phase 3)
Cloudflare incident External monitoring Evaluate: bypass to ALB direct (emergency only) Manual

12.6 DR Testing Cadence

Test Type Frequency Scope Owner
Backup restore Monthly Restore latest backup to Test environment Engineering
AZ failover Quarterly Simulate AZ failure, verify continued operation Engineering + Operations
Full DR exercise Bi-annually Full failover simulation, measure actual RTO/RPO CDO + Engineering
Tabletop exercise Quarterly Walk through failure scenarios with all stakeholders CDO
Chaos engineering Monthly (target) Controlled failure injection in Test/Prod Engineering

12.7 Runbooks

The following runbooks MUST be created, tested, and maintained:

Runbook Status
RDS failover procedure TO CREATE
Redis cluster failover TO CREATE
Payment channel outage response TO CREATE
Full region failover TO CREATE
KrakenD configuration rollback TO CREATE
Cloudflare bypass (emergency) TO CREATE
Data corruption recovery TO CREATE
DDoS attack response TO CREATE
Certificate emergency rotation TO CREATE
Merchant communication during outage TO CREATE

13. Compliance Infrastructure Requirements

This section documents the infrastructure controls required by regulators in each jurisdiction where Simpaisa operates. Compliance is not optional — failure to meet these requirements risks licence revocation.

Note: Regulatory requirements are subject to change. This section MUST be reviewed quarterly and updated when new circulars or regulations are issued.

13.1 Pakistan — State Bank of Pakistan (SBP)

Governing Legislation: - Payment Systems and Electronic Fund Transfers Act, 2007 (PSEFT Act) - Rules for Payment System Operators and Payment Service Providers, 2014 (PSO/PSP Rules) - Electronic Fund Transfer Regulations - Personal Data Protection Bill, 2023 (pending enactment — draft approved by Federal Cabinet)

Infrastructure Requirements:

Requirement Regulation Source Infrastructure Control Current Status Gap Priority
Data localisation PSO/PSP Rules 2014, PDPB 2023 (draft) Processing systems MUST be located within Pakistan; critical personal data stored on servers in Pakistan ASSESS — Verify all processing on Pakistan-based AWS region or local DC TBC CRITICAL
Technology platform approval PSO/PSP Rules 2014 Prior SBP approval required for changes to technology platforms ASSESS — Determine if current changes require approval TBC CRITICAL
Transaction record retention PSO/PSP Rules 2014 All transaction records retained for minimum 5 years (10 years recommended) ASSESS Log retention policy needed HIGH
Information security PSO/PSP Rules 2014 Appropriate measures for security, integrity, and confidentiality of financial transactions PARTIAL — AWS infrastructure sound, but gaps in observability and access control Strengthen controls HIGH
Risk management PSO/PSP Rules 2014 Documented risk management framework for payment operations ASSESS Documentation needed HIGH
Audit trail PSO/PSP Rules 2014 Complete audit trail of all transactions and system changes PARTIAL — Transaction logs exist but no centralised audit system Implement centralised audit logging HIGH
Business continuity PSO/PSP Rules 2014 Documented BCP/DR plan, tested regularly GAP — No DR documentation Create and test DR plan CRITICAL
Incident reporting SBP circulars Timely reporting of security incidents and system outages to SBP ASSESS Formalise incident reporting procedure HIGH
AML/CFT systems PSEFT Act, FATF requirements Transaction monitoring, sanctions screening, STR filing ASSESS Verify integration with FMU reporting HIGH

Pakistan-Specific Notes: - AWS does not have a region in Pakistan. Simpaisa MUST verify with SBP whether AWS ap-south-1 (Mumbai) is acceptable, or whether co-location in a Pakistan-based data centre is required for certain data categories - The Personal Data Protection Bill 2023 introduces strict data localisation once enacted — "critical personal data shall only be processed in servers within Pakistan" - SBP requires prior approval for changes to technology platforms — the migration to ControlPlane.com, KrakenD, and other new technologies may require SBP notification/approval

13.2 Bangladesh — Bangladesh Bank

Governing Legislation: - Payment and Settlement Systems Act, 2024 - Mobile Financial Services Regulations, 2022 - Bangladesh Bank Payment Systems Department circulars - Bangladesh Financial Intelligence Unit (BFIU) guidelines

Infrastructure Requirements:

Requirement Regulation Source Infrastructure Control Current Status Gap Priority
Data localisation (mandatory) MFS Regulations 2022, PSS Act 2024 IT infrastructure and data centres MUST be located within Bangladesh; data localisation is mandatory ASSESS — Verify hosting for Bangladesh operations If not locally hosted, establish local DC or partner CRITICAL
On-site inspection readiness MFS Regulations 2022 Bangladesh Bank conducts on-site inspections of IT infrastructure after setup ASSESS Ensure infrastructure meets inspection standards CRITICAL
Biometric e-KYC BFIU guidelines Electronic KYC with biometric verification required ASSESS Integration with national ID system needed HIGH
AML/CFT compliance BFIU guidelines Suspicious Transaction Report filing, transaction monitoring ASSESS Verify STR filing integration HIGH
Two-phase licensing MFS Regulations 2022 Phase 1: NOC to set up infrastructure; Phase 2: licence to operate ASSESS — Verify current licence status Follow licensing process HIGH
Transaction reporting Bangladesh Bank circulars Regular transaction reports to Bangladesh Bank PSD ASSESS Automated reporting needed HIGH
Capital adequacy MFS Regulations 2022 Minimum paid-up capital BDT 450 million for MFS (bank-led model) ASSESS Verify capital structure MEDIUM

Bangladesh-Specific Notes: - Data localisation is non-negotiable in Bangladesh — on-site infrastructure inspection is conducted by Bangladesh Bank - Two-phase licensing means infrastructure MUST be built before operational licence is granted - BFIU compliance is separate from Bangladesh Bank payment licensing and adds additional infrastructure requirements for transaction monitoring

13.3 Nepal — Nepal Rastra Bank (NRB)

Governing Legislation: - Payment and Settlement Act, 2019 (2075 BS) - NRB PSO/PSP licensing directives - Data Center and Cloud Services (Operation and Management) Directive, 2081 (2024) - NRB Cyber Resilience Guidelines - NRB IT Guidelines

Infrastructure Requirements:

Requirement Regulation Source Infrastructure Control Current Status Gap Priority
Data centre approval Data Center Directive 2081 Data MUST be stored in centres approved by Nepal's IT Department; centres MUST comply with the Directive ASSESS Identify approved data centres in Nepal CRITICAL
PCI DSS compliance NRB IT Guidelines Licensed institutions MUST adhere to PCI DSS standards ASSESS PCI DSS certification required CRITICAL
ISO 27000 certification NRB IT Guidelines Financial institutions involved in payment processing require ISO 27001 certification ASSESS ISO 27001 audit and certification needed HIGH
Cyber resilience NRB Cyber Resilience Guidelines Governance, cyber risk culture, training, resilience testing, recovery planning ASSESS Formalise cyber resilience programme HIGH
EMV compliance NRB IT Guidelines EMV and EMV Contactless standards compliance for card processing ASSESS Verify EMV compliance for Cards product HIGH
Licensing requirements Payment and Settlement Act 2019 Prior NRB approval/licence for PSO/PSP operations; 12-18 month process ASSESS Verify licence status CRITICAL
Capital requirements NRB directives NPR 150M (domestic PSP) / NPR 250M (foreign investment PSP) ASSESS Verify capital compliance MEDIUM
Technical assessment NRB licensing NRB assesses system security, reliability, and technical standards compliance ASSESS Prepare for technical assessment HIGH

Nepal-Specific Notes: - Nepal has explicit data centre approval requirements — data MUST reside in government-approved centres within Nepal - PCI DSS and ISO 27001 are explicitly mandated (not merely recommended) for payment processors - The 12-18 month licensing timeline means infrastructure investment precedes revenue

13.4 Iraq — Central Bank of Iraq (CBI)

Governing Legislation: - Electronic Payment Services Regulation, 2024 (replaced 2014 framework) - Central Bank of Iraq circulars on digital banking and payment systems - AML/CFT regulations (aligned with FATF recommendations)

Infrastructure Requirements:

Requirement Regulation Source Infrastructure Control Current Status Gap Priority
CBI licensing Electronic Payment Services Regulation 2024 Licence required from CBI for electronic payment services; 10-year licence validity ASSESS Verify licence status CRITICAL
Minimum capital Electronic Payment Services Regulation 2024 Minimum IQD 10 billion company capital ASSESS Verify capital compliance HIGH
Feasibility study Electronic Payment Services Regulation 2024 3-year feasibility study required covering: economic projections, technical infrastructure, information security, AML systems, dispute resolution ASSESS Prepare or update feasibility study HIGH
5-year record retention Electronic Payment Services Regulation 2024 All electronic payment transactions and related data retained for minimum 5 years ASSESS Implement 5-year retention policy HIGH
Cybersecurity infrastructure Electronic Payment Services Regulation 2024, CBI circulars Advanced cybersecurity measures to safeguard banking systems; compliance with international standards ASSESS Cybersecurity posture assessment needed HIGH
AML/CFT systems Electronic Payment Services Regulation 2024 Sanctions list screening, transaction monitoring, daily transaction reporting ASSESS Verify AML system integration CRITICAL
Business continuity CBI circulars Business continuity during crises; DR planning ASSESS DR plan required HIGH
ISO 20022 alignment CBI modernisation programme Payment messaging aligned with ISO 20022 standard ASSESS Evaluate ISO 20022 readiness MEDIUM

Iraq-Specific Notes: - The 2024 regulation is a significant upgrade from the 2014 framework — verify full compliance with the new requirements - IQD 10 billion minimum capital (~USD 7.6M) is a substantial requirement - The 3-year feasibility study requirement includes detailed technical infrastructure and security documentation - Iraq's financial system is heavily influenced by US sanctions compliance (OFAC) — additional sanctions screening infrastructure may be required

13.5 PCI DSS v4.0.1 (Cards Product)

Standard: PCI DSS v4.0.1 (mandatory as of 31 March 2025)

PCI DSS applies specifically to the Cards product (Visa/Mastercard acquiring). All systems that store, process, or transmit cardholder data are in scope.

Requirement Area PCI DSS Requirement Infrastructure Control Current Status Gap Priority
Network segmentation Req 1: Install and maintain network security controls CDE (Cardholder Data Environment) MUST be isolated in a dedicated subnet with strict firewall rules; micro-segmentation recommended ASSESS Verify CDE isolation CRITICAL
Secure configuration Req 2: Apply secure configurations to all system components Hardened OS images, no default credentials, unnecessary services disabled ASSESS Configuration baseline needed HIGH
Data protection (stored) Req 3: Protect stored account data PAN encrypted with AES-256; hash or truncate where possible; encryption keys managed separately from data ASSESS Verify encryption implementation CRITICAL
Data protection (transit) Req 4: Protect cardholder data with strong cryptography during transmission TLS 1.2+ for all cardholder data transmission; no SSL or early TLS PARTIAL — mTLS for Cards product Verify all transmission paths CRITICAL
Malware protection Req 5: Protect all systems and networks from malicious software Anti-malware on all CDE systems; regular scanning ASSESS Deploy and monitor HIGH
Secure development Req 6: Develop and maintain secure systems and software Secure coding practices, vulnerability patching within 30 days (critical) ASSESS SDLC security review needed HIGH
Access control Req 7 & 8: Restrict access; identify users and authenticate MFA mandatory for ALL CDE access (PCI DSS 4.0 requirement); role-based access; unique IDs ASSESS Implement MFA for all CDE access CRITICAL
Physical security Req 9: Restrict physical access to cardholder data Physical access controls for CDE infrastructure (if on-premise) N/A (cloud) Cloud provider responsibility; verify AWS compliance MEDIUM
Logging and monitoring Req 10: Log and monitor all access to system components and cardholder data All CDE access logged; logs tamper-evident; reviewed daily; retained 12 months (3 months immediately accessible) ASSESS Implement comprehensive CDE logging CRITICAL
Vulnerability management Req 11: Test security of systems and networks regularly Internal vulnerability scan quarterly; external ASV scan quarterly; penetration test annually; segmentation test bi-annually ASSESS Establish scanning programme CRITICAL
Organisational policies Req 12: Support information security with organisational policies and programmes Security policy, risk assessment, incident response plan, security awareness training ASSESS Formalise security programme HIGH

PCI DSS 4.0 New Requirements (Mandatory from March 2025):

New Requirement Description Infrastructure Impact
Targeted risk analysis Customised approach for each requirement based on risk Risk analysis documentation for each CDE control
MFA everywhere MFA for ALL access to CDE (not just remote) Deploy MFA for console, SSH, application access to CDE
Authenticated vulnerability scanning Internal scans must use authenticated scanning Scanning tools need credentials for CDE systems
Automated log review Automated mechanisms to detect security events SIEM/OpenSearch with automated alerting rules for CDE
Web application firewall WAF or equivalent for public-facing web applications Cloudflare WAF / KrakenD for card payment endpoints
Script management Inventory and integrity of payment page scripts CSP headers, SRI, script inventory for card entry pages
Enhanced encryption Disc-level encryption alone is insufficient Application-level encryption for stored PAN

PCI DSS Scoping Notes: - CDE MUST be clearly defined and documented - All systems connected to or that could impact the CDE are in scope - Network segmentation reduces scope — strongly recommended - Cloudflare and KrakenD processing card data brings them into scope - Annual PCI DSS assessment (SAQ or ROC depending on transaction volume)

13.6 Compliance Summary Matrix

Jurisdiction Data Localisation Incident Reporting SLA Record Retention Licensing Status PCI DSS Required
Pakistan Required (processing in-country; PDPB 2023 pending) TBC (SBP circulars) 5+ years VERIFY Yes (Cards)
Bangladesh Mandatory (DC inspection by Bangladesh Bank) TBC TBC VERIFY TBC
Nepal Mandatory (govt-approved DC only) TBC TBC VERIFY Mandatory (NRB directive)
Iraq TBC (new 2024 regulation) TBC 5 years (minimum) VERIFY TBC
PCI DSS N/A 72 hours (breach notification) 12 months (3 months immediately accessible) N/A Yes (Cards)

13.7 Compliance Remediation Priorities

Priority Action Jurisdictions Timeline
1 Verify all current licence and authorisation statuses All Immediate
2 Data localisation assessment — where is data stored/processed for each jurisdiction? PK, BD, NP Q2 2026
3 PCI DSS v4.0.1 gap assessment for Cards product Global Q2 2026
4 Implement 2-hour incident reporting capability (best practice across all markets) All Q2 2026
5 Formalise record retention policies meeting all jurisdictional minimums All Q2 2026
6 DR/BCP documentation and testing All (regulatory requirement in most jurisdictions) Q2-Q3 2026
7 AML/CFT system verification across all jurisdictions All Q3 2026
8 ISO 27001 certification (required for Nepal, beneficial for all) NP (mandatory), all Q3-Q4 2026
9 Prepare for Pakistan PDPB enactment PK Q3 2026
10 Iraq 2024 regulation full compliance assessment IQ Q3 2026

14. Infrastructure as Code

14.1 Tool Selection (TBC)

Tool Pros Cons Recommendation
Terraform Industry standard, large ecosystem, HCL is declarative, multi-cloud State management complexity, HCL learning curve, BSL licence (OpenTofu as alternative) Evaluate
Pulumi Real programming languages (Go, TypeScript), strong typing, testing Smaller ecosystem, less community content, state management similar to Terraform Evaluate (strong fit with Go stack)
AWS CDK Native AWS integration, TypeScript/Go support AWS-only (not multi-cloud), CloudFormation under the hood Lower priority (multi-cloud needed for Cloudflare)
OpenTofu Terraform-compatible, open source (MPL 2.0) Younger project, smaller team Evaluate (if Terraform BSL is a concern)

Decision required: IaC tool selection is TBC. Recommendation: evaluate Pulumi (Go alignment) and Terraform/OpenTofu (ecosystem breadth) in a spike. Whichever tool is chosen, the standards below apply.

14.2 Repository Structure

infrastructure/
├── modules/                    # Reusable modules
│   ├── vpc/                    # VPC, subnets, NAT, security groups
│   ├── compute/                # EC2/containers, ASG, ALB
│   ├── database/               # RDS, SurrealDB, ElastiCache
│   ├── observability/          # OpenSearch, Grafana, Jaeger, OTel Collector
│   ├── gateway/                # KrakenD deployment
│   ├── cloudflare/             # DNS, WAF, Workers, Pages, R2
│   └── security/               # WAF rules, security groups, KMS
├── environments/
│   ├── sandbox/                # Sandbox environment configuration
│   ├── dev/                    # Dev environment configuration
│   ├── test/                   # Test environment configuration
│   └── prod/                   # Prod environment configuration
├── policies/                   # OPA/Sentinel policies for compliance
└── README.md

14.3 Module Design Principles

  • One module per concern: VPC, compute, database, observability are separate modules
  • Inputs validated: All module inputs MUST have type constraints and validation rules
  • Outputs explicit: Modules MUST export IDs, ARNs, endpoints needed by dependent modules
  • No hardcoded values: All environment-specific values passed as variables
  • Tagging enforced: Every resource MUST be tagged (see Cost Management section)
  • Documentation: Every module MUST have a README with inputs, outputs, and examples

14.4 State Management

Requirement Standard
Remote state S3 bucket (encrypted, versioned) + DynamoDB table (locking)
State per environment Separate state file per environment (never shared)
State locking Mandatory — prevent concurrent modifications
State encryption AES-256 encryption at rest
State access Restricted to CI/CD pipeline service account and designated operators
State backup S3 versioning provides history; cross-region replication for DR

14.5 Drift Detection

  • Drift detection MUST run daily on all environments
  • Drift detection MUST run before every deployment
  • Any detected drift MUST be reported as a P2 alert
  • Drift MUST be resolved before the next planned deployment
  • Unplanned manual changes to infrastructure are prohibited

15. CI/CD Pipeline Standards

Jenkins will NOT be used. CI/CD tool is TBC. The standards below are tool-agnostic.

15.1 Tool Evaluation

Tool Pros Cons Status
Bitbucket Pipelines Native Bitbucket integration, simple YAML config Limited compute, caching limitations Evaluate (Simpaisa uses Bitbucket)
Dagger Containerised pipelines, language-native (Go SDK), portable Newer, smaller community Evaluate (strong fit with Go + AI SDLC)
Buildkite Fast, self-hosted agents, YAML config, scalable Requires agent infrastructure Evaluate
Woodpecker CI Open source, Drone-compatible, container-native Smaller community Evaluate

15.2 Pipeline Stages

┌─────┐   ┌──────┐   ┌───────┐   ┌──────────────┐   ┌────────┐   ┌────────┐
│ Lint │ → │ Test │ → │ Build │ → │ Security Scan │ → │ Deploy │ → │ Verify │
└─────┘   └──────┘   └───────┘   └──────────────┘   └────────┘   └────────┘
Stage Activities Failure Action
Lint Code formatting, linting, static analysis Block — fix before proceeding
Test Unit tests, integration tests (with coverage) Block — tests must pass
Build Compile, build container image, generate artefacts Block — build must succeed
Security Scan Dependency vulnerability scan, SAST, secret scanning, container scan Block if critical/high findings
Deploy Deploy to target environment (blue/green or canary) Automatic rollback on failure
Verify Smoke tests, health checks, synthetic transactions Automatic rollback if verification fails

15.3 Quality Gates

Gate Requirement Blocks Deployment?
Code coverage Minimum 80% for new code, 60% overall Yes
Security vulnerabilities Zero critical, zero high (for Prod) Yes (Prod), Warning (Dev/Test)
Secret scanning No secrets detected in code or config Yes (all environments)
Dependency vulnerabilities No known critical CVEs in dependencies Yes (Prod)
Container scan No critical vulnerabilities in container image Yes (Prod)
Performance regression No P95 latency regression > 10% (payment paths) Yes (Prod)
API contract OpenAPI spec validation passes Yes (all environments)

15.4 Artefact Management

Artefact Storage Retention Naming
Container images Container registry (evaluate: ECR, Cloudflare Container Registry, or self-hosted) 90 days for non-production tags, indefinite for production tags <service>:<git-sha>-<build-number>
Go binaries R2/S3 artefact bucket 90 days <service>-<version>-<os>-<arch>
IaC plans R2/S3 artefact bucket 365 days <environment>-<timestamp>-<git-sha>.plan
Test reports R2/S3 artefact bucket 365 days <service>-<timestamp>-test-report.xml

15.5 Deployment Automation

Requirement Standard
No manual deployments All deployments MUST go through the CI/CD pipeline
Reproducible Same artefact deployed to all environments (configuration differs, not code)
Auditable Every deployment logged: who triggered, what version, when, which environment
Rollback One-click rollback to previous version (< 5 minutes)
Deployment windows Prod deployments during business hours (UTC+5) unless emergency
Feature flags Use PostHog feature flags for gradual rollout, not deployment gating

16. Cost Management

16.1 Tagging Strategy

All AWS and Cloudflare resources MUST have the following tags:

Tag Key Description Example Values Required
Environment Deployment environment sandbox, dev, test, prod Yes
Service Service name pay-in-service, krakend, grafana Yes
Product Product line pay-ins, pay-outs, remittances, cards, platform Yes
Owner Team or individual responsible engineering, platform, security Yes
CostCentre Financial cost centre TECH-001, SEC-001 Yes
ManagedBy IaC tool or manual terraform, pulumi, manual Yes
Criticality Service tier tier-1, tier-2, tier-3, tier-4 Yes

16.2 Budget Alerts

Alert Level Threshold Notification Action
Info 50% of monthly budget Email to engineering lead Review spending trend
Warning 75% of monthly budget Slack notification to engineering Investigate and optimise
Critical 90% of monthly budget SMS to CDO + engineering lead Immediate cost review
Breach 100% of monthly budget Call to CDO Emergency cost reduction

16.3 Reserved Capacity Planning

Resource Strategy Review Cadence
EC2 instances Reserved Instances (1-year) for baseline, On-Demand for burst Quarterly
RDS Reserved Instances (1-year) for all Prod databases Annually
ElastiCache Reserved Nodes for Prod Annually
Cloudflare Enterprise plan (annual commitment) Annually
Data transfer Cloudflare reduces AWS egress; monitor and optimise Monthly

16.4 Cost Optimisation Reviews

Review Frequency Owner Focus
Resource utilisation Monthly Engineering Right-sizing instances, identifying idle resources
Data transfer costs Monthly Engineering Optimise cross-AZ and internet egress traffic
Reserved Instance coverage Quarterly CDO + Engineering Ensure RI coverage matches usage
Architecture cost review Quarterly CDO Evaluate architectural changes for cost impact
Vendor negotiation Annually CDO AWS, Cloudflare, ControlPlane.com contract review

17. Migration Roadmap

Phase Overview

Phase 1          Phase 2        Phase 3         Phase 4          Phase 5           Phase 6
Observability    Edge           Gateway         Identity         Compute           Data
Q2 2026          Q2-Q3 2026     Q3 2026         Q3-Q4 2026       Q4 2026-Q1 2027   2027
─────────────────────────────────────────────────────────────────────────────────────────►

Phase 1: Observability (Replace CloudWatch)

Timeline: Q2 2026 Priority: CRITICAL — prerequisite for all other phases

Task Description Dependencies Effort
Deploy OpenTelemetry Collector Central telemetry pipeline (gateway mode) None 1 week
Deploy Grafana Dashboards and alerting None 1 week
Deploy Prometheus Metrics storage Grafana 1 week
Deploy Jaeger (or Tempo) Distributed tracing OpenSearch (for storage) 1 week
Deploy OpenSearch Log aggregation and trace storage None 2 weeks
Instrument existing services Add OTel SDK to Spring Boot services OTel Collector 2-3 weeks
Build dashboards Per-product, per-channel, infrastructure, SLA Grafana + data flowing 2 weeks
Configure alerting Alert rules for all P1/P2 scenarios Grafana 1 week
Decommission CloudWatch dependency Remove CloudWatch alarms, switch to Grafana All above complete 1 week
Deploy PostHog Product analytics None 1 week

Success Criteria: - All services emit traces, metrics, and structured logs via OTel - End-to-end transaction tracing works for all products - Grafana dashboards operational for all products - Alerting functional with correct escalation paths - CloudWatch no longer primary monitoring tool

Phase 2: Edge (Cloudflare)

Timeline: Q2-Q3 2026 Priority: HIGH

Task Description Dependencies Effort
Migrate DNS to Cloudflare Authoritative DNS for all domains None 1 week
Enable Cloudflare CDN Cache static assets, configure cache rules DNS migration 1 week
Configure Cloudflare WAF Payment API protection rules DNS migration 1 week
Deploy Cloudflare Workers Geo-routing, rate limiting, header injection DNS migration 2 weeks
Migrate static sites to Pages Corporate site, developer portal DNS migration 2 weeks
Configure R2 buckets Merchant reports, transaction receipts None 1 week
Implement Authenticated Origin Pulls Secure Cloudflare-to-ALB connection CDN enabled 1 week
Configure bot management Bot detection and challenge rules WAF configured 1 week

Success Criteria: - All traffic routes through Cloudflare - WAF blocking malicious traffic - Static sites served from Cloudflare Pages - Origin servers only accessible from Cloudflare IPs - DDoS protection active

Phase 3: API Gateway (KrakenD)

Timeline: Q3 2026 Priority: CRITICAL

Task Description Dependencies Effort
Deploy KrakenD to Test Initial deployment with basic configuration Phase 1 (observability) 1 week
Define API specifications OpenAPI 3.1 specs for all endpoints None 2 weeks
Configure auth verification JWT validation, API key verification None 1 week
Configure rate limiting Per-merchant, per-product, per-endpoint limits None 1 week
Configure error standardisation RFC 9457 error responses None 1 week
Deploy to Sandbox Merchant-facing test environment Test deployment stable 1 week
Merchant migration (phased) Migrate merchants to gateway-fronted endpoints Sandbox proven 4-6 weeks
Deploy to Prod Production deployment with blue/green Merchant migration tested 1 week

Success Criteria: - All API traffic routes through KrakenD - Rate limiting enforced per merchant - Auth verification at gateway level - Standardised error responses - OpenAPI validation rejecting malformed requests

Phase 4: Identity (ControlPlane.com)

Timeline: Q3-Q4 2026 Priority: HIGH

Task Description Dependencies Effort
ControlPlane.com setup Account, organisation, initial configuration None 1 week
Workload identity Migrate service-to-service auth from IAM to ControlPlane Phase 3 (KrakenD) 2-3 weeks
Merchant identity Design merchant RBAC model None 1 week
KrakenD integration JWT issuance and validation via ControlPlane Phase 3 + workload identity 2 weeks
SSO for internal tools Grafana, OpenSearch, merchant portal via SSO ControlPlane setup 2 weeks
Policy-as-code Define and test access policies All above 2 weeks

Phase 5: Compute Modernisation

Timeline: Q4 2026 - Q1 2027 Priority: MEDIUM

Task Description Dependencies Effort
Container platform selection Evaluate ECS Fargate vs EKS vs ControlPlane.com Phase 4 1-2 weeks
Deploy Caddy Per-service reverse proxy with mTLS Container platform 2 weeks
First Go service New service built in Go, deployed as container Container platform 4-6 weeks
Blue/green deployment Implement for Tier 1 services Container platform 2 weeks
Canary deployment Implement for API Gateway and payment initiation Blue/green working 2 weeks
Unikraft evaluation Assess Unikraft for security-critical payment processing Go service proven 4 weeks

Phase 6: Data Infrastructure

Timeline: 2027 Priority: MEDIUM

Task Description Dependencies Effort
RDS split Separate shared RDS into per-service instances None (can start earlier) 4-6 weeks
SurrealDB pilot Deploy SurrealDB for first new Go service Phase 5 (Go service) 2-3 weeks
NSQ deployment Replace Kafka with NSQ for inter-service messaging None 3-4 weeks
Meilisearch deployment Merchant-facing search in portal None 2 weeks
Redis cluster mode Enable cluster mode, per-service namespacing None (can start earlier) 1-2 weeks

Migration Risk Register

Risk Impact Likelihood Mitigation
Service disruption during KrakenD rollout HIGH Medium Blue/green deployment, gradual merchant migration, instant rollback
Cloudflare outage impacts all services HIGH Low Document emergency bypass procedure; monitor Cloudflare status
Data loss during RDS split CRITICAL Low Extensive testing in Test environment; point-in-time recovery enabled; rollback plan
ControlPlane.com integration delays MEDIUM Medium Keep existing auth as fallback; phased migration
Compliance issues with new infrastructure HIGH Medium Engage regulators early; legal review of each technology change
Team skill gap (Go, new tooling) MEDIUM High Training programme; gradual adoption; AI SDLC augmentation

18. Appendix: Infrastructure Controls Checklist

Use this checklist for infrastructure reviews and compliance audits.

A. Network Security

# Control Required By Status
N-01 All public endpoints behind Cloudflare (no direct origin access) Security standard
N-02 ALB accepts traffic only from Cloudflare IP ranges Security standard
N-03 Security groups follow least-privilege (no 0.0.0.0/0 inbound) PCI DSS, all regulators
N-04 NACLs configured as defence in depth Security standard
N-05 VPC flow logs enabled and exported to OpenSearch PCI DSS, audit requirement
N-06 No public IP addresses on application or database instances Security standard
N-07 CDE network segment isolated (Cards product) PCI DSS 4.0
N-08 DDoS protection active (Cloudflare) All regulators
N-09 WAF rules configured for payment API protection PCI DSS, security standard
N-10 DNS DNSSEC enabled Security standard

B. Encryption

# Control Required By Status
E-01 TLS 1.2+ on all external connections PCI DSS 4.0, all regulators
E-02 TLS 1.3 preferred where supported Security standard
E-03 mTLS for all service-to-service communication Security standard
E-04 Database encryption at rest (AES-256) PCI DSS, all regulators
E-05 S3/R2 bucket encryption enabled Security standard
E-06 PAN encrypted at application level (not just disc) PCI DSS 4.0
E-07 Encryption keys managed in KMS (separate from data) PCI DSS 4.0
E-08 Certificate auto-renewal configured Operational
E-09 No SSL or early TLS anywhere PCI DSS 4.0

C. Access Control

# Control Required By Status
A-01 MFA enabled for all CDE access PCI DSS 4.0
A-02 MFA enabled for all infrastructure access Security standard, all regulators
A-03 No shared accounts or credentials PCI DSS, security standard
A-04 Service accounts use workload identity (no static credentials) Security standard
A-05 Quarterly access review completed PCI DSS, all regulators
A-06 Privileged access logged and alerted PCI DSS, all regulators
A-07 Break-glass procedure documented and tested Operational
A-08 Terminated employee access revoked within 24 hours PCI DSS, all regulators

D. Logging and Monitoring

# Control Required By Status
L-01 Structured JSON logging on all services Observability standard
L-02 Trace ID propagated end-to-end Observability standard
L-03 CDE access logs tamper-evident PCI DSS 4.0
L-04 Log retention meets jurisdictional requirements (7 years max) PK, BD, NP, IQ, EG regulators
L-05 Automated log review for security events PCI DSS 4.0
L-06 Alerting configured for all P1/P2 scenarios Operational
L-07 Dashboards operational for all products Operational
L-08 No sensitive data in logs (PAN, CVV, PIN, full CNIC) PCI DSS, PDPA
L-09 Audit trail for all infrastructure changes All regulators

E. Backup and Recovery

# Control Required By Status
R-01 Automated daily backups for all databases All regulators
R-02 Backup restore tested monthly DR standard
R-03 RPO/RTO defined per service tier DR standard
R-04 DR runbooks documented All regulators, DR standard
R-05 DR exercise conducted bi-annually DR standard
R-06 Backups encrypted PCI DSS, security standard
R-07 Backups stored in different location from primary DR standard

F. Compliance

# Control Required By Status
C-01 Data localisation requirements met per jurisdiction PK, BD, NP
C-02 Incident reporting capability (2-hour internal SLA) All
C-03 Transaction record retention (minimum 5 years) PK, IQ, PCI DSS
C-04 PCI DSS v4.0.1 assessment current PCI DSS
C-05 AML/CFT transaction monitoring operational All jurisdictions
C-06 Sanctions screening integrated All jurisdictions (especially IQ)
C-07 Regulatory technology change approvals obtained PK (SBP), BD, NP
C-08 ISO 27001 certification (required for Nepal) NP (NRB)
C-09 Annual PCI DSS assessment scheduled PCI DSS
C-10 Quarterly vulnerability scanning programme PCI DSS 4.0

Document Control

Version Date Author Changes
1.0.0 2026-04-03 CDO (AI SDLC) Initial version — AI SDLC prototype and showcase

Review Schedule: Quarterly (next review: Q3 2026)

Distribution: Architecture & Engineering Leadership


This document was generated as part of the Simpaisa AI SDLC prototype. All compliance information should be verified with legal counsel and regulatory advisors in each jurisdiction.