Simpaisa Incident Response Playbook¶
| Owner | Classification | Review Date | Status |
|---|---|---|---|
| Security | Confidential | April 2027 | Active |
Simpaisa Incident Response Playbook¶
Document Owner: Daniel O'Reilly, Chief Digital Officer
Version: 1.0
Created: 2026-04-03
Last Reviewed: 2026-04-03
Next Review: 2026-07-03
Classification: Internal — Confidential
Maturity Level: 1/5 (Initial) — Target: 3/5 within 6 months
Table of Contents¶
-
Executive Summary and Purpose
-
Scope
-
Severity Classification
-
Roles and Responsibilities
-
Incident Lifecycle
-
Detection and Alerting
-
Communication Templates
-
Runbooks
-
Escalation Matrix
-
Post-Incident Review (PIR)
-
Regulatory Reporting Requirements
-
Business Continuity
-
Testing and Drills
-
Tools and Access
-
Appendix: Quick Reference Card
1. Executive Summary and Purpose¶
Simpaisa is a payment gateway processing 270M+ transactions worth $1B+ annually across Pakistan, Bangladesh, Nepal, Iraq, and Egypt. Our platform handles Pay-Ins, Pay-Outs, Remittances, and Cards through 20+ payment channels including mobile wallets, bank integrations, direct carrier billing, card networks, and national payment rails.
Any disruption to our services has immediate financial, reputational, and regulatory consequences. This playbook establishes a structured, repeatable incident response process to:
-
Minimise downtime and financial impact to merchants and end-users
-
Ensure regulatory compliance across all jurisdictions (including Simpaisa's 2-hour internal incident reporting SLA)
-
Provide clear escalation paths so the right people are engaged at the right time
-
Enable blameless learning from every incident to continuously improve platform resilience
-
Standardise communication to merchants, regulators, and internal stakeholders during incidents
This is a living document. As our incident response maturity improves, this playbook will be updated to reflect new tooling, processes, and lessons learnt.
Agentic AI Integration¶
Simpaisa follows an Agentic AI SDLC-first approach. AI agents assist with:
-
Automated incident detection and initial triage
-
Log correlation and root cause analysis suggestions
-
Drafting communication templates and status updates
-
Post-incident review data gathering and timeline reconstruction
Human judgement remains authoritative for all escalation, resolution, and communication decisions.
2. Scope¶
What Constitutes an Incident¶
An incident is any event that:
-
Causes or threatens to cause degradation or loss of payment processing capability
-
Results in unauthorised access to payment data, merchant data, or internal systems
-
Triggers or may trigger regulatory reporting obligations in any jurisdiction
-
Causes financial loss to Simpaisa, merchants, or end-users through system malfunction
-
Results in incorrect transaction processing (duplicate charges, incorrect amounts, misrouted payments)
-
Causes breach of SLA commitments to merchants or payment channel partners
In Scope¶
| Category | Examples |
|---|---|
| Payment Processing | Transaction failures, channel outages, settlement errors, reconciliation mismatches |
| Platform Availability | Service outages, degraded performance, capacity exhaustion |
| Data Security | Breaches, unauthorised access, data exfiltration, credential compromise |
| Infrastructure | AWS failures, database issues, cache failures, messaging system failures |
| Integration | Payment channel API failures, webhook delivery failures, partner connectivity loss |
| Compliance | PCI DSS violations, regulatory reporting failures |
Out of Scope¶
-
Planned maintenance windows (covered by change management)
-
Feature requests or enhancement work
-
Non-production environment issues (unless blocking critical deployments)
-
Individual merchant configuration issues (handled by support)
-
General IT support issues (workstation, email, etc.)
3. Severity Classification¶
Severity Definitions¶
| Severity | Name | Definition | Examples |
|---|---|---|---|
| SEV1 | Critical | Complete loss of payment processing or confirmed data breach affecting multiple merchants or jurisdictions | Payment processing fully down across all channels; confirmed data breach with cardholder data exposure; all merchants unable to process; complete database failure; regulatory reporting triggered |
| SEV2 | High | Single product or payment channel down, significant performance degradation, or partial data exposure | Pay-Ins completely unavailable; Easypaisa channel down; P95 latency >5s for >5 minutes; partial data exposure; single country entirely affected |
| SEV3 | Medium | Non-critical service degraded, single merchant affected, or elevated error rates not yet impacting overall availability | Single merchant experiencing failures; error rate elevated but less than 10 percent of transactions; non-critical API endpoint degraded; webhook delivery delays |
| SEV4 | Low | Cosmetic issues, documentation errors, non-production problems, or minor anomalies with no user impact | UI rendering issue on merchant dashboard; incorrect error message text; staging environment instability; minor log anomalies |
SLA Targets¶
| Metric | SEV1 | SEV2 | SEV3 | SEV4 |
|---|---|---|---|---|
| Response Time | 5 minutes | 15 minutes | 1 hour | Next business day |
| Acknowledge | 10 minutes | 30 minutes | 2 hours | Next business day |
| First Update | 15 minutes | 30 minutes | 4 hours | N/A |
| Update Frequency | Every 15 minutes | Every 30 minutes | Every 4 hours | As needed |
| Resolution Target | 1 hour | 4 hours | 24 hours | 5 business days |
| Post-Incident Review | Mandatory within 24 hours | Mandatory within 48 hours | Required within 1 week | Optional |
| CDO Notification | Immediate | Within 30 minutes | Daily summary | N/A |
Severity Upgrade Criteria¶
An incident must be upgraded if:
-
A SEV2 is not resolved within 2 hours — Upgrade to SEV1
-
A SEV3 affects additional merchants or channels — Upgrade to SEV2
-
Any severity reveals a data breach component — Upgrade to SEV1
-
Any severity triggers regulatory reporting obligations — Upgrade to at least SEV2
-
Customer/merchant financial impact is confirmed — Upgrade by one level
4. Roles and Responsibilities¶
Incident Roles¶
Incident Commander (IC)¶
The IC owns the incident from declaration to closure. They do not fix the problem themselves — they coordinate.
Responsibilities:
-
Declare the incident and assign severity
-
Open the incident Slack channel (naming convention: inc-YYYY-MM-DD-brief-description)
-
Assign Technical Lead and Communications Lead
-
Maintain the incident timeline
-
Make escalation decisions
-
Authorise communications to merchants and regulators
-
Call for Post-Incident Review
-
Create and track the beads issue for the incident
Who: Senior engineer or engineering manager on call. For SEV1, the most senior available engineer assumes IC until explicitly handed over.
Technical Lead (TL)¶
Responsibilities:
-
Lead the technical investigation and diagnosis
-
Coordinate engineering resources working on the incident
-
Implement containment and resolution actions
-
Provide technical updates to the IC
-
Document technical findings for PIR
-
Verify resolution and confirm service recovery
Who: The engineer with deepest knowledge of the affected system. May change during an incident if a different specialist is needed.
Communications Lead (Comms)¶
Responsibilities:
-
Draft and send all external communications (merchant notifications, status page updates)
-
Draft regulatory notifications for IC/CDO approval
-
Manage the Slack incident channel information flow
-
Ensure update frequency SLAs are met
-
Coordinate with merchant support team
Who: Product or support team member. For SEV1, a dedicated person must be assigned.
CDO Escalation Criteria¶
The CDO (Daniel O'Reilly) must be notified when:
-
Any SEV1 incident is declared — immediately
-
Any SEV2 incident is declared — within 30 minutes
-
Any incident requires regulatory reporting in any jurisdiction
-
Any incident involves confirmed or suspected data breach
-
Any incident has confirmed financial impact to merchants or Simpaisa
-
Any incident is likely to receive media attention
-
A SEV2 incident is not resolved within 2 hours
-
A single payment channel has been down for more than 30 minutes during peak hours
Contact method: Slack DM + phone call for SEV1. Slack DM for SEV2. Daily summary for SEV3.
On-Call Rotation Structure¶
STATUS: To be established. The following is the target structure.
| Rotation | Coverage | Staffing |
|---|---|---|
| Primary On-Call | 24/7, weekly rotation | Engineers (minimum 4 in rotation) |
| Secondary On-Call | Escalation backup | Senior engineers (minimum 3 in rotation) |
| IC On-Call | Weekday business hours + weekend coverage | Engineering managers / senior leads |
| Comms On-Call | Business hours with on-call for SEV1 | Product / support team |
On-call expectations:
-
Acknowledge pages within 5 minutes
-
Have laptop and VPN access available at all times during on-call
-
No travel to areas without reliable internet during on-call
-
Handover briefing at rotation change
5. Incident Lifecycle¶
Overview¶
Detection then Triage then Containment then Resolution then Recovery then Post-Incident Review
Phase 1: Detection¶
Objective: Identify that an incident is occurring as quickly as possible.
Automated Detection:
-
Grafana alert fires based on predefined thresholds (see Section 6)
-
Alert routes to on-call via configured notification channel
-
On-call acknowledges the alert within 5 minutes
Manual Detection:
-
Engineer, merchant, or support team identifies anomalous behaviour
-
Reporter posts in incidents Slack channel with initial details
-
On-call engineer investigates and determines if an incident should be declared
AI-Assisted Detection:
-
Agentic AI monitors aggregate patterns across channels and products
-
AI flags anomalies that may not trigger individual threshold alerts
-
On-call engineer reviews AI-flagged anomalies and determines severity
Decision: Is this an incident?
-
If transaction success rate has dropped — Yes, declare incident
-
If a payment channel is returning errors — Yes, declare incident
-
If latency is elevated but transactions succeed — Monitor for 5 minutes, then declare if not improving
-
If only non-production is affected — No, handle as BAU
-
If a single merchant reports issues but metrics look normal — Investigate as potential SEV3
Phase 2: Triage¶
Objective: Classify the incident, mobilise the right people, and establish communications.
Steps:
-
Assign severity using the classification matrix in Section 3
-
Create incident channel using naming convention inc-YYYY-MM-DD-brief-description
-
Create beads issue with severity label
-
Post incident header in channel (use template from Section 7)
-
Assign roles: IC, Technical Lead, Communications Lead
-
Notify stakeholders per escalation matrix
-
Assess blast radius:
-
Which products are affected? (Pay-Ins, Pay-Outs, Remittances, Cards)
-
Which channels are affected?
-
Which countries/jurisdictions are affected?
-
How many merchants are impacted?
-
Is there a regulatory reporting obligation?
-
Phase 3: Containment¶
Objective: Stop the incident from getting worse. Limit the blast radius.
Steps:
-
Identify containment options:
-
Can the affected channel be isolated without impacting others?
-
Can traffic be rerouted to healthy channels?
-
Should the affected service be taken out of the load balancer?
-
Is a rollback of a recent deployment needed?
-
-
Implement containment — prefer reversible actions:
-
Disable the failing channel in configuration
-
Scale up healthy instances to absorb redirected traffic
-
Enable circuit breakers if not already triggered
-
Roll back the last deployment if suspected as cause
-
-
Verify containment:
-
Confirm the blast radius is not expanding
-
Confirm healthy services remain healthy
-
Monitor error rates for secondary impacts
-
-
Communicate containment status to IC and update incident channel
Phase 4: Resolution¶
Objective: Fix the root cause and restore full service.
Steps:
-
Diagnose root cause:
-
Review Grafana dashboards for anomalies
-
Search Jaeger traces for failing transactions
-
Query OpenSearch logs for errors and exceptions
-
Check recent deployments and configuration changes
-
Review AWS health dashboard for infrastructure issues
-
-
Implement fix:
-
Apply code fix, configuration change, or infrastructure remediation
-
For code changes: follow expedited deployment process (peer review still required for SEV1/2)
-
Document all changes made during the incident
-
-
Test the fix:
-
Verify in staging if time permits (SEV3/4)
-
For SEV1/2: verify fix addresses root cause, accept risk of direct-to-production if necessary
-
Run synthetic transactions through affected paths
-
Phase 5: Recovery¶
Objective: Restore full normal operations and verify stability.
Steps:
-
Re-enable affected services/channels gradually
-
Monitor recovery metrics:
-
Transaction success rate returning to baseline
-
Latency returning to normal
-
Error rates dropping to normal levels
-
Queue backlogs draining
-
-
Process backlogged transactions if applicable
-
Verify data consistency:
-
Check for stuck transactions
-
Reconcile any transactions that were in-flight during the incident
-
Verify settlement files are correct
-
-
Communicate resolution to merchants and stakeholders
-
Stand down incident team — IC makes the call
Phase 6: Post-Incident Review¶
Objective: Learn from the incident and prevent recurrence. See Section 10 for full details.
Decision Trees¶
Transaction Failure Rate Increasing¶
Transaction failure rate above 5 percent?
-
YES: Which channels are affected?
-
ALL channels: SEV1 — Likely platform issue (DB, cache, app). Check RDS health, Redis health, application logs
-
MULTIPLE channels: SEV2 — Possible shared dependency. Check shared middleware, network connectivity, DNS
-
SINGLE channel: SEV2 — Channel-specific issue. Check channel API status, integration logs, circuit breaker state
-
-
NO: Monitor. Set alert if above 3 percent sustained for 5 minutes.
Latency Spike¶
P95 latency above 2s?
-
YES: Is it affecting transaction success?
-
YES: Transactions timing out — SEV2. Check DB query performance, connection pools, external API latency
-
NO: Transactions slow but completing — SEV3. Check query plans, cache hit rates, JVM garbage collection
-
-
NO: Monitor. Investigate if above 1.5s sustained.
Suspected Security Incident¶
-
Confirmed data breach: SEV1 IMMEDIATELY. Engage security team, isolate affected systems, notify CDO
-
Suspicious activity (unusual patterns): SEV2. Investigate scope, preserve evidence, assess data exposure
-
Automated scan/probe detected: SEV4. Verify WAF rules, monitor for escalation
-
Credential compromise suspected: SEV2. Rotate credentials immediately, audit access logs
6. Detection and Alerting¶
Observability Stack¶
| Component | Tool | Purpose |
|---|---|---|
| Metrics | OpenTelemetry to Grafana | Time-series metrics, dashboards, alerting |
| Traces | OpenTelemetry to Jaeger | Distributed tracing across services |
| Logs | OpenTelemetry to OpenSearch | Centralised log aggregation and search |
Key Metrics to Monitor¶
Payment Processing Metrics¶
| Metric | Normal Range | Warning Threshold | Critical Threshold |
|---|---|---|---|
| Transaction Success Rate (overall) | above 99 percent | below 98 percent | below 95 percent |
| Transaction Success Rate (per channel) | above 97 percent | below 95 percent | below 90 percent |
| P50 Latency | below 500ms | above 1s | above 2s |
| P95 Latency | below 1.5s | above 3s | above 5s |
| P99 Latency | below 3s | above 5s | above 10s |
| Error Rate (per channel) | below 1 percent | above 3 percent | above 5 percent |
| Transactions Per Second | Baseline plus/minus 20 percent | plus/minus 40 percent | plus/minus 60 percent or zero |
Infrastructure Metrics¶
| Metric | Normal Range | Warning Threshold | Critical Threshold |
|---|---|---|---|
| RDS CPU Utilisation | below 60 percent | above 75 percent | above 90 percent |
| RDS Connection Count | below 70 percent of max | above 80 percent of max | above 90 percent of max |
| RDS Replication Lag | below 100ms | above 500ms | above 2s |
| Redis Memory Usage | below 70 percent | above 80 percent | above 90 percent |
| Redis Connection Count | below 80 percent of max | above 85 percent | above 95 percent |
| Kafka Consumer Lag | below 1000 messages | above 5000 messages | above 50000 messages |
| JVM Heap Usage | below 70 percent | above 80 percent | above 90 percent |
| ALB 5xx Rate | below 0.1 percent | above 0.5 percent | above 1 percent |
| ALB Healthy Host Count | All targets healthy | 1 unhealthy | more than 1 unhealthy |
Grafana Alert Rules¶
Critical Alerts (Page immediately)¶
-
TransactionSuccessRateCritical : Overall transaction success rate below 95 percent for 2 minutes
-
ChannelSuccessRateCritical : Channel success rate below 90 percent for 3 minutes
-
TransactionLatencyP95Critical : P95 transaction latency exceeds 5 seconds for 3 minutes
-
ZeroTransactions : No transactions processed in the last 5 minutes for 2 minutes
-
DBConnectionPoolExhausted : Database connection pool above 90 percent utilised for 1 minute
Warning Alerts (Notify on-call, investigate)¶
-
ElevatedErrorRate : Transaction error rate above 3 percent for 5 minutes
-
KafkaConsumerLagHigh : Kafka consumer lag exceeds 5000 messages for 5 minutes
-
RedisMemoryHigh : Redis memory usage above 80 percent for 5 minutes
Alert Routing and Escalation¶
Alert fires in Grafana:
-
Critical: Page on-call immediately (PagerDuty/Opsgenie — TBC)
-
Not acknowledged in 5 min: Page secondary on-call
-
Not acknowledged in 10 min: Page IC on-call + CDO
-
-
Warning: Slack alerts channel + on-call notification
-
Not acknowledged in 15 min: Page on-call
-
Sustained above 30 min: Upgrade to Critical
-
-
Info: Slack alerts channel (no page)
Synthetic Monitoring¶
STATUS: To be implemented.
Target synthetic checks:
| Check | Frequency | Timeout | Description |
|---|---|---|---|
| Pay-In Health | Every 1 minute | 10s | End-to-end test transaction through each active channel |
| Pay-Out Health | Every 1 minute | 10s | Verification of pay-out initiation path |
| API Gateway | Every 30 seconds | 5s | Health check endpoint for each API gateway instance |
| Channel Connectivity | Every 2 minutes | 15s | Connectivity check to each payment channel partner |
| Webhook Delivery | Every 5 minutes | 30s | End-to-end webhook delivery verification |
7. Communication Templates¶
Internal Communications¶
Incident Channel Header (Post at Incident Declaration)¶
INCIDENT DECLARED
-
Severity: SEV 1/2/3/4
-
Summary: One-line description of the incident
-
Impact: Products/channels/merchants affected
-
Countries: Affected jurisdictions
-
Roles: IC @name, Tech Lead @name, Comms @name
-
Beads: issue ID
-
Dashboard: Grafana link
-
Timeline: HH:MM UTC — Incident detected, HH:MM UTC — Incident declared
-
Next update: HH:MM UTC
Status Update Template¶
INCIDENT UPDATE — HH:MM UTC
-
Severity: SEV X (unchanged/upgraded/downgraded)
-
Status: Investigating / Contained / Resolving / Resolved
-
What we know: Key findings
-
What we are doing: Actions with owners
-
Impact update: Current impact assessment
-
Next update: HH:MM UTC
Merchant Communications¶
Outage Notification Email¶
Subject: Simpaisa Service Disruption — Product/Channel — Date
Dear Merchant Name,
We are currently experiencing a service disruption affecting specific product/channel.
What is affected: Description of affected functionality
What this means for you: Specific impact on the merchant operations
What we are doing: Our engineering team is actively investigating and working to resolve the issue.
Current workaround (if applicable): Any alternative processing options available
We will provide updates every frequency until the issue is resolved.
If you have urgent questions, please contact support channel.
Regards, Simpaisa Operations Team
Resolution Notification Email¶
Subject: Simpaisa Service Restored — Product/Channel — Date
Dear Merchant Name,
The service disruption affecting specific product/channel has been resolved.
Resolution time: HH:MM UTC on Date. Total duration: X hours Y minutes
What was affected: Summary of impact
What happened: Brief, non-technical explanation of the root cause
What we have done: Summary of resolution actions. Any preventive measures being implemented.
Impact on your transactions: Details on any transactions that need attention, reconciliation notes, etc.
A detailed post-incident summary will be shared within timeframe.
If you notice any ongoing issues, please contact support channel.
Regards, Simpaisa Operations Team
Post-Incident Summary for Merchants¶
Subject: Simpaisa Post-Incident Summary — Brief Description — Date
Incident Summary:
-
Duration: Start time to End time UTC (total duration)
-
Severity: Level
-
Affected services: List
-
Transaction impact: Number of failed/delayed transactions
Root Cause: Non-technical explanation of what caused the incident
Resolution: What was done to fix the issue
Preventive Measures: What we are doing to prevent recurrence — concrete actions and timelines
Transaction Reconciliation: Any guidance on reconciling transactions during the incident window
We apologise for the disruption and are committed to continuously improving our platform reliability.
Regulatory Communications¶
SBP Payment System Disruption Report (Pakistan)¶
-
Reporting Entity: Simpaisa (Pvt.) Limited
-
Date/Time of Disruption: Start time — PKT
-
Date/Time of Resolution: End time — PKT / Ongoing
-
Nature of Disruption: Description
-
Payment Systems Affected: Channel/rail affected
-
Volume Impact: Transactions affected count, Value affected PKR amount
-
Root Cause: Known/Under investigation
-
Remediation Steps: Actions taken
-
Preventive Measures: Planned improvements
-
Contact Person: Name, Designation, Phone, Email — TBC
Status Page Updates¶
Investigating: We are investigating reports of issue description. Affected: channels/products. Some transactions may fail/be delayed. We will provide an update within timeframe.
Identified: The issue has been identified as brief cause. Our team is implementing a fix. Channel X transactions are currently failing/delayed. Estimated resolution: time estimate or TBC.
Resolved: The issue affecting component has been resolved. All services are operating normally. Duration: X hours Y minutes. We apologise for any inconvenience.
8. Runbooks¶
Payment Channel Failures¶
8.1 Single Channel Down (e.g., Easypaisa Unavailable)¶
Symptoms:
-
Grafana alert: ChannelSuccessRateCritical for a specific channel
-
Spike in HTTP 5xx or timeout errors from the channel API
-
Circuit breaker tripped for the channel
-
Merchant reports of failed transactions on a specific payment method
Impact Assessment:
-
Which merchants rely primarily on this channel?
-
What percentage of total transaction volume flows through this channel?
-
Are there alternative channels available in the same country?
-
Is this during peak transaction hours?
Diagnostic Steps:
-
Check circuit breaker state via Spring Boot actuator health endpoint
-
Check channel-specific error rates in Grafana: Dashboard Payment Channels then Channel Name
-
Review channel integration logs in OpenSearch
-
Check Jaeger for failing traces — filter by service and error status
-
Verify channel partner status — check partner status page, contact partner technical support
-
Check for recent deployments
Resolution Steps:
-
If channel partner is down: Confirm with partner support team, disable the channel in configuration to fail fast, enable alternative channels if available, notify affected merchants
-
If our integration is failing: Check for API contract changes, review recent code/config changes, roll back if recent deployment is the cause
-
If timeout/latency is the issue: Increase timeout thresholds temporarily (if safe), check if partner API is responding slowly, verify network connectivity
Verification Steps:
-
Confirm transaction success rate for the channel returns to above 97 percent
-
Process test transactions through the channel
-
Verify circuit breaker has closed
-
Check that no transactions are stuck in pending state
Post-Resolution Checks:
-
Reconcile transactions during the outage window
-
Identify any transactions that need manual intervention
-
Verify settlement processing for the affected period
8.2 Multiple Channels Down Simultaneously¶
Symptoms:
-
Multiple ChannelSuccessRateCritical alerts firing
-
Overall transaction success rate dropping across products
-
Multiple circuit breakers tripping
Impact Assessment:
-
This is likely SEV1 — escalate immediately
-
Determine if this is a platform issue (shared dependency) vs coincidental channel failures
Diagnostic Steps:
-
Determine the common factor: Same country? Same protocol? Same start time?
-
Check shared infrastructure: NAT Gateway, DNS resolution, ALB health
-
Check for recent platform-wide changes: config management, infrastructure changes, certificate expirations
-
Review application logs for common errors
Resolution Steps:
-
If shared infrastructure issue: Fix the infrastructure component, verify connectivity is restored
-
If bad deployment: Immediate rollback to last known good version
-
If configuration change: Revert the configuration change, verify each channel individually
8.3 Channel Returning Incorrect Responses¶
Impact Assessment:
- This is a financial integrity issue — may be SEV1 depending on scale
Resolution Steps:
-
Immediately halt transactions to the affected channel if financial integrity is at risk
-
Fix the parsing/mapping issue or contact the partner
-
Test with controlled transactions before re-enabling
-
Mark all affected transactions for manual review and reconciliation
8.4 Channel Timeout / Latency Spike¶
Resolution Steps:
-
If partner is slow: Reduce timeout to fail faster, enable circuit breaker, route traffic to alternatives
-
If our side: Scale up instances, investigate and fix the bottleneck
Database Failures¶
8.5 RDS Primary Failure / Failover¶
Impact Assessment: SEV1 — RDS is a shared resource across all products
Resolution Steps:
-
For automatic failover (Multi-AZ): Monitor recovery, applications should reconnect automatically
-
If failover does not complete: Engage AWS Support immediately (SEV1), consider promoting read replica manually
-
Verify data integrity after failover
8.6 Replication Lag¶
Resolution Steps:
-
If caused by heavy write load: Identify and optimise, defer non-critical batch operations
-
If caused by long-running queries on replica: Terminate problematic queries, optimise or reschedule
-
If replication is broken: Rebuild replica from snapshot, engage AWS Support
8.7 Connection Pool Exhaustion¶
Impact Assessment: SEV1 if affecting all services (shared database)
Resolution Steps:
-
Terminate long-running queries/transactions
-
Restart the leaking service instance
-
Increase connection pool size temporarily
-
Investigate root cause: missing Transactional timeout, unclosed connections, deadlocks
8.8 Slow Query Causing Transaction Timeouts¶
Resolution Steps:
-
Terminate the immediate problem query
-
Add missing index
-
Optimise the query and deploy fix
-
If a batch job is causing contention, stop or reschedule it
8.9 Shared Database Contention Between Products¶
Impact Assessment: SEV1 — cross-product impact due to shared resource
Resolution Steps:
-
Stop the offending batch/query
-
Reschedule the operation to off-peak hours
-
Longer term: Evaluate database separation per product
Cache Failures¶
8.10 Redis Cluster Failure¶
Impact Assessment: SEV2 initially — degrades to SEV1 if database cannot handle additional load
Resolution Steps:
-
If single node failure: Failover should be automatic
-
If full cluster failure: Verify AWS ElastiCache health, create new cluster from backup if necessary
-
If memory exhaustion: Check for key space explosion, review eviction policy, scale up
8.11 Cache Poisoning (Incorrect Data Served)¶
Impact Assessment: SEV1 if affecting transaction processing integrity
Resolution Steps:
-
Flush affected cache keys
-
If scope is uncertain, flush entire cache
-
Fix the root cause (cache update logic, race condition)
8.12 ElastiCache Failover¶
Failover should complete automatically (typically 30-60 seconds). If applications do not reconnect, verify DNS resolution and consider rolling restart.
Application Failures¶
8.13 Service Crash / Restart Loop¶
Resolution Steps:
-
If caused by recent deployment: roll back immediately
-
If OOM: increase JVM heap settings temporarily
-
If configuration error: fix and redeploy
-
If external dependency failure: add circuit breaker
8.14 Memory Leak¶
Resolution Steps:
-
Immediate: Restart the affected instance
-
Set up automated restarts as temporary mitigation
-
Fix the leak in code and deploy
8.15 Thread Pool Exhaustion¶
Resolution Steps:
-
If threads blocked on slow external call: reduce timeout, enable circuit breaker, scale up
-
If deadlock: restart the instance, fix the deadlock condition
-
If genuine capacity issue: increase thread pool size, scale out instances
8.16 Deployment Failure / Bad Deploy¶
Resolution Steps:
-
Immediate rollback to previous version
-
If rollback not possible (database migration applied): fix forward with hotfix
-
Verify rollback success
Messaging Failures¶
8.17 Kafka Broker Down¶
Impact Assessment: SEV2 if single broker, SEV1 if multiple brokers or entire cluster
Resolution Steps:
-
Single broker failure: Partitions redistribute automatically, investigate root cause, restart or replace
-
Multiple broker failure: Assess data durability, check AWS MSK health, restore from backup if necessary
Note: Simpaisa is evaluating migration from Kafka to NSQ.
8.18 Consumer Lag Causing Webhook Delays¶
Resolution Steps:
-
If consumer instances unhealthy: restart, scale up
-
If poison message: skip or dead-letter, fix processing logic
-
If genuine throughput issue: scale consumer group, increase batch size
-
Communicate to merchants: webhooks delayed but will be delivered
8.19 Message Queue Backlog¶
Resolution Steps:
-
Scale consumers
-
If downstream bottleneck: fix the bottleneck first
-
If traffic spike: consider rate limiting at the producer
Security Incidents¶
8.20 Suspected Data Breach¶
Impact Assessment: SEV1 — ALWAYS
Resolution Steps:
-
Contain: Isolate affected systems, revoke credentials, block offending IPs
-
Notify: CDO immediately, legal counsel immediately, PCI QSA within 24 hours if cardholder data involved
-
Investigate: Engage forensics, full scope assessment, timeline reconstruction
-
Remediate: Patch vulnerability, rotate credentials, implement additional monitoring
8.21 DDoS Attack¶
Resolution Steps:
-
Enable AWS Shield Advanced
-
Activate WAF rate limiting rules
-
Scale up ALB and application instances
-
Engage AWS DDoS Response Team
8.22 Credential Compromise¶
Resolution Steps:
-
Immediately rotate the compromised credential
-
Revoke all active sessions
-
Audit all activity using the compromised credential
-
Notify the affected merchant
8.23 Webhook Spoofing Detected¶
Resolution Steps:
-
Block the offending source
-
Verify webhook signature validation is enforced
-
Audit all webhook-triggered actions during suspicious period
-
Rotate webhook secrets
-
Notify affected merchants
8.24 Unusual Transaction Patterns (Potential Fraud)¶
Resolution Steps:
-
Do NOT immediately block — assess first
-
Analyse the pattern
-
If fraud is confirmed: suspend processing, notify merchant, report to authorities
Infrastructure Failures¶
8.25 AWS Availability Zone Failure¶
Impact Assessment: SEV1 if single-AZ deployment, SEV2 if Multi-AZ with healthy failover
Resolution Steps:
-
If Multi-AZ working: Monitor remaining AZs, scale up if needed
-
If failover failed: Manually remove unhealthy targets, scale up in healthy AZs
-
Post-recovery: Verify services redeploy, rebalance traffic
8.26 ALB Unhealthy Targets¶
Resolution Steps:
-
If instances crashed: restart or replace
-
If health check failing but app running: fix the health check issue
-
If capacity issue: scale up
8.27 NAT Gateway Failure (Outbound to Channels Blocked)¶
Impact Assessment: SEV1 — all payment channel connectivity lost
Resolution Steps:
-
If NAT Gateway failed: Create new NAT Gateway in healthy AZ, update route tables
-
If throttled: Split traffic across multiple NAT Gateways
-
If route table misconfigured: Correct the route table entry
8.28 DNS Resolution Failure¶
Resolution Steps:
-
If Route 53 issue: check AWS Service Health Dashboard
-
If VPC resolver issue: restart VPC DNS resolver
-
If specific domain: check domain DNS configuration
-
Temporary workaround: add hosts file entries on critical instances (last resort)
9. Escalation Matrix¶
Time-Based Escalation¶
| Time Since Detection | Action |
|---|---|
| 0 minutes | Primary on-call alerted |
| 5 minutes | If not acknowledged: Secondary on-call alerted |
| 10 minutes | If not acknowledged: IC on-call + CDO alerted |
| 15 minutes | If SEV1 not contained: All senior engineers engaged |
| 30 minutes | If SEV1 not resolved: CDO to consider merchant communication |
| 1 hour | If SEV1 not resolved: CDO to consider regulatory notification |
| 2 hours | If SEV1/SEV2 not resolved: Executive review of situation |
Severity-Based Escalation¶
| Severity | Immediate | 15 Minutes | 30 Minutes | 1 Hour | 2 Hours |
|---|---|---|---|---|---|
| SEV1 | On-call, IC, CDO | All senior engineers | Merchant comms drafted | Regulatory assessment | Executive review |
| SEV2 | On-call, IC | CDO notified | Senior engineer if needed | Merchant comms if needed | Upgrade to SEV1 if unresolved |
| SEV3 | On-call | IC if complex | — | CDO daily summary | — |
| SEV4 | On-call (next business day) | — | — | — | — |
CDO Notification Criteria¶
| Scenario | Notification Timing |
|---|---|
| SEV1 declared | Immediately |
| SEV2 declared | Within 30 minutes |
| Any data breach (suspected or confirmed) | Immediately |
| Regulatory reporting required | Immediately |
| Merchant financial impact confirmed | Within 30 minutes |
| Single channel down above 30 min (peak hours) | Within 30 minutes |
| Media enquiry about an incident | Immediately |
| SEV2 unresolved after 2 hours | Immediately (upgrade to SEV1) |
Regulatory Notification Criteria¶
| Jurisdiction | Trigger | SLA |
|---|---|---|
| Pakistan (SBP) | Payment system disruption | As per SBP guidelines |
| Bangladesh (BB) | MFS service disruption | As per BB guidelines |
| All (PCI DSS) | Cardholder data breach | 24 hours to QSA |
10. Post-Incident Review (PIR)¶
Blameless Review Process¶
Simpaisa conducts blameless post-incident reviews. The principles are:
-
People are not the root cause. Systems, processes, and tooling are.
-
The goal is learning, not blame. We want to understand what happened and prevent recurrence.
-
Hindsight is 20/20. Decisions made during the incident were the best decisions possible with the information available at the time.
-
Every incident is an opportunity to improve our systems and processes.
PIR Scheduling¶
| Severity | PIR Required? | Scheduling |
|---|---|---|
| SEV1 | Mandatory | Within 24 hours of resolution |
| SEV2 | Mandatory | Within 48 hours of resolution |
| SEV3 | Required | Within 1 week of resolution |
| SEV4 | Optional | At team discretion |
PIR Attendance¶
-
Incident Commander
-
Technical Lead
-
All engineers who worked on the incident
-
Product owner for affected product(s)
-
CDO (for SEV1, optional for SEV2)
-
Anyone else who can provide context
PIR Template¶
Post-Incident Review: Incident Title
-
Date, Incident Date, Severity, Duration, IC, Tech Lead, Beads Issue
-
Summary: 2-3 sentence summary
-
Timeline table: Time (UTC) and Event
-
Impact: Transactions affected, financial impact, merchants affected, countries, duration, regulatory reporting
-
Root Cause: Detailed technical explanation
-
Contributing Factors
-
What Went Well
-
What Could Be Improved
-
Action Items table: ID, Action, Owner, Due Date, Beads Issue
-
Lessons Learnt
Action Item Tracking¶
-
All PIR action items must be tracked as beads issues
-
Review open PIR action items in weekly engineering stand-ups
-
Action items should have clear owners and due dates
Trend Analysis¶
On a quarterly basis , review:
-
Total incidents by severity
-
Mean time to detect (MTTD)
-
Mean time to acknowledge (MTTA)
-
Mean time to resolve (MTTR)
-
Most common root cause categories
-
Repeat incidents (same root cause recurring)
-
Action item completion rate
-
PIR completion rate
11. Regulatory Reporting Requirements¶
Pakistan — State Bank of Pakistan (SBP)¶
| Item | Details |
|---|---|
| SLA | As per SBP Payment Systems Department guidelines |
| Trigger | Disruption to payment systems operating under SBP licence |
| Contact | SBP Payment Systems Department — TBC |
Payment systems requiring reporting:
-
1Link transactions
-
RAAST transactions
-
Mobile wallet integrations (Easypaisa, JazzCash)
-
Any disruption to interbank payment clearing
Bangladesh — Bangladesh Bank¶
| Item | Details |
|---|---|
| SLA | As per Bangladesh Bank MFS guidelines |
| Trigger | Disruption to Mobile Financial Services (MFS) |
| Contact | Bangladesh Bank Payment Systems Department — TBC |
MFS services requiring reporting:
-
bKash integration disruptions
-
BRAC Bank integration disruptions
-
Any disruption affecting Bangladeshi mobile money transactions
PCI DSS — Breach Notification¶
| Item | Details |
|---|---|
| SLA | Notify QSA within 24 hours; card brands within 24-72 hours |
| Trigger | Confirmed or suspected breach of cardholder data |
| QSA Contact | TBC |
Cardholder data breach includes:
-
Primary Account Number (PAN) exposure
-
CVV/CVC exposure
-
Track data exposure
-
PIN data exposure
-
Any unauthorised access to cardholder data environment
PCI DSS Breach Response Requirements:
-
Immediately contain and limit the exposure
-
Notify acquirer and payment brands
-
Engage PCI Forensic Investigator (PFI) if required
-
Preserve all evidence
-
Provide all required documentation to card brands
Nepal and Iraq¶
STATUS: Regulatory reporting requirements for Nepal and Iraq to be documented. Engage local legal counsel to confirm obligations.
| Jurisdiction | Regulator | Reporting Requirements |
|---|---|---|
| Nepal | Nepal Rastra Bank | To be confirmed |
| Iraq | Central Bank of Iraq | To be confirmed |
12. Business Continuity¶
Degraded Mode Operations¶
| Failed Component | Can Still Process | Cannot Process | Notes |
|---|---|---|---|
| Single payment channel | All other channels | Transactions for that specific channel | Merchants can be routed to alternative channels |
| Multiple channels in one country | All other countries, unaffected channels | Affected country transactions through those channels | Consider if alternative local channels exist |
| RDS primary | Nothing during failover (60-120s) | All transactions | Multi-AZ failover should restore automatically |
| Redis | Transactions (with degraded performance) | Rate limiting, session management | Application falls back to DB; monitor DB load |
| Kafka | Synchronous transaction processing | Asynchronous webhooks, notifications | Webhooks will be delayed, not lost |
| Single AZ | All transactions (with reduced capacity) | — | Scale remaining AZ; monitor capacity |
| ALB | — | All inbound traffic | Failover to backup ALB or DNS-based routing |
| NAT Gateway | Internal processing | All outbound channel calls | Create new NAT Gateway in healthy AZ |
Manual Fallback Procedures¶
STATUS: To be developed.
-
Manual transaction processing: Document manual process for critical merchant transactions with dual authorisation
-
Offline reconciliation: Template spreadsheets for manual reconciliation
-
Partner direct communication: Contact list for all payment channel partner operations teams
Merchant Communication During Extended Outages¶
| Duration | Action |
|---|---|
| 0-15 minutes | Internal investigation; no external communication |
| 15-30 minutes | Status page updated; large merchants notified via email |
| 30-60 minutes | All affected merchants notified; estimated recovery time provided |
| 1-4 hours | Hourly updates to merchants; CDO involved in merchant communications |
| 4+ hours | Key merchant account managers engaged for direct outreach |
Recovery Priority Order¶
| Priority | Component | Rationale |
|---|---|---|
| 1 | Database (RDS) | Foundation for all services |
| 2 | Cache (Redis) | Required for performance and rate limiting |
| 3 | Pay-Ins | Highest transaction volume; direct merchant revenue impact |
| 4 | Pay-Outs | Merchant settlements and disbursements |
| 5 | Messaging (Kafka) | Webhooks and async processing |
| 6 | Cards | Card transaction processing |
| 7 | Remittances | Cross-border transfers |
| 8 | Merchant Dashboard | Merchant self-service (not transaction-critical) |
13. Testing and Drills¶
Quarterly Incident Response Drills¶
Frequency: Quarterly (minimum). Duration: 1-2 hours. Participants: All engineers who may be on-call
| Quarter | Drill Type | Scenario |
|---|---|---|
| Q1 | Tabletop | SEV1: Complete payment processing failure |
| Q2 | Live simulation | SEV2: Single channel failure with cascade |
| Q3 | Tabletop | SEV1: Data breach with regulatory reporting |
| Q4 | Live simulation | SEV2: Database failover with data consistency check |
Drill Process:
-
Scenario prepared in advance (not shared with participants)
-
Incident is declared — participants respond as in a real incident
-
IC assigned, roles filled
-
Team works through diagnosis and resolution
-
Drill coordinator introduces complications
-
Drill ends with debrief
Chaos Engineering / Game Days¶
STATUS: To be established after incident response maturity reaches 2/5.
Scope for game days:
-
Single channel connectivity failure
-
Cache failure (Redis node termination)
-
Single application instance failure
-
Increased latency injection to external calls
Tabletop Exercises for SEV1 Scenarios¶
-
Complete platform outage — RDS failure with failover not working
-
Data breach — Cardholder data exposed via API vulnerability
-
Multi-channel failure — Major channels down in Pakistan during peak hours
-
Security incident — Compromised credentials used to exfiltrate data
-
Regulatory crisis — incident triggering 2-hour internal regulatory reporting deadline
-
Bad deployment — Code change causes silent data corruption
-
DDoS during peak — Volumetric attack during Eid/holiday peak processing
Post-Drill Review and Improvement¶
-
Conduct 30-minute debrief immediately after drill
-
Document findings in a beads issue
-
Update this playbook with process changes
-
Update runbooks with new diagnostic or resolution steps
-
Track improvements as beads issues with owners and due dates
-
Review improvement completion before the next drill
14. Tools and Access¶
Observability Tools¶
| Tool | Purpose | URL | Access |
|---|---|---|---|
| Grafana | Metrics, dashboards, alerting | TBC | SSO / LDAP |
| Jaeger | Distributed tracing | TBC | SSO / LDAP |
| OpenSearch | Centralised logs | TBC | SSO / LDAP |
Key Grafana Dashboards¶
| Dashboard | Purpose |
|---|---|
| Payment Processing Overview | Overall transaction success rates, volumes, latency |
| Channel Health | Per-channel success rates, error rates, latency |
| Infrastructure | RDS, Redis, Kafka, ALB, EC2/ECS metrics |
| Application | JVM metrics, thread pools, connection pools |
| Alerts | Active and recent alert history |
Jaeger Trace Search¶
| Search | Purpose |
|---|---|
| Service: pay-in-service, Status: Error | Find failing pay-in transactions |
| Service: channel-integration, Tag: channel=name | Traces for a specific channel |
| Min Duration: 3s | Find slow transactions |
| Tag: transaction.id=id | Trace a specific transaction end-to-end |
AWS Console Access¶
| Service | Purpose | Notes |
|---|---|---|
| RDS | Database health, failover, performance insights | Shared across products |
| ElastiCache | Redis cluster health | |
| EC2 / ECS | Application instance health | |
| ALB | Load balancer health, target groups | |
| VPC | Network configuration, NAT Gateways | |
| CloudTrail | API audit trail | For security investigations |
Slack Channels¶
| Channel | Purpose |
|---|---|
| alerts | Automated alert notifications from Grafana |
| incidents | General incident discussion, incident declaration |
| inc-YYYY-MM-DD-brief-description | Per-incident channel (created at declaration) |
| on-call | On-call handover, scheduling, general on-call discussion |
| post-incident | PIR scheduling, action item tracking |
Incident Tracking¶
All incidents are tracked in beads (bd CLI):
-
Create an incident issue: bd create "INC: Brief description of the incident"
-
Update status during incident: bd update id --status in-progress
-
Close when resolved: bd close id
-
Link PIR action items: bd dep action-id --on incident-id
15. Appendix: Quick Reference Card¶
SIMPAISA INCIDENT RESPONSE — QUICK REFERENCE
SEVERITY DEFINITIONS
-
SEV1 Critical: Payment processing down / data breach / all merchants / regulatory triggered
-
SEV2 High: Single product down / single channel down / degraded above 5min / partial data exposure
-
SEV3 Medium: Non-critical degraded / single merchant / elevated errors
-
SEV4 Low: Cosmetic / docs / non-prod
RESPONSE TIMES
-
SEV1: Respond 5min, Update 15min, Resolve 1hr
-
SEV2: Respond 15min, Update 30min, Resolve 4hr
-
SEV3: Respond 1hr, Update 4hr, Resolve 24hr
-
SEV4: Next business day
FIRST STEPS WHEN PAGED
-
Acknowledge the alert
-
Open Grafana — assess impact
-
Assign severity
-
Create Slack channel: inc-YYYY-MM-DD-description
-
Create beads issue
-
Post incident header in channel
-
If SEV1/2: Notify CDO (Daniel O'Reilly)
-
Assign IC, Tech Lead, Comms
CDO NOTIFICATION REQUIRED
-
Any SEV1 — immediately
-
Any SEV2 — within 30 minutes
-
Data breach (suspected or confirmed)
-
Regulatory reporting triggered
-
Financial impact confirmed
REGULATORY SLAs
-
Simpaisa internal: 2 HOURS from detection (all markets)
-
Pakistan (SBP): Per SBP guidelines
-
Bangladesh (BB): Per BB MFS guidelines
-
PCI DSS (breach): 24 hours to QSA
KEY DASHBOARDS
-
Grafana: TBC
-
Jaeger: TBC
-
OpenSearch: TBC
-
AWS Console: TBC
SLACK CHANNELS
-
Alerts: alerts
-
Incidents: incidents
-
On-call: on-call
ESCALATION CONTACTS
-
Primary On-Call: TBC — rotation schedule
-
Secondary On-Call: TBC — rotation schedule
-
IC On-Call: TBC — rotation schedule
-
CDO: Daniel O'Reilly
Document Control¶
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2026-04-03 | Daniel O'Reilly (CDO) | Initial version |
Items Requiring Follow-Up¶
-
Grafana, Jaeger, and OpenSearch dashboard URLs
-
On-call rotation tool selection and setup (PagerDuty / Opsgenie)
-
On-call rotation participants and schedule
-
SBP contact details and reporting requirements
-
Bangladesh Bank contact details and MFS reporting guidelines
-
Nepal Rastra Bank reporting requirements
-
Central Bank of Iraq reporting requirements
-
PCI QSA contact details
-
Visa and Mastercard breach notification contacts (via acquirer)
-
Synthetic monitoring implementation
-
Manual fallback procedures development
-
Payment channel partner operations contact list
-
Chaos engineering programme initiation (maturity gate: 2/5)
-
Status page tool selection and setup
-
Alert routing integration (Grafana to PagerDuty/Opsgenie)
This playbook is effective immediately and supersedes any previous incident response documentation. All engineering staff are expected to familiarise themselves with this document and participate in quarterly incident response drills.
Related Documents¶
| Document | Relevance |
|---|---|
| Security Incident Response Procedure (SIRP) | ISMS incident response procedure |
| W-12: Security Operations Ways of Work | SecOps ways of work |
| ADR-SECURITY-2026-04-048: Audit Trail Architecture | Audit trail used during incident investigation |
| ADR-INFRA-2026-04-066: DNS Failover Strategy | DNS failover during infrastructure incidents |
| Threat Model: API Gateway & Platform | Gateway threats that trigger incident response |