Simpaisa Incident Response Playbook¶

Owner	Classification	Review Date	Status
Security	Confidential	April 2027	Active

Simpaisa Incident Response Playbook¶

Document Owner: Daniel O'Reilly, Chief Digital Officer
Version: 1.0
Created: 2026-04-03
Last Reviewed: 2026-04-03
Next Review: 2026-07-03
Classification: Internal — Confidential
Maturity Level: 1/5 (Initial) — Target: 3/5 within 6 months

Table of Contents¶

Executive Summary and Purpose
Scope
Severity Classification
Roles and Responsibilities
Incident Lifecycle
Detection and Alerting
Communication Templates
Runbooks
Escalation Matrix
Post-Incident Review (PIR)
Regulatory Reporting Requirements
Business Continuity
Testing and Drills
Tools and Access
Appendix: Quick Reference Card

1. Executive Summary and Purpose¶

Simpaisa is a payment gateway processing 270M+ transactions worth $1B+ annually across Pakistan, Bangladesh, Nepal, Iraq, and Egypt. Our platform handles Pay-Ins, Pay-Outs, Remittances, and Cards through 20+ payment channels including mobile wallets, bank integrations, direct carrier billing, card networks, and national payment rails.

Any disruption to our services has immediate financial, reputational, and regulatory consequences. This playbook establishes a structured, repeatable incident response process to:

Minimise downtime and financial impact to merchants and end-users
Ensure regulatory compliance across all jurisdictions (including Simpaisa's 2-hour internal incident reporting SLA)
Provide clear escalation paths so the right people are engaged at the right time
Enable blameless learning from every incident to continuously improve platform resilience
Standardise communication to merchants, regulators, and internal stakeholders during incidents

This is a living document. As our incident response maturity improves, this playbook will be updated to reflect new tooling, processes, and lessons learnt.

Agentic AI Integration¶

Simpaisa follows an Agentic AI SDLC-first approach. AI agents assist with:

Automated incident detection and initial triage
Log correlation and root cause analysis suggestions
Drafting communication templates and status updates
Post-incident review data gathering and timeline reconstruction

Human judgement remains authoritative for all escalation, resolution, and communication decisions.

2. Scope¶

What Constitutes an Incident¶

An incident is any event that:

Causes or threatens to cause degradation or loss of payment processing capability
Results in unauthorised access to payment data, merchant data, or internal systems
Triggers or may trigger regulatory reporting obligations in any jurisdiction
Causes financial loss to Simpaisa, merchants, or end-users through system malfunction
Results in incorrect transaction processing (duplicate charges, incorrect amounts, misrouted payments)
Causes breach of SLA commitments to merchants or payment channel partners

In Scope¶

Category	Examples
Payment Processing	Transaction failures, channel outages, settlement errors, reconciliation mismatches
Platform Availability	Service outages, degraded performance, capacity exhaustion
Data Security	Breaches, unauthorised access, data exfiltration, credential compromise
Infrastructure	AWS failures, database issues, cache failures, messaging system failures
Integration	Payment channel API failures, webhook delivery failures, partner connectivity loss
Compliance	PCI DSS violations, regulatory reporting failures

Out of Scope¶

Planned maintenance windows (covered by change management)
Feature requests or enhancement work
Non-production environment issues (unless blocking critical deployments)
Individual merchant configuration issues (handled by support)
General IT support issues (workstation, email, etc.)

3. Severity Classification¶

Severity Definitions¶

Severity	Name	Definition	Examples
SEV1	Critical	Complete loss of payment processing or confirmed data breach affecting multiple merchants or jurisdictions	Payment processing fully down across all channels; confirmed data breach with cardholder data exposure; all merchants unable to process; complete database failure; regulatory reporting triggered
SEV2	High	Single product or payment channel down, significant performance degradation, or partial data exposure	Pay-Ins completely unavailable; Easypaisa channel down; P95 latency >5s for >5 minutes; partial data exposure; single country entirely affected
SEV3	Medium	Non-critical service degraded, single merchant affected, or elevated error rates not yet impacting overall availability	Single merchant experiencing failures; error rate elevated but less than 10 percent of transactions; non-critical API endpoint degraded; webhook delivery delays
SEV4	Low	Cosmetic issues, documentation errors, non-production problems, or minor anomalies with no user impact	UI rendering issue on merchant dashboard; incorrect error message text; staging environment instability; minor log anomalies

SLA Targets¶

Metric	SEV1	SEV2	SEV3	SEV4
Response Time	5 minutes	15 minutes	1 hour	Next business day
Acknowledge	10 minutes	30 minutes	2 hours	Next business day
First Update	15 minutes	30 minutes	4 hours	N/A
Update Frequency	Every 15 minutes	Every 30 minutes	Every 4 hours	As needed
Resolution Target	1 hour	4 hours	24 hours	5 business days
Post-Incident Review	Mandatory within 24 hours	Mandatory within 48 hours	Required within 1 week	Optional
CDO Notification	Immediate	Within 30 minutes	Daily summary	N/A

Severity Upgrade Criteria¶

An incident must be upgraded if:

A SEV2 is not resolved within 2 hours — Upgrade to SEV1
A SEV3 affects additional merchants or channels — Upgrade to SEV2
Any severity reveals a data breach component — Upgrade to SEV1
Any severity triggers regulatory reporting obligations — Upgrade to at least SEV2
Customer/merchant financial impact is confirmed — Upgrade by one level

4. Roles and Responsibilities¶

Incident Roles¶

Incident Commander (IC)¶

The IC owns the incident from declaration to closure. They do not fix the problem themselves — they coordinate.

Responsibilities:

Declare the incident and assign severity
Open the incident Slack channel (naming convention: inc-YYYY-MM-DD-brief-description)
Assign Technical Lead and Communications Lead
Maintain the incident timeline
Make escalation decisions
Authorise communications to merchants and regulators
Call for Post-Incident Review
Create and track the beads issue for the incident

Who: Senior engineer or engineering manager on call. For SEV1, the most senior available engineer assumes IC until explicitly handed over.

Technical Lead (TL)¶

Responsibilities:

Lead the technical investigation and diagnosis
Coordinate engineering resources working on the incident
Implement containment and resolution actions
Provide technical updates to the IC
Document technical findings for PIR
Verify resolution and confirm service recovery

Who: The engineer with deepest knowledge of the affected system. May change during an incident if a different specialist is needed.

Communications Lead (Comms)¶

Responsibilities:

Draft and send all external communications (merchant notifications, status page updates)
Draft regulatory notifications for IC/CDO approval
Manage the Slack incident channel information flow
Ensure update frequency SLAs are met
Coordinate with merchant support team

Who: Product or support team member. For SEV1, a dedicated person must be assigned.

CDO Escalation Criteria¶

The CDO (Daniel O'Reilly) must be notified when:

Any SEV1 incident is declared — immediately
Any SEV2 incident is declared — within 30 minutes
Any incident requires regulatory reporting in any jurisdiction
Any incident involves confirmed or suspected data breach
Any incident has confirmed financial impact to merchants or Simpaisa
Any incident is likely to receive media attention
A SEV2 incident is not resolved within 2 hours
A single payment channel has been down for more than 30 minutes during peak hours

Contact method: Slack DM + phone call for SEV1. Slack DM for SEV2. Daily summary for SEV3.

On-Call Rotation Structure¶

STATUS: To be established. The following is the target structure.

Rotation	Coverage	Staffing
Primary On-Call	24/7, weekly rotation	Engineers (minimum 4 in rotation)
Secondary On-Call	Escalation backup	Senior engineers (minimum 3 in rotation)
IC On-Call	Weekday business hours + weekend coverage	Engineering managers / senior leads
Comms On-Call	Business hours with on-call for SEV1	Product / support team

On-call expectations:

Acknowledge pages within 5 minutes
Have laptop and VPN access available at all times during on-call
No travel to areas without reliable internet during on-call
Handover briefing at rotation change

5. Incident Lifecycle¶

Overview¶

Detection then Triage then Containment then Resolution then Recovery then Post-Incident Review

Phase 1: Detection¶

Objective: Identify that an incident is occurring as quickly as possible.

Automated Detection:

Grafana alert fires based on predefined thresholds (see Section 6)
Alert routes to on-call via configured notification channel
On-call acknowledges the alert within 5 minutes

Manual Detection:

Engineer, merchant, or support team identifies anomalous behaviour
Reporter posts in incidents Slack channel with initial details
On-call engineer investigates and determines if an incident should be declared

AI-Assisted Detection:

Agentic AI monitors aggregate patterns across channels and products
AI flags anomalies that may not trigger individual threshold alerts
On-call engineer reviews AI-flagged anomalies and determines severity

Decision: Is this an incident?

If transaction success rate has dropped — Yes, declare incident
If a payment channel is returning errors — Yes, declare incident
If latency is elevated but transactions succeed — Monitor for 5 minutes, then declare if not improving
If only non-production is affected — No, handle as BAU
If a single merchant reports issues but metrics look normal — Investigate as potential SEV3

Phase 2: Triage¶

Objective: Classify the incident, mobilise the right people, and establish communications.

Steps:

Assign severity using the classification matrix in Section 3
Create incident channel using naming convention inc-YYYY-MM-DD-brief-description
Create beads issue with severity label
Post incident header in channel (use template from Section 7)
Assign roles: IC, Technical Lead, Communications Lead
Notify stakeholders per escalation matrix
Assess blast radius:
- Which products are affected? (Pay-Ins, Pay-Outs, Remittances, Cards)
- Which channels are affected?
- Which countries/jurisdictions are affected?
- How many merchants are impacted?
- Is there a regulatory reporting obligation?

Phase 3: Containment¶

Objective: Stop the incident from getting worse. Limit the blast radius.

Steps:

Identify containment options:
- Can the affected channel be isolated without impacting others?
- Can traffic be rerouted to healthy channels?
- Should the affected service be taken out of the load balancer?
- Is a rollback of a recent deployment needed?
Implement containment — prefer reversible actions:
- Disable the failing channel in configuration
- Scale up healthy instances to absorb redirected traffic
- Enable circuit breakers if not already triggered
- Roll back the last deployment if suspected as cause
Verify containment:
- Confirm the blast radius is not expanding
- Confirm healthy services remain healthy
- Monitor error rates for secondary impacts
Communicate containment status to IC and update incident channel

Phase 4: Resolution¶

Objective: Fix the root cause and restore full service.

Steps:

Diagnose root cause:
- Review Grafana dashboards for anomalies
- Search Jaeger traces for failing transactions
- Query OpenSearch logs for errors and exceptions
- Check recent deployments and configuration changes
- Review AWS health dashboard for infrastructure issues
Implement fix:
- Apply code fix, configuration change, or infrastructure remediation
- For code changes: follow expedited deployment process (peer review still required for SEV1/2)
- Document all changes made during the incident
Test the fix:
- Verify in staging if time permits (SEV3/4)
- For SEV1/2: verify fix addresses root cause, accept risk of direct-to-production if necessary
- Run synthetic transactions through affected paths

Phase 5: Recovery¶

Objective: Restore full normal operations and verify stability.

Steps:

Re-enable affected services/channels gradually
Monitor recovery metrics:
- Transaction success rate returning to baseline
- Latency returning to normal
- Error rates dropping to normal levels
- Queue backlogs draining
Process backlogged transactions if applicable
Verify data consistency:
- Check for stuck transactions
- Reconcile any transactions that were in-flight during the incident
- Verify settlement files are correct
Communicate resolution to merchants and stakeholders
Stand down incident team — IC makes the call

Phase 6: Post-Incident Review¶

Objective: Learn from the incident and prevent recurrence. See Section 10 for full details.

Decision Trees¶

Transaction Failure Rate Increasing¶

Transaction failure rate above 5 percent?

YES: Which channels are affected?
- ALL channels: SEV1 — Likely platform issue (DB, cache, app). Check RDS health, Redis health, application logs
- MULTIPLE channels: SEV2 — Possible shared dependency. Check shared middleware, network connectivity, DNS
- SINGLE channel: SEV2 — Channel-specific issue. Check channel API status, integration logs, circuit breaker state
NO: Monitor. Set alert if above 3 percent sustained for 5 minutes.

Latency Spike¶

P95 latency above 2s?

YES: Is it affecting transaction success?
- YES: Transactions timing out — SEV2. Check DB query performance, connection pools, external API latency
- NO: Transactions slow but completing — SEV3. Check query plans, cache hit rates, JVM garbage collection
NO: Monitor. Investigate if above 1.5s sustained.

Suspected Security Incident¶

Confirmed data breach: SEV1 IMMEDIATELY. Engage security team, isolate affected systems, notify CDO
Suspicious activity (unusual patterns): SEV2. Investigate scope, preserve evidence, assess data exposure
Automated scan/probe detected: SEV4. Verify WAF rules, monitor for escalation
Credential compromise suspected: SEV2. Rotate credentials immediately, audit access logs

6. Detection and Alerting¶

Observability Stack¶

Component	Tool	Purpose
Metrics	OpenTelemetry to Grafana	Time-series metrics, dashboards, alerting
Traces	OpenTelemetry to Jaeger	Distributed tracing across services
Logs	OpenTelemetry to OpenSearch	Centralised log aggregation and search

Key Metrics to Monitor¶

Payment Processing Metrics¶

Metric	Normal Range	Warning Threshold	Critical Threshold
Transaction Success Rate (overall)	above 99 percent	below 98 percent	below 95 percent
Transaction Success Rate (per channel)	above 97 percent	below 95 percent	below 90 percent
P50 Latency	below 500ms	above 1s	above 2s
P95 Latency	below 1.5s	above 3s	above 5s
P99 Latency	below 3s	above 5s	above 10s
Error Rate (per channel)	below 1 percent	above 3 percent	above 5 percent
Transactions Per Second	Baseline plus/minus 20 percent	plus/minus 40 percent	plus/minus 60 percent or zero

Infrastructure Metrics¶

Metric	Normal Range	Warning Threshold	Critical Threshold
RDS CPU Utilisation	below 60 percent	above 75 percent	above 90 percent
RDS Connection Count	below 70 percent of max	above 80 percent of max	above 90 percent of max
RDS Replication Lag	below 100ms	above 500ms	above 2s
Redis Memory Usage	below 70 percent	above 80 percent	above 90 percent
Redis Connection Count	below 80 percent of max	above 85 percent	above 95 percent
Kafka Consumer Lag	below 1000 messages	above 5000 messages	above 50000 messages
JVM Heap Usage	below 70 percent	above 80 percent	above 90 percent
ALB 5xx Rate	below 0.1 percent	above 0.5 percent	above 1 percent
ALB Healthy Host Count	All targets healthy	1 unhealthy	more than 1 unhealthy

Grafana Alert Rules¶

Critical Alerts (Page immediately)¶

TransactionSuccessRateCritical : Overall transaction success rate below 95 percent for 2 minutes
ChannelSuccessRateCritical : Channel success rate below 90 percent for 3 minutes
TransactionLatencyP95Critical : P95 transaction latency exceeds 5 seconds for 3 minutes
ZeroTransactions : No transactions processed in the last 5 minutes for 2 minutes
DBConnectionPoolExhausted : Database connection pool above 90 percent utilised for 1 minute

Warning Alerts (Notify on-call, investigate)¶

ElevatedErrorRate : Transaction error rate above 3 percent for 5 minutes
KafkaConsumerLagHigh : Kafka consumer lag exceeds 5000 messages for 5 minutes
RedisMemoryHigh : Redis memory usage above 80 percent for 5 minutes

Alert Routing and Escalation¶

Alert fires in Grafana:

Critical: Page on-call immediately (PagerDuty/Opsgenie — TBC)
- Not acknowledged in 5 min: Page secondary on-call
- Not acknowledged in 10 min: Page IC on-call + CDO
Warning: Slack alerts channel + on-call notification
- Not acknowledged in 15 min: Page on-call
- Sustained above 30 min: Upgrade to Critical
Info: Slack alerts channel (no page)

Synthetic Monitoring¶

STATUS: To be implemented.

Target synthetic checks:

Check	Frequency	Timeout	Description
Pay-In Health	Every 1 minute	10s	End-to-end test transaction through each active channel
Pay-Out Health	Every 1 minute	10s	Verification of pay-out initiation path
API Gateway	Every 30 seconds	5s	Health check endpoint for each API gateway instance
Channel Connectivity	Every 2 minutes	15s	Connectivity check to each payment channel partner
Webhook Delivery	Every 5 minutes	30s	End-to-end webhook delivery verification

7. Communication Templates¶

Internal Communications¶

Incident Channel Header (Post at Incident Declaration)¶

INCIDENT DECLARED

Severity: SEV 1/2/3/4
Summary: One-line description of the incident
Impact: Products/channels/merchants affected
Countries: Affected jurisdictions
Roles: IC @name, Tech Lead @name, Comms @name
Beads: issue ID
Dashboard: Grafana link
Timeline: HH:MM UTC — Incident detected, HH:MM UTC — Incident declared
Next update: HH:MM UTC

Status Update Template¶

INCIDENT UPDATE — HH:MM UTC

Severity: SEV X (unchanged/upgraded/downgraded)
Status: Investigating / Contained / Resolving / Resolved
What we know: Key findings
What we are doing: Actions with owners
Impact update: Current impact assessment
Next update: HH:MM UTC

Merchant Communications¶

Outage Notification Email¶

Subject: Simpaisa Service Disruption — Product/Channel — Date

Dear Merchant Name,

We are currently experiencing a service disruption affecting specific product/channel.

What is affected: Description of affected functionality

What this means for you: Specific impact on the merchant operations

What we are doing: Our engineering team is actively investigating and working to resolve the issue.

Current workaround (if applicable): Any alternative processing options available

We will provide updates every frequency until the issue is resolved.

If you have urgent questions, please contact support channel.

Regards, Simpaisa Operations Team

Resolution Notification Email¶

Subject: Simpaisa Service Restored — Product/Channel — Date

Dear Merchant Name,

The service disruption affecting specific product/channel has been resolved.

Resolution time: HH:MM UTC on Date. Total duration: X hours Y minutes

What was affected: Summary of impact

What happened: Brief, non-technical explanation of the root cause

What we have done: Summary of resolution actions. Any preventive measures being implemented.

Impact on your transactions: Details on any transactions that need attention, reconciliation notes, etc.

A detailed post-incident summary will be shared within timeframe.

If you notice any ongoing issues, please contact support channel.

Regards, Simpaisa Operations Team

Post-Incident Summary for Merchants¶

Subject: Simpaisa Post-Incident Summary — Brief Description — Date

Incident Summary:

Duration: Start time to End time UTC (total duration)
Severity: Level
Affected services: List
Transaction impact: Number of failed/delayed transactions

Root Cause: Non-technical explanation of what caused the incident

Resolution: What was done to fix the issue

Preventive Measures: What we are doing to prevent recurrence — concrete actions and timelines

Transaction Reconciliation: Any guidance on reconciling transactions during the incident window

We apologise for the disruption and are committed to continuously improving our platform reliability.

Regulatory Communications¶

SBP Payment System Disruption Report (Pakistan)¶

Reporting Entity: Simpaisa (Pvt.) Limited
Date/Time of Disruption: Start time — PKT
Date/Time of Resolution: End time — PKT / Ongoing
Nature of Disruption: Description
Payment Systems Affected: Channel/rail affected
Volume Impact: Transactions affected count, Value affected PKR amount
Root Cause: Known/Under investigation
Remediation Steps: Actions taken
Preventive Measures: Planned improvements
Contact Person: Name, Designation, Phone, Email — TBC

Status Page Updates¶

Investigating: We are investigating reports of issue description. Affected: channels/products. Some transactions may fail/be delayed. We will provide an update within timeframe.

Identified: The issue has been identified as brief cause. Our team is implementing a fix. Channel X transactions are currently failing/delayed. Estimated resolution: time estimate or TBC.

Resolved: The issue affecting component has been resolved. All services are operating normally. Duration: X hours Y minutes. We apologise for any inconvenience.

8. Runbooks¶

Payment Channel Failures¶

8.1 Single Channel Down (e.g., Easypaisa Unavailable)¶

Symptoms:

Grafana alert: ChannelSuccessRateCritical for a specific channel
Spike in HTTP 5xx or timeout errors from the channel API
Circuit breaker tripped for the channel
Merchant reports of failed transactions on a specific payment method

Impact Assessment:

Which merchants rely primarily on this channel?
What percentage of total transaction volume flows through this channel?
Are there alternative channels available in the same country?
Is this during peak transaction hours?

Diagnostic Steps:

Check circuit breaker state via Spring Boot actuator health endpoint
Check channel-specific error rates in Grafana: Dashboard Payment Channels then Channel Name
Review channel integration logs in OpenSearch
Check Jaeger for failing traces — filter by service and error status
Verify channel partner status — check partner status page, contact partner technical support
Check for recent deployments

Resolution Steps:

If channel partner is down: Confirm with partner support team, disable the channel in configuration to fail fast, enable alternative channels if available, notify affected merchants
If our integration is failing: Check for API contract changes, review recent code/config changes, roll back if recent deployment is the cause
If timeout/latency is the issue: Increase timeout thresholds temporarily (if safe), check if partner API is responding slowly, verify network connectivity

Verification Steps:

Confirm transaction success rate for the channel returns to above 97 percent
Process test transactions through the channel
Verify circuit breaker has closed
Check that no transactions are stuck in pending state

Post-Resolution Checks:

Reconcile transactions during the outage window
Identify any transactions that need manual intervention
Verify settlement processing for the affected period

8.2 Multiple Channels Down Simultaneously¶

Symptoms:

Multiple ChannelSuccessRateCritical alerts firing
Overall transaction success rate dropping across products
Multiple circuit breakers tripping

Impact Assessment:

This is likely SEV1 — escalate immediately
Determine if this is a platform issue (shared dependency) vs coincidental channel failures

Diagnostic Steps:

Determine the common factor: Same country? Same protocol? Same start time?
Check shared infrastructure: NAT Gateway, DNS resolution, ALB health
Check for recent platform-wide changes: config management, infrastructure changes, certificate expirations
Review application logs for common errors

Resolution Steps:

If shared infrastructure issue: Fix the infrastructure component, verify connectivity is restored
If bad deployment: Immediate rollback to last known good version
If configuration change: Revert the configuration change, verify each channel individually

8.3 Channel Returning Incorrect Responses¶

Impact Assessment:

This is a financial integrity issue — may be SEV1 depending on scale

Resolution Steps:

Immediately halt transactions to the affected channel if financial integrity is at risk
Fix the parsing/mapping issue or contact the partner
Test with controlled transactions before re-enabling
Mark all affected transactions for manual review and reconciliation

8.4 Channel Timeout / Latency Spike¶

Resolution Steps:

If partner is slow: Reduce timeout to fail faster, enable circuit breaker, route traffic to alternatives
If our side: Scale up instances, investigate and fix the bottleneck

Database Failures¶

8.5 RDS Primary Failure / Failover¶

Impact Assessment: SEV1 — RDS is a shared resource across all products

Resolution Steps:

For automatic failover (Multi-AZ): Monitor recovery, applications should reconnect automatically
If failover does not complete: Engage AWS Support immediately (SEV1), consider promoting read replica manually
Verify data integrity after failover

8.6 Replication Lag¶

Resolution Steps:

If caused by heavy write load: Identify and optimise, defer non-critical batch operations
If caused by long-running queries on replica: Terminate problematic queries, optimise or reschedule
If replication is broken: Rebuild replica from snapshot, engage AWS Support

8.7 Connection Pool Exhaustion¶

Impact Assessment: SEV1 if affecting all services (shared database)

Resolution Steps:

Terminate long-running queries/transactions
Restart the leaking service instance
Increase connection pool size temporarily
Investigate root cause: missing Transactional timeout, unclosed connections, deadlocks

8.8 Slow Query Causing Transaction Timeouts¶

Resolution Steps:

Terminate the immediate problem query
Add missing index
Optimise the query and deploy fix
If a batch job is causing contention, stop or reschedule it

8.9 Shared Database Contention Between Products¶

Impact Assessment: SEV1 — cross-product impact due to shared resource

Resolution Steps:

Stop the offending batch/query
Reschedule the operation to off-peak hours
Longer term: Evaluate database separation per product

Cache Failures¶

8.10 Redis Cluster Failure¶

Impact Assessment: SEV2 initially — degrades to SEV1 if database cannot handle additional load

Resolution Steps:

If single node failure: Failover should be automatic
If full cluster failure: Verify AWS ElastiCache health, create new cluster from backup if necessary
If memory exhaustion: Check for key space explosion, review eviction policy, scale up

8.11 Cache Poisoning (Incorrect Data Served)¶

Impact Assessment: SEV1 if affecting transaction processing integrity

Resolution Steps:

Flush affected cache keys
If scope is uncertain, flush entire cache
Fix the root cause (cache update logic, race condition)

8.12 ElastiCache Failover¶

Failover should complete automatically (typically 30-60 seconds). If applications do not reconnect, verify DNS resolution and consider rolling restart.

Application Failures¶

8.13 Service Crash / Restart Loop¶

Resolution Steps:

If caused by recent deployment: roll back immediately
If OOM: increase JVM heap settings temporarily
If configuration error: fix and redeploy
If external dependency failure: add circuit breaker

8.14 Memory Leak¶

Resolution Steps:

Immediate: Restart the affected instance
Set up automated restarts as temporary mitigation
Fix the leak in code and deploy

8.15 Thread Pool Exhaustion¶

Resolution Steps:

If threads blocked on slow external call: reduce timeout, enable circuit breaker, scale up
If deadlock: restart the instance, fix the deadlock condition
If genuine capacity issue: increase thread pool size, scale out instances

8.16 Deployment Failure / Bad Deploy¶

Resolution Steps:

Immediate rollback to previous version
If rollback not possible (database migration applied): fix forward with hotfix
Verify rollback success

Messaging Failures¶

8.17 Kafka Broker Down¶

Impact Assessment: SEV2 if single broker, SEV1 if multiple brokers or entire cluster

Resolution Steps:

Single broker failure: Partitions redistribute automatically, investigate root cause, restart or replace
Multiple broker failure: Assess data durability, check AWS MSK health, restore from backup if necessary

Note: Simpaisa is evaluating migration from Kafka to NSQ.

8.18 Consumer Lag Causing Webhook Delays¶

Resolution Steps:

If consumer instances unhealthy: restart, scale up
If poison message: skip or dead-letter, fix processing logic
If genuine throughput issue: scale consumer group, increase batch size
Communicate to merchants: webhooks delayed but will be delivered

8.19 Message Queue Backlog¶

Resolution Steps:

Scale consumers
If downstream bottleneck: fix the bottleneck first
If traffic spike: consider rate limiting at the producer

Security Incidents¶

8.20 Suspected Data Breach¶

Impact Assessment: SEV1 — ALWAYS

Resolution Steps:

Contain: Isolate affected systems, revoke credentials, block offending IPs
Notify: CDO immediately, legal counsel immediately, PCI QSA within 24 hours if cardholder data involved
Investigate: Engage forensics, full scope assessment, timeline reconstruction
Remediate: Patch vulnerability, rotate credentials, implement additional monitoring

8.21 DDoS Attack¶

Resolution Steps:

Enable AWS Shield Advanced
Activate WAF rate limiting rules
Scale up ALB and application instances
Engage AWS DDoS Response Team

8.22 Credential Compromise¶

Resolution Steps:

Immediately rotate the compromised credential
Revoke all active sessions
Audit all activity using the compromised credential
Notify the affected merchant

8.23 Webhook Spoofing Detected¶

Resolution Steps:

Block the offending source
Verify webhook signature validation is enforced
Audit all webhook-triggered actions during suspicious period
Rotate webhook secrets
Notify affected merchants

8.24 Unusual Transaction Patterns (Potential Fraud)¶

Resolution Steps:

Do NOT immediately block — assess first
Analyse the pattern
If fraud is confirmed: suspend processing, notify merchant, report to authorities

Infrastructure Failures¶

8.25 AWS Availability Zone Failure¶

Impact Assessment: SEV1 if single-AZ deployment, SEV2 if Multi-AZ with healthy failover

Resolution Steps:

If Multi-AZ working: Monitor remaining AZs, scale up if needed
If failover failed: Manually remove unhealthy targets, scale up in healthy AZs
Post-recovery: Verify services redeploy, rebalance traffic

8.26 ALB Unhealthy Targets¶

Resolution Steps:

If instances crashed: restart or replace
If health check failing but app running: fix the health check issue
If capacity issue: scale up

8.27 NAT Gateway Failure (Outbound to Channels Blocked)¶

Impact Assessment: SEV1 — all payment channel connectivity lost

Resolution Steps:

If NAT Gateway failed: Create new NAT Gateway in healthy AZ, update route tables
If throttled: Split traffic across multiple NAT Gateways
If route table misconfigured: Correct the route table entry

8.28 DNS Resolution Failure¶

Resolution Steps:

If Route 53 issue: check AWS Service Health Dashboard
If VPC resolver issue: restart VPC DNS resolver
If specific domain: check domain DNS configuration
Temporary workaround: add hosts file entries on critical instances (last resort)

9. Escalation Matrix¶

Time-Based Escalation¶

Time Since Detection	Action
0 minutes	Primary on-call alerted
5 minutes	If not acknowledged: Secondary on-call alerted
10 minutes	If not acknowledged: IC on-call + CDO alerted
15 minutes	If SEV1 not contained: All senior engineers engaged
30 minutes	If SEV1 not resolved: CDO to consider merchant communication
1 hour	If SEV1 not resolved: CDO to consider regulatory notification
2 hours	If SEV1/SEV2 not resolved: Executive review of situation

Severity-Based Escalation¶

Severity	Immediate	15 Minutes	30 Minutes	1 Hour	2 Hours
SEV1	On-call, IC, CDO	All senior engineers	Merchant comms drafted	Regulatory assessment	Executive review
SEV2	On-call, IC	CDO notified	Senior engineer if needed	Merchant comms if needed	Upgrade to SEV1 if unresolved
SEV3	On-call	IC if complex	—	CDO daily summary	—
SEV4	On-call (next business day)	—	—	—	—

CDO Notification Criteria¶

Scenario	Notification Timing
SEV1 declared	Immediately
SEV2 declared	Within 30 minutes
Any data breach (suspected or confirmed)	Immediately
Regulatory reporting required	Immediately
Merchant financial impact confirmed	Within 30 minutes
Single channel down above 30 min (peak hours)	Within 30 minutes
Media enquiry about an incident	Immediately
SEV2 unresolved after 2 hours	Immediately (upgrade to SEV1)

Regulatory Notification Criteria¶

Jurisdiction	Trigger	SLA
Pakistan (SBP)	Payment system disruption	As per SBP guidelines
Bangladesh (BB)	MFS service disruption	As per BB guidelines
All (PCI DSS)	Cardholder data breach	24 hours to QSA

10. Post-Incident Review (PIR)¶

Blameless Review Process¶

Simpaisa conducts blameless post-incident reviews. The principles are:

People are not the root cause. Systems, processes, and tooling are.
The goal is learning, not blame. We want to understand what happened and prevent recurrence.
Hindsight is 20/20. Decisions made during the incident were the best decisions possible with the information available at the time.
Every incident is an opportunity to improve our systems and processes.

PIR Scheduling¶

Severity	PIR Required?	Scheduling
SEV1	Mandatory	Within 24 hours of resolution
SEV2	Mandatory	Within 48 hours of resolution
SEV3	Required	Within 1 week of resolution
SEV4	Optional	At team discretion

PIR Attendance¶

Incident Commander
Technical Lead
All engineers who worked on the incident
Product owner for affected product(s)
CDO (for SEV1, optional for SEV2)
Anyone else who can provide context

PIR Template¶

Post-Incident Review: Incident Title

Date, Incident Date, Severity, Duration, IC, Tech Lead, Beads Issue
Summary: 2-3 sentence summary
Timeline table: Time (UTC) and Event
Impact: Transactions affected, financial impact, merchants affected, countries, duration, regulatory reporting
Root Cause: Detailed technical explanation
Contributing Factors
What Went Well
What Could Be Improved
Action Items table: ID, Action, Owner, Due Date, Beads Issue
Lessons Learnt

Action Item Tracking¶

All PIR action items must be tracked as beads issues
Review open PIR action items in weekly engineering stand-ups
Action items should have clear owners and due dates

Trend Analysis¶

On a quarterly basis , review:

Total incidents by severity
Mean time to detect (MTTD)
Mean time to acknowledge (MTTA)
Mean time to resolve (MTTR)
Most common root cause categories
Repeat incidents (same root cause recurring)
Action item completion rate
PIR completion rate

11. Regulatory Reporting Requirements¶

Pakistan — State Bank of Pakistan (SBP)¶

Item	Details
SLA	As per SBP Payment Systems Department guidelines
Trigger	Disruption to payment systems operating under SBP licence
Contact	SBP Payment Systems Department — TBC

Payment systems requiring reporting:

1Link transactions
RAAST transactions
Mobile wallet integrations (Easypaisa, JazzCash)
Any disruption to interbank payment clearing

Bangladesh — Bangladesh Bank¶

Item	Details
SLA	As per Bangladesh Bank MFS guidelines
Trigger	Disruption to Mobile Financial Services (MFS)
Contact	Bangladesh Bank Payment Systems Department — TBC

MFS services requiring reporting:

bKash integration disruptions
BRAC Bank integration disruptions
Any disruption affecting Bangladeshi mobile money transactions

PCI DSS — Breach Notification¶

Item	Details
SLA	Notify QSA within 24 hours; card brands within 24-72 hours
Trigger	Confirmed or suspected breach of cardholder data
QSA Contact	TBC

Cardholder data breach includes:

Primary Account Number (PAN) exposure
CVV/CVC exposure
Track data exposure
PIN data exposure
Any unauthorised access to cardholder data environment

PCI DSS Breach Response Requirements:

Immediately contain and limit the exposure
Notify acquirer and payment brands
Engage PCI Forensic Investigator (PFI) if required
Preserve all evidence
Provide all required documentation to card brands

Nepal and Iraq¶

STATUS: Regulatory reporting requirements for Nepal and Iraq to be documented. Engage local legal counsel to confirm obligations.

Jurisdiction	Regulator	Reporting Requirements
Nepal	Nepal Rastra Bank	To be confirmed
Iraq	Central Bank of Iraq	To be confirmed

12. Business Continuity¶

Degraded Mode Operations¶

Failed Component	Can Still Process	Cannot Process	Notes
Single payment channel	All other channels	Transactions for that specific channel	Merchants can be routed to alternative channels
Multiple channels in one country	All other countries, unaffected channels	Affected country transactions through those channels	Consider if alternative local channels exist
RDS primary	Nothing during failover (60-120s)	All transactions	Multi-AZ failover should restore automatically
Redis	Transactions (with degraded performance)	Rate limiting, session management	Application falls back to DB; monitor DB load
Kafka	Synchronous transaction processing	Asynchronous webhooks, notifications	Webhooks will be delayed, not lost
Single AZ	All transactions (with reduced capacity)	—	Scale remaining AZ; monitor capacity
ALB	—	All inbound traffic	Failover to backup ALB or DNS-based routing
NAT Gateway	Internal processing	All outbound channel calls	Create new NAT Gateway in healthy AZ

Manual Fallback Procedures¶

STATUS: To be developed.

Manual transaction processing: Document manual process for critical merchant transactions with dual authorisation
Offline reconciliation: Template spreadsheets for manual reconciliation
Partner direct communication: Contact list for all payment channel partner operations teams

Merchant Communication During Extended Outages¶

Duration	Action
0-15 minutes	Internal investigation; no external communication
15-30 minutes	Status page updated; large merchants notified via email
30-60 minutes	All affected merchants notified; estimated recovery time provided
1-4 hours	Hourly updates to merchants; CDO involved in merchant communications
4+ hours	Key merchant account managers engaged for direct outreach

Recovery Priority Order¶

Priority	Component	Rationale
1	Database (RDS)	Foundation for all services
2	Cache (Redis)	Required for performance and rate limiting
3	Pay-Ins	Highest transaction volume; direct merchant revenue impact
4	Pay-Outs	Merchant settlements and disbursements
5	Messaging (Kafka)	Webhooks and async processing
6	Cards	Card transaction processing
7	Remittances	Cross-border transfers
8	Merchant Dashboard	Merchant self-service (not transaction-critical)

13. Testing and Drills¶

Quarterly Incident Response Drills¶

Frequency: Quarterly (minimum). Duration: 1-2 hours. Participants: All engineers who may be on-call

Quarter	Drill Type	Scenario
Q1	Tabletop	SEV1: Complete payment processing failure
Q2	Live simulation	SEV2: Single channel failure with cascade
Q3	Tabletop	SEV1: Data breach with regulatory reporting
Q4	Live simulation	SEV2: Database failover with data consistency check

Drill Process:

Scenario prepared in advance (not shared with participants)
Incident is declared — participants respond as in a real incident
IC assigned, roles filled
Team works through diagnosis and resolution
Drill coordinator introduces complications
Drill ends with debrief

Chaos Engineering / Game Days¶

STATUS: To be established after incident response maturity reaches 2/5.

Scope for game days:

Single channel connectivity failure
Cache failure (Redis node termination)
Single application instance failure
Increased latency injection to external calls

Tabletop Exercises for SEV1 Scenarios¶

Complete platform outage — RDS failure with failover not working
Data breach — Cardholder data exposed via API vulnerability
Multi-channel failure — Major channels down in Pakistan during peak hours
Security incident — Compromised credentials used to exfiltrate data
Regulatory crisis — incident triggering 2-hour internal regulatory reporting deadline
Bad deployment — Code change causes silent data corruption
DDoS during peak — Volumetric attack during Eid/holiday peak processing

Post-Drill Review and Improvement¶

Conduct 30-minute debrief immediately after drill
Document findings in a beads issue
Update this playbook with process changes
Update runbooks with new diagnostic or resolution steps
Track improvements as beads issues with owners and due dates
Review improvement completion before the next drill

14. Tools and Access¶

Observability Tools¶

Tool	Purpose	URL	Access
Grafana	Metrics, dashboards, alerting	TBC	SSO / LDAP
Jaeger	Distributed tracing	TBC	SSO / LDAP
OpenSearch	Centralised logs	TBC	SSO / LDAP

Key Grafana Dashboards¶

Dashboard	Purpose
Payment Processing Overview	Overall transaction success rates, volumes, latency
Channel Health	Per-channel success rates, error rates, latency
Infrastructure	RDS, Redis, Kafka, ALB, EC2/ECS metrics
Application	JVM metrics, thread pools, connection pools
Alerts	Active and recent alert history

Jaeger Trace Search¶

Search	Purpose
Service: pay-in-service, Status: Error	Find failing pay-in transactions
Service: channel-integration, Tag: channel=name	Traces for a specific channel
Min Duration: 3s	Find slow transactions
Tag: transaction.id=id	Trace a specific transaction end-to-end

AWS Console Access¶

Service	Purpose	Notes
RDS	Database health, failover, performance insights	Shared across products
ElastiCache	Redis cluster health
EC2 / ECS	Application instance health
ALB	Load balancer health, target groups
VPC	Network configuration, NAT Gateways
CloudTrail	API audit trail	For security investigations

Slack Channels¶

Channel	Purpose
alerts	Automated alert notifications from Grafana
incidents	General incident discussion, incident declaration
inc-YYYY-MM-DD-brief-description	Per-incident channel (created at declaration)
on-call	On-call handover, scheduling, general on-call discussion
post-incident	PIR scheduling, action item tracking

Incident Tracking¶

All incidents are tracked in beads (bd CLI):

Create an incident issue: bd create "INC: Brief description of the incident"
Update status during incident: bd update id --status in-progress
Close when resolved: bd close id
Link PIR action items: bd dep action-id --on incident-id

15. Appendix: Quick Reference Card¶

SIMPAISA INCIDENT RESPONSE — QUICK REFERENCE

SEVERITY DEFINITIONS

SEV1 Critical: Payment processing down / data breach / all merchants / regulatory triggered
SEV2 High: Single product down / single channel down / degraded above 5min / partial data exposure
SEV3 Medium: Non-critical degraded / single merchant / elevated errors
SEV4 Low: Cosmetic / docs / non-prod

RESPONSE TIMES

SEV1: Respond 5min, Update 15min, Resolve 1hr
SEV2: Respond 15min, Update 30min, Resolve 4hr
SEV3: Respond 1hr, Update 4hr, Resolve 24hr
SEV4: Next business day

FIRST STEPS WHEN PAGED

Acknowledge the alert
Open Grafana — assess impact
Assign severity
Create Slack channel: inc-YYYY-MM-DD-description
Create beads issue
Post incident header in channel
If SEV1/2: Notify CDO (Daniel O'Reilly)
Assign IC, Tech Lead, Comms

CDO NOTIFICATION REQUIRED

Any SEV1 — immediately
Any SEV2 — within 30 minutes
Data breach (suspected or confirmed)
Regulatory reporting triggered
Financial impact confirmed

REGULATORY SLAs

Simpaisa internal: 2 HOURS from detection (all markets)
Pakistan (SBP): Per SBP guidelines
Bangladesh (BB): Per BB MFS guidelines
PCI DSS (breach): 24 hours to QSA

KEY DASHBOARDS

Grafana: TBC
Jaeger: TBC
OpenSearch: TBC
AWS Console: TBC

SLACK CHANNELS

Alerts: alerts
Incidents: incidents
On-call: on-call

ESCALATION CONTACTS

Primary On-Call: TBC — rotation schedule
Secondary On-Call: TBC — rotation schedule
IC On-Call: TBC — rotation schedule
CDO: Daniel O'Reilly

Document Control¶

Version	Date	Author	Changes
1.0	2026-04-03	Daniel O'Reilly (CDO)	Initial version

Items Requiring Follow-Up¶

Grafana, Jaeger, and OpenSearch dashboard URLs
On-call rotation tool selection and setup (PagerDuty / Opsgenie)
On-call rotation participants and schedule
SBP contact details and reporting requirements
Bangladesh Bank contact details and MFS reporting guidelines
Nepal Rastra Bank reporting requirements
Central Bank of Iraq reporting requirements
PCI QSA contact details
Visa and Mastercard breach notification contacts (via acquirer)
Synthetic monitoring implementation
Manual fallback procedures development
Payment channel partner operations contact list
Chaos engineering programme initiation (maturity gate: 2/5)
Status page tool selection and setup
Alert routing integration (Grafana to PagerDuty/Opsgenie)

This playbook is effective immediately and supersedes any previous incident response documentation. All engineering staff are expected to familiarise themselves with this document and participate in quarterly incident response drills.

Document	Relevance
Security Incident Response Procedure (SIRP)	ISMS incident response procedure
W-12: Security Operations Ways of Work	SecOps ways of work
ADR-SECURITY-2026-04-048: Audit Trail Architecture	Audit trail used during incident investigation
ADR-INFRA-2026-04-066: DNS Failover Strategy	DNS failover during infrastructure incidents
Threat Model: API Gateway & Platform	Gateway threats that trigger incident response