Skip to content

Simpaisa Incident Response Playbook

Owner Classification Review Date Status
Security Confidential April 2027 Active

Simpaisa Incident Response Playbook

Document Owner: Daniel O'Reilly, Chief Digital Officer
Version: 1.0
Created: 2026-04-03
Last Reviewed: 2026-04-03
Next Review: 2026-07-03
Classification: Internal — Confidential
Maturity Level: 1/5 (Initial) — Target: 3/5 within 6 months


Table of Contents

  1. Executive Summary and Purpose

  2. Scope

  3. Severity Classification

  4. Roles and Responsibilities

  5. Incident Lifecycle

  6. Detection and Alerting

  7. Communication Templates

  8. Runbooks

  9. Escalation Matrix

  10. Post-Incident Review (PIR)

  11. Regulatory Reporting Requirements

  12. Business Continuity

  13. Testing and Drills

  14. Tools and Access

  15. Appendix: Quick Reference Card


1. Executive Summary and Purpose

Simpaisa is a payment gateway processing 270M+ transactions worth $1B+ annually across Pakistan, Bangladesh, Nepal, Iraq, and Egypt. Our platform handles Pay-Ins, Pay-Outs, Remittances, and Cards through 20+ payment channels including mobile wallets, bank integrations, direct carrier billing, card networks, and national payment rails.

Any disruption to our services has immediate financial, reputational, and regulatory consequences. This playbook establishes a structured, repeatable incident response process to:

  • Minimise downtime and financial impact to merchants and end-users

  • Ensure regulatory compliance across all jurisdictions (including Simpaisa's 2-hour internal incident reporting SLA)

  • Provide clear escalation paths so the right people are engaged at the right time

  • Enable blameless learning from every incident to continuously improve platform resilience

  • Standardise communication to merchants, regulators, and internal stakeholders during incidents

This is a living document. As our incident response maturity improves, this playbook will be updated to reflect new tooling, processes, and lessons learnt.

Agentic AI Integration

Simpaisa follows an Agentic AI SDLC-first approach. AI agents assist with:

  • Automated incident detection and initial triage

  • Log correlation and root cause analysis suggestions

  • Drafting communication templates and status updates

  • Post-incident review data gathering and timeline reconstruction

Human judgement remains authoritative for all escalation, resolution, and communication decisions.


2. Scope

What Constitutes an Incident

An incident is any event that:

  • Causes or threatens to cause degradation or loss of payment processing capability

  • Results in unauthorised access to payment data, merchant data, or internal systems

  • Triggers or may trigger regulatory reporting obligations in any jurisdiction

  • Causes financial loss to Simpaisa, merchants, or end-users through system malfunction

  • Results in incorrect transaction processing (duplicate charges, incorrect amounts, misrouted payments)

  • Causes breach of SLA commitments to merchants or payment channel partners

In Scope

Category Examples
Payment Processing Transaction failures, channel outages, settlement errors, reconciliation mismatches
Platform Availability Service outages, degraded performance, capacity exhaustion
Data Security Breaches, unauthorised access, data exfiltration, credential compromise
Infrastructure AWS failures, database issues, cache failures, messaging system failures
Integration Payment channel API failures, webhook delivery failures, partner connectivity loss
Compliance PCI DSS violations, regulatory reporting failures

Out of Scope

  • Planned maintenance windows (covered by change management)

  • Feature requests or enhancement work

  • Non-production environment issues (unless blocking critical deployments)

  • Individual merchant configuration issues (handled by support)

  • General IT support issues (workstation, email, etc.)


3. Severity Classification

Severity Definitions

Severity Name Definition Examples
SEV1 Critical Complete loss of payment processing or confirmed data breach affecting multiple merchants or jurisdictions Payment processing fully down across all channels; confirmed data breach with cardholder data exposure; all merchants unable to process; complete database failure; regulatory reporting triggered
SEV2 High Single product or payment channel down, significant performance degradation, or partial data exposure Pay-Ins completely unavailable; Easypaisa channel down; P95 latency >5s for >5 minutes; partial data exposure; single country entirely affected
SEV3 Medium Non-critical service degraded, single merchant affected, or elevated error rates not yet impacting overall availability Single merchant experiencing failures; error rate elevated but less than 10 percent of transactions; non-critical API endpoint degraded; webhook delivery delays
SEV4 Low Cosmetic issues, documentation errors, non-production problems, or minor anomalies with no user impact UI rendering issue on merchant dashboard; incorrect error message text; staging environment instability; minor log anomalies

SLA Targets

Metric SEV1 SEV2 SEV3 SEV4
Response Time 5 minutes 15 minutes 1 hour Next business day
Acknowledge 10 minutes 30 minutes 2 hours Next business day
First Update 15 minutes 30 minutes 4 hours N/A
Update Frequency Every 15 minutes Every 30 minutes Every 4 hours As needed
Resolution Target 1 hour 4 hours 24 hours 5 business days
Post-Incident Review Mandatory within 24 hours Mandatory within 48 hours Required within 1 week Optional
CDO Notification Immediate Within 30 minutes Daily summary N/A

Severity Upgrade Criteria

An incident must be upgraded if:

  • A SEV2 is not resolved within 2 hours — Upgrade to SEV1

  • A SEV3 affects additional merchants or channels — Upgrade to SEV2

  • Any severity reveals a data breach component — Upgrade to SEV1

  • Any severity triggers regulatory reporting obligations — Upgrade to at least SEV2

  • Customer/merchant financial impact is confirmed — Upgrade by one level


4. Roles and Responsibilities

Incident Roles

Incident Commander (IC)

The IC owns the incident from declaration to closure. They do not fix the problem themselves — they coordinate.

Responsibilities:

  • Declare the incident and assign severity

  • Open the incident Slack channel (naming convention: inc-YYYY-MM-DD-brief-description)

  • Assign Technical Lead and Communications Lead

  • Maintain the incident timeline

  • Make escalation decisions

  • Authorise communications to merchants and regulators

  • Call for Post-Incident Review

  • Create and track the beads issue for the incident

Who: Senior engineer or engineering manager on call. For SEV1, the most senior available engineer assumes IC until explicitly handed over.

Technical Lead (TL)

Responsibilities:

  • Lead the technical investigation and diagnosis

  • Coordinate engineering resources working on the incident

  • Implement containment and resolution actions

  • Provide technical updates to the IC

  • Document technical findings for PIR

  • Verify resolution and confirm service recovery

Who: The engineer with deepest knowledge of the affected system. May change during an incident if a different specialist is needed.

Communications Lead (Comms)

Responsibilities:

  • Draft and send all external communications (merchant notifications, status page updates)

  • Draft regulatory notifications for IC/CDO approval

  • Manage the Slack incident channel information flow

  • Ensure update frequency SLAs are met

  • Coordinate with merchant support team

Who: Product or support team member. For SEV1, a dedicated person must be assigned.

CDO Escalation Criteria

The CDO (Daniel O'Reilly) must be notified when:

  • Any SEV1 incident is declared — immediately

  • Any SEV2 incident is declared — within 30 minutes

  • Any incident requires regulatory reporting in any jurisdiction

  • Any incident involves confirmed or suspected data breach

  • Any incident has confirmed financial impact to merchants or Simpaisa

  • Any incident is likely to receive media attention

  • A SEV2 incident is not resolved within 2 hours

  • A single payment channel has been down for more than 30 minutes during peak hours

Contact method: Slack DM + phone call for SEV1. Slack DM for SEV2. Daily summary for SEV3.

On-Call Rotation Structure

STATUS: To be established. The following is the target structure.

Rotation Coverage Staffing
Primary On-Call 24/7, weekly rotation Engineers (minimum 4 in rotation)
Secondary On-Call Escalation backup Senior engineers (minimum 3 in rotation)
IC On-Call Weekday business hours + weekend coverage Engineering managers / senior leads
Comms On-Call Business hours with on-call for SEV1 Product / support team

On-call expectations:

  • Acknowledge pages within 5 minutes

  • Have laptop and VPN access available at all times during on-call

  • No travel to areas without reliable internet during on-call

  • Handover briefing at rotation change


5. Incident Lifecycle

Overview

Detection then Triage then Containment then Resolution then Recovery then Post-Incident Review

Phase 1: Detection

Objective: Identify that an incident is occurring as quickly as possible.

Automated Detection:

  1. Grafana alert fires based on predefined thresholds (see Section 6)

  2. Alert routes to on-call via configured notification channel

  3. On-call acknowledges the alert within 5 minutes

Manual Detection:

  1. Engineer, merchant, or support team identifies anomalous behaviour

  2. Reporter posts in incidents Slack channel with initial details

  3. On-call engineer investigates and determines if an incident should be declared

AI-Assisted Detection:

  1. Agentic AI monitors aggregate patterns across channels and products

  2. AI flags anomalies that may not trigger individual threshold alerts

  3. On-call engineer reviews AI-flagged anomalies and determines severity

Decision: Is this an incident?

  • If transaction success rate has dropped — Yes, declare incident

  • If a payment channel is returning errors — Yes, declare incident

  • If latency is elevated but transactions succeed — Monitor for 5 minutes, then declare if not improving

  • If only non-production is affected — No, handle as BAU

  • If a single merchant reports issues but metrics look normal — Investigate as potential SEV3

Phase 2: Triage

Objective: Classify the incident, mobilise the right people, and establish communications.

Steps:

  1. Assign severity using the classification matrix in Section 3

  2. Create incident channel using naming convention inc-YYYY-MM-DD-brief-description

  3. Create beads issue with severity label

  4. Post incident header in channel (use template from Section 7)

  5. Assign roles: IC, Technical Lead, Communications Lead

  6. Notify stakeholders per escalation matrix

  7. Assess blast radius:

    • Which products are affected? (Pay-Ins, Pay-Outs, Remittances, Cards)

    • Which channels are affected?

    • Which countries/jurisdictions are affected?

    • How many merchants are impacted?

    • Is there a regulatory reporting obligation?

Phase 3: Containment

Objective: Stop the incident from getting worse. Limit the blast radius.

Steps:

  1. Identify containment options:

    • Can the affected channel be isolated without impacting others?

    • Can traffic be rerouted to healthy channels?

    • Should the affected service be taken out of the load balancer?

    • Is a rollback of a recent deployment needed?

  2. Implement containment — prefer reversible actions:

    • Disable the failing channel in configuration

    • Scale up healthy instances to absorb redirected traffic

    • Enable circuit breakers if not already triggered

    • Roll back the last deployment if suspected as cause

  3. Verify containment:

    • Confirm the blast radius is not expanding

    • Confirm healthy services remain healthy

    • Monitor error rates for secondary impacts

  4. Communicate containment status to IC and update incident channel

Phase 4: Resolution

Objective: Fix the root cause and restore full service.

Steps:

  1. Diagnose root cause:

    • Review Grafana dashboards for anomalies

    • Search Jaeger traces for failing transactions

    • Query OpenSearch logs for errors and exceptions

    • Check recent deployments and configuration changes

    • Review AWS health dashboard for infrastructure issues

  2. Implement fix:

    • Apply code fix, configuration change, or infrastructure remediation

    • For code changes: follow expedited deployment process (peer review still required for SEV1/2)

    • Document all changes made during the incident

  3. Test the fix:

    • Verify in staging if time permits (SEV3/4)

    • For SEV1/2: verify fix addresses root cause, accept risk of direct-to-production if necessary

    • Run synthetic transactions through affected paths

Phase 5: Recovery

Objective: Restore full normal operations and verify stability.

Steps:

  1. Re-enable affected services/channels gradually

  2. Monitor recovery metrics:

    • Transaction success rate returning to baseline

    • Latency returning to normal

    • Error rates dropping to normal levels

    • Queue backlogs draining

  3. Process backlogged transactions if applicable

  4. Verify data consistency:

    • Check for stuck transactions

    • Reconcile any transactions that were in-flight during the incident

    • Verify settlement files are correct

  5. Communicate resolution to merchants and stakeholders

  6. Stand down incident team — IC makes the call

Phase 6: Post-Incident Review

Objective: Learn from the incident and prevent recurrence. See Section 10 for full details.

Decision Trees

Transaction Failure Rate Increasing

Transaction failure rate above 5 percent?

  • YES: Which channels are affected?

    • ALL channels: SEV1 — Likely platform issue (DB, cache, app). Check RDS health, Redis health, application logs

    • MULTIPLE channels: SEV2 — Possible shared dependency. Check shared middleware, network connectivity, DNS

    • SINGLE channel: SEV2 — Channel-specific issue. Check channel API status, integration logs, circuit breaker state

  • NO: Monitor. Set alert if above 3 percent sustained for 5 minutes.

Latency Spike

P95 latency above 2s?

  • YES: Is it affecting transaction success?

    • YES: Transactions timing out — SEV2. Check DB query performance, connection pools, external API latency

    • NO: Transactions slow but completing — SEV3. Check query plans, cache hit rates, JVM garbage collection

  • NO: Monitor. Investigate if above 1.5s sustained.

Suspected Security Incident

  • Confirmed data breach: SEV1 IMMEDIATELY. Engage security team, isolate affected systems, notify CDO

  • Suspicious activity (unusual patterns): SEV2. Investigate scope, preserve evidence, assess data exposure

  • Automated scan/probe detected: SEV4. Verify WAF rules, monitor for escalation

  • Credential compromise suspected: SEV2. Rotate credentials immediately, audit access logs


6. Detection and Alerting

Observability Stack

Component Tool Purpose
Metrics OpenTelemetry to Grafana Time-series metrics, dashboards, alerting
Traces OpenTelemetry to Jaeger Distributed tracing across services
Logs OpenTelemetry to OpenSearch Centralised log aggregation and search

Key Metrics to Monitor

Payment Processing Metrics

Metric Normal Range Warning Threshold Critical Threshold
Transaction Success Rate (overall) above 99 percent below 98 percent below 95 percent
Transaction Success Rate (per channel) above 97 percent below 95 percent below 90 percent
P50 Latency below 500ms above 1s above 2s
P95 Latency below 1.5s above 3s above 5s
P99 Latency below 3s above 5s above 10s
Error Rate (per channel) below 1 percent above 3 percent above 5 percent
Transactions Per Second Baseline plus/minus 20 percent plus/minus 40 percent plus/minus 60 percent or zero

Infrastructure Metrics

Metric Normal Range Warning Threshold Critical Threshold
RDS CPU Utilisation below 60 percent above 75 percent above 90 percent
RDS Connection Count below 70 percent of max above 80 percent of max above 90 percent of max
RDS Replication Lag below 100ms above 500ms above 2s
Redis Memory Usage below 70 percent above 80 percent above 90 percent
Redis Connection Count below 80 percent of max above 85 percent above 95 percent
Kafka Consumer Lag below 1000 messages above 5000 messages above 50000 messages
JVM Heap Usage below 70 percent above 80 percent above 90 percent
ALB 5xx Rate below 0.1 percent above 0.5 percent above 1 percent
ALB Healthy Host Count All targets healthy 1 unhealthy more than 1 unhealthy

Grafana Alert Rules

Critical Alerts (Page immediately)

  • TransactionSuccessRateCritical : Overall transaction success rate below 95 percent for 2 minutes

  • ChannelSuccessRateCritical : Channel success rate below 90 percent for 3 minutes

  • TransactionLatencyP95Critical : P95 transaction latency exceeds 5 seconds for 3 minutes

  • ZeroTransactions : No transactions processed in the last 5 minutes for 2 minutes

  • DBConnectionPoolExhausted : Database connection pool above 90 percent utilised for 1 minute

Warning Alerts (Notify on-call, investigate)

  • ElevatedErrorRate : Transaction error rate above 3 percent for 5 minutes

  • KafkaConsumerLagHigh : Kafka consumer lag exceeds 5000 messages for 5 minutes

  • RedisMemoryHigh : Redis memory usage above 80 percent for 5 minutes

Alert Routing and Escalation

Alert fires in Grafana:

  • Critical: Page on-call immediately (PagerDuty/Opsgenie — TBC)

    • Not acknowledged in 5 min: Page secondary on-call

    • Not acknowledged in 10 min: Page IC on-call + CDO

  • Warning: Slack alerts channel + on-call notification

    • Not acknowledged in 15 min: Page on-call

    • Sustained above 30 min: Upgrade to Critical

  • Info: Slack alerts channel (no page)

Synthetic Monitoring

STATUS: To be implemented.

Target synthetic checks:

Check Frequency Timeout Description
Pay-In Health Every 1 minute 10s End-to-end test transaction through each active channel
Pay-Out Health Every 1 minute 10s Verification of pay-out initiation path
API Gateway Every 30 seconds 5s Health check endpoint for each API gateway instance
Channel Connectivity Every 2 minutes 15s Connectivity check to each payment channel partner
Webhook Delivery Every 5 minutes 30s End-to-end webhook delivery verification

7. Communication Templates

Internal Communications

Incident Channel Header (Post at Incident Declaration)

INCIDENT DECLARED

  • Severity: SEV 1/2/3/4

  • Summary: One-line description of the incident

  • Impact: Products/channels/merchants affected

  • Countries: Affected jurisdictions

  • Roles: IC @name, Tech Lead @name, Comms @name

  • Beads: issue ID

  • Dashboard: Grafana link

  • Timeline: HH:MM UTC — Incident detected, HH:MM UTC — Incident declared

  • Next update: HH:MM UTC

Status Update Template

INCIDENT UPDATE — HH:MM UTC

  • Severity: SEV X (unchanged/upgraded/downgraded)

  • Status: Investigating / Contained / Resolving / Resolved

  • What we know: Key findings

  • What we are doing: Actions with owners

  • Impact update: Current impact assessment

  • Next update: HH:MM UTC

Merchant Communications

Outage Notification Email

Subject: Simpaisa Service Disruption — Product/Channel — Date

Dear Merchant Name,

We are currently experiencing a service disruption affecting specific product/channel.

What is affected: Description of affected functionality

What this means for you: Specific impact on the merchant operations

What we are doing: Our engineering team is actively investigating and working to resolve the issue.

Current workaround (if applicable): Any alternative processing options available

We will provide updates every frequency until the issue is resolved.

If you have urgent questions, please contact support channel.

Regards, Simpaisa Operations Team

Resolution Notification Email

Subject: Simpaisa Service Restored — Product/Channel — Date

Dear Merchant Name,

The service disruption affecting specific product/channel has been resolved.

Resolution time: HH:MM UTC on Date. Total duration: X hours Y minutes

What was affected: Summary of impact

What happened: Brief, non-technical explanation of the root cause

What we have done: Summary of resolution actions. Any preventive measures being implemented.

Impact on your transactions: Details on any transactions that need attention, reconciliation notes, etc.

A detailed post-incident summary will be shared within timeframe.

If you notice any ongoing issues, please contact support channel.

Regards, Simpaisa Operations Team

Post-Incident Summary for Merchants

Subject: Simpaisa Post-Incident Summary — Brief Description — Date

Incident Summary:

  • Duration: Start time to End time UTC (total duration)

  • Severity: Level

  • Affected services: List

  • Transaction impact: Number of failed/delayed transactions

Root Cause: Non-technical explanation of what caused the incident

Resolution: What was done to fix the issue

Preventive Measures: What we are doing to prevent recurrence — concrete actions and timelines

Transaction Reconciliation: Any guidance on reconciling transactions during the incident window

We apologise for the disruption and are committed to continuously improving our platform reliability.

Regulatory Communications

SBP Payment System Disruption Report (Pakistan)

  1. Reporting Entity: Simpaisa (Pvt.) Limited

  2. Date/Time of Disruption: Start time — PKT

  3. Date/Time of Resolution: End time — PKT / Ongoing

  4. Nature of Disruption: Description

  5. Payment Systems Affected: Channel/rail affected

  6. Volume Impact: Transactions affected count, Value affected PKR amount

  7. Root Cause: Known/Under investigation

  8. Remediation Steps: Actions taken

  9. Preventive Measures: Planned improvements

  10. Contact Person: Name, Designation, Phone, Email — TBC

Status Page Updates

Investigating: We are investigating reports of issue description. Affected: channels/products. Some transactions may fail/be delayed. We will provide an update within timeframe.

Identified: The issue has been identified as brief cause. Our team is implementing a fix. Channel X transactions are currently failing/delayed. Estimated resolution: time estimate or TBC.

Resolved: The issue affecting component has been resolved. All services are operating normally. Duration: X hours Y minutes. We apologise for any inconvenience.


8. Runbooks

Payment Channel Failures

8.1 Single Channel Down (e.g., Easypaisa Unavailable)

Symptoms:

  • Grafana alert: ChannelSuccessRateCritical for a specific channel

  • Spike in HTTP 5xx or timeout errors from the channel API

  • Circuit breaker tripped for the channel

  • Merchant reports of failed transactions on a specific payment method

Impact Assessment:

  • Which merchants rely primarily on this channel?

  • What percentage of total transaction volume flows through this channel?

  • Are there alternative channels available in the same country?

  • Is this during peak transaction hours?

Diagnostic Steps:

  1. Check circuit breaker state via Spring Boot actuator health endpoint

  2. Check channel-specific error rates in Grafana: Dashboard Payment Channels then Channel Name

  3. Review channel integration logs in OpenSearch

  4. Check Jaeger for failing traces — filter by service and error status

  5. Verify channel partner status — check partner status page, contact partner technical support

  6. Check for recent deployments

Resolution Steps:

  1. If channel partner is down: Confirm with partner support team, disable the channel in configuration to fail fast, enable alternative channels if available, notify affected merchants

  2. If our integration is failing: Check for API contract changes, review recent code/config changes, roll back if recent deployment is the cause

  3. If timeout/latency is the issue: Increase timeout thresholds temporarily (if safe), check if partner API is responding slowly, verify network connectivity

Verification Steps:

  • Confirm transaction success rate for the channel returns to above 97 percent

  • Process test transactions through the channel

  • Verify circuit breaker has closed

  • Check that no transactions are stuck in pending state

Post-Resolution Checks:

  • Reconcile transactions during the outage window

  • Identify any transactions that need manual intervention

  • Verify settlement processing for the affected period


8.2 Multiple Channels Down Simultaneously

Symptoms:

  • Multiple ChannelSuccessRateCritical alerts firing

  • Overall transaction success rate dropping across products

  • Multiple circuit breakers tripping

Impact Assessment:

  • This is likely SEV1 — escalate immediately

  • Determine if this is a platform issue (shared dependency) vs coincidental channel failures

Diagnostic Steps:

  1. Determine the common factor: Same country? Same protocol? Same start time?

  2. Check shared infrastructure: NAT Gateway, DNS resolution, ALB health

  3. Check for recent platform-wide changes: config management, infrastructure changes, certificate expirations

  4. Review application logs for common errors

Resolution Steps:

  1. If shared infrastructure issue: Fix the infrastructure component, verify connectivity is restored

  2. If bad deployment: Immediate rollback to last known good version

  3. If configuration change: Revert the configuration change, verify each channel individually


8.3 Channel Returning Incorrect Responses

Impact Assessment:

  • This is a financial integrity issue — may be SEV1 depending on scale

Resolution Steps:

  1. Immediately halt transactions to the affected channel if financial integrity is at risk

  2. Fix the parsing/mapping issue or contact the partner

  3. Test with controlled transactions before re-enabling

  4. Mark all affected transactions for manual review and reconciliation


8.4 Channel Timeout / Latency Spike

Resolution Steps:

  1. If partner is slow: Reduce timeout to fail faster, enable circuit breaker, route traffic to alternatives

  2. If our side: Scale up instances, investigate and fix the bottleneck


Database Failures

8.5 RDS Primary Failure / Failover

Impact Assessment: SEV1 — RDS is a shared resource across all products

Resolution Steps:

  1. For automatic failover (Multi-AZ): Monitor recovery, applications should reconnect automatically

  2. If failover does not complete: Engage AWS Support immediately (SEV1), consider promoting read replica manually

  3. Verify data integrity after failover


8.6 Replication Lag

Resolution Steps:

  1. If caused by heavy write load: Identify and optimise, defer non-critical batch operations

  2. If caused by long-running queries on replica: Terminate problematic queries, optimise or reschedule

  3. If replication is broken: Rebuild replica from snapshot, engage AWS Support


8.7 Connection Pool Exhaustion

Impact Assessment: SEV1 if affecting all services (shared database)

Resolution Steps:

  1. Terminate long-running queries/transactions

  2. Restart the leaking service instance

  3. Increase connection pool size temporarily

  4. Investigate root cause: missing Transactional timeout, unclosed connections, deadlocks


8.8 Slow Query Causing Transaction Timeouts

Resolution Steps:

  1. Terminate the immediate problem query

  2. Add missing index

  3. Optimise the query and deploy fix

  4. If a batch job is causing contention, stop or reschedule it


8.9 Shared Database Contention Between Products

Impact Assessment: SEV1 — cross-product impact due to shared resource

Resolution Steps:

  1. Stop the offending batch/query

  2. Reschedule the operation to off-peak hours

  3. Longer term: Evaluate database separation per product


Cache Failures

8.10 Redis Cluster Failure

Impact Assessment: SEV2 initially — degrades to SEV1 if database cannot handle additional load

Resolution Steps:

  1. If single node failure: Failover should be automatic

  2. If full cluster failure: Verify AWS ElastiCache health, create new cluster from backup if necessary

  3. If memory exhaustion: Check for key space explosion, review eviction policy, scale up


8.11 Cache Poisoning (Incorrect Data Served)

Impact Assessment: SEV1 if affecting transaction processing integrity

Resolution Steps:

  1. Flush affected cache keys

  2. If scope is uncertain, flush entire cache

  3. Fix the root cause (cache update logic, race condition)


8.12 ElastiCache Failover

Failover should complete automatically (typically 30-60 seconds). If applications do not reconnect, verify DNS resolution and consider rolling restart.


Application Failures

8.13 Service Crash / Restart Loop

Resolution Steps:

  1. If caused by recent deployment: roll back immediately

  2. If OOM: increase JVM heap settings temporarily

  3. If configuration error: fix and redeploy

  4. If external dependency failure: add circuit breaker


8.14 Memory Leak

Resolution Steps:

  1. Immediate: Restart the affected instance

  2. Set up automated restarts as temporary mitigation

  3. Fix the leak in code and deploy


8.15 Thread Pool Exhaustion

Resolution Steps:

  1. If threads blocked on slow external call: reduce timeout, enable circuit breaker, scale up

  2. If deadlock: restart the instance, fix the deadlock condition

  3. If genuine capacity issue: increase thread pool size, scale out instances


8.16 Deployment Failure / Bad Deploy

Resolution Steps:

  1. Immediate rollback to previous version

  2. If rollback not possible (database migration applied): fix forward with hotfix

  3. Verify rollback success


Messaging Failures

8.17 Kafka Broker Down

Impact Assessment: SEV2 if single broker, SEV1 if multiple brokers or entire cluster

Resolution Steps:

  1. Single broker failure: Partitions redistribute automatically, investigate root cause, restart or replace

  2. Multiple broker failure: Assess data durability, check AWS MSK health, restore from backup if necessary

Note: Simpaisa is evaluating migration from Kafka to NSQ.


8.18 Consumer Lag Causing Webhook Delays

Resolution Steps:

  1. If consumer instances unhealthy: restart, scale up

  2. If poison message: skip or dead-letter, fix processing logic

  3. If genuine throughput issue: scale consumer group, increase batch size

  4. Communicate to merchants: webhooks delayed but will be delivered


8.19 Message Queue Backlog

Resolution Steps:

  1. Scale consumers

  2. If downstream bottleneck: fix the bottleneck first

  3. If traffic spike: consider rate limiting at the producer


Security Incidents

8.20 Suspected Data Breach

Impact Assessment: SEV1 — ALWAYS

Resolution Steps:

  1. Contain: Isolate affected systems, revoke credentials, block offending IPs

  2. Notify: CDO immediately, legal counsel immediately, PCI QSA within 24 hours if cardholder data involved

  3. Investigate: Engage forensics, full scope assessment, timeline reconstruction

  4. Remediate: Patch vulnerability, rotate credentials, implement additional monitoring


8.21 DDoS Attack

Resolution Steps:

  1. Enable AWS Shield Advanced

  2. Activate WAF rate limiting rules

  3. Scale up ALB and application instances

  4. Engage AWS DDoS Response Team


8.22 Credential Compromise

Resolution Steps:

  1. Immediately rotate the compromised credential

  2. Revoke all active sessions

  3. Audit all activity using the compromised credential

  4. Notify the affected merchant


8.23 Webhook Spoofing Detected

Resolution Steps:

  1. Block the offending source

  2. Verify webhook signature validation is enforced

  3. Audit all webhook-triggered actions during suspicious period

  4. Rotate webhook secrets

  5. Notify affected merchants


8.24 Unusual Transaction Patterns (Potential Fraud)

Resolution Steps:

  1. Do NOT immediately block — assess first

  2. Analyse the pattern

  3. If fraud is confirmed: suspend processing, notify merchant, report to authorities


Infrastructure Failures

8.25 AWS Availability Zone Failure

Impact Assessment: SEV1 if single-AZ deployment, SEV2 if Multi-AZ with healthy failover

Resolution Steps:

  1. If Multi-AZ working: Monitor remaining AZs, scale up if needed

  2. If failover failed: Manually remove unhealthy targets, scale up in healthy AZs

  3. Post-recovery: Verify services redeploy, rebalance traffic


8.26 ALB Unhealthy Targets

Resolution Steps:

  1. If instances crashed: restart or replace

  2. If health check failing but app running: fix the health check issue

  3. If capacity issue: scale up


8.27 NAT Gateway Failure (Outbound to Channels Blocked)

Impact Assessment: SEV1 — all payment channel connectivity lost

Resolution Steps:

  1. If NAT Gateway failed: Create new NAT Gateway in healthy AZ, update route tables

  2. If throttled: Split traffic across multiple NAT Gateways

  3. If route table misconfigured: Correct the route table entry


8.28 DNS Resolution Failure

Resolution Steps:

  1. If Route 53 issue: check AWS Service Health Dashboard

  2. If VPC resolver issue: restart VPC DNS resolver

  3. If specific domain: check domain DNS configuration

  4. Temporary workaround: add hosts file entries on critical instances (last resort)


9. Escalation Matrix

Time-Based Escalation

Time Since Detection Action
0 minutes Primary on-call alerted
5 minutes If not acknowledged: Secondary on-call alerted
10 minutes If not acknowledged: IC on-call + CDO alerted
15 minutes If SEV1 not contained: All senior engineers engaged
30 minutes If SEV1 not resolved: CDO to consider merchant communication
1 hour If SEV1 not resolved: CDO to consider regulatory notification
2 hours If SEV1/SEV2 not resolved: Executive review of situation

Severity-Based Escalation

Severity Immediate 15 Minutes 30 Minutes 1 Hour 2 Hours
SEV1 On-call, IC, CDO All senior engineers Merchant comms drafted Regulatory assessment Executive review
SEV2 On-call, IC CDO notified Senior engineer if needed Merchant comms if needed Upgrade to SEV1 if unresolved
SEV3 On-call IC if complex CDO daily summary
SEV4 On-call (next business day)

CDO Notification Criteria

Scenario Notification Timing
SEV1 declared Immediately
SEV2 declared Within 30 minutes
Any data breach (suspected or confirmed) Immediately
Regulatory reporting required Immediately
Merchant financial impact confirmed Within 30 minutes
Single channel down above 30 min (peak hours) Within 30 minutes
Media enquiry about an incident Immediately
SEV2 unresolved after 2 hours Immediately (upgrade to SEV1)

Regulatory Notification Criteria

Jurisdiction Trigger SLA
Pakistan (SBP) Payment system disruption As per SBP guidelines
Bangladesh (BB) MFS service disruption As per BB guidelines
All (PCI DSS) Cardholder data breach 24 hours to QSA

10. Post-Incident Review (PIR)

Blameless Review Process

Simpaisa conducts blameless post-incident reviews. The principles are:

  1. People are not the root cause. Systems, processes, and tooling are.

  2. The goal is learning, not blame. We want to understand what happened and prevent recurrence.

  3. Hindsight is 20/20. Decisions made during the incident were the best decisions possible with the information available at the time.

  4. Every incident is an opportunity to improve our systems and processes.

PIR Scheduling

Severity PIR Required? Scheduling
SEV1 Mandatory Within 24 hours of resolution
SEV2 Mandatory Within 48 hours of resolution
SEV3 Required Within 1 week of resolution
SEV4 Optional At team discretion

PIR Attendance

  • Incident Commander

  • Technical Lead

  • All engineers who worked on the incident

  • Product owner for affected product(s)

  • CDO (for SEV1, optional for SEV2)

  • Anyone else who can provide context

PIR Template

Post-Incident Review: Incident Title

  • Date, Incident Date, Severity, Duration, IC, Tech Lead, Beads Issue

  • Summary: 2-3 sentence summary

  • Timeline table: Time (UTC) and Event

  • Impact: Transactions affected, financial impact, merchants affected, countries, duration, regulatory reporting

  • Root Cause: Detailed technical explanation

  • Contributing Factors

  • What Went Well

  • What Could Be Improved

  • Action Items table: ID, Action, Owner, Due Date, Beads Issue

  • Lessons Learnt

Action Item Tracking

  • All PIR action items must be tracked as beads issues

  • Review open PIR action items in weekly engineering stand-ups

  • Action items should have clear owners and due dates

Trend Analysis

On a quarterly basis , review:

  • Total incidents by severity

  • Mean time to detect (MTTD)

  • Mean time to acknowledge (MTTA)

  • Mean time to resolve (MTTR)

  • Most common root cause categories

  • Repeat incidents (same root cause recurring)

  • Action item completion rate

  • PIR completion rate


11. Regulatory Reporting Requirements

Pakistan — State Bank of Pakistan (SBP)

Item Details
SLA As per SBP Payment Systems Department guidelines
Trigger Disruption to payment systems operating under SBP licence
Contact SBP Payment Systems Department — TBC

Payment systems requiring reporting:

  • 1Link transactions

  • RAAST transactions

  • Mobile wallet integrations (Easypaisa, JazzCash)

  • Any disruption to interbank payment clearing

Bangladesh — Bangladesh Bank

Item Details
SLA As per Bangladesh Bank MFS guidelines
Trigger Disruption to Mobile Financial Services (MFS)
Contact Bangladesh Bank Payment Systems Department — TBC

MFS services requiring reporting:

  • bKash integration disruptions

  • BRAC Bank integration disruptions

  • Any disruption affecting Bangladeshi mobile money transactions

PCI DSS — Breach Notification

Item Details
SLA Notify QSA within 24 hours; card brands within 24-72 hours
Trigger Confirmed or suspected breach of cardholder data
QSA Contact TBC

Cardholder data breach includes:

  • Primary Account Number (PAN) exposure

  • CVV/CVC exposure

  • Track data exposure

  • PIN data exposure

  • Any unauthorised access to cardholder data environment

PCI DSS Breach Response Requirements:

  1. Immediately contain and limit the exposure

  2. Notify acquirer and payment brands

  3. Engage PCI Forensic Investigator (PFI) if required

  4. Preserve all evidence

  5. Provide all required documentation to card brands

Nepal and Iraq

STATUS: Regulatory reporting requirements for Nepal and Iraq to be documented. Engage local legal counsel to confirm obligations.

Jurisdiction Regulator Reporting Requirements
Nepal Nepal Rastra Bank To be confirmed
Iraq Central Bank of Iraq To be confirmed

12. Business Continuity

Degraded Mode Operations

Failed Component Can Still Process Cannot Process Notes
Single payment channel All other channels Transactions for that specific channel Merchants can be routed to alternative channels
Multiple channels in one country All other countries, unaffected channels Affected country transactions through those channels Consider if alternative local channels exist
RDS primary Nothing during failover (60-120s) All transactions Multi-AZ failover should restore automatically
Redis Transactions (with degraded performance) Rate limiting, session management Application falls back to DB; monitor DB load
Kafka Synchronous transaction processing Asynchronous webhooks, notifications Webhooks will be delayed, not lost
Single AZ All transactions (with reduced capacity) Scale remaining AZ; monitor capacity
ALB All inbound traffic Failover to backup ALB or DNS-based routing
NAT Gateway Internal processing All outbound channel calls Create new NAT Gateway in healthy AZ

Manual Fallback Procedures

STATUS: To be developed.

  1. Manual transaction processing: Document manual process for critical merchant transactions with dual authorisation

  2. Offline reconciliation: Template spreadsheets for manual reconciliation

  3. Partner direct communication: Contact list for all payment channel partner operations teams

Merchant Communication During Extended Outages

Duration Action
0-15 minutes Internal investigation; no external communication
15-30 minutes Status page updated; large merchants notified via email
30-60 minutes All affected merchants notified; estimated recovery time provided
1-4 hours Hourly updates to merchants; CDO involved in merchant communications
4+ hours Key merchant account managers engaged for direct outreach

Recovery Priority Order

Priority Component Rationale
1 Database (RDS) Foundation for all services
2 Cache (Redis) Required for performance and rate limiting
3 Pay-Ins Highest transaction volume; direct merchant revenue impact
4 Pay-Outs Merchant settlements and disbursements
5 Messaging (Kafka) Webhooks and async processing
6 Cards Card transaction processing
7 Remittances Cross-border transfers
8 Merchant Dashboard Merchant self-service (not transaction-critical)

13. Testing and Drills

Quarterly Incident Response Drills

Frequency: Quarterly (minimum). Duration: 1-2 hours. Participants: All engineers who may be on-call

Quarter Drill Type Scenario
Q1 Tabletop SEV1: Complete payment processing failure
Q2 Live simulation SEV2: Single channel failure with cascade
Q3 Tabletop SEV1: Data breach with regulatory reporting
Q4 Live simulation SEV2: Database failover with data consistency check

Drill Process:

  1. Scenario prepared in advance (not shared with participants)

  2. Incident is declared — participants respond as in a real incident

  3. IC assigned, roles filled

  4. Team works through diagnosis and resolution

  5. Drill coordinator introduces complications

  6. Drill ends with debrief

Chaos Engineering / Game Days

STATUS: To be established after incident response maturity reaches 2/5.

Scope for game days:

  • Single channel connectivity failure

  • Cache failure (Redis node termination)

  • Single application instance failure

  • Increased latency injection to external calls

Tabletop Exercises for SEV1 Scenarios

  1. Complete platform outage — RDS failure with failover not working

  2. Data breach — Cardholder data exposed via API vulnerability

  3. Multi-channel failure — Major channels down in Pakistan during peak hours

  4. Security incident — Compromised credentials used to exfiltrate data

  5. Regulatory crisis — incident triggering 2-hour internal regulatory reporting deadline

  6. Bad deployment — Code change causes silent data corruption

  7. DDoS during peak — Volumetric attack during Eid/holiday peak processing

Post-Drill Review and Improvement

  1. Conduct 30-minute debrief immediately after drill

  2. Document findings in a beads issue

  3. Update this playbook with process changes

  4. Update runbooks with new diagnostic or resolution steps

  5. Track improvements as beads issues with owners and due dates

  6. Review improvement completion before the next drill


14. Tools and Access

Observability Tools

Tool Purpose URL Access
Grafana Metrics, dashboards, alerting TBC SSO / LDAP
Jaeger Distributed tracing TBC SSO / LDAP
OpenSearch Centralised logs TBC SSO / LDAP

Key Grafana Dashboards

Dashboard Purpose
Payment Processing Overview Overall transaction success rates, volumes, latency
Channel Health Per-channel success rates, error rates, latency
Infrastructure RDS, Redis, Kafka, ALB, EC2/ECS metrics
Application JVM metrics, thread pools, connection pools
Alerts Active and recent alert history
Search Purpose
Service: pay-in-service, Status: Error Find failing pay-in transactions
Service: channel-integration, Tag: channel=name Traces for a specific channel
Min Duration: 3s Find slow transactions
Tag: transaction.id=id Trace a specific transaction end-to-end

AWS Console Access

Service Purpose Notes
RDS Database health, failover, performance insights Shared across products
ElastiCache Redis cluster health
EC2 / ECS Application instance health
ALB Load balancer health, target groups
VPC Network configuration, NAT Gateways
CloudTrail API audit trail For security investigations

Slack Channels

Channel Purpose
alerts Automated alert notifications from Grafana
incidents General incident discussion, incident declaration
inc-YYYY-MM-DD-brief-description Per-incident channel (created at declaration)
on-call On-call handover, scheduling, general on-call discussion
post-incident PIR scheduling, action item tracking

Incident Tracking

All incidents are tracked in beads (bd CLI):

  • Create an incident issue: bd create "INC: Brief description of the incident"

  • Update status during incident: bd update id --status in-progress

  • Close when resolved: bd close id

  • Link PIR action items: bd dep action-id --on incident-id


15. Appendix: Quick Reference Card

SIMPAISA INCIDENT RESPONSE — QUICK REFERENCE

SEVERITY DEFINITIONS

  • SEV1 Critical: Payment processing down / data breach / all merchants / regulatory triggered

  • SEV2 High: Single product down / single channel down / degraded above 5min / partial data exposure

  • SEV3 Medium: Non-critical degraded / single merchant / elevated errors

  • SEV4 Low: Cosmetic / docs / non-prod

RESPONSE TIMES

  • SEV1: Respond 5min, Update 15min, Resolve 1hr

  • SEV2: Respond 15min, Update 30min, Resolve 4hr

  • SEV3: Respond 1hr, Update 4hr, Resolve 24hr

  • SEV4: Next business day

FIRST STEPS WHEN PAGED

  1. Acknowledge the alert

  2. Open Grafana — assess impact

  3. Assign severity

  4. Create Slack channel: inc-YYYY-MM-DD-description

  5. Create beads issue

  6. Post incident header in channel

  7. If SEV1/2: Notify CDO (Daniel O'Reilly)

  8. Assign IC, Tech Lead, Comms

CDO NOTIFICATION REQUIRED

  • Any SEV1 — immediately

  • Any SEV2 — within 30 minutes

  • Data breach (suspected or confirmed)

  • Regulatory reporting triggered

  • Financial impact confirmed

REGULATORY SLAs

  • Simpaisa internal: 2 HOURS from detection (all markets)

  • Pakistan (SBP): Per SBP guidelines

  • Bangladesh (BB): Per BB MFS guidelines

  • PCI DSS (breach): 24 hours to QSA

KEY DASHBOARDS

  • Grafana: TBC

  • Jaeger: TBC

  • OpenSearch: TBC

  • AWS Console: TBC

SLACK CHANNELS

  • Alerts: alerts

  • Incidents: incidents

  • On-call: on-call

ESCALATION CONTACTS

  • Primary On-Call: TBC — rotation schedule

  • Secondary On-Call: TBC — rotation schedule

  • IC On-Call: TBC — rotation schedule

  • CDO: Daniel O'Reilly


Document Control

Version Date Author Changes
1.0 2026-04-03 Daniel O'Reilly (CDO) Initial version

Items Requiring Follow-Up

  • Grafana, Jaeger, and OpenSearch dashboard URLs

  • On-call rotation tool selection and setup (PagerDuty / Opsgenie)

  • On-call rotation participants and schedule

  • SBP contact details and reporting requirements

  • Bangladesh Bank contact details and MFS reporting guidelines

  • Nepal Rastra Bank reporting requirements

  • Central Bank of Iraq reporting requirements

  • PCI QSA contact details

  • Visa and Mastercard breach notification contacts (via acquirer)

  • Synthetic monitoring implementation

  • Manual fallback procedures development

  • Payment channel partner operations contact list

  • Chaos engineering programme initiation (maturity gate: 2/5)

  • Status page tool selection and setup

  • Alert routing integration (Grafana to PagerDuty/Opsgenie)


This playbook is effective immediately and supersedes any previous incident response documentation. All engineering staff are expected to familiarise themselves with this document and participate in quarterly incident response drills.

Document Relevance
Security Incident Response Procedure (SIRP) ISMS incident response procedure
W-12: Security Operations Ways of Work SecOps ways of work
ADR-SECURITY-2026-04-048: Audit Trail Architecture Audit trail used during incident investigation
ADR-INFRA-2026-04-066: DNS Failover Strategy DNS failover during infrastructure incidents
Threat Model: API Gateway & Platform Gateway threats that trigger incident response