Skip to content

STD-PRODUCT-105: Channel Health Monitoring Standard

Version: 1.0.0 Effective Date: 2026-04-03 Owner: CDO / Architecture Status: Active Applicability: All Simpaisa payment channels across PK, BD, NP, and IQ markets

Compliance: All channel integrations MUST implement health monitoring as defined in this standard. New channel integrations MUST comply before production launch. Existing channels MUST comply within 60 days of this standard's effective date.


1. Purpose

This standard defines how Simpaisa monitors the health of every payment channel (mobile wallets, bank APIs, card acquirers, remittance corridors) in real time. Channel health scores drive routing decisions (ADR-PRODUCT-2026-04-104), merchant-facing status information, and operational alerting.

2. Health Score Computation

Each channel receives a composite health score (0.0–1.0) computed from four weighted metrics:

Metric Weight Calculation
Availability 30% Successful responses / total requests (5-minute window)
P95 Latency 25% 95th percentile response time vs channel-specific SLA
Success Rate 30% Successful transactions / total attempts (5-minute window)
Error Rate 15% Provider errors / total requests (5-minute window)

Score formula:

health = (availability * 0.30) + (latency_score * 0.25) + (success_rate * 0.30) + ((1 - error_rate) * 0.15)

Where latency_score = max(0, 1 - (p95_latency / sla_latency_threshold)).

Score interpretation:

Score Range Status Routing Impact
0.90–1.00 Healthy Full traffic eligible
0.70–0.89 Degraded Reduced preference in routing
0.50–0.69 Impaired Only used if no healthy alternative
0.00–0.49 Down Excluded from routing

Scores are recomputed every 30 seconds and published to an in-memory cache accessible by the routing engine.

3. Degradation Detection

Automated degradation detection triggers when any of the following conditions are met:

Condition Detection Rule Action
Consecutive failures 3 consecutive failed requests Mark channel as impaired
Error rate spike >10% error rate in a 5-minute window Mark channel as degraded
Latency spike P95 latency >3x SLA threshold for 2 minutes Mark channel as degraded
Total outage 0 successful responses for 60 seconds Mark channel as down

Recovery detection: channel returns to healthy status when: - Success rate exceeds 95% for 5 consecutive minutes - P95 latency returns below SLA threshold - Gradual traffic ramp-up: 10% → 25% → 50% → 100% over 10 minutes after recovery

4. Per-Channel SLA Baselines

Each channel has market-specific SLA baselines used for scoring:

Market Channel Type Expected Availability Expected P95 Latency
PK Mobile Wallet (JazzCash, Easypaisa) 99.0% 3,000ms
PK Bank API (HBL, UBL) 98.5% 5,000ms
PK Card Acquirer 99.5% 2,000ms
BD Mobile Wallet (bKash, Nagad) 98.5% 4,000ms
BD Bank API 98.0% 6,000ms
NP Mobile Wallet (eSewa, Khalti) 98.0% 4,000ms
IQ Mobile Wallet (Zain Cash) 97.0% 5,000ms

Baselines are reviewed quarterly and updated based on observed provider performance.

5. Status Page Integration

A public-facing channel status page displays:

  • Current health status per channel per market (healthy/degraded/impaired/down)
  • Incident timeline (last 90 days)
  • Scheduled maintenance windows
  • Current and historical uptime percentages

The status page is hosted on Cloudflare Pages (separate from merchant portal) and updated via API from the health monitoring service. Updates propagate within 60 seconds of status change.

6. Merchant-Facing Channel Status API

Merchants can query channel health via API:

GET /v1/channels/health?market=PK

Response:

{
  "market": "PK",
  "timestamp": "2026-04-03T14:30:00Z",
  "channels": [
    {
      "channelId": "jazzcash_pk",
      "name": "JazzCash",
      "status": "healthy",
      "healthScore": 0.95,
      "p95LatencyMs": 1850,
      "availabilityPct": 99.7,
      "successRatePct": 98.2
    }
  ]
}

Rate limited to 10 requests per minute per merchant. Cached with 30-second TTL.

7. Grafana Dashboards

Three mandatory dashboards per market:

1. Channel Overview: - Health scores for all channels in market (heatmap) - Real-time success rate and latency per channel - Transaction volume per channel (throughput)

2. Incident Timeline: - Degradation and outage events plotted on timeline - Duration and impact (estimated failed transactions) - Correlation with routing failover events

3. Trend Analysis: - 30-day availability trend per channel - Latency percentile trends (P50, P95, P99) - Success rate trends with anomaly highlighting

Dashboards sourced from Prometheus metrics exported by the health monitoring service. Retention: raw metrics 30 days, 5-minute aggregates 1 year.

8. Alerting

Alert Condition Recipient Channel
Channel degraded Health score drops below 0.70 On-call engineer PagerDuty
Channel down Health score drops below 0.50 On-call engineer + Engineering Lead PagerDuty + Slack
Prolonged degradation Degraded for >30 minutes Market Operations Lead Slack + Email
Recovery Channel returns to healthy On-call engineer Slack

9. Data Collection

Health metrics collected via:

  • Active probes: synthetic transactions (lightweight balance-check or status endpoint) every 30 seconds per channel
  • Passive monitoring: real transaction outcomes tagged with channel, latency, and status
  • Both signals feed into the health score computation

Active probes use dedicated test credentials per channel and MUST NOT affect production transaction counts or merchant billing.

10. Implementation Requirements

  • Health monitoring service deployed as a Go microservice
  • Metrics exported to Prometheus via OpenTelemetry
  • Health scores cached in Redis with 30-second TTL
  • NSQ events published on status transitions for downstream consumers (routing engine, notification engine, status page)
  • All health data retained for 1 year for SLA reporting (STD-PRODUCT-106)