STD-PRODUCT-105: Channel Health Monitoring Standard¶
Version: 1.0.0 Effective Date: 2026-04-03 Owner: CDO / Architecture Status: Active Applicability: All Simpaisa payment channels across PK, BD, NP, and IQ markets
Compliance: All channel integrations MUST implement health monitoring as defined in this standard. New channel integrations MUST comply before production launch. Existing channels MUST comply within 60 days of this standard's effective date.
1. Purpose¶
This standard defines how Simpaisa monitors the health of every payment channel (mobile wallets, bank APIs, card acquirers, remittance corridors) in real time. Channel health scores drive routing decisions (ADR-PRODUCT-2026-04-104), merchant-facing status information, and operational alerting.
2. Health Score Computation¶
Each channel receives a composite health score (0.0–1.0) computed from four weighted metrics:
| Metric | Weight | Calculation |
|---|---|---|
| Availability | 30% | Successful responses / total requests (5-minute window) |
| P95 Latency | 25% | 95th percentile response time vs channel-specific SLA |
| Success Rate | 30% | Successful transactions / total attempts (5-minute window) |
| Error Rate | 15% | Provider errors / total requests (5-minute window) |
Score formula:
health = (availability * 0.30) + (latency_score * 0.25) + (success_rate * 0.30) + ((1 - error_rate) * 0.15)
Where latency_score = max(0, 1 - (p95_latency / sla_latency_threshold)).
Score interpretation:
| Score Range | Status | Routing Impact |
|---|---|---|
| 0.90–1.00 | Healthy | Full traffic eligible |
| 0.70–0.89 | Degraded | Reduced preference in routing |
| 0.50–0.69 | Impaired | Only used if no healthy alternative |
| 0.00–0.49 | Down | Excluded from routing |
Scores are recomputed every 30 seconds and published to an in-memory cache accessible by the routing engine.
3. Degradation Detection¶
Automated degradation detection triggers when any of the following conditions are met:
| Condition | Detection Rule | Action |
|---|---|---|
| Consecutive failures | 3 consecutive failed requests | Mark channel as impaired |
| Error rate spike | >10% error rate in a 5-minute window | Mark channel as degraded |
| Latency spike | P95 latency >3x SLA threshold for 2 minutes | Mark channel as degraded |
| Total outage | 0 successful responses for 60 seconds | Mark channel as down |
Recovery detection: channel returns to healthy status when: - Success rate exceeds 95% for 5 consecutive minutes - P95 latency returns below SLA threshold - Gradual traffic ramp-up: 10% → 25% → 50% → 100% over 10 minutes after recovery
4. Per-Channel SLA Baselines¶
Each channel has market-specific SLA baselines used for scoring:
| Market | Channel Type | Expected Availability | Expected P95 Latency |
|---|---|---|---|
| PK | Mobile Wallet (JazzCash, Easypaisa) | 99.0% | 3,000ms |
| PK | Bank API (HBL, UBL) | 98.5% | 5,000ms |
| PK | Card Acquirer | 99.5% | 2,000ms |
| BD | Mobile Wallet (bKash, Nagad) | 98.5% | 4,000ms |
| BD | Bank API | 98.0% | 6,000ms |
| NP | Mobile Wallet (eSewa, Khalti) | 98.0% | 4,000ms |
| IQ | Mobile Wallet (Zain Cash) | 97.0% | 5,000ms |
Baselines are reviewed quarterly and updated based on observed provider performance.
5. Status Page Integration¶
A public-facing channel status page displays:
- Current health status per channel per market (healthy/degraded/impaired/down)
- Incident timeline (last 90 days)
- Scheduled maintenance windows
- Current and historical uptime percentages
The status page is hosted on Cloudflare Pages (separate from merchant portal) and updated via API from the health monitoring service. Updates propagate within 60 seconds of status change.
6. Merchant-Facing Channel Status API¶
Merchants can query channel health via API:
GET /v1/channels/health?market=PK
Response:
{
"market": "PK",
"timestamp": "2026-04-03T14:30:00Z",
"channels": [
{
"channelId": "jazzcash_pk",
"name": "JazzCash",
"status": "healthy",
"healthScore": 0.95,
"p95LatencyMs": 1850,
"availabilityPct": 99.7,
"successRatePct": 98.2
}
]
}
Rate limited to 10 requests per minute per merchant. Cached with 30-second TTL.
7. Grafana Dashboards¶
Three mandatory dashboards per market:
1. Channel Overview: - Health scores for all channels in market (heatmap) - Real-time success rate and latency per channel - Transaction volume per channel (throughput)
2. Incident Timeline: - Degradation and outage events plotted on timeline - Duration and impact (estimated failed transactions) - Correlation with routing failover events
3. Trend Analysis: - 30-day availability trend per channel - Latency percentile trends (P50, P95, P99) - Success rate trends with anomaly highlighting
Dashboards sourced from Prometheus metrics exported by the health monitoring service. Retention: raw metrics 30 days, 5-minute aggregates 1 year.
8. Alerting¶
| Alert | Condition | Recipient | Channel |
|---|---|---|---|
| Channel degraded | Health score drops below 0.70 | On-call engineer | PagerDuty |
| Channel down | Health score drops below 0.50 | On-call engineer + Engineering Lead | PagerDuty + Slack |
| Prolonged degradation | Degraded for >30 minutes | Market Operations Lead | Slack + Email |
| Recovery | Channel returns to healthy | On-call engineer | Slack |
9. Data Collection¶
Health metrics collected via:
- Active probes: synthetic transactions (lightweight balance-check or status endpoint) every 30 seconds per channel
- Passive monitoring: real transaction outcomes tagged with channel, latency, and status
- Both signals feed into the health score computation
Active probes use dedicated test credentials per channel and MUST NOT affect production transaction counts or merchant billing.
10. Implementation Requirements¶
- Health monitoring service deployed as a Go microservice
- Metrics exported to Prometheus via OpenTelemetry
- Health scores cached in Redis with 30-second TTL
- NSQ events published on status transitions for downstream consumers (routing engine, notification engine, status page)
- All health data retained for 1 year for SLA reporting (STD-PRODUCT-106)