STD-PRODUCT-105: Channel Health Monitoring Standard¶

Version: 1.0.0 Effective Date: 2026-04-03 Owner: CDO / Architecture Status: Active Applicability: All Simpaisa payment channels across PK, BD, NP, and IQ markets

Compliance: All channel integrations MUST implement health monitoring as defined in this standard. New channel integrations MUST comply before production launch. Existing channels MUST comply within 60 days of this standard's effective date.

1. Purpose¶

This standard defines how Simpaisa monitors the health of every payment channel (mobile wallets, bank APIs, card acquirers, remittance corridors) in real time. Channel health scores drive routing decisions (ADR-PRODUCT-2026-04-104), merchant-facing status information, and operational alerting.

2. Health Score Computation¶

Each channel receives a composite health score (0.0–1.0) computed from four weighted metrics:

Metric	Weight	Calculation
Availability	30%	Successful responses / total requests (5-minute window)
P95 Latency	25%	95th percentile response time vs channel-specific SLA
Success Rate	30%	Successful transactions / total attempts (5-minute window)
Error Rate	15%	Provider errors / total requests (5-minute window)

Score formula:

health = (availability * 0.30) + (latency_score * 0.25) + (success_rate * 0.30) + ((1 - error_rate) * 0.15)

Where latency_score = max(0, 1 - (p95_latency / sla_latency_threshold)).

Score interpretation:

Score Range	Status	Routing Impact
0.90–1.00	Healthy	Full traffic eligible
0.70–0.89	Degraded	Reduced preference in routing
0.50–0.69	Impaired	Only used if no healthy alternative
0.00–0.49	Down	Excluded from routing

Scores are recomputed every 30 seconds and published to an in-memory cache accessible by the routing engine.

3. Degradation Detection¶

Automated degradation detection triggers when any of the following conditions are met:

Condition	Detection Rule	Action
Consecutive failures	3 consecutive failed requests	Mark channel as impaired
Error rate spike	>10% error rate in a 5-minute window	Mark channel as degraded
Latency spike	P95 latency >3x SLA threshold for 2 minutes	Mark channel as degraded
Total outage	0 successful responses for 60 seconds	Mark channel as down

Recovery detection: channel returns to healthy status when: - Success rate exceeds 95% for 5 consecutive minutes - P95 latency returns below SLA threshold - Gradual traffic ramp-up: 10% → 25% → 50% → 100% over 10 minutes after recovery

4. Per-Channel SLA Baselines¶

Each channel has market-specific SLA baselines used for scoring:

Market	Channel Type	Expected Availability	Expected P95 Latency
PK	Mobile Wallet (JazzCash, Easypaisa)	99.0%	3,000ms
PK	Bank API (HBL, UBL)	98.5%	5,000ms
PK	Card Acquirer	99.5%	2,000ms
BD	Mobile Wallet (bKash, Nagad)	98.5%	4,000ms
BD	Bank API	98.0%	6,000ms
NP	Mobile Wallet (eSewa, Khalti)	98.0%	4,000ms
IQ	Mobile Wallet (Zain Cash)	97.0%	5,000ms

Baselines are reviewed quarterly and updated based on observed provider performance.

5. Status Page Integration¶

A public-facing channel status page displays:

Current health status per channel per market (healthy/degraded/impaired/down)
Incident timeline (last 90 days)
Scheduled maintenance windows
Current and historical uptime percentages

The status page is hosted on Cloudflare Pages (separate from merchant portal) and updated via API from the health monitoring service. Updates propagate within 60 seconds of status change.

6. Merchant-Facing Channel Status API¶

Merchants can query channel health via API:

GET /v1/channels/health?market=PK

Response:

{
  "market": "PK",
  "timestamp": "2026-04-03T14:30:00Z",
  "channels": [
    {
      "channelId": "jazzcash_pk",
      "name": "JazzCash",
      "status": "healthy",
      "healthScore": 0.95,
      "p95LatencyMs": 1850,
      "availabilityPct": 99.7,
      "successRatePct": 98.2
    }
  ]
}

Rate limited to 10 requests per minute per merchant. Cached with 30-second TTL.

7. Grafana Dashboards¶

Three mandatory dashboards per market:

1. Channel Overview: - Health scores for all channels in market (heatmap) - Real-time success rate and latency per channel - Transaction volume per channel (throughput)

2. Incident Timeline: - Degradation and outage events plotted on timeline - Duration and impact (estimated failed transactions) - Correlation with routing failover events

3. Trend Analysis: - 30-day availability trend per channel - Latency percentile trends (P50, P95, P99) - Success rate trends with anomaly highlighting

Dashboards sourced from Prometheus metrics exported by the health monitoring service. Retention: raw metrics 30 days, 5-minute aggregates 1 year.

8. Alerting¶

Alert	Condition	Recipient	Channel
Channel degraded	Health score drops below 0.70	On-call engineer	PagerDuty
Channel down	Health score drops below 0.50	On-call engineer + Engineering Lead	PagerDuty + Slack
Prolonged degradation	Degraded for >30 minutes	Market Operations Lead	Slack + Email
Recovery	Channel returns to healthy	On-call engineer	Slack

9. Data Collection¶

Health metrics collected via:

Active probes: synthetic transactions (lightweight balance-check or status endpoint) every 30 seconds per channel
Passive monitoring: real transaction outcomes tagged with channel, latency, and status
Both signals feed into the health score computation

Active probes use dedicated test credentials per channel and MUST NOT affect production transaction counts or merchant billing.

10. Implementation Requirements¶

Health monitoring service deployed as a Go microservice
Metrics exported to Prometheus via OpenTelemetry
Health scores cached in Redis with 30-second TTL
NSQ events published on status transitions for downstream consumers (routing engine, notification engine, status page)
All health data retained for 1 year for SLA reporting (STD-PRODUCT-106)