STD-PRODUCT-106: SLA Management Framework¶
Version: 1.0.0 Effective Date: 2026-04-03 Owner: CDO / Architecture Status: Active Applicability: All Simpaisa products (Pay-Ins, Pay-Outs, Remittances, Cards) across PK, BD, NP, and IQ markets
Compliance: All products MUST have SLA definitions registered in this framework before general availability. SLA tracking MUST be operational within 30 days of product launch. Exceptions require written CDO approval.
1. Purpose¶
This standard defines how Simpaisa measures, tracks, and reports on Service Level Agreements across all products and markets. It establishes SLA definitions, automated tracking via OpenTelemetry, breach alerting, merchant reporting, and SLA credit calculations for Enterprise tier merchants.
2. SLA Definitions¶
2.1 Per-Product, Per-Market SLAs¶
| Product | Market | Availability SLA | P95 Latency SLA | Success Rate SLA |
|---|---|---|---|---|
| Pay-In | PK | 99.50% | 3,000ms | 95.0% |
| Pay-In | BD | 99.00% | 4,000ms | 93.0% |
| Pay-In | NP | 99.00% | 4,000ms | 93.0% |
| Pay-In | IQ | 98.50% | 5,000ms | 90.0% |
| Pay-Out | PK | 99.50% | 5,000ms | 96.0% |
| Pay-Out | BD | 99.00% | 6,000ms | 94.0% |
| Pay-Out | NP | 99.00% | 6,000ms | 94.0% |
| Pay-Out | IQ | 98.50% | 8,000ms | 92.0% |
| Remittance | PK | 99.50% | 8,000ms | 97.0% |
| Remittance | BD | 99.00% | 10,000ms | 96.0% |
| Remittance | NP | 99.00% | 10,000ms | 96.0% |
| Cards | PK | 99.50% | 2,000ms | 96.0% |
SLA targets are reviewed quarterly. Adjustments require CDO approval and 30-day merchant notice.
2.2 Measurement Definitions¶
- Availability: percentage of time the product API returns non-5xx responses, measured in 1-minute intervals. A minute is "unavailable" if >50% of requests return 5xx status codes.
- P95 Latency: 95th percentile end-to-end response time, measured from KrakenD request receipt to response dispatch. Excludes client network latency.
- Success Rate: percentage of transactions that reach a terminal success state (completed, settled) out of total initiated transactions. Excludes transactions declined due to payer-side issues (insufficient funds, invalid credentials).
2.3 Measurement Window¶
- Monthly SLA period: calendar month in UTC
- Excluded periods: pre-announced maintenance windows (minimum 48-hour advance notice), force majeure events documented within 24 hours
3. Automated Tracking via OpenTelemetry¶
3.1 Instrumentation¶
All services MUST export the following metrics via OpenTelemetry:
simpaisa.transaction.duration (histogram, labels: product, market, channel, status)
simpaisa.transaction.count (counter, labels: product, market, channel, status)
simpaisa.api.request.duration (histogram, labels: product, market, endpoint, status_code)
simpaisa.api.request.count (counter, labels: product, market, endpoint, status_code)
simpaisa.api.availability (gauge, labels: product, market, value: 0 or 1 per minute)
3.2 Collection Pipeline¶
Service (OTel SDK) → OTel Collector → Prometheus (metrics) → Grafana (dashboards)
→ SLA Tracking Service (Go) → SurrealDB (SLA records)
The SLA Tracking Service consumes metrics from Prometheus, computes rolling SLA values, and persists them for reporting and credit calculation.
3.3 Computation Frequency¶
| Metric | Computation Interval | Retention |
|---|---|---|
| Real-time SLA status | Every 1 minute | 30 days (raw) |
| Hourly SLA aggregate | Every 1 hour | 1 year |
| Daily SLA aggregate | Every 24 hours | 3 years |
| Monthly SLA report | End of month | 7 years |
4. Breach Alerting¶
4.1 Alert Thresholds¶
| Alert Level | Condition | Notification |
|---|---|---|
| Warning | SLA metric within 10% of breach threshold for 15 minutes | Slack notification to product channel |
| Breach | SLA metric below threshold for 5 consecutive minutes | PagerDuty alert to on-call engineer |
| Critical | SLA metric below threshold for 30 minutes | PagerDuty escalation to Engineering Lead + CDO notification |
| Extended | SLA metric below threshold for 2 hours | Executive alert + merchant communication triggered |
4.2 Breach Documentation¶
Every SLA breach MUST be documented with: - Start time and end time (UTC) - Affected product(s), market(s), and channel(s) - Root cause classification (infrastructure, channel provider, code defect, capacity) - Customer impact (estimated failed/delayed transactions) - Remediation actions taken - Prevention measures for recurrence
Breach records stored in SurrealDB and linked to incident records.
5. Monthly SLA Reports¶
5.1 Merchant-Facing Reports¶
Generated automatically at month-end for all Premium and Enterprise merchants:
Report contents: - SLA targets vs actual performance per product per market - Availability timeline (minute-by-minute for breach periods, hourly otherwise) - P95 latency trend (daily) - Success rate trend (daily) - Incident summary (if any breaches occurred) - SLA credit calculation (Enterprise tier only)
Delivery: - Available in merchant self-service portal (ADR-PRODUCT-2026-04-099) - Emailed to merchant's designated finance/operations contact - PDF format with machine-readable JSON supplement
5.2 Internal Reports¶
Monthly internal SLA review includes: - Cross-market comparison dashboards - Channel-level SLA breakdown (identifying weakest channels) - Trend analysis (improving/declining per product-market) - Capacity planning inputs (correlation between volume and SLA performance)
6. SLA Credit Calculation (Enterprise Tier)¶
6.1 Credit Schedule¶
When monthly SLA targets are missed, Enterprise merchants receive service credits:
| Availability Achieved | Credit (% of Monthly Fees) |
|---|---|
| 99.0%–99.49% (PK) / 98.5%–98.99% (others) | 5% |
| 98.0%–98.99% (PK) / 97.0%–98.49% (others) | 10% |
| 95.0%–97.99% (PK) / 94.0%–96.99% (others) | 25% |
| Below 95.0% (PK) / Below 94.0% (others) | 50% |
6.2 Credit Rules¶
- Credits apply to the next billing cycle automatically
- Maximum credit per month: 50% of monthly fees (not cumulative)
- Credits are calculated per product per market — a breach in PK Pay-In does not trigger credits for BD Pay-Out
- Excluded: breaches caused by merchant-side issues (invalid API calls, exceeded rate limits)
- Credit calculation automated by billing engine (ADR-PRODUCT-2026-04-097)
6.3 Credit Dispute Process¶
Merchants may dispute SLA calculations within 30 days of report generation. Disputes are reviewed by the operations team with access to raw metric data. CDO is final arbiter for disputed credits.
7. Grafana Dashboards¶
7.1 Required Dashboards¶
Executive SLA Overview: - Single-pane view of all product-market SLA status (green/amber/red) - Month-to-date SLA performance vs target - Breach count and total downtime minutes
Product Deep-Dive (one per product): - Real-time availability, latency, and success rate - Per-channel breakdown - Historical trend (30/60/90 day)
SLA Credit Tracker (Enterprise): - Current month projected credits per merchant - Historical credit payouts - Correlation with breach incidents
8. Governance¶
- SLA targets reviewed quarterly at architecture review
- New market launches MUST define SLA targets before go-live
- Channel provider contracts MUST include SLA terms aligned with (or exceeding) Simpaisa's merchant-facing SLAs
- Annual benchmarking against industry standards for each market