Skip to content

STD-PRODUCT-106: SLA Management Framework

Version: 1.0.0 Effective Date: 2026-04-03 Owner: CDO / Architecture Status: Active Applicability: All Simpaisa products (Pay-Ins, Pay-Outs, Remittances, Cards) across PK, BD, NP, and IQ markets

Compliance: All products MUST have SLA definitions registered in this framework before general availability. SLA tracking MUST be operational within 30 days of product launch. Exceptions require written CDO approval.


1. Purpose

This standard defines how Simpaisa measures, tracks, and reports on Service Level Agreements across all products and markets. It establishes SLA definitions, automated tracking via OpenTelemetry, breach alerting, merchant reporting, and SLA credit calculations for Enterprise tier merchants.

2. SLA Definitions

2.1 Per-Product, Per-Market SLAs

Product Market Availability SLA P95 Latency SLA Success Rate SLA
Pay-In PK 99.50% 3,000ms 95.0%
Pay-In BD 99.00% 4,000ms 93.0%
Pay-In NP 99.00% 4,000ms 93.0%
Pay-In IQ 98.50% 5,000ms 90.0%
Pay-Out PK 99.50% 5,000ms 96.0%
Pay-Out BD 99.00% 6,000ms 94.0%
Pay-Out NP 99.00% 6,000ms 94.0%
Pay-Out IQ 98.50% 8,000ms 92.0%
Remittance PK 99.50% 8,000ms 97.0%
Remittance BD 99.00% 10,000ms 96.0%
Remittance NP 99.00% 10,000ms 96.0%
Cards PK 99.50% 2,000ms 96.0%

SLA targets are reviewed quarterly. Adjustments require CDO approval and 30-day merchant notice.

2.2 Measurement Definitions

  • Availability: percentage of time the product API returns non-5xx responses, measured in 1-minute intervals. A minute is "unavailable" if >50% of requests return 5xx status codes.
  • P95 Latency: 95th percentile end-to-end response time, measured from KrakenD request receipt to response dispatch. Excludes client network latency.
  • Success Rate: percentage of transactions that reach a terminal success state (completed, settled) out of total initiated transactions. Excludes transactions declined due to payer-side issues (insufficient funds, invalid credentials).

2.3 Measurement Window

  • Monthly SLA period: calendar month in UTC
  • Excluded periods: pre-announced maintenance windows (minimum 48-hour advance notice), force majeure events documented within 24 hours

3. Automated Tracking via OpenTelemetry

3.1 Instrumentation

All services MUST export the following metrics via OpenTelemetry:

simpaisa.transaction.duration     (histogram, labels: product, market, channel, status)
simpaisa.transaction.count        (counter, labels: product, market, channel, status)
simpaisa.api.request.duration     (histogram, labels: product, market, endpoint, status_code)
simpaisa.api.request.count        (counter, labels: product, market, endpoint, status_code)
simpaisa.api.availability         (gauge, labels: product, market, value: 0 or 1 per minute)

3.2 Collection Pipeline

Service (OTel SDK) → OTel Collector → Prometheus (metrics) → Grafana (dashboards)
                                    → SLA Tracking Service (Go) → SurrealDB (SLA records)

The SLA Tracking Service consumes metrics from Prometheus, computes rolling SLA values, and persists them for reporting and credit calculation.

3.3 Computation Frequency

Metric Computation Interval Retention
Real-time SLA status Every 1 minute 30 days (raw)
Hourly SLA aggregate Every 1 hour 1 year
Daily SLA aggregate Every 24 hours 3 years
Monthly SLA report End of month 7 years

4. Breach Alerting

4.1 Alert Thresholds

Alert Level Condition Notification
Warning SLA metric within 10% of breach threshold for 15 minutes Slack notification to product channel
Breach SLA metric below threshold for 5 consecutive minutes PagerDuty alert to on-call engineer
Critical SLA metric below threshold for 30 minutes PagerDuty escalation to Engineering Lead + CDO notification
Extended SLA metric below threshold for 2 hours Executive alert + merchant communication triggered

4.2 Breach Documentation

Every SLA breach MUST be documented with: - Start time and end time (UTC) - Affected product(s), market(s), and channel(s) - Root cause classification (infrastructure, channel provider, code defect, capacity) - Customer impact (estimated failed/delayed transactions) - Remediation actions taken - Prevention measures for recurrence

Breach records stored in SurrealDB and linked to incident records.

5. Monthly SLA Reports

5.1 Merchant-Facing Reports

Generated automatically at month-end for all Premium and Enterprise merchants:

Report contents: - SLA targets vs actual performance per product per market - Availability timeline (minute-by-minute for breach periods, hourly otherwise) - P95 latency trend (daily) - Success rate trend (daily) - Incident summary (if any breaches occurred) - SLA credit calculation (Enterprise tier only)

Delivery: - Available in merchant self-service portal (ADR-PRODUCT-2026-04-099) - Emailed to merchant's designated finance/operations contact - PDF format with machine-readable JSON supplement

5.2 Internal Reports

Monthly internal SLA review includes: - Cross-market comparison dashboards - Channel-level SLA breakdown (identifying weakest channels) - Trend analysis (improving/declining per product-market) - Capacity planning inputs (correlation between volume and SLA performance)

6. SLA Credit Calculation (Enterprise Tier)

6.1 Credit Schedule

When monthly SLA targets are missed, Enterprise merchants receive service credits:

Availability Achieved Credit (% of Monthly Fees)
99.0%–99.49% (PK) / 98.5%–98.99% (others) 5%
98.0%–98.99% (PK) / 97.0%–98.49% (others) 10%
95.0%–97.99% (PK) / 94.0%–96.99% (others) 25%
Below 95.0% (PK) / Below 94.0% (others) 50%

6.2 Credit Rules

  • Credits apply to the next billing cycle automatically
  • Maximum credit per month: 50% of monthly fees (not cumulative)
  • Credits are calculated per product per market — a breach in PK Pay-In does not trigger credits for BD Pay-Out
  • Excluded: breaches caused by merchant-side issues (invalid API calls, exceeded rate limits)
  • Credit calculation automated by billing engine (ADR-PRODUCT-2026-04-097)

6.3 Credit Dispute Process

Merchants may dispute SLA calculations within 30 days of report generation. Disputes are reviewed by the operations team with access to raw metric data. CDO is final arbiter for disputed credits.

7. Grafana Dashboards

7.1 Required Dashboards

Executive SLA Overview: - Single-pane view of all product-market SLA status (green/amber/red) - Month-to-date SLA performance vs target - Breach count and total downtime minutes

Product Deep-Dive (one per product): - Real-time availability, latency, and success rate - Per-channel breakdown - Historical trend (30/60/90 day)

SLA Credit Tracker (Enterprise): - Current month projected credits per merchant - Historical credit payouts - Correlation with breach incidents

8. Governance

  • SLA targets reviewed quarterly at architecture review
  • New market launches MUST define SLA targets before go-live
  • Channel provider contracts MUST include SLA terms aligned with (or exceeding) Simpaisa's merchant-facing SLAs
  • Annual benchmarking against industry standards for each market