STD-INFRA-067: Load Testing Standards¶
| Owner | Classification | Review Date | Status |
|---|---|---|---|
| Infrastructure | Internal | April 2027 | Active |
STD-INFRA-067: Load Testing Standards¶
| Field | Value |
|---|---|
| Owner | Platform Engineering |
| Approved By | CDO |
| Date | 2026-04-03 |
| Review Cycle | Quarterly |
| Last Review | — |
Purpose¶
This standard defines mandatory load testing requirements for Simpaisa's payment platform, which processes 270M+ transactions worth over $1B annually across five markets (PK, BD, NP, IQ, EG). Load testing ensures that Go microservices behind KrakenD can handle baseline, peak, and surge traffic without degradation.
Scope¶
Applies to all services in the payment processing path: KrakenD gateway, Pay-In, Pay-Out, Remittance, and Cards services, plus shared services (merchant-svc, auth-svc). Non-payment services (analytics, reporting) are recommended but not mandatory.
Tooling¶
-
Primary tool : k6 (Go-based, scriptable in JavaScript). Selected for Go ecosystem alignment, CI/CD integration, and support for gRPC and HTTP/2.
-
Test scripts : Stored in
infra/loadtest/in each service repository. Committed alongside application code. -
Results storage : k6 results exported to InfluxDB via the k6-to-influxdb extension. Grafana dashboards for trending.
-
Execution environment : Dedicated k6 runners in the
test-*Kubernetes namespace. Never run load tests against production.
Mandatory Testing Gates¶
Pre-Release Load Test¶
Every service release that touches payment endpoints MUST pass a load test before promotion to production. Failures block the release pipeline.
Test Profiles¶
| Profile | Description | Duration | Target RPS (per service) |
|---|---|---|---|
| Baseline | Normal weekday traffic | 15 min | Per service baseline (see below) |
| Peak | Friday salary-day traffic (PK/BD) | 15 min | 3× baseline |
| Surge | Eid/Black Friday spike | 15 min | 5× baseline |
| Soak | Sustained load to detect memory leaks and connection pool exhaustion | 4 hours | 1.5× baseline |
| Stress | Ramp to breaking point to find capacity ceiling | 30 min (ramp) | Ramp from baseline to 10× |
Service Baselines¶
| Service | Baseline RPS | Peak RPS | Surge RPS |
|---|---|---|---|
| KrakenD gateway | 3,000 | 9,000 | 15,000 |
| payin-svc | 1,500 | 4,500 | 7,500 |
| payout-svc | 800 | 2,400 | 4,000 |
| remit-svc | 400 | 1,200 | 2,000 |
| cards-svc | 600 | 1,800 | 3,000 |
| auth-svc | 2,000 | 6,000 | 10,000 |
Baselines are recalculated quarterly based on actual production traffic (see STD-INFRA-069).
Pass/Fail Thresholds¶
| Metric | Threshold | Action on Breach |
|---|---|---|
| P50 latency | ≤100ms (gateway), ≤50ms (service) | Warning |
| P95 latency | ≤300ms (gateway), ≤150ms (service) | Release blocked |
| P99 latency | ≤1,000ms (gateway), ≤500ms (service) | Release blocked |
| Error rate (5xx) | <0.1% | Release blocked |
| Transaction success rate | ≥99.5% | Release blocked |
| CPU utilisation | <80% at peak | Warning |
| Memory utilisation | <85% at peak | Warning |
| Connection pool exhaustion | 0 occurrences | Release blocked |
Soak Testing Requirements¶
Soak tests run for 4 hours at 1.5× baseline and verify:
-
Memory stability : RSS does not grow more than 10% over the test duration (detects Go memory leaks, goroutine leaks).
-
Connection pool health : Database and Redis connection counts remain stable (no leak, no exhaustion).
-
Latency stability : P95 latency does not degrade more than 20% between the first and last hour.
-
Error accumulation : No increasing error rate trend over time.
Stress Testing Requirements¶
Stress tests ramp traffic from baseline to 10× baseline over 30 minutes to:
-
Identify breaking point : The RPS at which error rate exceeds 1% or P95 latency exceeds 2 seconds.
-
Validate graceful degradation : Services should return 429 (rate limited) or 503 (circuit open), not crash or corrupt data.
-
Document capacity ceiling : Results feed into quarterly capacity planning (STD-INFRA-069).
Reporting and Trending¶
-
Per-test report : k6 summary output stored as a CI/CD artefact. Includes all threshold results.
-
Trending dashboard : Grafana dashboard (
Load Test Trends) shows P95 latency, throughput, and error rate over time per service. -
Regression detection : If P95 latency increases by more than 20% between releases, the release is flagged for performance review.
Scheduling¶
| Test Type | Trigger | Frequency |
|---|---|---|
| Baseline + Peak | Pre-release CI/CD gate | Every release |
| Surge | Manual or scheduled | Monthly |
| Soak | Scheduled | Weekly (Friday night) |
| Stress | Manual | Quarterly (before capacity review) |
Exceptions¶
Services that do not process payment transactions may request an exemption from mandatory pre-release load testing. Exemptions must be approved by the Platform Engineering lead and documented in the service's README.md.