STD-DEVEX-089: Canary Deployment Automation¶
| Owner | Classification | Review Date | Status |
|---|---|---|---|
| CDO Office | Internal | April 2027 | Active |
STD-DEVEX-089: Canary Deployment Automation¶
| Field | Value |
|---|---|
| Standard | STD-DEVEX-089 |
| Title | Canary Deployment Automation |
| Status | Draft |
| Owner | Platform Team |
| Created | 2026-04-03 |
| Review | Quarterly |
Purpose¶
Define automated canary deployment procedures for Simpaisa's payment gateway. With 270M+ transactions annually and $1B+ in volume, a bad deployment can cause immediate financial impact across PK, BD, NP, IQ and EG. Canary releases limit blast radius by gradually shifting traffic to new versions with automated rollback on degradation.
Scope¶
All production deployments of Go services behind KrakenD API gateway. Applies to Pay-Ins, Pay-Outs, Remittances, Cards and supporting platform services. Infrastructure changes (Cloudflare, database migrations) follow separate runbooks.
Current State¶
-
Deployments are blue-green via Kubernetes, but traffic shift is manual (0% or 100%).
-
No automated rollback — engineers monitor dashboards and manually revert.
-
KrakenD configuration supports weighted routing but it is not used for canary.
-
Deployment-related incidents account for ~30% of SEV1/2 incidents.
Gaps¶
-
No gradual traffic ramp — all-or-nothing deployments.
-
No automated rollback triggers — human reaction time is the bottleneck.
-
No per-canary observability — metrics are aggregated across versions.
-
Bake time not enforced — deployments are promoted immediately if "looks fine."
-
KrakenD weighted routing not leveraged.
Target State¶
-
Every production deployment follows an automated canary ramp.
-
Automated rollback triggers fire within 60 seconds of threshold breach.
-
Per-canary Grafana dashboards provide real-time comparison against baseline.
-
KrakenD weighted routing manages traffic distribution.
Traffic Ramp Schedule¶
| Stage | Traffic | Bake Time | Rollback Window |
|---|---|---|---|
| 1 | 1% | 15 min | Immediate |
| 2 | 5% | 15 min | Immediate |
| 3 | 25% | 15 min | Immediate |
| 4 | 50% | 15 min | Immediate |
| 5 | 100% | 30 min | Manual revert |
Total ramp time: ~90 minutes minimum for a clean deployment.
Automated Rollback Triggers¶
Rollback is triggered automatically if any condition is met during stages 1-4:
| Metric | Threshold |
|---|---|
| Error rate (5xx) |
1% above baseline
P95 latency|
2x baseline
Transaction success rate| Drops >0.5% below baseline
Pod crash loops| Any canary pod restarts >2 times
Health check failures| Any canary endpoint unresponsive
Baseline is calculated from the preceding 30-minute window of the stable version.
KrakenD Configuration¶
-
Weighted backend routing directs traffic percentage to canary vs stable service.
-
Canary deployments tagged with version label in Kubernetes.
-
KrakenD configuration updated via CI pipeline at each ramp stage.
-
OpenTelemetry propagates
x-canary-versionheader for per-version metric segmentation.
Observability¶
-
Per-canary Grafana dashboard : error rate, latency (P50/P95/P99), throughput, transaction success rate — segmented by version.
-
OpenTelemetry spans tagged with deployment version for trace-level comparison.
-
PostHog feature flags used for canary targeting when user-level segmentation is needed (e.g., merchant cohort testing).
-
Alerting : PagerDuty alerts fire on rollback trigger conditions.
Process¶
-
CI builds container image, tags with commit SHA and version.
-
Deploy stage 1 : Kubernetes deploys canary pods. KrakenD routes 1% traffic.
-
Automated monitor : Observability pipeline compares canary vs baseline.
-
Ramp or rollback : If thresholds hold, advance to next stage. If breached, rollback.
-
Full promotion : At 100%, old version pods are scaled down after 30-minute bake.
-
Beads issue updated with deployment outcome (success/rollback with metrics).
Exceptions¶
-
Database migration deployments require a separate coordination process (see
DATABASE-SCHEMA-CHANGE-STANDARD.md). -
Emergency hotfixes (P1) may use accelerated ramp: 5% → 50% → 100% with 5-minute bake times, with CDO approval.
Actions¶
| # | Action | Owner | Deadline |
|---|---|---|---|
| 1 | Implement KrakenD weighted routing in CI pipeline | Platform Team | 2026-Q2 |
| 2 | Build per-canary Grafana dashboard template | Platform Team | 2026-Q2 |
| 3 | Implement automated rollback controller | Platform Team | 2026-Q2 |
| 4 | Tag OpenTelemetry spans with deployment version | Platform Team | 2026-Q2 |
| 5 | Run first canary deployment on non-critical service | Platform Team | 2026-Q3 |
| 6 | Roll out to payment-critical services | Platform Team | 2026-Q3 |
References¶
-
DEPLOYMENT-STANDARD.md -
DATABASE-SCHEMA-CHANGE-STANDARD.md -
LOGGING-STANDARD.md -
INCIDENT-RESPONSE-PLAYBOOK.md