Skip to content

STD-DEVEX-089: Canary Deployment Automation

Owner Classification Review Date Status
CDO Office Internal April 2027 Active

STD-DEVEX-089: Canary Deployment Automation

Field Value
Standard STD-DEVEX-089
Title Canary Deployment Automation
Status Draft
Owner Platform Team
Created 2026-04-03
Review Quarterly

Purpose

Define automated canary deployment procedures for Simpaisa's payment gateway. With 270M+ transactions annually and $1B+ in volume, a bad deployment can cause immediate financial impact across PK, BD, NP, IQ and EG. Canary releases limit blast radius by gradually shifting traffic to new versions with automated rollback on degradation.

Scope

All production deployments of Go services behind KrakenD API gateway. Applies to Pay-Ins, Pay-Outs, Remittances, Cards and supporting platform services. Infrastructure changes (Cloudflare, database migrations) follow separate runbooks.

Current State

  • Deployments are blue-green via Kubernetes, but traffic shift is manual (0% or 100%).

  • No automated rollback — engineers monitor dashboards and manually revert.

  • KrakenD configuration supports weighted routing but it is not used for canary.

  • Deployment-related incidents account for ~30% of SEV1/2 incidents.

Gaps

  1. No gradual traffic ramp — all-or-nothing deployments.

  2. No automated rollback triggers — human reaction time is the bottleneck.

  3. No per-canary observability — metrics are aggregated across versions.

  4. Bake time not enforced — deployments are promoted immediately if "looks fine."

  5. KrakenD weighted routing not leveraged.

Target State

  • Every production deployment follows an automated canary ramp.

  • Automated rollback triggers fire within 60 seconds of threshold breach.

  • Per-canary Grafana dashboards provide real-time comparison against baseline.

  • KrakenD weighted routing manages traffic distribution.

Traffic Ramp Schedule

Stage Traffic Bake Time Rollback Window
1 1% 15 min Immediate
2 5% 15 min Immediate
3 25% 15 min Immediate
4 50% 15 min Immediate
5 100% 30 min Manual revert

Total ramp time: ~90 minutes minimum for a clean deployment.

Automated Rollback Triggers

Rollback is triggered automatically if any condition is met during stages 1-4:

Metric Threshold
Error rate (5xx)

1% above baseline

P95 latency|

2x baseline

Transaction success rate| Drops >0.5% below baseline
Pod crash loops| Any canary pod restarts >2 times
Health check failures| Any canary endpoint unresponsive

Baseline is calculated from the preceding 30-minute window of the stable version.

KrakenD Configuration

  • Weighted backend routing directs traffic percentage to canary vs stable service.

  • Canary deployments tagged with version label in Kubernetes.

  • KrakenD configuration updated via CI pipeline at each ramp stage.

  • OpenTelemetry propagates x-canary-version header for per-version metric segmentation.

Observability

  • Per-canary Grafana dashboard : error rate, latency (P50/P95/P99), throughput, transaction success rate — segmented by version.

  • OpenTelemetry spans tagged with deployment version for trace-level comparison.

  • PostHog feature flags used for canary targeting when user-level segmentation is needed (e.g., merchant cohort testing).

  • Alerting : PagerDuty alerts fire on rollback trigger conditions.

Process

  1. CI builds container image, tags with commit SHA and version.

  2. Deploy stage 1 : Kubernetes deploys canary pods. KrakenD routes 1% traffic.

  3. Automated monitor : Observability pipeline compares canary vs baseline.

  4. Ramp or rollback : If thresholds hold, advance to next stage. If breached, rollback.

  5. Full promotion : At 100%, old version pods are scaled down after 30-minute bake.

  6. Beads issue updated with deployment outcome (success/rollback with metrics).

Exceptions

  • Database migration deployments require a separate coordination process (see DATABASE-SCHEMA-CHANGE-STANDARD.md).

  • Emergency hotfixes (P1) may use accelerated ramp: 5% → 50% → 100% with 5-minute bake times, with CDO approval.

Actions

# Action Owner Deadline
1 Implement KrakenD weighted routing in CI pipeline Platform Team 2026-Q2
2 Build per-canary Grafana dashboard template Platform Team 2026-Q2
3 Implement automated rollback controller Platform Team 2026-Q2
4 Tag OpenTelemetry spans with deployment version Platform Team 2026-Q2
5 Run first canary deployment on non-critical service Platform Team 2026-Q3
6 Roll out to payment-critical services Platform Team 2026-Q3

References

  • DEPLOYMENT-STANDARD.md

  • DATABASE-SCHEMA-CHANGE-STANDARD.md

  • LOGGING-STANDARD.md

  • INCIDENT-RESPONSE-PLAYBOOK.md