JD — Reliability Engineer¶

Owner	Classification	Review Date	Status
People Operations	Internal	April 2027	Active

Job Description: Reliability Engineer¶

Department: Technology & Digital — Production Engineering
Reports to: Head of Production Engineering

Role Overview¶

Simpaisa's payments infrastructure runs 24/7 across 7 markets. When it goes down, money stops moving. The Reliability Engineer's job is to make sure that doesn't happen — and when it does, to fix it fast and make sure it doesn't happen the same way twice.

This is a Production Engineering role within a team that owns deployment, observability, incident response, SLA tracking, and the operational health of every payment corridor in production. You will work closely with Solution Engineers (who build the product) and the Service Delivery team (who manage partner-facing issues), but your primary focus is system reliability, not feature delivery.

Key Responsibilities¶

Monitor the health of Simpaisa's payment infrastructure — transaction success rates, API latency, operator connectivity, settlement pipeline — across all live corridors.
Respond to production incidents: diagnose, contain, and resolve issues within SLA windows. Participate in the on-call rotation.
Write and maintain runbooks for all production corridors and components — if an incident happens at 3am, the runbook is what the on-call engineer reaches for.
Lead post-incident reviews (PIRs) within 48 hours of any P1/P2 incident. Identify root cause. Write the blameless post-mortem. Track action items to closure.
Own deployment pipelines — manage canary rollouts, validate SLO compliance during traffic ramp-up, and execute rollbacks when required (target: automated rollback within 5 minutes).
Build and maintain observability infrastructure: dashboards (Grafana or similar), alerting rules (PagerDuty), and log aggregation.
Track SLA performance per corridor and operator; surface degradation trends before they become outages.
Work with Solution Engineers during Phase 7–8 of the SDLC to ensure new deployments have appropriate monitoring, alerting, and rollback capability before traffic hits.

Required Skills and Experience¶

Linux and systems: Comfortable on the command line. Ability to diagnose production issues using logs, metrics, and traces without a GUI.
Observability: Experience with Grafana, Prometheus, ELK/OpenSearch, Datadog, or similar. Ability to build dashboards that actually help on-call engineers.
CI/CD and deployment: Experience with deployment pipelines, canary releases, blue/green deployments, and rollback procedures.
Cloud infrastructure: AWS or Azure — understanding of networking, compute, managed databases, and load balancing in a cloud environment.
Incident management: Experience with on-call rotations, PagerDuty (or similar), and structured incident response (SEV levels, war rooms, PIRs).
Scripting: Python, Bash, or similar for automation — runbooks, monitoring scripts, deployment tooling.
Payments domain (preferred): Experience supporting payment systems or high-availability financial infrastructure is a strong advantage.
ITIL / SRE principles: Familiarity with SLOs, SLIs, error budgets, and toil reduction.

General Requirements¶

Bachelor's degree in Computer Science, Engineering, or a related field.
3+ years of experience in site reliability engineering, DevOps, or production operations.
Demonstrated experience maintaining high-availability systems in production.

What We Offer¶

Competitive salary benchmarked to your local market.
On-call compensation in addition to base salary.
Work on systems where reliability directly impacts financial outcomes for millions of people.
Clear career path: Reliability Engineer → Senior RE → Lead RE → Head of Production Engineering.
Flexible hybrid working (on-call schedule respected).