Senior Production Engineer¶

Job Title: Senior Production Engineer¶

Department: Production Engineering¶

Reports to: Chief Digital Officer¶

Role Overview:¶

We are seeking an experienced Senior Production Engineer (SRE) to join Simpaisa Holdings, a cross-border payments and remittances company operating across the Middle East and South Asia. The ideal candidate will be responsible for the reliability, performance, and observability of production payment processing systems that require 99.99% uptime. This role encompasses defining and maintaining Service Level Objectives (SLOs), managing error budgets, building observability platforms, conducting chaos engineering exercises, and leading on-call operations for mission-critical payment infrastructure. Strong expertise in site reliability engineering principles, cloud infrastructure (AWS), and production operations for financial systems is essential. Experience with agile methodologies and collaborating with development and security teams is also preferable.

Key Responsibilities:¶

Define, implement, and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for all payment processing systems, including Pay-In/Pay-Out channels, FX engines, settlement, and reconciliation services.
Manage error budgets across services, using burn-rate alerts and budget policies to balance reliability with feature velocity.
Design and operate the organisation's observability platform, encompassing metrics, logging, tracing, and alerting across all payment processing infrastructure.
Conduct chaos engineering experiments to proactively identify failure modes in payment processing systems, ensuring graceful degradation and rapid recovery.
Lead and participate in on-call rotations for critical payment processing incidents, driving rapid resolution and conducting blameless post-incident reviews.
Design and implement infrastructure automation, including infrastructure-as-code (Terraform, CloudFormation), CI/CD pipelines (Bitbucket Pipelines), and automated deployment strategies (canary, blue-green).
Monitor and optimise system performance, capacity, and cost across all production environments, ensuring payment processing latency and throughput targets are met.
Implement and maintain security and compliance controls in production environments, aligned with PCI-DSS, ISO 27001, and regulatory requirements (DFSA, SBP, SAMA).
Collaborate with development, security, and data teams to resolve complex technical challenges and improve service delivery.
Develop and maintain runbooks, incident response procedures, and disaster recovery plans for payment processing systems.
Continuously learn and stay up-to-date with SRE practices, cloud technologies, and reliability engineering best practices for financial systems.

Required Skills and Experience:¶

Agile: Understanding of agile methodologies and how production engineering supports agile development teams through CI/CD, feature flags, and progressive delivery.
Communication: Excellent written and verbal communication skills with the ability to articulate technical issues, incident reports, and reliability metrics clearly to both technical and non-technical audiences.
Strategy and Planning: Ability to develop and execute reliability strategies, capacity plans, and disaster recovery plans for payment processing infrastructure. Strong organisational skills for managing priorities across multiple production systems.
Leadership & Influence Skills: Ability to lead incident response, drive post-incident reviews, and influence engineering teams to adopt reliability best practices. Experience mentoring junior engineers.
Problem-solving and Analytical skills: Exceptional problem-solving and troubleshooting skills to diagnose complex production issues under pressure, particularly in distributed payment processing systems.
SRE Expertise: Deep understanding of site reliability engineering principles, including SLOs/SLIs, error budgets, toil reduction, and capacity planning. Experience with cloud platforms (AWS), container orchestration (Kubernetes, ECS), observability tools (Datadog, Prometheus, Grafana, ELK), infrastructure-as-code (Terraform), and CI/CD pipelines. Strong scripting skills (Python, Bash, Go). Familiarity with PCI-DSS production environment requirements.
Payments Domain Awareness: Understanding of payment processing system architecture, transaction flow reliability requirements, and the operational impact of downtime on financial transactions and regulatory obligations.
Stakeholder Management: Ability to build and maintain strong relationships with development teams, security teams, and business stakeholders, ensuring alignment on reliability targets and incident management processes.

General Requirements for the Role:¶

Bachelor's Degree in related field: A bachelor's degree in Information Systems, Computer Science, Engineering, or a closely related STEM field is required.
6+ years of experience in SRE, DevOps, or production engineering: Minimum of 6 years of progressive experience in operating and scaling production systems, preferably within financial services or payments.
Experience with observability and incident management: Demonstrated experience in building observability platforms, defining SLOs, and leading incident response for high-availability systems.
Proven track record of maintaining reliable production systems: A verifiable history of contributing to 99.9%+ uptime for mission-critical systems in regulated environments.

Benefits and Perks:¶

Competitive salary and comprehensive benefits package.
Opportunity to work with cutting-edge payments and fintech infrastructure and collaborate with skilled professionals across multiple markets.
Professional development and training opportunities, including cloud and SRE certification sponsorship.
Inclusive company culture that values diversity and innovation.