Core Responsibilities

Team Leadership & Operational Management

- Run the daily operations of the SRE practice: team planning, shift assignments, escalation routing, and workload balancing.

- Maintain a healthy on-call program: define rotation rules, track fatigue, ensure coverage, and continuously improve response maturity.

- Oversee incident management processes—ensuring consistent triage, high-quality postmortems, and follow-through on remediation work.

- Establish operational KPIs for the team (MTTA, MTTR, on-call load, ticket aging, toil reduction) and drive accountability.

- Coach and develop SREs at all levels through 1:1s, technical guidance, and structured growth plans.

- Ensure the team’s processes, documentation, and runbooks stay current and audited.

Technical Oversight

- Provide architecture-level guidance on resilience, observability, and reliability patterns; step in directly when the team is blocked or customer-impacting work demands senior technical judgment.

- Validate SLIs/SLOs and error budgets across services; ensure consistent implementation and reporting.

- Review and approve reliability design work—monitoring strategies, automation initiatives, CI/CD changes, deployment safety controls, and cloud cost/performance optimizations.

- Participate in high-severity incidents as escalation point and technical lead when needed.

- Ensure engineering quality for IaC, CI/CD, observability instrumentation, and Kubernetes platform operations.

Cross-Functional Leadership

- Act as primary point of contact for internal stakeholders (Dev, Product, Architecture, Cloud) regarding reliability strategy and prioritization.

- Translate business priorities into reliability roadmaps, staffing plans, and operational improvements.

- Align teams around shared reliability objectives—ensuring corrective actions, automation priorities, and capacity planning are actually executed.

- Support customer-facing conversations when reliability posture, operational processes, or technical improvements require leadership representation.

Required Qualifications

- 6–10 years in SRE/Operations/Platform roles, with at least 2 years leading or managing engineers.

- Hands-on technical background across cloud platforms (AWS/Azure/GCP) and Kubernetes.

- Experience defining and operating SLIs/SLOs, incident response, and postmortem programs.

- Strong grounding in Terraform or similar IaC, CI/CD systems, and observability technologies (Prometheus, Grafana, OpenTelemetry, ELK).

- Ability to assess technical work, coach engineers through complex problems, and make informed trade-offs under pressure.

- Excellent operational judgment: triage, prioritization, team load balancing, and process design.

- Cloud provider certification: Professional-level certification in AWS (Solutions Architect), Azure (Solutions Architect Expert), GCP (Professional Cloud Architect), or Oracle Cloud (Architect Professional)

Nice-to-Have

- Prior experience running a distributed or follow-the-sun SRE practice.

- Exposure to chaos engineering, fault injection, or reliability stress testing.

- Familiarity with cloud cost governance and rightsizing strategies.

- Experience improving or scaling on-call systems.

Practice Technical Manager