Posted 2d ago

Practice Technical Manager

@ Datavail
Canada
RemoteFull Time
Responsibilities:Lead team, Define roadmaps, Improve reliability
Requirements Summary:6–10 years in SRE/Operations/Platform roles with at least 2 years leading engineers; cloud platforms (AWS/Azure/GCP) and Kubernetes; define operating SLIs/SLOs; IaC with Terraform; CI/CD; observability tools; strong coaching and judgment.
Technical Tools Mentioned:Terraform, CI/CD, Kubernetes, Prometheus, Grafana, OpenTelemetry, ELK, AWS, Azure, GCP
Save
Mark Applied
Hide Job
Report & Hide
Job Description

Core Responsibilities

Team Leadership & Operational Management

- Run the daily operations of the SRE practice: team planning, shift assignments, escalation routing, and workload balancing.
   
- Maintain a healthy on-call program: define rotation rules, track fatigue, ensure coverage, and continuously improve response maturity.
   
- Oversee incident management processes—ensuring consistent triage, high-quality postmortems, and follow-through on remediation work.
   
- Establish operational KPIs for the team (MTTA, MTTR, on-call load, ticket aging, toil reduction) and drive accountability.
   
- Coach and develop SREs at all levels through 1:1s, technical guidance, and structured growth plans.
   
- Ensure the team’s processes, documentation, and runbooks stay current and audited.
   

Technical Oversight

- Provide architecture-level guidance on resilience, observability, and reliability patterns; step in directly when the team is blocked or customer-impacting work demands senior technical judgment.
   
- Validate SLIs/SLOs and error budgets across services; ensure consistent implementation and reporting.
   
- Review and approve reliability design work—monitoring strategies, automation initiatives, CI/CD changes, deployment safety controls, and cloud cost/performance optimizations.
   
- Participate in high-severity incidents as escalation point and technical lead when needed.
   
- Ensure engineering quality for IaC, CI/CD, observability instrumentation, and Kubernetes platform operations.
   

Cross-Functional Leadership

- Act as primary point of contact for internal stakeholders (Dev, Product, Architecture, Cloud) regarding reliability strategy and prioritization.
   
- Translate business priorities into reliability roadmaps, staffing plans, and operational improvements.
   
- Align teams around shared reliability objectives—ensuring corrective actions, automation priorities, and capacity planning are actually executed.
   
- Support customer-facing conversations when reliability posture, operational processes, or technical improvements require leadership representation.
   

Required Qualifications

- 6–10 years in SRE/Operations/Platform roles, with at least 2 years leading or managing engineers.
   
- Hands-on technical background across cloud platforms (AWS/Azure/GCP) and Kubernetes.
   
- Experience defining and operating SLIs/SLOs, incident response, and postmortem programs.
   
- Strong grounding in Terraform or similar IaC, CI/CD systems, and observability technologies (Prometheus, Grafana, OpenTelemetry, ELK).
   
- Ability to assess technical work, coach engineers through complex problems, and make informed trade-offs under pressure.
   
- Excellent operational judgment: triage, prioritization, team load balancing, and process design.
 
- Cloud provider certification: Professional-level certification in AWS (Solutions Architect), Azure (Solutions Architect Expert), GCP (Professional Cloud Architect), or Oracle Cloud (Architect Professional)

 

Nice-to-Have

- Prior experience running a distributed or follow-the-sun SRE practice.
   
- Exposure to chaos engineering, fault injection, or reliability stress testing.
   
- Familiarity with cloud cost governance and rightsizing strategies.
   
- Experience improving or scaling on-call systems.