In this role, you will:
Architect and optimize scalable systems: You will design, implement, and continuously improve highly reliable infrastructure, directly impacting the success and safety of Zoox's autonomous vehicle platform.
Build proactive monitoring solutions: You will develop advanced monitoring, alerting, and reporting tools to ensure potential issues are identified and resolved before they affect production.
Collaborate across engineering: You will partner closely with software engineering teams to elevate our system architecture, streamline deployment processes, and drive automation initiatives.
Lead incident resolution: You will conduct thorough root cause analyses on production issues and rapidly deploy corrective actions to maintain a resilient and stable environment.
Ensure business continuity: You will safeguard the company's operations by designing and implementing robust disaster recovery plans to keep the Zoox fleet running smoothly under any circumstances.
Qualifications
SRE & Distributed Systems Experience: 5+ years of experience in site reliability engineering or a similar role, with a strong, objective background in managing large-scale distributed systems.
Cloud & Infrastructure as Code (IaC): Proven experience operating within major cloud platforms (AWS, GCP, or Azure) and utilizing IaC tools like Terraform, Ansible, Salt, or CloudFormation.
Container Orchestration: Technical expertise in deploying, managing, and scaling systems using container orchestration technologies such as Kubernetes.
Core Infrastructure Knowledge: Deep, foundational understanding of networking protocols, storage solutions, and database technologies.
Programming Proficiency: Strong, demonstrable programming and scripting skills in languages such as Python, Go, C/C++, or Java.
Bonus Qualifications
Experience in the automotive or autonomous vehicle industry.
Knowledge of security best practices and compliance requirements.