Posted 2mo ago

Site Reliability Engineer — Info Apps

@ Apple
Austin or San Francisco or San Diego
HybridFull Time
Responsibilities:shape observability, lead incidents, automate toil
Requirements Summary:5+ years in SRE/DevOps or infrastructure; Kubernetes (EKS/GKE); observability (Prometheus, Grafana, Splunk); cloud (AWS/GCP/Azure); scripting (Python/Go); IaC (Terraform/CloudFormation/Pulumi); incident leadership; CI/CD and RCA.
Technical Tools Mentioned:Kubernetes, Prometheus, Grafana, Splunk, Terraform, CloudFormation, Pulumi, Python, Go, EKS, GKE, AWS, GCP, Azure, Elasticsearch, Solr, Kafka, Postgres, Istio, Linkerd, TCP/IP, ELB, ALB, RCA, Blameless post-mortems
Save
Mark Applied
Hide Job
Report & Hide
Job Description

Do you love building and scaling infrastructure that delights millions of customers? At Apple, we believe reliability is a feature. We are looking for a Site Reliability Engineer to join our team in overseeing the performance and availability of our core backend services in News, Stocks, Weather, Books and Creator Studio applications.

Description

As a SRE, you won’t just be responding to alerts; you will be shaping the evolution of our observability strategy, a mentor for incident management, and a champion for automation. You will help us refine our "Golden Signals" and ensure our Kubernetes-based ecosystem remains world-class.

Minimum Qualifications

  • Experience: 5+ years in SRE, DevOps, or Infrastructure roles with a proven track record of managing high-traffic, internet-facing production environments.
  • Kubernetes Expertise: Deep experience building and operating container orchestration systems (EKS/GKE/Vanilla K8s). You should be comfortable troubleshooting from the networking layer up to the application pod.
  • Observability Champion: Expert knowledge of the 4 Golden Signals (Latency, Traffic, Errors, and Saturation). Proficiency with tools like Prometheus, Grafana, and Splunk is essential.
  • Cloud Proficiency: Hands-on experience designing and maintaining resilient infrastructure on public cloud providers (AWS, GCP, or Azure).
  • Scripting & Automation: Strong ability to code at a scripting level (Python or Go preferred) to automate toil and build self-healing systems.
  • Incident Leadership: Experience leading incident response, performing Root Cause Analysis (RCA), and implementing blameless post-mortems to improve system resilience.
  • Infrastructure as Code: Proficient in Terraform, CloudFormation, or Pulumi to manage immutable infrastructure.
  • Bachelor's degree in Computer Science, Engineering, or related field (or equivalent practical experience)

Preferred Qualifications

  • Search & Data: Specialized experience operating and tuning Solr or Elasticsearch at scale.
  • Networking: Strong understanding of TCP/IP, Load Balancing (ELB/ALB), and Service Mesh (Istio/Linkerd).
  • Data Systems: Experience with Kafka, Cassandra, or Postgres in a distributed environment.