- Lead and manage a team of Site Reliability Engineers (SREs) to ensure the reliability, availability, and performance of critical systems and applications.
- Develop and implement SRE best practices using Python, enhancing the automation of operational tasks and processes.
- Oversee the design, deployment, and maintenance of infrastructure and services, ensuring they meet business requirements and performance standards.
- Collaborate with cross-functional teams to define service level objectives (SLOs) and service level indicators (SLIs) that align with organizational goals.
- Mentor and guide team members in technical skills and career development, fostering a culture of continuous improvement and learning.
- Conduct regular reviews of system performance metrics and s, addressing any issues proactively to minimize downtime.
- Drive incident management processes, ensuring effective response and resolution strategies are in place.
- Stay up-to-date with industry trends and emerging technologies to enhance the SRE practice within the organization.