Seeking a highly skilled Site Reliability Engineer to work closely with engineering teams to ensure applications are highly available, meet performance standards, and meet the reliability expectations of business stakeholders. As a Site Reliability Engineer, you will work to identify and deliver automation solutions designed to ensure high availability and resiliency using your expertise in software development, complexity analysis, and scalable system design.
Duties and Responsibilities
- Monitor system performance, identify areas for improvement, and implement solutions to enhance reliability and availability.
- Guide architecture and development teams on how to make applications highly available, reliable, and performant at a global scale.
- Collaborate with product owners to implement and monitor key metrics to meet SLOs and SLAs.
- Collaborate with development team members to troubleshoot and resolve problems.
- Drive the Root Cause Analysis of production issues and other failures within the supported application software stack.
- Design, build, and champion automated solutions and tasks to optimize application/service/platform uptime with minimal human intervention.
- Develop tools and processes to monitor the Cloud resources and applications.
- Use Kubernetes to deploy platform services.
- Create and implement standards and best practices, driving adoption across development teams and external vendors as applicable.
Requirements and Qualifications
Expertise and/or relevant experience in the following areas is mandatory:
- Bachelor or above degree in Computer Science or a related technical discipline.
- 4+ years experience in the deployment, administration, and troubleshooting of large-scale distributed systems.
- 4+ years of experience working with Linux terminal tools and writing shell scripts within a Linux environment.
- Strong understanding of SLAs, SLOs, and SLIs.
- Strong understanding of public cloud service concepts.
- Strong understanding of Unix/Linux operating systems internals and administration (Debian is preferred but not required).
- Strong understanding of networking (e.g. TCP/IP, routing, network topologies, and hardware).
- Strong experience in debugging and optimizing code and automating routine tasks.
- Strong skills in problem-solving and communication.
- SRE experience including monitoring, alert creation, and tuning.
- Willing to work and support West Coast hours (9 AM – 6 PM PST).
- Willing to work in on-call rotation to participate in troubleshooting and communication efforts outside of normal business hours.
Expertise and/or relevant experience in the following areas is preferred:
- Experience with the following or equivalent technologies: Kubernetes, Docker, OpenStack, Relational Databases, NoSQL Databases.
- Strong communication skills and presentation skills.
- Exhibits a determination or willingness to take action and achieve results.
- Excellent command of the English language (written and spoken).
- Excellent organizational skills in planning and prioritizing own workload and initiatives.
Seniority level
Mid-Senior level
Employment type
Full-time
Job function
IT Services and IT Consulting
#J-18808-Ljbffr