Seeking an experienced Site Reliability Engineer (SRE) responsible for ensuring the availability, scalability, and performance of AWS-hosted systems while leveraging the HashiCorp tool stack for automation and orchestration. This role will require deep technical expertise, a proactive approach to reliability, and the ability to collaborate with cross-functional engineering teams.
Responsibilities
- Design, deploy, and maintain AWS infrastructure supporting production and development environments.
- Manage and optimize EC2, EKS, RDS, IAM roles/policies, and cross-account access.
- Build and manage infrastructure using Terraform and HashiCorp tools (Nomad, Vault, Consul).
- Monitor system performance, automate operational tasks, and ensure high system availability.
- Troubleshoot incidents, perform root cause analysis, and implement preventive measures.
- Collaborate with development teams to support applications built in Clojure and Python.
- Maintain infrastructure documentation and follow security/compliance best practices.
Must-Have Skills
- AWS Cloud Services – EC2, EKS, RDS, IAM roles & policies, cross-account permissions.
- Infrastructure as Code – Strong proficiency in Terraform.
- HashiCorp Stack – Nomad, Vault, Consul.