Kaztronix
Job Location :
Sunnyvale,CA, USA
Posted on :
2025-08-01T17:05:39Z
Job Description :
A Global Government Contracting Company is seeking a Site Reliability Engineer to join their team in Sunnyvale, CA! As a Site Reliability Engineer, you will: Design, implement, and maintain highly available and scalable systems and infrastructure to support classified applications and servicesDevelop and implement reliability-focused engineering practices, such as continuous integration, continuous deployment, and continuous monitoring, while ensuring compliance with classified system requirementsCollaborate with development teams to ensure that reliability and scalability are considered throughout the software development lifecycle, while maintaining the security and integrity of the classified systemIdentify and mitigate potential sources of downtime and performance degradation, including infrastructure, application, and network issues, while ensuring that all troubleshooting and debugging activities are conducted in accordance with classified system proceduresDevelop and maintain technical documentation, including system diagrams, architecture documents, and runbooks, while ensuring that all documentation is properly marked and handled in accordance with classified system requirementsLead and participate in incident response and post-incident reviews to identify root causes and implement corrective actions, while ensuring that all incident response activities are conducted in accordance with classified system proceduresCollaborate with other teams, including development, operations, and security, to ensure that reliability and scalability are considered in all aspects of system design and operation, while maintaining the security and integrity of the classified systemDevelop and maintain metrics and monitoring systems to measure system reliability and performance, while ensuring that all monitoring activities are conducted in accordance with classified system requirementsStay up-to-date with industry trends and emerging technologies, and apply this knowledge to continuously improve system reliability and scalability, while maintaining the security and integrity of the classified systemBasic Qualifications Bachelor's degree in Computer Science, Engineering, or a related fieldMinimum 8 years of experience in site reliability engineering, DevOps, or a related field, with a focus on classified systemsMust possess or be able to obtain within 6 months of start date a valid IAT Level II or III DoD Approved 8140 (DoD 8570) certification such as Security+, in good standingAbility to obtain & maintain a Top Secret security clearance, US Citizenship requiredExperienced with production use of vSphere/ESXi/vCenter, RHELAdvance proficiency using of Python, BASH, Ansible, puppet, and chef for system administrationDemonstrable proficiency with MRTG/PRTG, Nagios, SolarWinds or similarProven ability with Cloud and Container technologies: Kubernetes, Docker/Mirantis, AWS, and/or AzureStrong technical background in systems administration, networking, and software development, with a focus on classified systemsExcellent problem-solving skills, with the ability to analyze complex systems and identify root causes of issues, while maintaining the security and integrity of the classified systemNetworking fundamentals, including TCP/IP, DNS, and routing protocols Desired Skills System integration experience of large-scale distributed infrastructure systemsMasters degree in Computer Engineering or related fieldData center operations/system administrator experience, preferably in a DoD environment (RMF, STIG, or NISPOM)Certification in site reliability engineering, DevOps, or a related field, with a focus on classified systemsExperience with machine learning and artificial intelligence technologies, with a focus on classified systemsStrong knowledge of security principles and practices, including secure coding, secure deployment, and secure operations, with a focus on classified systemsStrong understanding of networking fundamentals, including TCP/IP, DNS, and routing protocols, with a focus on classified systemsAbility to support on-call 24X7 and off-shift for mission critical events/operation that may require extended hours or weekend supportsComfortable working in a fast paced and dynamic multi-disciplinary environment Location: Sunnyvale, CA Work Schedule 9 x 80 onsite with on call rotations
Apply Now!