Site Reliability Engineer - X.AI Corporation : Job Details

Site Reliability Engineer

X.AI Corporation

Job Location : Palo Alto,CA, USA

Posted on : 2025-08-05T17:25:52Z

Job Description :
As a Data Center Site Reliability Engineer (SRE) at x. AI, you will play a pivotal role in ensuring the reliability, scalability, and performance of our state-of-the-art data center infrastructure, including the Colossus supercluster in Memphisthe world's largest AI training cluster with over 100,000 liquid-cooled Nvidia GP - Us and plans for expansion to 1 million. This infrastructure powers advanced AI workloads, massive-scale model training, and products like Grok, enabling breakthroughs in understanding the universe. You will collaborate with cross-functional teams to automate operations, enhance observability, and maintain high availability for large-scale distributed systems. This is a hands-on technical position in a dynamic environment, offering the opportunity to tackle complex challenges at the intersection of AI, data center operations, and software reliability. Key Responsibilities Maintain and improve the reliability and uptime of x. AIs on-premises and cloud-based data ...Reliability Engineer, Liability, Reliability, Engineer, Reliability, Operations, Technology
Apply Now!

Similar Jobs ( 0)