Job Description
For two decades, we have pioneered visual computing, the art and science of computer graphics. Our invention of GPUs has expanded the field to AI-powered video games, social networking, web search, IC & product design, medical diagnosis, and scientific research. Today, visual computing is central to deep learning-based AI, including ChatGPT, transforming entertainment and interaction. Join us to advance visual computing and AI to the next chapter.
We are seeking a Product Development Engineer as a Subject Matter Expert (SME) to drive key aspects of RAS / Resilience features across chip, module, and server levels for our next-generation AI products. The ideal candidate will have deep expertise in RAS / Resilience testing, characterization, analysis, benchmarking, and risk assessment of large AI training or HPC cluster systems with InfiniBand or enhanced Ethernet.
Responsibilities
- Serve as the SME for manufacturing test requirements, methodologies, plans, and flows for AI system RAS / Resilience features, ensuring comprehensive test coverage and successful production ramp-up.
- Own the development and validation of AI system RAS / Resilience models, benchmarking, and risk assessments.
- Lead troubleshooting and root cause analysis of RAS / Resilience failures in manufacturing and field deployments.
- Drive end-to-end RAS efforts from chip to system to reduce FIT rates.
- Analyze RAS / Resilience logs to refine testing methodologies and manufacturing processes; influence software tools and infrastructure for product development, validation, and production.
- Collaborate with architecture, hardware, software, and product engineering teams throughout the product lifecycle.
- Assess new hardware features and architect manufacturing RAS tests, flows, and methodologies.
- Develop a deep understanding of NVIDIA's AI hardware and software architecture.
Qualifications
- Bachelor's degree or higher in EE, CE, CS, Mathematics, or related field.
- 12+ years of hands-on experience in designing, testing, benchmarking, and risk assessment of system RAS / Resilience features in large Compute, AI, or HPC systems.
- Proficiency in RAS / Resilience modeling theory and methodology.
- Knowledge of HPC or AI system architecture and cluster interconnect technologies.
- Experience with test equipment, Linux commands, and benchmarking utilities for testing and troubleshooting compute systems.
- Strong problem-solving skills and experience with root-cause analysis.
- Self-motivated, with strong interpersonal skills and adaptability to new technologies.
- Knowledge or experience in HPC or MLPerf benchmarking is a plus.
NVIDIA is a leading employer in technology, known for forward-thinking and hardworking teams. We value creativity and autonomy. If you fit this profile, we want to hear from you!
Compensation and Benefits
The base salary range is $188,000 - $356,500, determined by location, experience, and comparable roles. Employees are also eligible for equity and other benefits. NVIDIA is an equal opportunity employer committed to diversity and inclusion, and we do not discriminate based on race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, disability, or any other protected characteristic.
#J-18808-Ljbffr