Member of Technical Staff - ML Infrastructure Engineer Freiburg (Germany), San Francisco (USA),[...]
: Job Details :


Member of Technical Staff - ML Infrastructure Engineer Freiburg (Germany), San Francisco (USA),[...]

Global Trade Plaza

Job Location : all cities,MS, USA

Posted on : 2025-08-06T01:11:59Z

Job Description :

Black Forest Labs is a cutting-edge startup pioneering generative image and video models. Our team, which invented Stable Diffusion, Stable Video Diffusion, and FLUX.1, is currently looking for a strong candidate to join us in developing and maintaining our ML infra including large GPU training and inference clusters.

Role:

  • Design, deploy, and maintain cloud-based ML training (Slurm) and inference (Kubernetes) clusters
  • Implement and manage network-based cloud file systems and blob/S3 storage solutions
  • Develop and maintain Infrastructure as Code (IaC) for resource provisioning
  • Implement and optimize CI/CD pipelines for ML workflows
  • Design and implement custom autoscaling solutions for ML workloads
  • Ensure security best practices across the ML infrastructure
  • Provide developer-friendly tools and practices for efficient ML operations

Ideal Experience:

  • Strong proficiency in cloud platforms (AWS, Azure, or GCP) with focus on ML/AI services
  • Extensive experience with Kubernetes and Slurm cluster management
  • Expertise in Infrastructure as Code tools (e.g., Terraform, Ansible)
  • Proven track record in managing and optimizing network-based cloud file systems and object storage
  • Experience with CI/CD tools and practices (e.g., CircleCI, GitHub Actions, ArgoCD)
  • Strong understanding of security principles and best practices in cloud environments
  • Experience with monitoring and observability tools (e.g., Prometheus, Grafana, Loki)
  • Familiarity with ML workflows and GPU infrastructure management
  • Demonstrated ability to handle complex migrations and breaking changes in production environments

Nice to have:

  • Experience with custom autoscaling solutions for ML workloads
  • Knowledge of cost optimization strategies for cloud-based ML infrastructure
  • Familiarity with MLOps practices and tools
  • Experience with high-performance computing (HPC) environments
  • Understanding of data versioning and experiment tracking for ML
  • Knowledge of network optimization for distributed ML training
  • Experience with multi-cloud or hybrid cloud architectures
  • Familiarity with container security and vulnerability scanning tools
Apply for this job

* indicates a required field

First Name *

Last Name *

Email *

Phone

Resume/CV

Enter manually

Accepted file types: pdf, doc, docx, txt, rtf

LinkedIn Profile

Website

#J-18808-Ljbffr
Apply Now!

Similar Jobs (0)