Member of Technical Staff - ML Infrastructure Engineer Freiburg (Germany), San Francisco (USA),[...]
: Job Details :

Member of Technical Staff - ML Infrastructure Engineer Freiburg (Germany), San Francisco (USA),[...]

Global Trade Plaza

Job Location : all cities,MS, USA

Posted on : 2025-08-06T01:11:59Z

Job Description :

Black Forest Labs is a cutting-edge startup pioneering generative image and video models. Our team, which invented Stable Diffusion, Stable Video Diffusion, and FLUX.1, is currently looking for a strong candidate to join us in developing and maintaining our ML infra including large GPU training and inference clusters.

Role:

Design, deploy, and maintain cloud-based ML training (Slurm) and inference (Kubernetes) clusters
Implement and manage network-based cloud file systems and blob/S3 storage solutions
Develop and maintain Infrastructure as Code (IaC) for resource provisioning
Implement and optimize CI/CD pipelines for ML workflows
Design and implement custom autoscaling solutions for ML workloads
Ensure security best practices across the ML infrastructure
Provide developer-friendly tools and practices for efficient ML operations

Ideal Experience:

Strong proficiency in cloud platforms (AWS, Azure, or GCP) with focus on ML/AI services
Extensive experience with Kubernetes and Slurm cluster management
Expertise in Infrastructure as Code tools (e.g., Terraform, Ansible)
Proven track record in managing and optimizing network-based cloud file systems and object storage
Experience with CI/CD tools and practices (e.g., CircleCI, GitHub Actions, ArgoCD)
Strong understanding of security principles and best practices in cloud environments
Experience with monitoring and observability tools (e.g., Prometheus, Grafana, Loki)
Familiarity with ML workflows and GPU infrastructure management
Demonstrated ability to handle complex migrations and breaking changes in production environments

Nice to have:

Experience with custom autoscaling solutions for ML workloads
Knowledge of cost optimization strategies for cloud-based ML infrastructure
Familiarity with MLOps practices and tools
Experience with high-performance computing (HPC) environments
Understanding of data versioning and experiment tracking for ML
Knowledge of network optimization for distributed ML training
Experience with multi-cloud or hybrid cloud architectures
Familiarity with container security and vulnerability scanning tools

Apply for this job

* indicates a required field

First Name *

Last Name *

Email *

Phone

Resume/CV

Enter manually

Accepted file types: pdf, doc, docx, txt, rtf

LinkedIn Profile

Website

#J-18808-Ljbffr

Apply Now!

Similar Jobs (0)

-- View More Similar Jobs --

Member of Technical Staff - ML Infrastructure Engineer Freiburg (Germany), San Francisco (USA),[...]: Job Details :

Member of Technical Staff - ML Infrastructure Engineer Freiburg (Germany), San Francisco (USA),[...]

Member of Technical Staff - ML Infrastructure Engineer Freiburg (Germany), San Francisco (USA),[...]
: Job Details :