Principal Kafka Site Reliability Engineer DevOps
: Job Details :

Principal Kafka Site Reliability Engineer DevOps

palo_alto_networks

Job Location : Santa Clara,CA, USA

Posted on : 2025-09-12T13:29:26Z

Job Description :

We are reshaping the cybersecurity market through our cloud-delivered security services, and our cloud infrastructure is quickly and massively growing with a global footprint. We're looking for great SREs, as well as software engineers interested in production engineering, to help us scale the largest enterprise security cloud infrastructure in the world.

Description

Palo Alto Networks reinvented the enterprise firewall, growing from a start-up to a multi-billion-dollar company. Our Application Framework, the latest offering in our cloud-delivered security services, ingests security events from hundreds of thousands of firewalls deployed across the globe to provide a massive data analytics platform for deep inspection, anomaly detection, and actionable security automation. Our cloud infrastructure hosts a series of massive and complex distributed systems and virtualization software platforms that enable big data processing for security services, sandboxing and malware detection, URL categorization, and malicious site/domain identification, as well as security research and response.

RESPONSIBILITIES:

You will be responsible for maintaining and scaling production Kafka clusters with very high ingestion rates, Zookeeper clusters, and other big data pipeline systems such as Kafka and HDFS.

You will work on improving scalability, service reliability, capacity, and performance.

You will develop automation code for managing, monitoring, measuring, expanding, and healing clusters.

You are an experienced software engineer focused on operations, not just an operator.

You will perform Kafka tuning, capacity planning, and deep dive troubleshooting.

You will participate in occasional on-call rotations supporting the infrastructure.

You will troubleshoot incidents, formulate hypotheses, test them, and identify root causes.

QUALIFICATIONS:

Hands-on experience managing production Kafka clusters.

Strong development and automation skills, especially with Python; familiarity with Kafka source code is a plus.

Deep understanding of Kafka internals, Zookeeper, partitioning, topic replication, and mirroring.

Excellent monitoring, metrics collection, performance tuning, and troubleshooting skills for distributed systems.

Tools-first mindset: building tools to increase efficiency and simplify tasks.

Organized, focused on delivery, good communicator, team player, and proactive in ownership.

Principal Kafka Site Reliability Engineer DevOps: Job Details :

Principal Kafka Site Reliability Engineer DevOps

Principal Kafka Site Reliability Engineer DevOps
: Job Details :