Cloud Reliability Engineer - Compunnel : Job Details

Cloud Reliability Engineer

Compunnel

Job Location : Charleston,SC, USA

Posted on : 2025-07-15T01:35:01Z

Job Description :

Reporting to the Head of Cloud/API Engineering, the Cloud Reliability Engineer will play a key role in driving innovation and growth for the Banking Solutions business. This position will contribute to the company's digital transformation journey, driving customer-centric innovation, and positioning the organization as a leader in the competitive digital banking landscape. The Cloud Reliability Engineer will be responsible for ensuring the reliability, availability, and performance of applications and services as the company transitions from private to public cloud, while implementing automation and improving system resilience.

Key Responsibilities:

  • Reliability Engineering Strategy: Drive the transition from private to public cloud and implement foundational reliability engineering practices to ensure high availability, performance, and reliability of services.
  • Incident Response: Lead incident response efforts including identification, triage, resolution, and post-incident analysis to prevent recurrence and enhance system resilience.
  • Monitoring & Alerting: Develop and maintain monitoring solutions and alerting mechanisms for infrastructure, application performance, and user experience metrics, enabling proactive issue detection and mitigation.
  • Automation: Implement automation tools and processes to automate routine tasks, scale infrastructure, and ensure seamless deployments, updates, and rollbacks with minimal user impact.
  • Capacity Planning & Optimization: Conduct capacity planning, performance tuning, and resource optimization in collaboration with development and operations teams to meet scalability and performance goals.
  • Security Collaboration: Collaborate with security teams to implement security best practices, perform vulnerability assessments, and ensure compliance with security standards and regulatory requirements for applications.
  • Deployment & Configuration Management: Manage deployment pipelines, release processes, and configuration management to ensure consistency, reliability, and version control across environments.
  • Data-Driven Improvement: Identify areas for improvement in reliability, performance, and efficiency through data analysis, root cause analysis, and trend analysis. Drive initiatives to enhance system reliability and operational efficiency.
  • Documentation & Knowledge Sharing: Create and maintain documentation, runbooks, troubleshooting guides, and best practices, and encourage knowledge sharing within the team.
  • Disaster Recovery: Develop and test disaster recovery plans, backup strategies, and failover mechanisms to ensure business continuity and data integrity in case of failures or disasters.
  • Collaboration & Coordination: Work closely with development, QA, DevOps, and product teams to ensure alignment on reliability goals, performance metrics, release schedules, and incident response processes.
  • On-call Support: Participate in on-call rotations and provide 24/7 support for critical incidents, troubleshooting issues, and coordinating with teams for resolution and follow-up actions as per defined SLAs.
Required Qualifications:
  • Bachelor's degree in computer science, Information Technology, or a related field, or equivalent work experience.
  • Proven experience with cloud technologies and environments, specifically public cloud platforms (e.g., AWS, Azure, GCP).
  • Strong experience in reliability engineering, incident management, and system optimization.
  • Solid knowledge of monitoring, alerting, and observability tools (e.g., Prometheus, Grafana, Datadog).
  • Expertise in automation tools and CI/CD processes (e.g., Jenkins, GitLab CI, Terraform, Ansible).
  • Hands-on experience in application and system performance tuning, capacity planning, and resource management.
  • Experience with deployment pipelines, release processes, and configuration management tools (e.g., Jenkins, Spinnaker).
  • Familiarity with disaster recovery planning, backup strategies, and failover mechanisms.
  • Excellent problem-solving, troubleshooting, and communication skills.
  • Strong collaboration skills and ability to work effectively across teams.
Preferred Qualifications:
  • Experience with containerized environments and orchestration tools (e.g., Docker, Kubernetes).
  • Familiarity with security best practices, vulnerability assessments, and compliance standards (e.g., PCI-DSS, SOC 2).
  • Experience in performance and fault-tolerant systems design, with emphasis on reliability and scalability.
Certifications (if any):

Cloud certifications (e.g., AWS Certified Solutions Architect, Microsoft Certified: Azure Solutions Architect Expert, Google Professional Cloud Architect) are highly preferred.

Certified Kubernetes Administrator (CKA) or similar certifications are a plus.

#J-18808-Ljbffr
Apply Now!

Similar Jobs ( 0)