Reliability Engineer - Ansible and DataDog - WFH - 1099 / C2C ok - Datamanagementgroup : Job Details

Reliability Engineer - Ansible and DataDog - WFH - 1099 / C2C ok

Datamanagementgroup

Job Location : all cities,SC, USA

Posted on : 2025-05-07T00:53:48Z

Job Description :

All Jobs • Reliability Engineer - Ansible and DataDog - WFH - 1099 / C2C ok

Reliability Engineer - Ansible and DataDog - WFH - 1099 / C2C ok

Looking for an experienced Reliability Engineer to support critical projects for our Technology, Infrastructure & Operations teams.

Work from home, work to be done primarily on US Eastern Timezone.

  • Minimum of 7 years performance engineering and performance testing experience
  • MUST HAVE 3+ years of recent work with Ansible
  • MUST HAVE 4+ years of work with DataDog
  • Excellent English Communications skills - Verbal & Written (idiomatic english)
  • Experience managing performance engineering efforts for applications strongly preferred.
  • Knowledge of developing scripts for monitoring using PowerShell, Python and Shell scripting.
  • 5 years' of Splunk programming proficiency is highly preferred.
  • 5-6 years' experience using .NET and Java application and Application Monitoring Tools like App Dynamics or Datadog are highly preferred.
  • Proficiency is performance tuning is preferred.
  • Good understanding of the UI, Middleware and backend Databases
  • BA/BS degree in Information Technology, Computer Science or related field of study

Duties include:

  • Develop and maintain comprehensive monitoring solutions for cloud-based services and applications.
  • Configure monitoring tools and systems to collect relevant metrics, logs, and traces.
  • Create custom monitoring dashboards and reports using Splunk, DataDog, DynaTrace or other tools, to provide real-time insights into system performance and health.
  • Continuously monitor the cloud infrastructure's performance and capacity, anticipating and addressing potential scalability issues.
  • Proactively suggest and implement improvements to enhance the system's reliability, resilience, and fault tolerance.
  • Work on automating tasks to streamline operational processes and reduce manual intervention.
  • Collaborate with cross-functional teams to investigate and resolve critical incidents, ensuring minimal impact on end-users.
  • Work with Problem Management team to complete post-mortem analysis of incidents to identify root causes and implement preventive measures.
  • Understand the overall architecture of our systems to identify gaps in monitoring and troubleshoot issues.
  • Configure and maintain custom dashboards and alerts in various monitoring tools.
  • Create custom reports, deliver report presentations to various stakeholders.
  • Develop scripts for monitoring PowerShell, Python, Shell scripting.
  • Develop metrics for both the business and technical teams to determine the health of systems.
  • Provide on-call support as needed.
  • Leads and coordinates performance engineering for medium to large initiatives.
  • Collect and document expected system performance and operational characteristics.
  • Collect and/or prepare test data for test execution.
  • Develop and execute performance tests including load, stress, endurance, fail-over and interoperability.
  • Conduct technical analysis of performance test results and production systems, and provide recommendations on performance tuning, systems, and infrastructure.
  • Identify, report, and review defects in assessing system performance and stability.
  • Defining the strategy for enabling performance diagnostics and monitoring using an Application Performance Management (APM) tool, other monitoring tools, and diagnostic techniques.
  • Collaborating with developers to promote the concept of performance engineering during all phases of the SDLC to detect and correct performance issues earlier in the lifecycle.
  • Leads peer reviews to ensure the completeness of all test assets created.
  • Resolve performance and stability issues in performance test environment.
  • Develop performance engineering work plan structure and project schedule.
  • Review architectural design for performance risks and potential issues.
  • Prepare capacity analysis when applicable.
#J-18808-Ljbffr
Apply Now!

Similar Jobs ( 0)