DevOpsHunt

Position: Site Reliability Engineer (SRE) - Infrastructure

Location: Atlanta, GA

Employment Type: Full-Time

Work Arrangement: Onsite Hybrid

Overview

The Site Reliability Engineer (SRE) will ensure the reliability, scalability, and performance of enterprise applications and services across cloud and on-premises environments. This role focuses on automation, monitoring, and incident response to minimize downtime and enhance operational efficiency. The position requires close collaboration with development, quality assurance, and operations teams to deliver secure and resilient systems.

What You Will Do

• Design, build, and maintain secure, compliant infrastructure using Infrastructure as Code tools such as Terraform and Ansible

• Automate provisioning and management of servers, storage, networks, Kubernetes clusters, and related systems across cloud and on-premises environments

• Develop tools and processes for automated deployment, configuration, monitoring, and alerting

• Collaborate with cross-functional teams to implement scalable and reliable cloud and data center solutions

• Participate in incident response, on-call rotations, and post-incident reviews to improve system resilience

• Monitor system performance and availability using service-level agreements (SLAs), objectives (SLOs), and indicators (SLIs); proactively troubleshoot and resolve reliability, performance, or security issues

• Create and maintain disaster recovery and business continuity plans for critical systems

• Continuously analyze and improve infrastructure efficiency, scalability, and performance

• Stay current with emerging technologies and recommend tools or practices to enhance platform capabilities

• Share technical expertise and mentor team members to strengthen internal capabilities

What We Are Looking For

Required Qualifications

• Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience

• Proven experience as a Site Reliability Engineer or Systems Engineer

• Strong proficiency in Terraform and Ansible for infrastructure automation

• Hands-on experience with Kubernetes, Docker, or other container orchestration tools

• Proficiency in scripting languages such as Python or Bash

• In-depth knowledge of Google Cloud Platform (GCP) services including compute, networking, storage, Kubernetes, and security

• Solid understanding of VMware virtualization and enterprise storage systems (e.g., Pure Storage)

• Experience with networking technologies including VLANs, VPNs, and routing protocols

• Strong grasp of IT infrastructure and operations principles, including systems integration and automation best practices

• Excellent communication and collaboration skills

• Ability to manage multiple priorities under pressure with strong problem-solving skills

Preferred Qualifications

• Terraform Associate certification

• GCP certification (e.g., Cloud Architect)

• Relevant certifications such as ITIL, PMP, or CISSP

• Experience in regulated or enterprise environments

Core Competencies

• Communication and collaboration across technical and business teams

• Problem-solving and analytical thinking

• Ownership and accountability for system reliability

• Adaptability to emerging technologies and changing business needs

• Leadership and mentorship within technical teams

Site Reliability Engineer - Infrastructure 4872

Description