DevOpsHunt

Job Title

Site Reliability Engineer (Remote – United States)

Job Summary

We are seeking a skilled and dependable Site Reliability Engineer to join our engineering team. In this role, you will be responsible for ensuring the reliability, scalability, and performance of our systems and services. You will work closely with software engineers, product teams, and operations to build resilient infrastructure and improve system availability while supporting a culture of continuous improvement.

This position is fully remote within the United States.

Key Responsibilities

Design, implement, and maintain reliable, scalable, and highly available systems
Monitor system performance and availability, identifying and resolving issues proactively
Automate operational tasks to improve efficiency and reduce manual intervention
Participate in incident response, root cause analysis, and post-incident reviews
Collaborate with development teams to improve application reliability and deployment processes
Maintain and enhance monitoring, alerting, and logging systems
Ensure systems meet security, compliance, and performance standards
Contribute to documentation, runbooks, and best practices for operational excellence
Support on-call rotations as needed to maintain system uptime

Required Qualifications

Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
3+ years of experience in Site Reliability Engineering, DevOps, or systems engineering roles
Strong experience with Linux/Unix-based systems
Proficiency in scripting or programming languages such as Python, Go, or Bash
Experience with cloud platforms (AWS, Azure, or Google Cloud Platform)
Familiarity with containerization and orchestration tools (Docker, Kubernetes)
Experience with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog)
Strong problem-solving and troubleshooting skills

Preferred Qualifications

Experience with Infrastructure as Code tools (Terraform, CloudFormation)
Knowledge of CI/CD pipelines and deployment automation
Understanding of networking, security best practices, and distributed systems
Prior experience supporting high-availability or large-scale production environments

Soft Skills

Strong communication and collaboration skills
Ability to work independently in a remote environment
Detail-oriented with a proactive approach to problem-solving
Commitment to reliability, quality, and continuous improvement

Site Reliability Engineer

Description