Site Reliability Engineer
Description
Overview
Our client is seeking a Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of mission-critical production systems. This role applies software engineering principles to infrastructure and operations, partnering closely with Product, Platform, and Software Engineering teams to improve system resilience, automate operational workflows, and drive reliability best practices across a cloud-native environment.
Key Responsibilities
- Design, implement, and operate highly reliable, scalable infrastructure and services in cloud environments.
- Establish and enforce service-level objectives (SLOs), service-level indicators (SLIs), and error budgets in partnership with engineering teams.
- Build and maintain automation to reduce toil and improve operational efficiency across systems and services.
- Monitor system health, performance, and availability; respond to incidents and lead root cause analysis and post-incident reviews.
- Collaborate with software engineers to improve system design, operability, and production readiness.
- Develop and maintain observability solutions, including monitoring, logging, and alerting frameworks.
- Participate in on-call rotations and continuously improve incident response processes and documentation.
- Identify and mitigate reliability risks related to scalability, capacity, security, and dependencies.
- Advocate for reliability-focused engineering practices throughout the development lifecycle.
Operating Context & Impact
- Operates in a cloud-first, distributed systems environment with frequent production deployments.
- Supports multiple product and platform teams, enabling reliable delivery at scale.
- Success is measured by system uptime, SLO attainment, mean time to recovery (MTTR), incident frequency, and reduction of operational toil.
Required Qualifications
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 4+ years of professional experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
- Strong experience operating distributed systems in cloud environments (AWS, GCP, or Azure).
- Proficiency in at least one programming or scripting language (e.g., Python, Go, Bash).
- Experience with containerization and orchestration technologies (Docker, Kubernetes).
- Hands-on experience with monitoring, logging, and alerting systems.
- Solid understanding of Linux systems, networking, and cloud security fundamentals.
- Strong analytical, troubleshooting, and cross-functional communication skills.
Core Skills
- Site Reliability Engineering
- Cloud Infrastructure & Distributed Systems
- Automation & Reliability Tooling
- Monitoring, Logging & Observability
- Incident Response & Root Cause Analysis
- SLOs, SLIs & Error Budgets
- Cross-Functional Collaboration
By applying, you:
Join our candidate network for current and future opportunities with our hiring partners. May receive feedback on your resume and job search approach. When we see a live opportunity that matches your background and preferences, we’ll reach out to you directly.