DevOpsHunt

Senior Site Reliability Engineer

Remote (USA)

About the Opportunity:

The organization operates a kubernetes-native distributed system that orchestrates many components to serve and train large neural networks efficiently. This role focuses on solving infrastructure challenges in both cloud and on-premise environments, and managing broader cloud infrastructure and development tools.

Responsibilities:

• Ensure smooth operation and high availability of core services

• Monitor system performance, identify bottlenecks, and implement optimizations

• Develop Kubernetes resources and custom tooling for cloud and on-premise deployments

• Design and implement scalable, secure, and cost-effective infrastructure solutions

• Collaborate with teams across the organization to resolve engineering challenges

Requirements:

• BS/BA in Computer Science or related degree

• Good knowledge of cloud providers such as AWS, GCP, or similar

• Expertise with Kubernetes (EKS, GKE, self-hosted) and Infrastructure as Code using Terraform and Helm

• Solid understanding of web and networking protocols including HTTP, TLS, DNS, and certificates

• Experience with CI/CD pipelines using tools such as GitHub Actions, ArgoCD, and Atlantis

• Strong interpersonal skills for collaboration across different time zones and regions

Benefits & Perks:

No benefits or perks specified

Note:

“RemoteHunter is not the Employer of Record (EOR) for this role. Our purpose in this opportunity is to connect exceptional candidates with leading employers. We help job seekers worldwide discover roles that match their goals and guide them to complete their full application directly through the hiring company’s career page or ATS.”

Senior Site Reliability Engineer

Description