Jobs / CareerUS Solutions
Site Reliability Engineer
CareerUS Solutions · United States
Visa: unknownSalary: unknownWork mode: unknown
Skills
awsazurebashci/cdcloudformationdatadogdevopsdockergografanakuberneteslinuxprometheuspythonterraform
Description
Job Title
Site Reliability Engineer (Remote – United States)
Job Summary
We are seeking a skilled and dependable Site Reliability Engineer to join our engineering team. In this role, you will be responsible for ensuring the reliability, scalability, and performance of our systems and services. You will work closely with software engineers, product teams, and operations to build resilient infrastructure and improve system availability while supporting a culture of continuous improvement.
This position is fully remote within the United States.
Key Responsibilities
- Design, implement, and maintain reliable, scalable, and highly available systems
- Monitor system performance and availability, identifying and resolving issues proactively
- Automate operational tasks to improve efficiency and reduce manual intervention
- Participate in incident response, root cause analysis, and post-incident reviews
- Collaborate with development teams to improve application reliability and deployment processes
- Maintain and enhance monitoring, alerting, and logging systems
- Ensure systems meet security, compliance, and performance standards
- Contribute to documentation, runbooks, and best practices for operational excellence
- Support on-call rotations as needed to maintain system uptime
Required Qualifications
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
- 3+ years of experience in Site Reliability Engineering, DevOps, or systems engineering roles
- Strong experience with Linux/Unix-based systems
- Proficiency in scripting or programming languages such as Python, Go, or Bash
- Experience with cloud platforms (AWS, Azure, or Google Cloud Platform)
- Familiarity with containerization and orchestration tools (Docker, Kubernetes)
- Experience with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog)
- Strong problem-solving and troubleshooting skills
Preferred Qualifications
- Experience with Infrastructure as Code tools (Terraform, CloudFormation)
- Knowledge of CI/CD pipelines and deployment automation
- Understanding of networking, security best practices, and distributed systems
- Prior experience supporting high-availability or large-scale production environments
Soft Skills
- Strong communication and collaboration skills
- Ability to work independently in a remote environment
- Detail-oriented with a proactive approach to problem-solving
- Commitment to reliability, quality, and continuous improvement