Infra Engineer - SRE
Description
About GMI
We are a fast-growing AI infrastructure startup based in Silicon Valley, working on cutting-edge technologies that power the future of artificial intelligence.We power developers, startups, and enterprises with scalable GPU cloud and inference solutions, helping AI builders turn ideas into reality. As we expand globally, we are looking for a dynamic and hands-on Site Reliability Engineer
Role Overview
We are seeking a skilled Site Reliability Engineer to join the GMI Global Infrastructure team. This role is hands-on and critical to ensuring the stability, efficiency, and reliability of the large-scale high performance AI/ML clusters in our data center. The ideal candidate will bring expertise in system-level troubleshooting, AI cluster maintenance, and operational excellence to ensure maximum performance for our infrastructure. Experience with large-scale infrastructure automation is considered a strong plus.
Responsibilities
- Design, implement and maintain scalable AI/ML infrastructure solutions.
- Proactively monitor GPU cluster health, performance and troubleshoot issues across compute, accelerator, and storage systems.
- Automate deployment, configuration and management of infrastructure resources.
- Manage GPU node lifecycle workflows, including provisioning, scaling, maintenance, decommissioning and upgrades of GPU nodes.
- Implement CI/CD pipelines for infrastructure deployment and orchestration.
- Ensure security, compliance and best practices across infrastructure.
- Manage incident response related to Infrastructure resources (GPU, CPU, Storage, Network).
- Handle customer provisioning requests for GPU resources, including onboarding, configuration and troubleshooting; resolve customer service requests related to GPU infrastructure, ensuring high customer satisfaction.
- Stay current with emerging GPU hardware and software technologies, integrating improvements as appropriate.
- Regional/international travel to GMI data center locations.
Qualifications
- Bachelor’s degree in Computer Science or related field.
- Over 3+ years of experience in data center operations, infrastructure, or systems engineering.
- Proven experience in site reliability engineering and infrastructure automation (e.g. Ansible, Terraform)
- Familiarity with containers orchestration platform (e.g. Kubernetes, Nvidia GPU operator, Nvidia Network operator, CNI, CSI) and job scheduling systems (e.g. Slurm).
- Familiarity with Linux system administration and scripting (Python, Bash).
- Familiarity with logging and monitoring tools such as Prometheus, Grafana, Loki.
- Good knowledge of GPU architecture, Nvidia CUDA, NCCL, or related AI/ML frameworks - added advantage.
- Strong troubleshooting skills and ability to analyze system logs and performance metrics.
- Excellent communication and teamwork abilities.
Meeting every qualification is not required—if you’re excited about this role, we’d love to hear from you. We believe diverse perspectives and experiences strengthen our team.