DevOpsHunt

Role Summary:

This role is the team facing, consultative side of observability. The senior engineer partners directly with internal engineering teams to understand their systems, pain points, and reliability gaps. They translate team needs into observability solutions: dashboards, metrics, SLOs, SLIs, alerting strategies, and visibility improvements. They'll help establish the client's enterprise wide SRE standards and ensure teams adopt consistent best practices.

How this role works day to day:

● Meet with internal teams to gather technical and operational requirements

● Design and implement tailored observability solutions across tools like Grafana, Sumo, AppDynamics, and New Relic

● Build deeper dashboards for product teams and executive visibility

● Define and maintain SLOs, SLIs, and reliability reporting patterns

● Identify gaps in monitoring or alerting and lead the solutioning

● Partner with embedded SREs across Alaska's hub and spoke model

● Influence tool consolidation, standards, and enterprise reliability strategy This role acts as an internal consultant and technical leader for observability and reliability practices.

Top 3 Skills:

● Advanced Grafana Expertise - Strong ability to create complex dashboards, build transformations, define SLOs/SLIs, and integrate with multiple data sources.

● SRE Principles and System Thinking - Deep understanding of service health, SLOs, SLIs, error budgets, incident patterns, distributed systems, and reliability engineering fundamentals.

● Cross Team Collaboration and Technical Requirements Gathering - Ability to sit with teams, understand their needs, translate them into observability solutions, and deliver dashboards, alerting, and reliability patterns.

Core Responsibilities:

● Build dashboards in Grafana for internal teams and leadership.

● Maintain observability tools and handle incoming requests.

● Connect data sources across tools (Grafana, Sumo, AppD, New Relic).

● Assist teams with setting up alerting, logging structure, and basic SLOs. ● Instrument new apps into monitoring tools.

● Create repeatable patterns and templates for team onboarding.

● Build playbooks and small automation tasks using Ansible Automation Platform.

Required Skills:

● 3+ years of hands-on observability experience (Grafana required plus supporting tools)

● 2+ years practicing SRE fundamentals (SLOs/SLIs, incident patterns, distributed systems, reliability engineering)

● 5+ total years in SRE, DevOps, cloud, systems, platform, or monitoring engineering roles

● Experience partnering with application teams to gather requirements and deliver solutions

● Strong ability to explain complex concepts clearly to non-SRE partners

Nice to Have:

● Experience with ThousandEyes, AppDynamics, New Relic, or Sumo Logic

● Familiarity with Azure, Kubernetes, CI and CD pipelines, or software delivery platforms ● Experience contributing to observability standards at scale

● Background in high uptime industries such as travel, finance, telecom, or cloud-based SaaS

Site Reliability Engineer

Description