Jobs / STN Inc

Site Reliability Engineer

STN Inc · San Francisco Bay Area, CA, United States
San Francisco Bay Area, CA, United StatesExp: 5+ yrsRemote
Remuneration
Not specified
Location
San Francisco Bay Area, CA, United States
Visa sponsorship
Not specified

Job summary

The Site Reliability Engineer (SRE) is responsible for the reliability, observability, and incident response for the GPU One platform.

Qualifications

  • 5+ years in SRE, DevOps, or production engineering roles
  • Strong programming skills in Go, Python, or both
  • Hands-on experience operating Kubernetes-based platforms at scale
  • Deep familiarity with observability tooling
  • Strong incident management experience including major-incident command

Responsibilities

  • Define and operate Service Level Objectives (SLOs) aligned with customer SLAs
  • Build and maintain the observability stack including metrics, logs, traces, and alerting
  • Lead incident response and chair post-incident reviews
  • Drive automation to reduce toil and improve mean-time-to-recover (MTTR)
  • Author and maintain operational runbooks alongside the NOC
  • Manage on-call rotation, escalation paths, and incident-management tooling
  • Coordinate cross-functionally with NOC, Platform Engineering, and Network Engineering
  • Drive chaos engineering, game days, and reliability testing programs
  • Produce SLA performance reports in coordination with the SLA Manager
  • Mentor junior engineers and contribute to engineering culture

Skills

DatadogGoGrafanaKubernetesOpenTelemetryPrometheusPython

Relocation

No