Jobs / STN Inc

Site Reliability Engineer

STN Inc · San Francisco Bay Area, CA, United States

San Francisco Bay Area, CA, United StatesExp: 5+ yrsRemote

Remuneration

Not specified

Location

San Francisco Bay Area, CA, United States

Visa sponsorship

Not specified

Job summary

The Site Reliability Engineer (SRE) is responsible for the reliability, observability, and incident response for the GPU One platform.

Define and operate Service Level Objectives (SLOs) aligned with customer SLAs
Build and maintain the observability stack including metrics, logs, traces, and alerting
Lead incident response and chair post-incident reviews
Drive automation to reduce toil and improve mean-time-to-recover (MTTR)
Author and maintain operational runbooks alongside the NOC
Manage on-call rotation, escalation paths, and incident-management tooling
Coordinate cross-functionally with NOC, Platform Engineering, and Network Engineering
Drive chaos engineering, game days, and reliability testing programs
Produce SLA performance reports in coordination with the SLA Manager
Mentor junior engineers and contribute to engineering culture

DatadogGoGrafanaKubernetesOpenTelemetryPrometheusPython