Jobs / Fab***

Site Reliability Engineer

Fab*** · New York, NY, United States

Visa sponsorship details are locked. Unlock company name and apply link with .

New York, NY, United StatesFull timeExp: 5+ yrs135,000-160,000 USD/yearlyRemote

Remuneration

135,000-160,000 USD/yearly

Location

New York, NY, United States

Visa sponsorship

Sponsors visa

Job summary

Fab*** is seeking a Site Reliability Engineer to own and evolve the infrastructure powering healthcare experiences for millions of patients. This role involves bridging traditional infrastructure excellence with AI-driven operations, acting as a primary architect for AWS and Kubernetes (EKS) environments, and exploring agentic workflows to modernize SRE practices. The engineer will be a steward of Fabric’s production integrity, leading strategy for infrastructure automation, observability, and system resilience.

Benefits

Medical insuranceDental insuranceVision insuranceUnlimited PTO401(k) planStock optionsBonuses

Qualifications

5+ years of experience in SRE, DevOps, or Platform roles managing production environments at scale
Expert technical depth in AWS (EKS, EC2, RDS, S3) and production-grade Kubernetes management
Proficiency with modern tooling including Terraform (IaC), Datadog (Observability), and CI/CD systems
Deeply proficient coding and scripting skills in Python, Bash, Ruby, or Go
Preferred experience building agentic workflows or AI-assisted tooling to drive operational efficiency
A rigor-first mindset with a dedication to HIPAA-compliant, high-availability architecture

Responsibilities

Own and evolve the infrastructure powering healthcare experiences for millions of patients
Act as a primary architect for AWS and Kubernetes (EKS) environment
Ensure the platform is resilient, scalable, and compliant
Explore how agentic workflows can modernize SRE practices
Be a steward of Fabric’s production integrity
Lead the strategy for infrastructure automation, observability, and system resilience
Design, deploy, and maintain production Kubernetes (EKS) clusters to ensure enterprise-grade availability
Eliminate manual configuration by building and managing scalable infrastructure state through Terraform
Optimize the AWS footprint (EC2, RDS, S3) for performance, cost-efficiency, and reliability
Explore and deploy agentic workflows for AI-assisted runbooks to automate complex operational decisions and repetitive tasks
Build and evolve deployment pipelines using GitHub Actions or Semaphore for rapid and safe delivery
Focus on toil reduction by developing internal tools that replace manual operational work with intelligent, autonomous systems
Drive the evolution of the observability stack in Datadog by implementing sophisticated metrics, traces, and logs to meet SLOs
Lead incident response efforts and facilitate blameless postmortems to systematically reduce recovery time (MTTR)
Define and monitor SLIs and SLOs to ensure the platform consistently meets rigorous healthcare performance standards
Ensure all infrastructure remains fully compliant with HIPAA and other critical healthcare regulatory requirements
Mentor engineers across the company on reliability best practices
Contribute a clinical-safety perspective to cross-functional design reviews

Skills

GoKubernetesGitHub ActionsTerraformRubyS3Cloud MonitoringEKSDatadogAWSBashGitHubPythonREST

Site Reliability Engineer

Job summary

Benefits

Qualifications

Responsibilities

Skills

Languages

Industry

Relocation