Jobs / Fab***

Site Reliability Engineer

Fab*** · New York, NY, United States
Visa sponsorship details are locked. Unlock company name and apply link with .
New York, NY, United StatesFull timeExp: 5+ yrs135,000-160,000 USD/yearlyRemote
Remuneration
135,000-160,000 USD/yearly
Location
New York, NY, United States
Visa sponsorship
Sponsors visa

Job summary

Fab*** is seeking a Site Reliability Engineer to own and evolve the infrastructure powering healthcare experiences for millions of patients. This role involves bridging traditional infrastructure excellence with AI-driven operations, acting as a primary architect for AWS and Kubernetes (EKS) environments, and exploring agentic workflows to modernize SRE practices. The engineer will be a steward of Fabric’s production integrity, leading strategy for infrastructure automation, observability, and system resilience.

Benefits

Medical insuranceDental insuranceVision insuranceUnlimited PTO401(k) planStock optionsBonuses

Qualifications

  • 5+ years of experience in SRE, DevOps, or Platform roles managing production environments at scale
  • Expert technical depth in AWS (EKS, EC2, RDS, S3) and production-grade Kubernetes management
  • Proficiency with modern tooling including Terraform (IaC), Datadog (Observability), and CI/CD systems
  • Deeply proficient coding and scripting skills in Python, Bash, Ruby, or Go
  • Preferred experience building agentic workflows or AI-assisted tooling to drive operational efficiency
  • A rigor-first mindset with a dedication to HIPAA-compliant, high-availability architecture

Responsibilities

  • Own and evolve the infrastructure powering healthcare experiences for millions of patients
  • Act as a primary architect for AWS and Kubernetes (EKS) environment
  • Ensure the platform is resilient, scalable, and compliant
  • Explore how agentic workflows can modernize SRE practices
  • Be a steward of Fabric’s production integrity
  • Lead the strategy for infrastructure automation, observability, and system resilience
  • Design, deploy, and maintain production Kubernetes (EKS) clusters to ensure enterprise-grade availability
  • Eliminate manual configuration by building and managing scalable infrastructure state through Terraform
  • Optimize the AWS footprint (EC2, RDS, S3) for performance, cost-efficiency, and reliability
  • Explore and deploy agentic workflows for AI-assisted runbooks to automate complex operational decisions and repetitive tasks
  • Build and evolve deployment pipelines using GitHub Actions or Semaphore for rapid and safe delivery
  • Focus on toil reduction by developing internal tools that replace manual operational work with intelligent, autonomous systems
  • Drive the evolution of the observability stack in Datadog by implementing sophisticated metrics, traces, and logs to meet SLOs
  • Lead incident response efforts and facilitate blameless postmortems to systematically reduce recovery time (MTTR)
  • Define and monitor SLIs and SLOs to ensure the platform consistently meets rigorous healthcare performance standards
  • Ensure all infrastructure remains fully compliant with HIPAA and other critical healthcare regulatory requirements
  • Mentor engineers across the company on reliability best practices
  • Contribute a clinical-safety perspective to cross-functional design reviews

Skills

GoKubernetesGitHub ActionsTerraformRubyS3Cloud MonitoringEKSDatadogAWSBashGitHubPythonREST

Languages

PythonBashRubyGo

Industry

Healthcare

Relocation

No