Jobs / Fab***
Site Reliability Engineer
Fab*** · New York, NY, United States
Visa sponsorship details are locked. Unlock company name and apply link with .
New York, NY, United StatesFull timeExp: 5+ yrs135,000-160,000 USD/yearlyRemote
Remuneration
135,000-160,000 USD/yearly
Location
New York, NY, United States
Visa sponsorship
Sponsors visa
Job summary
Fab*** is seeking a Site Reliability Engineer to own and evolve the infrastructure powering healthcare experiences for millions of patients. This role involves bridging traditional infrastructure excellence with AI-driven operations, acting as a primary architect for AWS and Kubernetes (EKS) environments, and exploring agentic workflows to modernize SRE practices. The engineer will be a steward of Fabric’s production integrity, leading strategy for infrastructure automation, observability, and system resilience.
Benefits
Medical insuranceDental insuranceVision insuranceUnlimited PTO401(k) planStock optionsBonuses
Qualifications
- 5+ years of experience in SRE, DevOps, or Platform roles managing production environments at scale
- Expert technical depth in AWS (EKS, EC2, RDS, S3) and production-grade Kubernetes management
- Proficiency with modern tooling including Terraform (IaC), Datadog (Observability), and CI/CD systems
- Deeply proficient coding and scripting skills in Python, Bash, Ruby, or Go
- Preferred experience building agentic workflows or AI-assisted tooling to drive operational efficiency
- A rigor-first mindset with a dedication to HIPAA-compliant, high-availability architecture
Responsibilities
- Own and evolve the infrastructure powering healthcare experiences for millions of patients
- Act as a primary architect for AWS and Kubernetes (EKS) environment
- Ensure the platform is resilient, scalable, and compliant
- Explore how agentic workflows can modernize SRE practices
- Be a steward of Fabric’s production integrity
- Lead the strategy for infrastructure automation, observability, and system resilience
- Design, deploy, and maintain production Kubernetes (EKS) clusters to ensure enterprise-grade availability
- Eliminate manual configuration by building and managing scalable infrastructure state through Terraform
- Optimize the AWS footprint (EC2, RDS, S3) for performance, cost-efficiency, and reliability
- Explore and deploy agentic workflows for AI-assisted runbooks to automate complex operational decisions and repetitive tasks
- Build and evolve deployment pipelines using GitHub Actions or Semaphore for rapid and safe delivery
- Focus on toil reduction by developing internal tools that replace manual operational work with intelligent, autonomous systems
- Drive the evolution of the observability stack in Datadog by implementing sophisticated metrics, traces, and logs to meet SLOs
- Lead incident response efforts and facilitate blameless postmortems to systematically reduce recovery time (MTTR)
- Define and monitor SLIs and SLOs to ensure the platform consistently meets rigorous healthcare performance standards
- Ensure all infrastructure remains fully compliant with HIPAA and other critical healthcare regulatory requirements
- Mentor engineers across the company on reliability best practices
- Contribute a clinical-safety perspective to cross-functional design reviews
Skills
GoKubernetesGitHub ActionsTerraformRubyS3Cloud MonitoringEKSDatadogAWSBashGitHubPythonREST
Languages
PythonBashRubyGo
Industry
Healthcare
Relocation
No