Jobs / Akamai

Senior II Site Reliability Engineer

Apply Now

Akamai · Canada

CanadaExp: 8+ yrs120,400-216,600 CAD/yearlyRemote

Apply Now

Remuneration

120,400-216,600 CAD/yearly

Location

Canada

Visa sponsorship

Not specified

Job summary

Akamai is seeking a Senior II Site Reliability Engineer to join their Cloud Networking SRE Team. This role involves owning the reliability of cloud load balancing infrastructure, designing and implementing observability frameworks, leading incident response, and driving safe deployment practices. The ideal candidate will have extensive experience in SRE or infrastructure engineering with a focus on large-scale distributed systems.

Benefits

HealthcareRRSPCompany holidaysVacationSick timeFamily friendly benefitsEmployee assistance programMental wellnessFinancial wellnessAnnual bonusIncentivesEquity awardsEmployee Stock Purchase Plan (ESPP)

Qualifications

8+ years of experience in SRE, infrastructure engineering, or platform engineering, working with large-scale distributed systems
Deep expertise with Linux networking fundamentals and diagnosing at the packet level using tcpdump, netstat, and similar tools
Hands-on experience with L4/L7 load balancing technologies covering configuration, health checking, high availability, and failure modes at scale
Track record of defining SLO/SLI frameworks, building observability platforms from scratch, and running incident management processes at scale
Expertise in Kubernetes and containerization at scale including workload scheduling, networking, resource management, and operating stateful or network-intensive workloads in a cluster environment
Ability to build automation and tooling using Python or Go, with infrastructure-as-code experience (SaltStack, Ansible, or Terraform) and deployment safety instincts

Responsibilities

Own the SRE infrastructure lifecycle from design reviews and pre-rollout readiness assessments through production sign-off and ongoing reliability management
Design and implement frameworks that reflect customer experience for load balancing services and drive action when error budgets are at risk
Build and maintain observability pipelines from load-balancing components and system-level sources to dashboards that enable rapid incident triage
Lead technical incident response for complex NB/NLB failures, acting as the technical commander and driving root cause analysis and preventive follow-through
Develop and automate safe deployment workflows for phased releases, including bake-period monitoring, feature flag management, and validation across global datacenter rollouts
Review design documents, product-requirement documents and produce actionable SRE input on operational risks, capacity implications, Day-2 concerns, and product strategy gaps
Build automation and tooling using Python or Go that reduces operational toil and improves team-wide operational capability

Skills

AkamaiAnsibleGoKubernetesLinuxPythonSaltStackTerraform

Relocation

Apply Now