Jobs / Akamai
Senior II Site Reliability Engineer
Akamai · Canada
CanadaExp: 8+ yrs120,400-216,600 CAD/yearlyRemote
Remuneration
120,400-216,600 CAD/yearly
Location
Canada
Visa sponsorship
Not specified
Job summary
Akamai is seeking a Senior II Site Reliability Engineer to join their Cloud Networking SRE Team. This role involves owning the reliability of cloud load balancing infrastructure, designing and implementing observability frameworks, leading incident response, and driving safe deployment practices. The ideal candidate will have extensive experience in SRE or infrastructure engineering with a focus on large-scale distributed systems.
Benefits
HealthcareRRSPCompany holidaysVacationSick timeFamily friendly benefitsEmployee assistance programMental wellnessFinancial wellnessAnnual bonusIncentivesEquity awardsEmployee Stock Purchase Plan (ESPP)
Qualifications
- 8+ years of experience in SRE, infrastructure engineering, or platform engineering, working with large-scale distributed systems
- Deep expertise with Linux networking fundamentals and diagnosing at the packet level using tcpdump, netstat, and similar tools
- Hands-on experience with L4/L7 load balancing technologies covering configuration, health checking, high availability, and failure modes at scale
- Track record of defining SLO/SLI frameworks, building observability platforms from scratch, and running incident management processes at scale
- Expertise in Kubernetes and containerization at scale including workload scheduling, networking, resource management, and operating stateful or network-intensive workloads in a cluster environment
- Ability to build automation and tooling using Python or Go, with infrastructure-as-code experience (SaltStack, Ansible, or Terraform) and deployment safety instincts
Responsibilities
- Own the SRE infrastructure lifecycle from design reviews and pre-rollout readiness assessments through production sign-off and ongoing reliability management
- Design and implement frameworks that reflect customer experience for load balancing services and drive action when error budgets are at risk
- Build and maintain observability pipelines from load-balancing components and system-level sources to dashboards that enable rapid incident triage
- Lead technical incident response for complex NB/NLB failures, acting as the technical commander and driving root cause analysis and preventive follow-through
- Develop and automate safe deployment workflows for phased releases, including bake-period monitoring, feature flag management, and validation across global datacenter rollouts
- Review design documents, product-requirement documents and produce actionable SRE input on operational risks, capacity implications, Day-2 concerns, and product strategy gaps
- Build automation and tooling using Python or Go that reduces operational toil and improves team-wide operational capability
Skills
AkamaiAnsibleGoKubernetesLinuxPythonSaltStackTerraform
Relocation
No