Hi, I’m Ibrahim Jafri, a Site Reliability Engineer with over a decade of hands-on experience automating, architecting, and operating cloud-based, large-scale systems. I specialize in Linux, cloud platforms, Kubernetes, and infrastructure as code to drive reliability, scalability, and cost efficiency. I’ve led incident response and built robust observability to help product teams move faster with confidence. I’m passionate about turning complex reliability challenges into repeatable, automated solutions that empower teams. I thrive in collaborative environments, partner closely with development and product teams, and continually refine deployment safety, resilience, and performance. I enjoy turning hard reliability problems into scalable, repeatable workflows that enable teams to innovate with confidence.

Ibrahim Jafri

Hi, I’m Ibrahim Jafri, a Site Reliability Engineer with over a decade of hands-on experience automating, architecting, and operating cloud-based, large-scale systems. I specialize in Linux, cloud platforms, Kubernetes, and infrastructure as code to drive reliability, scalability, and cost efficiency. I’ve led incident response and built robust observability to help product teams move faster with confidence. I’m passionate about turning complex reliability challenges into repeatable, automated solutions that empower teams. I thrive in collaborative environments, partner closely with development and product teams, and continually refine deployment safety, resilience, and performance. I enjoy turning hard reliability problems into scalable, repeatable workflows that enable teams to innovate with confidence.

Available to hire

Hi, I’m Ibrahim Jafri, a Site Reliability Engineer with over a decade of hands-on experience automating, architecting, and operating cloud-based, large-scale systems. I specialize in Linux, cloud platforms, Kubernetes, and infrastructure as code to drive reliability, scalability, and cost efficiency. I’ve led incident response and built robust observability to help product teams move faster with confidence. I’m passionate about turning complex reliability challenges into repeatable, automated solutions that empower teams.

I thrive in collaborative environments, partner closely with development and product teams, and continually refine deployment safety, resilience, and performance. I enjoy turning hard reliability problems into scalable, repeatable workflows that enable teams to innovate with confidence.

See more

Language

English
Fluent

Work Experience

Site Reliability Engineer at Shopify
January 1, 2021 - Present
Design, operate, and improve the reliability of large-scale distributed systems running on Kubernetes. Define and implement SLOs/SLIs and error budgets. Automate infrastructure provisioning and lifecycle management using Terraform and internal platform tooling. Participate in on-call rotations, incident command, and post-incident reviews. Improve observability through metrics, logging, alerting, and reliability dashboards. Collaborate with application teams to improve deployment safety, resilience, and system performance. Contribute to platform reliability initiatives supporting high-traffic production services.
Senior DevOps Engineer at Clio
January 1, 2017 - December 31, 2020
Led DevOps initiatives for a growing SaaS platform. Designed and managed Kubernetes (EKS) clusters for production workloads. Built and maintained cloud infrastructure using Terraform and Helm. Developed and maintained CI/CD pipelines for safe, repeatable deployments. Implemented monitoring and alerting using Prometheus and Grafana. Led incident response efforts and authored postmortems. Mentored junior engineers and helped establish DevOps best practices across teams.
Cloud Engineer at TextNow
June 1, 2014 - December 31, 2016
Managed and scaled AWS-based infrastructure supporting high-traffic services. Designed cloud networking, load balancing, and auto-scaling solutions. Introduced Docker-based workflows for application deployment. Built CI/CD pipelines to automate build and release processes. Improved system reliability and uptime through proactive monitoring and automation. Worked closely with development teams to optimize cloud performance and operational costs.
Automation Engineer at Absorb Software
June 1, 2012 - May 31, 2014
Automated deployment and operational workflows using Bash and Python. Managed Linux-based servers and early cloud infrastructure. Built and maintained CI pipelines using Jenkins. Assisted development teams with release automation and environment provisioning. Implemented basic monitoring and alerting solutions. Supported production systems and participated in on-call rotations.

Education

Bachelor’s Degree at PUCIT
January 11, 2030 - January 1, 2012
Coursework: Operating Systems Using Command Line; CCNA 2; CCNA 1; Network+ at Bow Valley College
January 11, 2030 - January 23, 2026

Qualifications

CCNA 2
January 11, 2030 - January 23, 2026
CCNA 1
January 11, 2030 - January 23, 2026
Network+
January 11, 2030 - January 23, 2026
Operating Systems Using Command Line
January 11, 2030 - January 23, 2026

Industry Experience

Software & Internet, Professional Services, Media & Entertainment