Hi, I’m Shahzad Khawar, a Site Reliability Engineer (SRE) and DevOps professional with 6+ years of experience designing, automating, and maintaining highly available, multi-cloud infrastructure. I’m passionate about operational excellence, scalability, and reliability across hybrid and multi-cloud environments. I enjoy bridging gaps between development, operations, and business teams, mentoring juniors, and building robust monitoring, incident response, and automation solutions that prevent problems before they occur.

Shahzad Khawar

Hi, I’m Shahzad Khawar, a Site Reliability Engineer (SRE) and DevOps professional with 6+ years of experience designing, automating, and maintaining highly available, multi-cloud infrastructure. I’m passionate about operational excellence, scalability, and reliability across hybrid and multi-cloud environments. I enjoy bridging gaps between development, operations, and business teams, mentoring juniors, and building robust monitoring, incident response, and automation solutions that prevent problems before they occur.

Available to hire

Hi, I’m Shahzad Khawar, a Site Reliability Engineer (SRE) and DevOps professional with 6+ years of experience designing, automating, and maintaining highly available, multi-cloud infrastructure. I’m passionate about operational excellence, scalability, and reliability across hybrid and multi-cloud environments.

I enjoy bridging gaps between development, operations, and business teams, mentoring juniors, and building robust monitoring, incident response, and automation solutions that prevent problems before they occur.

See more

Experience Level

Expert
Expert
Expert
Expert
Expert
Intermediate

Work Experience

SRE, DevOps, & Cloud Automation at AT&T
April 1, 2019 - June 1, 2023
Led incident management processes, bridged communication gaps between development, operations, and business teams, and implemented proactive monitoring using Grafana. Built and maintained CI/CD pipelines with Jenkins, implemented IaC with Terraform, and used Ansible for configuration management. Managed Kubernetes clusters and automated deployment pipelines for speed and reliability.
SRE, DevOps, & Cloud Automation at IBM
June 1, 2023 - Present
Actively mentoring junior engineers on best practices in CI/CD, IaC (Terraform), and SRE principles, significantly improving team capability and code quality. Drove SRE principles and practices to maintain high availability across multi-cloud production environments. Established end-to-end incident management, structured root cause analysis, and post-mortems to reduce MTTR. Implemented comprehensive monitoring and alerting using Grafana to proactively detect and mitigate performance bottlenecks. Managed end-to-end CI/CD pipelines with Jenkins, automated build/test/deploy cycles for speed and reliability. Configured GitOps workflows with GitLab for source-of-truth and immutability, and maintained core cloud resources with Terraform for AWS, GCP, and Azure. Automated configuration management across hundreds of nodes using Ansible, boosting infrastructure consistency and scaling capabilities. Deployed, managed, and troubleshot Kubernetes clusters, optimized resource allocation and deploym
Site Reliability Engineer / DevOps / Cloud Automation at IBM
June 1, 2023 - Present
Mentored junior engineers on best practices in CI/CD, IaC (Terraform), and SRE principles; improved team capability and code quality. Bridged communication gaps between development, operations, and business teams to improve deployment velocity. Implemented comprehensive monitoring and alerting with Grafana to proactively detect and mitigate performance bottlenecks. Led end-to-end CI/CD pipelines with Jenkins, automated builds/tests/deployments, and established GitOps workflows with GitLab. Automated configuration management and deployments across hundreds of nodes using Ansible. Deployed and managed Kubernetes clusters, optimizing resource allocation and deployment strategies. Wrote robust Python and Bash automation scripts for daily operations. Drove high-availability across multi-cloud production environments. Led Incident Management with structured root cause analyses and post-mortems reducing MTTR. Implemented a proactive monitoring and alerting solution using Grafana. Established
Linux Systems & Infrastructure at AT&T
April 1, 2019 - June 30, 2023
Administered critical enterprise systems on RHEL and Ubuntu with a focus on security and compliance. Managed centralized patching, provisioning, and content management for RHEL environments using Red Hat Satellite. Configured and managed AWS services (EC2, S3, VPC) with proper security and tagging. Codified initial infrastructure deployments in AWS using Terraform and maintained version control for core cloud resources. Supported physical bare-metal servers and VMware-based virtual infrastructure, ensuring hybrid environment stability. Wrote Bash scripts to automate routine administration, health checks, and data backups; performed in-depth performance analysis to optimize resources. Implemented structured post-mortems and runbooks to retain institutional knowledge and enable repeatable success.
Linux Systems & Infrastructure at AT&T
April 1, 2015 - April 1, 2019
Early career role focusing on Linux systems administration and infrastructure management. Administered RHEL/Ubuntu systems, implemented patching strategies, and supported cloud and on-prem resources. Contributed to automation and scripting efforts to streamline operational tasks and improve reliability.

Education

Bachelor of Science at CUNY Brooklyn College
January 11, 2030 - February 13, 2026
Bachelor of Science at CUNY Brooklyn College
January 11, 2030 - February 16, 2026

Qualifications

Red Hat Enterprise Linux (RHEL)
January 11, 2030 - February 13, 2026
Ubuntu
January 11, 2030 - February 13, 2026

Industry Experience

Telecommunications, Software & Internet, Professional Services