I’m a Site Reliability Engineer | Developer | DevOps Specialist with 5+ years of experience in Java/Spring Boot, AWS, and CI/CD automation. Skilled in monitoring, performance optimization, and cloud infrastructure, I build reliable, scalable systems and deliver fast, high-quality results. Based in Toronto, Canada, I’m a quick learner passionate about driving automation and uptime excellence.

Harshita Kaur Chugh

I’m a Site Reliability Engineer | Developer | DevOps Specialist with 5+ years of experience in Java/Spring Boot, AWS, and CI/CD automation. Skilled in monitoring, performance optimization, and cloud infrastructure, I build reliable, scalable systems and deliver fast, high-quality results. Based in Toronto, Canada, I’m a quick learner passionate about driving automation and uptime excellence.

Available to hire

I’m a Site Reliability Engineer | Developer | DevOps Specialist with 5+ years of experience in Java/Spring Boot, AWS, and CI/CD automation. Skilled in monitoring, performance optimization, and cloud infrastructure, I build reliable, scalable systems and deliver fast, high-quality results. Based in Toronto, Canada, I’m a quick learner passionate about driving automation and uptime excellence.

See more

Experience Level

Expert
Expert
Expert
Expert
Expert
Expert
Expert
Expert
Expert
Intermediate
See more

Language

English
Fluent

Work Experience

Site Reliability Engineer at American Express
June 1, 2025 - June 1, 2025
Lead production support and mission-critical incident management for 5+ Java/Spring Boot applications in enterprise environments. Coordinated RCA and incident response for 100+ high-severity incidents. Implemented proactive monitoring with Splunk, Dynatrace, and Grafana; provided CI/CD support with Jenkins, GitHub Actions, and Bitbucket; optimized MTTR and ensured reliable production operations through permanent fixes and automation.
Production Support & Incident Management Engineer at American Express
June 1, 2025 - June 1, 2025
Led incident response to 100+ high-severity incidents in a 24/7 production environment, coordinating cross-functional teams to restore services quickly and drive root-cause analysis (RCA). Reduced system downtime by 30% through proactive monitoring, RCA ownership, and implementation of permanent fixes. Applied ITIL/ITSM processes to standardize production support workflows. Defined and tracked SLIs/SLOs for key services to align with business SLAs. Designed alert strategies in Splunk, Dynatrace, and Grafana, reducing alert noise and MTTR. Led Post-Implementation Validation automation, saving 200+ man-hours across 100+ apps. Ensured smooth execution of batch scheduling jobs and production workflows, eliminating delays and data mismatches. Provided DB & platform support with SQL/PL-SQL for PostgreSQL and Oracle. Supported Spring Boot/J2EE applications, identified performance bottlenecks, and coordinated fixes with development teams. Managed secure file transfer operations by renewing 100

Education

Bachelor of Technology - Computer Science Engineering at University of Petroleum and Energy Studies
January 1, 2016 - January 1, 2020
Bachelor of Technology - Computer Science Engineering at University of Petroleum and Energy Studies
January 1, 2016 - January 1, 2020
Bachelor of Technology at University of Petroleum and Energy Studies
January 1, 2016 - January 1, 2020

Qualifications

ITIL / ITSM
January 11, 2030 - November 3, 2025
Post-Implementation Validation Automation
January 11, 2030 - November 3, 2025
ITIL Certification
January 11, 2030 - November 3, 2025

Industry Experience

Software & Internet, Professional Services, Financial Services