Azure Infrastructure & Kubernetes: Lead the SRE team in building and operating Kubernetes and VM clusters in the AIOps platform with 10M+ DAU, managing the 10,000+ Linux and Windows servers. Developing Helm charts for microservices deployment and versioning across multiple environments, maintaining 99.99% high availability. Terraform & IaC: Building production workloads at scale on AWS using Terraform and IaC. Managing high concurrency services, including Nginx, LVS, and Kafka, eliminating configuration drift. Automation & Scripting: Developed Python, Bash, or PowerShell scripts for Kubernetes and VM cluster automation operation. Using Ansible and Argo Workflows to OS hardening and patching, reducing manual effort by 80% and MTTR by 35%. Monitoring & Incident Response: Engineered full-stack observability using Prometheus, Grafana, and SQL. Designed advanced alerting/alarm logic and led on-call rotations, investigating root causes and implementing permanent fixes for high-severity outages. Backup & Disaster Recovery: Designed and tested multi-region Disaster Recovery (DR) strategies and automated backup/recovery pipelines for PostgreSQL, MySQL, and S3, ensuring data integrity and meeting strict regulatory standards. CI/CD & Security: Built and managed automated CI/CD pipelines using GitHub Actions, GitLab CI, and Jenkins, integrating ArgoCD for GitOps-based Kubernetes deployments. Identity & Access Management (IAM): Implemented Least-Privilege access controls in Linux and Windows environments. Managed user permissions and system hardening to ensure 99% configuration compliance. Networking & Connectivity: Deeply optimized load balancing (Nginx, LVS) and resolved complex TCP/IP, DNS resolution, response latency and routing issues. Managed VPN and Firewall configurations to ensure secure connectivity across distributed environments. Message Queue: Optimize Kafka Broker kernel page cache parameters to avoid a large number of topic requests writing to the Broker disk and stabilize Kafka P99 write latency within 15 ms. Collaboration & Documentation: Maintain comprehensive runbooks and technical documentation in English to streamline team hand-offs and improve system reliability.

Azure Infrastructure & Kubernetes: Lead the SRE team in building and operating Kubernetes and VM clusters in the AIOps platform with 10M+ DAU, managing the 10,000+ Linux and Windows servers. Developing Helm charts for microservices deployment and versioning across multiple environments, maintaining 99.99% high availability. Terraform & IaC: Building production workloads at scale on AWS using Terraform and IaC. Managing high concurrency services, including Nginx, LVS, and Kafka, eliminating configuration drift. Automation & Scripting: Developed Python, Bash, or PowerShell scripts for Kubernetes and VM cluster automation operation. Using Ansible and Argo Workflows to OS hardening and patching, reducing manual effort by 80% and MTTR by 35%. Monitoring & Incident Response: Engineered full-stack observability using Prometheus, Grafana, and SQL. Designed advanced alerting/alarm logic and led on-call rotations, investigating root causes and implementing permanent fixes for high-severity outages. Backup & Disaster Recovery: Designed and tested multi-region Disaster Recovery (DR) strategies and automated backup/recovery pipelines for PostgreSQL, MySQL, and S3, ensuring data integrity and meeting strict regulatory standards. CI/CD & Security: Built and managed automated CI/CD pipelines using GitHub Actions, GitLab CI, and Jenkins, integrating ArgoCD for GitOps-based Kubernetes deployments. Identity & Access Management (IAM): Implemented Least-Privilege access controls in Linux and Windows environments. Managed user permissions and system hardening to ensure 99% configuration compliance. Networking & Connectivity: Deeply optimized load balancing (Nginx, LVS) and resolved complex TCP/IP, DNS resolution, response latency and routing issues. Managed VPN and Firewall configurations to ensure secure connectivity across distributed environments. Message Queue: Optimize Kafka Broker kernel page cache parameters to avoid a large number of topic requests writing to the Broker disk and stabilize Kafka P99 write latency within 15 ms. Collaboration & Documentation: Maintain comprehensive runbooks and technical documentation in English to streamline team hand-offs and improve system reliability.

Available to hire

Azure Infrastructure & Kubernetes: Lead the SRE team in building and operating Kubernetes and VM clusters in the AIOps platform with 10M+ DAU, managing the 10,000+ Linux and Windows servers. Developing Helm charts for microservices deployment and versioning across multiple environments, maintaining 99.99% high availability.

Terraform & IaC: Building production workloads at scale on AWS using Terraform and IaC. Managing high concurrency services, including Nginx, LVS, and Kafka, eliminating configuration drift.

Automation & Scripting: Developed Python, Bash, or PowerShell scripts for Kubernetes and VM cluster automation operation. Using Ansible and Argo Workflows to OS hardening and patching, reducing manual effort by 80% and MTTR by 35%.

Monitoring & Incident Response: Engineered full-stack observability using Prometheus, Grafana, and SQL. Designed advanced alerting/alarm logic and led on-call rotations, investigating root causes and implementing permanent fixes for high-severity outages.

Backup & Disaster Recovery: Designed and tested multi-region Disaster Recovery (DR) strategies and automated backup/recovery pipelines for PostgreSQL, MySQL, and S3, ensuring data integrity and meeting strict regulatory standards.

CI/CD & Security: Built and managed automated CI/CD pipelines using GitHub Actions, GitLab CI, and Jenkins, integrating ArgoCD for GitOps-based Kubernetes deployments.

Identity & Access Management (IAM): Implemented Least-Privilege access controls in Linux and Windows environments. Managed user permissions and system hardening to ensure 99% configuration compliance.

Networking & Connectivity: Deeply optimized load balancing (Nginx, LVS) and resolved complex TCP/IP, DNS resolution, response latency and routing issues. Managed VPN and Firewall configurations to ensure secure connectivity across distributed environments.

Message Queue: Optimize Kafka Broker kernel page cache parameters to avoid a large number of topic requests writing to the Broker disk and stabilize Kafka P99 write latency within 15 ms.

Collaboration & Documentation: Maintain comprehensive runbooks and technical documentation in English to streamline team hand-offs and improve system reliability.

See more

Experience Level

Expert
Expert
Expert
Expert
Expert
Expert
Intermediate
Intermediate
Intermediate
Intermediate
See more

Language

English
Fluent
Chinese
Advanced

Work Experience

Site Reliability Engineer at Huawei Technologies
March 1, 2025 - Present
Lead the SRE team in building and operating Kubernetes and VM clusters on the AIOps platform with 10M+ DAU, managing 10,000+ Linux and Windows servers. Designed Helm charts for microservices deployment and versioning across environments; implemented Terraform/IaC for production workloads on Azure. Managed high-availability services including Nginx, LVS, and Kafka, eliminating configuration drift and reducing MTTR. Automated pipelines (GitHub Actions, GitLab CI, Jenkins) with Argo CD for GitOps deployments; implemented IAM and system hardening; built full-stack observability with Prometheus, Grafana, and SQL. Led on-call rotations and incident response, investigating root causes and implementing permanent fixes for high-severity outages.
DevOps Engineer, Infrastructure at Shanghai Gan Gyu Industrial Co., Ltd.
June 1, 2021 - August 1, 2023
Network engineering: deeply optimized load balancing (NGINX, LVS) and resolved complex TCP/IP, DNS resolution issues. Managed VPN and firewall configurations to ensure secure connectivity across distributed environments. Implemented Kafka broker tuning (kernel page cache) to stabilize write latency (P99 < 15 ms). Developed runbooks and documentation in English to streamline handoffs and improve reliability. Led automation initiatives across Linux and Windows stacks, including orchestration and patching, with an emphasis on security and incident response. Contributed to Knowledge Orchestration Platform projects to auto-detect issues and coordinate responses.
Site Reliability Engineer at Huawei
March 1, 2025 - Present
Lead the SRE team building and operating Kubernetes and VM clusters on the AI Ops platform with 10M+ DAU across 10,000+ Linux/Windows servers. Developed Helm charts for microservice deployment and versioning; maintained 99.99% availability. Implemented Terraform/IaC on Azure; managed Nginx, LVS, Kafka; eliminated configuration drift. Created automation scripts (Python, Bash, PowerShell) and used Ansible and Argo Workflow for OS hardening and patching, reducing manual effort by 80% and MTTR by 35%. Engineered full-stack observability with Prometheus, Grafana, and SQL; designed alerting and on-call rotations; investigated root causes and implemented permanent fixes for high-severity outages. Designed multi-region DR strategies and automated backup/recovery pipelines for PostgreSQL, MySQL, and S3; ensured data integrity and regulatory compliance. Built CI/CD pipelines with GitHub Actions, GitLab CI, Jenkins; integrated ArgoCD for GitOps-based Kubernetes deployments. Implemented Least-Priv
DevOps Engineer, Infrastructure at Shanghai Gengyu Industrial Co., Ltd
June 1, 2021 - August 1, 2023
Optimized load balancing (Nginx, LVS) and resolved complex TCP/IP, DNS, latency, and routing issues. Managed VPN and firewall configurations for secure connectivity across distributed environments. Optimized Kafka broker kernel page cache parameters to avoid excessive topic writes and stabilized Kafka. Maintained comprehensive runbooks and English documentation to streamline hand-offs and improve system reliability.

Education

Bachelor of Computer Science and Technology at Maynooth University
September 1, 2023 - June 1, 2024
Bachelor of Computer Science and Technology at Chengdu Normal University
September 1, 2017 - June 1, 2021
Master of Science in Software Engineering at Maynooth University
September 1, 2023 - June 1, 2024
Bachelor of Computer Science and Technology at Chengdu Normal University
September 1, 2017 - June 1, 2021

Qualifications

Add your qualifications or awards here.

Industry Experience

Computers & Electronics, Software & Internet, Professional Services, Media & Entertainment, Telecommunications, Education, Other