I’m Antonio Marchi, a Site Reliability Engineer with a focus on ML Ops and on-prem platform engineering. I’ve led migrations across cloud and on-prem, cut millions of infra spend, and designed reusable blueprints by pooling knowledge from senior engineers across London and NYC. My experience spans public cloud and on-prem environments, with a recent emphasis on low-latency systems and end-to-end data center management. I even built a data center in my basement to run ML/AI projects that inform investment decisions (my geeky gambling hobby). I’ve led CI/CD redesigns, telemetry adoption, and production readiness across distributed teams, and I’ve worked in London and NYC to support 80+ engineers building proprietary trading platforms. I enjoy hands-on engineering and collaboration with other seasoned engineers to deliver reliable, scalable platforms.

Antonio Marchi

I’m Antonio Marchi, a Site Reliability Engineer with a focus on ML Ops and on-prem platform engineering. I’ve led migrations across cloud and on-prem, cut millions of infra spend, and designed reusable blueprints by pooling knowledge from senior engineers across London and NYC. My experience spans public cloud and on-prem environments, with a recent emphasis on low-latency systems and end-to-end data center management. I even built a data center in my basement to run ML/AI projects that inform investment decisions (my geeky gambling hobby). I’ve led CI/CD redesigns, telemetry adoption, and production readiness across distributed teams, and I’ve worked in London and NYC to support 80+ engineers building proprietary trading platforms. I enjoy hands-on engineering and collaboration with other seasoned engineers to deliver reliable, scalable platforms.

Available to hire

I’m Antonio Marchi, a Site Reliability Engineer with a focus on ML Ops and on-prem platform engineering. I’ve led migrations across cloud and on-prem, cut millions of infra spend, and designed reusable blueprints by pooling knowledge from senior engineers across London and NYC. My experience spans public cloud and on-prem environments, with a recent emphasis on low-latency systems and end-to-end data center management. I even built a data center in my basement to run ML/AI projects that inform investment decisions (my geeky gambling hobby).

I’ve led CI/CD redesigns, telemetry adoption, and production readiness across distributed teams, and I’ve worked in London and NYC to support 80+ engineers building proprietary trading platforms. I enjoy hands-on engineering and collaboration with other seasoned engineers to deliver reliable, scalable platforms.

See more

Experience Level

Expert
Expert
Expert
Expert
Expert
Expert
Expert
Expert
Expert
Intermediate
Intermediate
Intermediate
See more

Language

English
Fluent

Work Experience

Senior Site Reliability Engineer at ThinkAlpha NYC
July 1, 2024 - November 17, 2025
Remote SRE leading deployment and observability initiatives for telemetry collection, consolidating logs, metrics, and traces into Honeycomb. Established SLOs and error budgets; redesigned CI/CD into centralized GitHub workflows with shared Helm chart repos and an ArgoCD App-of-Apps pattern. Reimagined environments by decoupling staging/production into centralized CI/CD dev environments, container scanning, stop-on-fail guards, and staging canary deployments.
Site Reliability Engineer at Rain Bird Technologies
January 31, 2024 - January 31, 2024
Led migration of platforms to AWS EKS; improved scalability and stability across development and production environments. Implemented ELK stack, Grafana, and Prometheus-based monitoring; established SLOs and error budgets; reduced firefighting through automation.
Site Reliability Engineer at OnSatis Services Ltd
July 1, 2024 - July 1, 2024
Migration of bank auth system from on-prem to AWS Kubernetes; led 10+ production migrations; saved over £3 million in cloud costs. Deployed Proxmox-based infra, Kubernetes clusters, and implemented latency optimizations; decoupled staging/production and introduced centralized observability.
Senior Site Reliability Engineer at Rain Bird Technologies Ltd
January 31, 2024 - January 31, 2024
Led migration of platform to AWS EKS, established SLOs and error budgets, consolidated logs and traces into Honeycomb, redesigned CI/CD into centralized GitHub workflows, and built shared Helm charts and an Argo CD App-of-Apps pattern. Implemented decoupled staging/production environments with end-to-end testing, container scanning, stop-on-fail guards, and canary deployments.
Senior Site Reliability Engineer at ThinkAlpha NY (Avatar Securities spin-off, Citadel Investment Group)
June 30, 2025 - June 30, 2025
Led migration of platforms to AWS EKS and Proxmox; improved observability with ELK/Prometheus/Grafana, established SLOs and error budgets; reduced firefighting with PagerDuty integration; migrated to centralized CI/CD and end-to-end testing for trading systems; supported 80+ engineers delivering proprietary trading platforms.
Senior Site Reliability Engineer at Rain Bird Technologies
June 1, 2021 - January 1, 2024
Led migration of platforms to AWS EKS; improved scalability and stability across development and production; implemented centralized monitoring with ELK, Grafana and Prometheus; established SLOs and error budgets; streamlined CI/CD via GitLab CI and ArgoCD-related patterns; reduced firefighting with on-call improvements and guard rails.
Site Reliability Engineer at ThinkAlpha NYC
June 1, 2024 - June 1, 2025
Spin-off of Avatar Securities; part of Citadel Investment Group team of 6 SREs supporting 80+ engineers building proprietary trading platforms in Python and C++; integrated DataDog APM and implemented a FinOps strategy, cutting AWS costs by ~35% and saving ~$1.2M annually; infrastructure migration of development environment and ML training to colocated servers; re-architected ML training jobs (Python) to run on Proxmox VM infrastructure; stack includes Proxmox, AlmaLinux 9, Kubernetes 1.28, ZFS on Linux, OpenTelemetry, Grafana, Prometheus, and more.
Platform Lead (Contract) at Satis Services Ltd
February 1, 2024 - July 1, 2024
Led a two-person team migrating operations from vSphere to Kubernetes on ZFS over iSCSI; performed latency optimization with topology-aware routing, CNIs, and file system tuning; latency reduced from 250-400 ms to 15-40 ms; stack included Proxmox 7, AlmaLinux 9, Kubernetes 1.28, ZFS on Linux, OpenTelemetry, Grafana; implemented container scanning and staged canary deployments.
Senior Site Reliability Engineer at Rain Bird Technologies
June 1, 2025 - November 1, 2025
Led deployment of telemetry agents for on-call telemetry collection; consolidated logs, metrics, and traces into Honeycomb; established SLOs and error budgets; redesigned environments by centralizing CI/CD and enabling staging canary deployments; improved governance with ELK/Grafana/Prometheus stack.

Education

MSc Data Science at University of Strathclyde
January 1, 2018 - January 1, 2019
BSc Accounting & Finance - Quantitative Finance at Robert Gordon University
January 1, 2014 - January 1, 2018
MSc Data Science at University of Strathclyde
January 1, 2018 - December 31, 2019
BSc Accounting & Finance - Quantitative Finance at Robert Gordon University
October 1, 2019 - May 1, 2021
MSc Data Science at University of Strathclyde, Glasgow, UK
January 1, 2018 - January 1, 2020
BSc Accounting & Finance (Quantitative Finance) at Robert Gordon University, Aberdeen, UK
January 1, 2014 - January 1, 2018

Qualifications

Add your qualifications or awards here.

Industry Experience

Software & Internet, Professional Services, Financial Services

Experience Level

Expert
Expert
Expert
Expert
Expert
Expert
Expert
Expert
Expert
Intermediate
Intermediate
Intermediate
See more