Looks like you have JavaScript disabled. For the full Twine experience, you will need to re-enable it.

I’m Antonio Marchi, a Site Reliability Engineer with a focus on ML Ops and on-prem platform engineering. I’ve led migrations across cloud and on-prem, cut millions of infra spend, and designed reusable blueprints by pooling knowledge from senior engineers across London and NYC. My experience spans public cloud and on-prem environments, with a recent emphasis on low-latency systems and end-to-end data center management. I even built a data center in my basement to run ML/AI projects that inform investment decisions (my geeky gambling hobby). I’ve led CI/CD redesigns, telemetry adoption, and production readiness across distributed teams, and I’ve worked in London and NYC to support 80+ engineers building proprietary trading platforms. I enjoy hands-on engineering and collaboration with other seasoned engineers to deliver reliable, scalable platforms.…I’m Antonio Marchi, a Site Reliability Engineer with a focus on ML Ops and on-prem platform engineering. I’ve led migrations across cloud and on-prem, cut millions of infra spend, and designed reusable blueprints by pooling knowledge from senior engineers across London and NYC. My experience spans public cloud and on-prem environments, with a recent emphasis on low-latency systems and end-to-end data center management. I even built a data center in my basement to run ML/AI projects that inform investment decisions (my geeky gambling hobby). I’ve led CI/CD redesigns, telemetry adoption, and production readiness across distributed teams, and I’ve worked in London and NYC to support 80+ engineers building proprietary trading platforms. I enjoy hands-on engineering and collaboration with other seasoned engineers to deliver reliable, scalable platforms.

Antonio Marchi

DevOps Developer, Cloud Developer, Full Stack Developer, +1





I’m Antonio Marchi, a Site Reliability Engineer with a focus on ML Ops and on-prem platform engineering. I’ve led migrations across cloud and on-prem, cut millions of infra spend, and designed reusable blueprints by pooling knowledge from senior engineers across London and NYC. My experience spans public cloud and on-prem environments, with a recent emphasis on low-latency systems and end-to-end data center management. I even built a data center in my basement to run ML/AI projects that inform investment decisions (my geeky gambling hobby). I’ve led CI/CD redesigns, telemetry adoption, and production readiness across distributed teams, and I’ve worked in London and NYC to support 80+ engineers building proprietary trading platforms. I enjoy hands-on engineering and collaboration with other seasoned engineers to deliver reliable, scalable platforms.…I’m Antonio Marchi, a Site Reliability Engineer with a focus on ML Ops and on-prem platform engineering. I’ve led migrations across cloud and on-prem, cut millions of infra spend, and designed reusable blueprints by pooling knowledge from senior engineers across London and NYC. My experience spans public cloud and on-prem environments, with a recent emphasis on low-latency systems and end-to-end data center management. I even built a data center in my basement to run ML/AI projects that inform investment decisions (my geeky gambling hobby). I’ve led CI/CD redesigns, telemetry adoption, and production readiness across distributed teams, and I’ve worked in London and NYC to support 80+ engineers building proprietary trading platforms. I enjoy hands-on engineering and collaboration with other seasoned engineers to deliver reliable, scalable platforms.

Available to hire

I’m Antonio Marchi, a Site Reliability Engineer with a focus on ML Ops and on-prem platform engineering. I’ve led migrations across cloud and on-prem, cut millions of infra spend, and designed reusable blueprints by pooling knowledge from senior engineers across London and NYC. My experience spans public cloud and on-prem environments, with a recent emphasis on low-latency systems and end-to-end data center management. I even built a data center in my basement to run ML/AI projects that inform investment decisions (my geeky gambling hobby).

I’ve led CI/CD redesigns, telemetry adoption, and production readiness across distributed teams, and I’ve worked in London and NYC to support 80+ engineers building proprietary trading platforms. I enjoy hands-on engineering and collaboration with other seasoned engineers to deliver reliable, scalable platforms.

Skills

Experience Level

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Intermediate

TypeScript

Intermediate

JavaScript

Intermediate

Language

English

Fluent

Work Experience

Senior Site Reliability Engineer at ThinkAlpha NY (Avatar Securities spin-off, Citadel Investment Group)

June 30, 2025 - June 30, 2025

Led migration of platforms to AWS EKS and Proxmox; improved observability with ELK/Prometheus/Grafana, established SLOs and error budgets; reduced firefighting with PagerDuty integration; migrated to centralized CI/CD and end-to-end testing for trading systems; supported 80+ engineers delivering proprietary trading platforms.

Senior Site Reliability Engineer at Rain Bird Technologies

June 1, 2025 - November 1, 2025

Led deployment of telemetry agents for on-call telemetry collection; consolidated logs, metrics, and traces into Honeycomb; established SLOs and error budgets; redesigned environments by centralizing CI/CD and enabling staging canary deployments; improved governance with ELK/Grafana/Prometheus stack.

Site Reliability Engineer at OnSatis Services Ltd

July 1, 2024 - July 1, 2024

Migration of bank auth system from on-prem to AWS Kubernetes; led 10+ production migrations; saved over £3 million in cloud costs. Deployed Proxmox-based infra, Kubernetes clusters, and implemented latency optimizations; decoupled staging/production and introduced centralized observability.

Senior Site Reliability Engineer at ThinkAlpha NYC

July 1, 2024 - November 17, 2025

Remote SRE leading deployment and observability initiatives for telemetry collection, consolidating logs, metrics, and traces into Honeycomb. Established SLOs and error budgets; redesigned CI/CD into centralized GitHub workflows with shared Helm chart repos and an ArgoCD App-of-Apps pattern. Reimagined environments by decoupling staging/production into centralized CI/CD dev environments, container scanning, stop-on-fail guards, and staging canary deployments.

Site Reliability Engineer at ThinkAlpha NYC

June 1, 2024 - June 1, 2025

Spin-off of Avatar Securities; part of Citadel Investment Group team of 6 SREs supporting 80+ engineers building proprietary trading platforms in Python and C++; integrated DataDog APM and implemented a FinOps strategy, cutting AWS costs by ~35% and saving ~$1.2M annually; infrastructure migration of development environment and ML training to colocated servers; re-architected ML training jobs (Python) to run on Proxmox VM infrastructure; stack includes Proxmox, AlmaLinux 9, Kubernetes 1.28, ZFS on Linux, OpenTelemetry, Grafana, Prometheus, and more.

Platform Lead (Contract) at Satis Services Ltd

February 1, 2024 - July 1, 2024

Led a two-person team migrating operations from vSphere to Kubernetes on ZFS over iSCSI; performed latency optimization with topology-aware routing, CNIs, and file system tuning; latency reduced from 250-400 ms to 15-40 ms; stack included Proxmox 7, AlmaLinux 9, Kubernetes 1.28, ZFS on Linux, OpenTelemetry, Grafana; implemented container scanning and staged canary deployments.

Senior Site Reliability Engineer at Rain Bird Technologies Ltd

January 31, 2024 - January 31, 2024

Led migration of platform to AWS EKS, established SLOs and error budgets, consolidated logs and traces into Honeycomb, redesigned CI/CD into centralized GitHub workflows, and built shared Helm charts and an Argo CD App-of-Apps pattern. Implemented decoupled staging/production environments with end-to-end testing, container scanning, stop-on-fail guards, and canary deployments.

Site Reliability Engineer at Rain Bird Technologies

January 31, 2024 - January 31, 2024

Led migration of platforms to AWS EKS; improved scalability and stability across development and production environments. Implemented ELK stack, Grafana, and Prometheus-based monitoring; established SLOs and error budgets; reduced firefighting through automation.

Senior Site Reliability Engineer at Rain Bird Technologies

June 1, 2021 - January 1, 2024

Led migration of platforms to AWS EKS; improved scalability and stability across development and production; implemented centralized monitoring with ELK, Grafana and Prometheus; established SLOs and error budgets; streamlined CI/CD via GitLab CI and ArgoCD-related patterns; reduced firefighting with on-call improvements and guard rails.