Hi, I’m Sai Teja Gurrapu, an AI/ML Engineer with 5+ years focused on multimodal AI, generative AI, and large-scale ML deployments. I enjoy accelerating analytics with GPU computing, building RAG systems, and co-developing enterprise AI platforms. I thrive when turning complex data into practical business solutions and collaborating with cross-functional teams across global industries. My toolkit spans MLOps, cloud-native integration, and Responsible AI practices. I’ve led scalable data pipelines, monitoring for GPU clusters, and multi-agent AI orchestration to deliver measurable improvements in research productivity and production performance. I’m passionate about shared learning, experimentation, and building robust, auditable AI systems.

Sai Teja Gurrapu

Hi, I’m Sai Teja Gurrapu, an AI/ML Engineer with 5+ years focused on multimodal AI, generative AI, and large-scale ML deployments. I enjoy accelerating analytics with GPU computing, building RAG systems, and co-developing enterprise AI platforms. I thrive when turning complex data into practical business solutions and collaborating with cross-functional teams across global industries. My toolkit spans MLOps, cloud-native integration, and Responsible AI practices. I’ve led scalable data pipelines, monitoring for GPU clusters, and multi-agent AI orchestration to deliver measurable improvements in research productivity and production performance. I’m passionate about shared learning, experimentation, and building robust, auditable AI systems.

Available to hire

Hi, I’m Sai Teja Gurrapu, an AI/ML Engineer with 5+ years focused on multimodal AI, generative AI, and large-scale ML deployments. I enjoy accelerating analytics with GPU computing, building RAG systems, and co-developing enterprise AI platforms. I thrive when turning complex data into practical business solutions and collaborating with cross-functional teams across global industries.

My toolkit spans MLOps, cloud-native integration, and Responsible AI practices. I’ve led scalable data pipelines, monitoring for GPU clusters, and multi-agent AI orchestration to deliver measurable improvements in research productivity and production performance. I’m passionate about shared learning, experimentation, and building robust, auditable AI systems.

See more

Experience Level

Expert
Expert
Expert
Expert
Expert
Expert
Expert
Expert
Expert
Expert
Intermediate
Intermediate
See more

Language

English
Fluent

Work Experience

AI/ML Engineer at Meta
February 1, 2024 - Present
Scaled Meta’s GPU infrastructure powering LLMs like LLaMA; delivered high-throughput compute pipelines across 24,000+ H100 GPUs; contributed to global AI hardware standards; engineered distributed training workflows in PyTorch with tensor and pipeline parallelism; developed AI-assisted pipelines automating dataset preprocessing, fault recovery, and experiment tracking; designed and deployed GPU cluster monitoring with AutoGen agents; implemented custom schedulers coordinating thousands of nodes; partnered with research teams on LLaMA training/inference; built Python automation scripts for multimodal dataset orchestration and reproducible experiments across Meta’s infrastructure platforms; applied AutoGen multi-agent orchestration to scale LLaMA training and inference.
AI/ML Engineer at HCL Technologies
November 1, 2022 - October 9, 2025
Assisted in anomaly detection for medical devices using predictive maintenance ML models; co-automated fault detection on edge compute; implemented anomaly alerts and health monitoring dashboards; co-designed supervised ML pipelines (SVM, Random Forest, Neural Networks) for IoT telemetry; co-built unsupervised anomaly detection (PCA, K-means); engineered IoT data pipelines with Fourier transforms, noise filtering, and feature engineering; deployed models on on-premise edge with Flask REST APIs and Docker; helped establish retraining and monitoring pipelines; co-prototyped computer vision and NLP solutions in Next.ai Lab using NVIDIA DGX-1 GPUs, TensorFlow, PyTorch, and Scikit-learn; collaborated with junior engineers on GPU utilization, AI experimentation, and MLOps best practices.
AI/ML Engineer at Meta
February 1, 2024 - Present
AI/ML Engineer responsible for scaling Meta’s GPU infrastructure powering LLMs such as LLaMA and enabling efficient distributed training. Built high-throughput compute pipelines across 24,000+ H100 GPUs, accelerating AI-assisted pipelines and production deployments for global research teams. Contributed to open AI hardware standards, reducing training costs by ~20% and advancing scalability. Engineered PyTorch distributed training workflows with tensor and pipeline parallelism to optimize GPU utilization during pretraining, fine-tuning, and large-scale inference. Developed AI-assisted pipelines automating dataset preprocessing, fault recovery, and experiment tracking for billions of parameters. Designed GPU-cluster monitoring with AutoGen agents for workload management and anomaly detection, enabling dynamic scaling. Implemented custom schedulers coordinating thousands of nodes and integrated PyTorch Distributed with AI tools. Collaborated with research teams on next-generation LLaMA
AI/ML Engineer at HCL Technologies
November 1, 2022 - October 9, 2025
Assisted in detecting anomalies in critical medical devices using predictive maintenance ML models, improving uptime by ~30% and reducing unplanned downtime. Co-automated AI models deployed on edge compute with Flask APIs and Docker, delivering real-time insights through integrated dashboards. Implemented supervised ML pipelines (SVM, Random Forest, Neural Networks) on IoT telemetry to predict equipment failures and support preventive maintenance strategies. Built unsupervised anomaly detection workflows using PCA and K-means to identify unusual sensor patterns and enhance early failure detection. Engineered IoT data pipelines with preprocessing steps including Fourier transforms and noise filtering to convert raw signals into structured model-ready data. Deployed models in on-premise edge environments and co-integrated with HCL’s IoT Works platform for live monitoring. Helped establish retraining and monitoring pipelines to maintain predictive accuracy and adapt to evolving telemetr
AI/ML Engineer at Meta
February 1, 2024 - Present
Scaled Meta’s GPU infrastructure powering LLMs (e.g., LLaMA) and enabled efficient distributed training, boosting AI research productivity and reducing model development cycles. Built high-throughput pipelines across 24,000+ H100 GPUs and contributed to global AI hardware standards that cut training costs by 20%. Implemented distributed training workflows in PyTorch with tensor and pipeline parallelism, and developed AI-assisted data pipelines for preprocessing, fault recovery, and experiment tracking around billions of parameters. Designed monitoring with AutoGen agents for intelligent workload management and dynamic scaling, and created custom schedulers coordinating thousands of nodes for optimized resource allocation.
AI/ML Engineer at HCL Technologies
November 1, 2022 - October 9, 2025
Co-designed and deployed predictive maintenance ML models on edge compute for medical equipment, improving uptime by 30% and enabling real-time insights via integrated dashboards. Built supervised ML pipelines (SVM, Random Forest, Neural Networks) on IoT telemetry and unsupervised anomaly detection (PCA, K-means) to enhance early failure detection. Engineered IoT data pipelines with preprocessing steps including Fourier transforms and noise filtering, and deployed models on on-premise edge with Flask REST APIs and Docker, integrating with HCL’s IoT Works platform for live monitoring. Led model retraining and monitoring pipelines to maintain accuracy and collaborated with junior engineers on GPU utilization and MLOps practices, co-prototyping CV and NLP PoCs in DGX environments.
AI/ML Engineer at HCL Technologies
June 1, 2019 - November 1, 2022
Contributed to predictive maintenance initiatives by developing ML models to detect anomalies in critical medical devices, improving uptime and reducing unplanned downtime. Co-automated fault detection on edge devices with real-time dashboards, boosting operational efficiency. Implemented anomaly alerts and health monitoring dashboards to accelerate decision-making and maintenance scheduling. Built supervised ML pipelines (SVM, Random Forest, Neural Networks) on IoT telemetry for predicting equipment failures. Developed unsupervised anomaly detection using PCA and K-means to extend coverage. Engineered IoT data pipelines with preprocessing (Fourier transforms, noise filtering, feature engineering) to prepare high-frequency signals for modeling. Deployed AI models on on-premise edge environments via Flask REST APIs and Docker, integrating with HCL’s IoT Works platform for real-time monitoring. Implemented model retraining and monitoring pipelines to maintain accuracy as equipment beha

Education

Master of Science in Information Systems Technology at Wilmington University
January 11, 2030 - October 9, 2025
Master of Science in Information Systems Technology at Wilmington University
January 11, 2030 - October 9, 2025
Master of Science in Information Systems Technology at Wilmington University
January 11, 2030 - October 9, 2025
Master of Science in Information Systems Technology at Wilmington University
January 11, 2030 - January 27, 2026

Qualifications

Add your qualifications or awards here.

Industry Experience

Software & Internet, Computers & Electronics, Media & Entertainment, Professional Services, Healthcare