Hi there! I’m Zhilin Xu, a senior full-stack AI/ML engineer based in Santa Clara, CA. I design and deploy production-grade AI systems, including real-time anomaly detection, RAG/LLM workflows, and cloud-native observability platforms. With a strong foundation in deep learning, computer vision, NLP, and embedded inference, I love building robust data pipelines, telemetry, and interactive visualizations to empower engineering teams and drive impactful outcomes across enterprise and edge environments.

Zhilin Xu

Hi there! I’m Zhilin Xu, a senior full-stack AI/ML engineer based in Santa Clara, CA. I design and deploy production-grade AI systems, including real-time anomaly detection, RAG/LLM workflows, and cloud-native observability platforms. With a strong foundation in deep learning, computer vision, NLP, and embedded inference, I love building robust data pipelines, telemetry, and interactive visualizations to empower engineering teams and drive impactful outcomes across enterprise and edge environments.

Available to hire

Hi there! I’m Zhilin Xu, a senior full-stack AI/ML engineer based in Santa Clara, CA. I design and deploy production-grade AI systems, including real-time anomaly detection, RAG/LLM workflows, and cloud-native observability platforms.

With a strong foundation in deep learning, computer vision, NLP, and embedded inference, I love building robust data pipelines, telemetry, and interactive visualizations to empower engineering teams and drive impactful outcomes across enterprise and edge environments.

See more

Experience Level

Expert
Expert
Expert
Expert
Expert
Expert
Expert
Expert
Expert
Expert
Expert
Expert
Expert
Intermediate
Intermediate
Intermediate
Intermediate
Intermediate
See more

Language

English
Fluent

Work Experience

Machine Learning Engineer at AppDynamics
August 1, 2022 - January 1, 2026
Designed AI-driven observability platform that combines Splunk analytics with AppDynamics telemetry to enable end-to-end incident diagnosis across hybrid and cloud-native applications. Extended RCA capabilities to integrate with Splunk Observability Cloud, correlating metrics, traces, and deployment metadata. Built multi-stage analytics combining statistical anomaly detection, causal inference, and LLM-based reasoning to translate raw signals into actionable incident narratives. Implemented integration between AppDynamics APM telemetry and Splunk Observability Cloud; built production monitoring pipelines using Prometheus, Grafana, and OpenTelemetry. Containerized services with Docker and deployed to AWS; ensured prompt safety and reliability by securely handling prompts. Developed an interactive RCA visualization UI with React/TypeScript/WebSockets to explore causal graphs in real time. Contributed to cross-platform observability architecture aligning traditional APM monitoring with cl
AI Platform Engineer at Black Sesame Technologies Inc
April 1, 2021 - August 1, 2022
Implemented graph optimization and INT8/FP16 quantization modules in Python to improve inference throughput on edge devices. Designed a runtime scheduling layer in C++ for QNX RTOS and AUTOSAR, coordinating NPU-CPU workloads. Extended the Shanghai AI Toolchain with ONNX/TensorFlow/PyTorch importers, graph parsers, layer fusion, and tensor memory reuse for unified model conversion and optimized runtime across CPU, GPU, and NPU under QNX RTOS. Built benchmarking and validation pipelines with CUDA profilers for OEM integration. Developed Python FastAPI microservices that integrate AppDynamics APIs, telemetry streams, and model inference services for real-time diagnostics. Built streaming telemetry ingestion pipelines with Kafka, Redis, and OpenTelemetry for large-scale metrics. Implemented LangChain-based agent orchestration for anomaly detection and dependency analysis, and an LLM-driven explanation generator. Built an interactive RCA visualization UI using React/TS/WebSockets/Plotly/Rec
Software Engineer at Cornami, Inc.
May 1, 2019 - April 1, 2021
Designed and developed an AI framework to drive Cornami's low-power, high-performance AI chips for competitive training and inference. Accelerated ML models by implementing hardware-specific operators capable of processing gigabytes of batched and streamed data. Automated operator generation during compilation and built test suites to validate ONNX and JIT benchmarks. Contributed to CI/CD and hardware-software co-design processes to improve overall AI chip utilization.
Software Engineer Intern at KLA
May 1, 2018 - August 1, 2018
Automated wafer defect matching and generated data statistics/visualizations for SurfScan SPS/SP7, gaining insights into wafer characteristics. Improved wafer defect matching efficiency via a clustering-based defect-pairing algorithm (approx. 15x). Led Python-based workshops to teach Python to teams.
Software Engineer Intern at IBM
January 1, 2015 - April 1, 2015
Code review tooling project: integrated Orion compare widget into IBM’s Rational Team Concert code review web app to improve client-side review efficiency. Resolved cross-team code-change merge discrepancies using Java/JavaScript. Addressed 100+ code review bugs within two weeks and led cross-team UI/UX discussions to improve backend processing efficiency.
Software Engineer Intern at Zynga
May 1, 2014 - August 1, 2014
Part of an Android game development team; contributed to UI updates and achieved a 4.5 Google Play rating for Word Streak: Words with Friends. Implemented a one-way friend-matching algorithm to improve social features and user engagement.
Software Engineer Intern at InfoMax Technologies Corporation
September 1, 2014 - December 1, 2014
Developed digital pen solutions, building web and Android apps to register handwritten forms using digital pens.
Software Engineer Intern at TELUS
January 1, 2013 - April 1, 2013
Ottawa internship focusing on code development; contributed to software projects and collaborated across teams to improve processing efficiency.
Software Engineer Intern at TELUS
September 1, 2013 - December 1, 2013
Campbell, Ontario internship focusing on software development and hands-on coding experience.
Research Assistant at University of Waterloo
September 1, 2015 - December 1, 2015
Robot communication development: built distributed control for multiple robots using C/C++ and Java; incorporated stream-based processing to enable real-time voice input to robot systems; built cloud applications to facilitate multi-robot access; collaborated with an autonomous robotics team on PID control and movement planning.
Database Administrator Intern at TELUS
September 1, 2013 - December 1, 2013
Database administration internship focusing on data management and optimization tasks.

Education

Master’s degree in Computer Science in Data Science at University of Southern California
August 1, 2017 - May 1, 2019
Bachelor’s degree in Computer Engineering at University of Waterloo
September 1, 2012 - May 1, 2017
Master’s degree at University of Southern California
August 1, 2017 - May 1, 2019
Bachelor’s degree at University of Waterloo
September 1, 2012 - May 1, 2017

Qualifications

Add your qualifications or awards here.

Industry Experience

Software & Internet, Media & Entertainment, Professional Services