Looks like you have JavaScript disabled. For the full Twine experience, you will need to re-enable it.

I'm a data and AI engineer with a passion for turning complex data into actionable insights. Over the years I have built and scaled end-to-end data pipelines, deployed LLM-powered solutions, and partnered with cross-functional teams to drive enterprise AI initiatives. Currently, I design and productionize RAG pipelines, document-intelligence systems, and production-grade APIs, while prioritizing data governance and scalable architecture across cloud platforms.…I'm a data and AI engineer with a passion for turning complex data into actionable insights. Over the years I have built and scaled end-to-end data pipelines, deployed LLM-powered solutions, and partnered with cross-functional teams to drive enterprise AI initiatives. Currently, I design and productionize RAG pipelines, document-intelligence systems, and production-grade APIs, while prioritizing data governance and scalable architecture across cloud platforms.

Rohan Munshi

Data Scientist, AI Engineer, Developer, +4





I'm a data and AI engineer with a passion for turning complex data into actionable insights. Over the years I have built and scaled end-to-end data pipelines, deployed LLM-powered solutions, and partnered with cross-functional teams to drive enterprise AI initiatives. Currently, I design and productionize RAG pipelines, document-intelligence systems, and production-grade APIs, while prioritizing data governance and scalable architecture across cloud platforms.…I'm a data and AI engineer with a passion for turning complex data into actionable insights. Over the years I have built and scaled end-to-end data pipelines, deployed LLM-powered solutions, and partnered with cross-functional teams to drive enterprise AI initiatives. Currently, I design and productionize RAG pipelines, document-intelligence systems, and production-grade APIs, while prioritizing data governance and scalable architecture across cloud platforms.

Available to hire

I’m a data and AI engineer with a passion for turning complex data into actionable insights. Over the years I have built and scaled end-to-end data pipelines, deployed LLM-powered solutions, and partnered with cross-functional teams to drive enterprise AI initiatives.

Currently, I design and productionize RAG pipelines, document-intelligence systems, and production-grade APIs, while prioritizing data governance and scalable architecture across cloud platforms.

Skills

Experience Level

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Expert

Intermediate

Intermediate

Intermediate

Intermediate

Language

English

Fluent

German

Intermediate

Javanese

Advanced

Bashkir

Advanced

Work Experience

Data and AI Engineer at Talonic

July 1, 2025 - July 1, 2025

Designed and deployed ETL processes and document intelligence pipelines using AWS Glue, S3, and Lambda; established an event-driven architecture to ingest and normalize unstructured documents. Engineered RAG pipelines with vector search and language agents; developed and implemented advanced prompt engineering techniques to improve enterprise knowledge retrieval speed by 2x. Deployed LLM-powered APIs into production using FastAPI and TypeScript with Prisma, integrating with Supabase for database management and ensuring high availability with Docker. Engineered a full-stack data pipeline using PySpark on Databricks for pre-processing and feeding into fine-tuned AI models, directly supporting business-critical enterprise applications. Partnered with stakeholders to improve data architecture and governance, ensuring quality and compliance.

Data and AI Engineer at Talonic

November 1, 2024 - Present

Designed and deployed document intelligence pipelines using AWS Glue, S3, and Lambda to ingest and normalize unstructured documents, integrating OCR and Generative AI for enriched extraction. Built and optimized Spark data pipelines on Databricks, enhancing AI model performance on large-scale data. Engineered Retrieval-Augmented Generation (RAG) pipelines using vector search and language agents to double knowledge retrieval speed while reducing manual overhead. Deployed LLM-powered APIs with FastAPI and Docker, incorporating logging and monitoring for quality assurance. Collaborated with stakeholders to improve data architecture and governance.

Data and AI Engineer at Talonic

November 1, 2024 - Present

Designed end-to-end document intelligence pipelines converting unstructured documents into structured datasets. Fine-tuned large language and vision models for insights extraction supporting R&D workflows. Developed retrieval-augmented generation (RAG) workflows to accelerate knowledge discovery for enterprise clients. Implemented quality assurance for structured output validation ensuring data integrity. Collaborated with non-technical stakeholders to build data-driven automation pipelines enhancing operational efficiency.

Data Scientist (Intern) at Index Exchange

October 31, 2024 - August 27, 2025

Collaborated with cross-functional teams to enhance system efficiency via data-driven insights and machine learning algorithms. Used AWS S3 for data processing, and managed workflows with Spark, Kafka, and Airflow. Articulated research hypotheses supporting automated decision-making. Trained ML models using AWS SageMaker. Implemented MLOps with Docker, Kubernetes, Airflow, and Kafka for real-time ingestion and streamlined deployment. Designed data pipelines and APIs for real-time data flow in ad exchange platforms.

Data Scientist at Index Exchange

October 31, 2024 - July 25, 2025

Conducted exploratory data analysis and statistical modeling to optimize ad exchange operations. Built and validated machine learning models in SageMaker to improve targeting efficiency. Developed real-time inference workflows using Airflow, Kafka, and Docker. Collaborated to deploy scalable ML pipelines with high reliability and traceability. Engineered APIs and Spark-based data flows to streamline model integration and automated decision-making.

Data Scientist (Intern) at Index Exchange

October 1, 2024 - October 1, 2024

Collaborated with cross-functional teams to enhance system efficiency through data-driven insights and ML algorithms. Clearly articulated research hypotheses, decisions, and results, supporting automated decision-making algorithms. Trained and evaluated ML models using AWS SageMaker to optimize performance and accuracy. Developed and maintained data workflows using Apache Airflow for orchestration, scheduling complex ETL tasks. Implemented MLOps with Docker and Kubernetes; integrated Kafka to build a real-time, event-driven architecture for data ingestion, streamlining algorithm deployment. Designed and implemented robust data pipelines and APIs, facilitating seamless data flow and real-time processing platforms.

Data Engineer at Ek Robotics

March 31, 2024 - March 31, 2024

Designed and developed a scalable data warehouse using AWS Redshift, leveraging deep knowledge of its internals for performance tuning, including optimizing distribution/sort keys, analyzing query execution plans, and managing workload (WLM). Implemented AWS Lambda and Step Functions to automate business processes, significantly improving operational efficiency. Developed ETL/ELT pipelines with Apache Spark on Databricks, integrating diverse data sources into cloud-based data lakes. Enhanced OLAP DataMart through dimensional modeling (Kimball methodology), optimizing performance and accessibility for business intelligence. Developed conceptual, logical, and physical data models utilizing Amazon DynamoDB and AWS Data Modeler, while exploring AWS CDK for infrastructure as code to improve deployment processes. Implemented streaming data pipelines with Amazon Kinesis, ensuring real-time data ingestion and processing. Actively monitored data pipelines and warehouse performance using AWS Clo

Data Engineer (Work-student) at Ek Robotics

March 31, 2024 - August 27, 2025

Designed a scalable data warehouse using AWS S3, Data Lake, and Redshift to manage large data volumes efficiently. Automated business processes with AWS Lambda and Step Functions resulting in operational improvements. Developed ETL/ELT pipelines with Apache Spark on Databricks integrating diverse data sources into cloud data lakes. Enhanced OLAP DataMart via dimensional modeling. Created conceptual, logical, and physical data models using Amazon DynamoDB and AWS Data Modeler; explored AWS CDK for infrastructure as code. Built dashboards and reports using AWS QuickSight. Implemented real-time streaming pipelines with Amazon Kinesis.

Data Engineer at Ek Robotics

March 31, 2024 - July 25, 2025

Performed data wrangling and quality enhancement using Python libraries, adhering to coding standards. Developed ML pipelines for sentiment analysis using Scikit-learn and TensorFlow. Designed scalable data warehouse architectures with AWS S3, Data Lake, and Redshift. Automated business processes with AWS Lambda and Step Functions. Integrated Apache Spark with AWS Glue to optimize data pipelines and reduce latency. Modeled relational and NoSQL databases and explored infrastructure as code practices for deployment.

Master’s Thesis Student at Pragmatic industries GmbH

March 1, 2024 - November 21, 2025

Architected a two-stage, hybrid model for document intelligence, first leveraging a YOLOv8-based detector to identify key layout components (e.g., tables, figures), then applying a fine-tuned LayoutLMv3 model to these specific regions for high-precision, region-specific data extraction. Implemented a layout-aware OCR pipeline using PaddleOCR, leveraging the hybrid model's output to correct segmentation errors and generate clean markdown data. Orchestrated a zero-shot extraction pipeline by fine-tuning Llama3-8B with PEFT/LoRA. Aligned the model for high-fidelity schema generation using Direct Preference Optimization (DPO), leveraging a curated preference dataset of correct vs. incorrect schema outputs to directly steer the model’s policy.

Master’s Thesis Student at Pragmatic industries GmbH

March 1, 2024 - Present

Extracted key information from over 1,000 technical PDFs using multimodal ML models like CLIP for combined visual and textual embeddings. Developed rule-based models to structure initial data extraction and annotated datasets to improve entity recognition training. Built and fine-tuned Named Entity Recognition models using BERT for precise classification of technical details. Leveraged large language models including Ollama and LLaMAIndex to improve data extraction, interpretation, and comprehension of unstructured document data.

Data Engineer at Tata Consultancy Services

March 31, 2022 - August 27, 2025

Delivered customized ETL solutions using Informatica PowerCenter handling millions of data records efficiently. Designed and implemented complex PL/SQL database processes integrating NoSQL with batch and ETL functionality. Managed large datasets with HDFS and Hive ensuring efficient distributed querying and storage. Translated business requirements into scalable data architectures using DBT and enhanced backend integrations using TypeScript. Connected diverse data sources through RESTful APIs. Led statistical data analysis and information flow automation using Python scripting. Conducted data profiling and quality assurance using Informatica Data Quality, SQL, and Python on Teradata, MS SQL Server, and MongoDB.

Data Engineer at Tata Consultancy Services

March 31, 2022 - July 25, 2025

Developed customized ETL solutions using Informatica handling millions of records. Designed complex PL/SQL processes integrating NoSQL with batch and ETL workflows. Managed large datasets with HDFS and Hive for distributed querying. Led scalable data architecture projects using DBT and enhanced backend processes with TypeScript. Improved data pipeline integration with RESTful APIs. Performed data profiling on multiple database platforms ensuring data accuracy.

Data Engineer at Tata Consultancy Services

March 1, 2022 - March 1, 2022

Developed customized ETL solutions using Informatica PowerCenter, handling millions of data records efficiently. Designed and implemented complex PL/SQL database processes integrating NoSQL database with batch and ETL functionality. Managed large datasets using HDFS and leveraged Hive for distributed querying, ensuring efficient data processing and storage. Built and maintained modular, reusable data transformation models using DBT; integrated data versioning with Git and implemented CI/CD practices to automate testing and deployment of DBT workflows. Enhanced data pipeline integration by utilizing RESTful APIs to connect various data sources and services. Coordinated statistical data analysis, design, and information flow for efficient data processing using Python scripting. Performed data profiling on Teradata, MS SQL Server, and MongoDB environments using Informatica Data Quality, SQL queries, and Python, to ensure data accuracy and integrity across various platforms.