In my role at Change Healthcare since March 2022, I design and develop architectures for storing retail data within enterprises and building data lakes on Azure cloud. I develop Spark and Pyspark applications for data extraction, transformation, and aggregation, handling diverse sources such as FTP and CSV files, and store data in Azure Data Lake Storage and Snowflake. I optimize data pipelines in Databricks and Azure Data Factory to migrate and process campaign and customer data, reducing organizational costs by 30%. I also replace Hive scripts with Spark DataFrame transformations for faster data analysis and manage data ingestion via Kafka and Spark Streaming APIs while using Airflow for task orchestration.

Data Engineer at Vaayuja Info Solutions, India

December 1, 2021 - August 21, 2025

At Vaayuja Info Solutions from December 2018 to December 2021, I worked extensively with big data and Hadoop ecosystems, converting Hive/SQL queries into Spark transformations using Scala. I created Spark data frames and managed data storage in AWS S3. Responsibilities included loading data from Kafka into HBase, developing batch and streaming transformation scripts, enriching clickstream data, and automating AWS EMR cluster management. I also improved application performance, used Sqoop for data imports, and contributed to data partitioning strategies and pipeline optimization for loading data into AWS Redshift. I supported testing teams and developed data verification and control processes.

Data Engineer at Change Healthcare

March 1, 2022 - November 4, 2025

Interacted with stakeholders to gather requirements and designed a common data architecture for storing Retail data within the Enterprise and building a Data Lake on Azure. Implemented Spark-based ETL in Azure Databricks to extract, transform, and aggregate data from multiple systems, storing results in ADLS Gen2. Built PySpark pipelines to ingest data from sources like FTP and CSV, processing and loading into Snowflake. Implemented XML parsing in Spark with Scala and loaded parsed data into Azure Blog storage. Created PySpark scripts to ingest data from Azure Event Hub into Delta tables in Databricks in reload/append/merge modes. Designed and developed ETL processes in Databricks to migrate Campaign data from external sources to ORC/Parquet/Text formats in Azure Data Lake Gen2. Optimized PySpark jobs to reduce costs. Built Data Factory pipelines to copy Parquet files from ADLS Gen2 to Azure Synapse Analytics Data Warehouse. Replaced Hive scripts with Spark DataFrame transformations fo

Data Engineer at Vaayuja Info Solutions

December 1, 2021 - December 1, 2021

Worked with Big Data and Hadoop ecosystem components. Converted Hive/SQL queries into Spark transformations using Scala. Created Spark DataFrames and Spark SQL queries, storing analytics-ready data in AWS S3. Loaded data from Kafka into HBase via REST API. Developed batch and streaming pipelines in Scala and Spark for near real-time data ingestion and enrichment, building a common learner data model and persisting it to HBase. Implemented Sqoop scripts to move data between RDBMS and S3. Built enrichment jobs in Spark for cleansing and enriching clickstream data with customer profiles. Automated AWS EMR cluster provisioning/termination and tuned Spark/hive scripts for performance. Worked with AWS Kinesis, EMR, and Glue for real-time data processing.

Data Engineer at Change Healthcare

March 1, 2022 - November 21, 2025

Designed and developed an enterprise data lake architecture in Azure to centralize retail data; built Spark-based ETL pipelines in Azure Databricks for extraction, transformation, and loading, with data stored in Azure Data Lake Storage and Snowflake. Implemented PySpark data integrations from sources such as FTP and CSV, unzipping and decoding XML, and ingesting via Azure Event Hub into Delta tables in Databricks. Led development of ETL processes to migrate campaign data from external sources to ORC/Parquet/Text formats in ADLS Gen2 and Azure Synapse. Optimized Spark workloads to reduce costs; created Data Factory pipelines to transfer parquet files to Synapse, and migrated Hive scripts to Spark DataFrames for faster analytics. Built streaming/data ingestion using Kafka and Spark Streaming and orchestrated with Airflow.