I'm a data engineering professional with 8 years of experience in Azure Cloud, Big Data, and data warehousing. I design and optimize end-to-end ETL/ELT pipelines using Azure Data Factory, Databricks, Snowflake, and a wide range of data technologies to enable reliable, scalable analytics. I enjoy collaborating with cross-functional teams to solve complex data problems, improve data governance, and drive performance through thoughtful architecture and automation. Outside of work, I love exploring new data tech, sharing learnings through documentation and automation, and contributing to open data projects whenever possible. I bring a pragmatic, user-focused mindset to data platforms, ensuring trusted analytics while keeping security and compliance top of mind.

Sowmya Cherukuri

I'm a data engineering professional with 8 years of experience in Azure Cloud, Big Data, and data warehousing. I design and optimize end-to-end ETL/ELT pipelines using Azure Data Factory, Databricks, Snowflake, and a wide range of data technologies to enable reliable, scalable analytics. I enjoy collaborating with cross-functional teams to solve complex data problems, improve data governance, and drive performance through thoughtful architecture and automation. Outside of work, I love exploring new data tech, sharing learnings through documentation and automation, and contributing to open data projects whenever possible. I bring a pragmatic, user-focused mindset to data platforms, ensuring trusted analytics while keeping security and compliance top of mind.

Available to hire

I’m a data engineering professional with 8 years of experience in Azure Cloud, Big Data, and data warehousing. I design and optimize end-to-end ETL/ELT pipelines using Azure Data Factory, Databricks, Snowflake, and a wide range of data technologies to enable reliable, scalable analytics. I enjoy collaborating with cross-functional teams to solve complex data problems, improve data governance, and drive performance through thoughtful architecture and automation.

Outside of work, I love exploring new data tech, sharing learnings through documentation and automation, and contributing to open data projects whenever possible. I bring a pragmatic, user-focused mindset to data platforms, ensuring trusted analytics while keeping security and compliance top of mind.

See more

Experience Level

Expert
Expert
Expert
Expert
Expert
Expert
Expert
Expert
Expert

Language

English
Fluent

Work Experience

Senior Data Engineer at Goldman Sachs
June 1, 2024 - November 7, 2025
Led end-to-end ETL/ELT pipelines using Azure Data Factory, Azure Databricks, Snowflake, and large-scale datasets, integrating on-premises and cloud sources (MySQL, Cassandra, Blob Storage, and Azure SQL DB). Designed and optimized SQL with indexes, views, stored procedures, functions, and packages to boost query performance. Applied data warehousing techniques in Snowflake, including cleansing, Slowly Changing Dimensions (SCD), surrogate keys, and change data capture. Built and tuned ETL transformations with Spark SQL and Spark DataFrames in Databricks and orchestrated via ADF pipelines. Developed Python and SnowSQL scripts for data movement and processing. Leveraged Azure Functions, Logic Apps, and Service Bus for automation, data integration, and pipeline monitoring. Built real-time streaming pipelines using Apache Kafka, Spark Streaming, and Hive to ingest, transform, and analyze streaming data. Maintained Hadoop and RDBMS data integration programs with Hive on Spark, Spark SQL, and
Senior Data Engineer at Mayo Clinic
May 1, 2024 - May 1, 2024
Handled large-scale data engineering initiatives leveraging Hadoop Cloudera and Microsoft Azure. Ingested data from diverse sources into Azure Data Lake, Azure Storage, Azure SQL, and Azure Synapse Analytics; processed with Databricks. Migrated on-premises Oracle ETL processes and SQL workloads to Azure using Data Factory for batch, near-real-time, and event-driven data movement. Built scalable batch and real-time data pipelines using Apache Kafka, Spark Streaming, Hive, and Databricks for ingestion, transformation, and analytics. Processed structured, semi-structured, and unstructured healthcare data (HL7, FHIR) with Scala, PySpark, Spark SQL, and Hive, implementing UDFs, partitioning, and bucketing. Designed lakehouse architectures with Azure Data Lake Gen2 with hierarchical namespace for HIPAA-compliant storage. Developed DBT models encoding business logic, data masking, dimensional modeling, and slowly changing dimensions. Orchestrated workflows with Apache Airflow, automating ETL,
Big Data Developer at Lowe's Companies, Inc.
October 1, 2022 - October 1, 2022
Imported data from MySQL to HDFS using Sqoop; migrated data from Oracle to Hadoop. Performed ETL and aggregations using Apache Spark (Scala, PySpark, Spark SQL); loaded results into Hive. Designed pipelines using Flume, Sqoop, Kafka, Spark, and Hive for ingesting and analyzing customer and streaming data. Implemented CI/CD pipelines for deployments in Hadoop environments; used Zookeeper for coordination and Oozie for workflow scheduling. Worked with AWS EC2, Python, Shell scripting, and Ambari for cluster management. Architected AWS data lake using S3, AWS Glue, and Redshift; designed GCP data lake using Cloud Storage (GCS) and BigQuery; Developed pipelines using GCP Dataflow (Apache Beam) for batch and streaming. Used Infrastructure as Code (IaC) with Terraform for AWS and GCP. Focused on cloud cost optimization through resource rightsizing and reserved instances.
Hadoop Developer at KeyBank
December 1, 2020 - December 1, 2020
Maintained source code in Git and GitHub; prepared ETL framework using Sqoop, Pig, Hive for data ingestion. Processed HDFS data, created external Hive tables, and reusable ingestion/repair scripts. Developed ETL jobs using Spark-Scala (RDDs, DataFrames, Spark SQL) for migration and reporting; built Spark Streaming applications for real-time analytics. Analyzed source data; modified data types for efficient handling. Designed PySpark solutions for SQL script analysis. Extracted data using Sqoop; performed transformations using Hive and MapReduce. Implemented automation for deployments using YAML scripts. Worked with Hive, Pig, HBase, Spark, Zookeeper, Flume, Kafka, Sqoop. Implemented data classification algorithms using MapReduce, optimized with partitioning, combiners, and distributed cache. Created technical documentation for ETL processes, data flows, and system architecture.
Data Warehouse Developer at Atlantis Architects
November 1, 2018 - November 1, 2018
Worked as SQL Server Analyst/Developer/DBA using SQL Server 2012, 2015, 2016. Managed and updated Erwin models (Logical/Physical) for CDS, ADM, and Reference DB. Tracked source control and deployment using TFS across environments. Wrote Triggers, Stored Procedures, Functions, and maintained structures using T-SQL. Managed file/file groups, table/index associations, query and performance tuning. Implemented ETL processes using SSIS packages for data extraction, transformation, and loading. Optimized SQL performance with indexing, reducing report generation time by 25%. Built data validation and cleansing routines in SSIS ensuring data quality. Implemented slowly changing dimensions (SCD Type 1 and 2) in the data warehouse. Created aggregate and summary tables to enhance reporting performance.
Senior Data Engineer at Mayo Clinic
November 1, 2022 - May 1, 2024
Hands-on Hadoop Cloudera and Microsoft Azure data engineering for healthcare. Ingested data into Azure Data Lake, Azure Storage, Azure SQL, and Azure Synapse Analytics; processed with Azure Databricks. Migrated on-premises Oracle ETL processes and SQL databases to Azure via Data Factory. Built scalable batch and real-time data pipelines using Kafka, Spark Streaming, Hive, and Databricks for ingestion, transformation, and analytics. Processed structured, semi-structured, and unstructured healthcare data using Scala, PySpark, Spark SQL, and Hive, including UDFs, joins, partitioning, and bucketing. Designed lakehouse architecture with Azure Data Lake Gen2 for HIPAA-compliant storage. Developed DBT models with data masking, dimensional modeling, and SCDs. Orchestrated workflows with Apache Airflow, automating ETL pipelines, data quality checks, and failure recovery. Implemented HL7 and FHIR standards for healthcare data integration and interoperability. Optimized ADF pipelines for performa
Big Data Developer at Lowe's Companies, Inc.
January 1, 2021 - October 1, 2022
Imported data from MySQL to HDFS using Sqoop; migrated data from Oracle to Hadoop. Performed ETL and aggregations using Apache Spark (Scala, PySpark, Spark SQL); loaded results into Hive. Designed pipelines using Flume, Sqoop, Kafka, Spark, Hive for ingesting and analyzing customer and streaming data. Implemented CI/CD for deployments in Hadoop environments. Used Zookeeper for coordination and Oozie for workflow scheduling. Worked with AWS EC2, Python, Shell scripting, and Ambari for cluster monitoring. Architected AWS data lake with S3, AWS Glue, and Redshift; developed UNIX scripts and YAML automation for job scheduling and builds. Designed GCP data lake using Cloud Storage and BigQuery for analytics; developed pipelines using GCP Dataflow (Apache Beam) for batch and streaming; used GCP Dataproc with BigQuery for scalable analytics. Implemented IaC with Terraform for AWS and GCP; focused on cloud cost optimization.
Hadoop Developer at KeyBank
February 1, 2019 - December 1, 2020
Maintained source code in Git/GitHub; prepared ETL framework using Sqoop, Pig, Hive for data ingestion. Developed Spark-Scala jobs (RDDs, DataFrames, Spark SQL) for migration and reporting; built Spark Streaming pipelines for real-time analytics. Analyzed source data and optimized data types. Implemented automation for deployments using YAML scripts. Worked with Hive, Pig, HBase, Spark, Zookeeper, Flume, Kafka, Sqoop; deployed on AWS (S3, DMS, RDS) and Git-based workflows. Created technical documentation for ETL processes and data flows.
Data Warehouse Developer at Atlantis Architects
September 1, 2017 - November 1, 2018
SQL Server Analyst/Developer/DBA handling SQL Server 2012-2016. Managed Erwin data models; wrote T-SQL triggers, stored procedures, functions; performed performance tuning and indexing. Implemented ETL with SSIS; built data validation and cleansing routines; implemented SCDs; created aggregate tables to boost reporting performance.

Education

Bachelor of Pharmacy (B.Pharm) at Malla Reddy College of Pharmacy, Hyderabad, India
January 11, 2030 - January 1, 2017
Bachelor of Pharmacy (B.Pharm) at Malla Reddy College of Pharmacy, Hyderabad, India
January 11, 2030 - January 1, 2017

Qualifications

Add your qualifications or awards here.

Industry Experience

Software & Internet, Healthcare, Professional Services, Financial Services, Computers & Electronics