Frontiers in Emerging Computer Science and Information Technology

  1. Home
  2. Archives
  3. Vol. 2 No. 09 (2025): Volume02 Issue09 September
  4. Articles
Frontiers in Emerging Computer Science and Information Technology

Article Details Page

Optimizing Modern Data Lakehouse Architectures: A Comprehensive Comparative Study of Delta Lake and Apache Parquet in Python Data Pipelines

Authors

  • Khalid Al-Mutairi Center for Data Analytics, King Saud University, Riyadh, Saudi Arabia
  • Mariam Al Mazrouei College of Engineering and IT, Ajman University, UAE

Keywords:

Delta Lake, Apache Parquet, Python pipelines, PySpark, ACID transactions

Abstract

The evolution of big data architectures from static warehouses to dynamic lakehouses has driven demand for storage formats that balance high performance, scalability, and strong data reliability. This paper provides an in-depth comparative study of Delta Lake and Apache Parquet within Python-based data pipelines, exploring their architectural differences, performance benchmarks, and operational implications. Using simulated workloads and a review of peer-reviewed literature, the study evaluates read/write throughput, concurrency, schema evolution, governance, and total cost of ownership. Results demonstrate that Parquet remains highly efficient for stable, append-only analytical workloads, while Delta Lake extends functionality with ACID transactions, schema evolution, and time travel, improving reliability and flexibility for mixed batch-streaming environments. The findings inform practitioners designing modern data lakehouse systems that integrate analytical and operational workloads under unified governance frameworks.  

References

Armbrust, M., Das, T., Li, Y., et al. (2020). Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proceedings of the VLDB Endowment, 13(12), 3411–3424. https://doi.org/10.14778/3415478.3415560

Apache Parquet. (2023). Apache Parquet File Format. Retrieved from https://parquet.apache.org

Camacho-Rodríguez, J., Nitu, F., Kemper, A., & Neumann, T. (2023). LST-Bench: Benchmarking Log-Structured Tables in the Cloud. arXiv:2305.01120.

Geetla, D. (2025). Optimizing ETL Workflows Using Databricks and Delta Lake. HEXstream Tech Corner. https://www.hexstream.com/tech-corner

Le Dem, J. (2013). Parquet: Columnar Storage for the People. Strata + Hadoop World.

Library of Congress. (2023). Apache Parquet File Format (FDD000575). https://www.loc.gov/preservation/digital/formats

Melnik, S., Gubarev, A., Long, J., Romer, G., Shivakumar, S., Tolton, M., & Vassilakis, T. (2010). Dremel: Interactive Analysis of Web-Scale Datasets. Proceedings of the VLDB Endowment, 3(1), 330–339.

Microsoft Fabric Documentation. (2025). Delta Table Optimization and V-Order. https://learn.microsoft.com/en-us/fabric/data-engineering/delta-optimization

Powers, M. (n.d.). Delta Lake vs Parquet: Comparison. Delta.io Blog. https://delta.io/blog/delta-lake-vs-parquet-comparison

Evaluating Effectiveness of Delta Lake Over Parquet in Python Pipeline. (2025). International Journal of Data Science and Machine Learning, 5(02), 126-144. https://doi.org/10.55640/ijdsml-05-02-12

Saeedan, M., & Eldawy, A. (2022). Spatial Parquet: A Column File Format for Geospatial Data Lakes. arXiv:2209.02158.

Downloads

Published

2025-09-18

How to Cite

Khalid Al-Mutairi, & Mariam Al Mazrouei. (2025). Optimizing Modern Data Lakehouse Architectures: A Comprehensive Comparative Study of Delta Lake and Apache Parquet in Python Data Pipelines. Frontiers in Emerging Computer Science and Information Technology, 2(09), 30–33. Retrieved from https://irjernet.com/index.php/fecsit/article/view/237