Optimizing Modern Data Lakehouse Architectures: A Comprehensive Comparative Study of Delta Lake and Apache Parquet in Python Data Pipelines

Khalid Al-Mutairi; Mariam Al Mazrouei

Open Access

Optimizing Modern Data Lakehouse Architectures: A Comprehensive Comparative Study of Delta Lake and Apache Parquet in Python Data Pipelines

PDF

Khalid Al-Mutairi ¹ Mariam Al Mazrouei ¹

⁴ Center for Data Analytics, King Saud University, Riyadh, Saudi Arabia

⁴ College of Engineering and IT, Ajman University, UAE

Abstract

The evolution of big data architectures from static warehouses to dynamic lakehouses has driven demand for storage formats that balance high performance, scalability, and strong data reliability. This paper provides an in-depth comparative study of Delta Lake and Apache Parquet within Python-based data pipelines, exploring their architectural differences, performance benchmarks, and operational implications. Using simulated workloads and a review of peer-reviewed literature, the study evaluates read/write throughput, concurrency, schema evolution, governance, and total cost of ownership. Results demonstrate that Parquet remains highly efficient for stable, append-only analytical workloads, while Delta Lake extends functionality with ACID transactions, schema evolution, and time travel, improving reliability and flexibility for mixed batch-streaming environments. The findings inform practitioners designing modern data lakehouse systems that integrate analytical and operational workloads under unified governance frameworks.

How to Cite

Khalid Al-Mutairi, & Mariam Al Mazrouei. (2025). Optimizing Modern Data Lakehouse Architectures: A Comprehensive Comparative Study of Delta Lake and Apache Parquet in Python Data Pipelines. Frontiers in Emerging Computer Science and Information Technology, 2(09), 30–33. Retrieved from https://irjernet.com/index.php/fecsit/article/view/237

⬇ Endnote/Zotero/Mendeley (RIS) ⬇ BibTeX

References

Armbrust, M., Das, T., Li, Y., et al. (2020). Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proceedings of the VLDB Endowment, 13(12), 3411–3424. https://doi.org/10.14778/3415478.3415560

Apache Parquet. (2023). Apache Parquet File Format. Retrieved from https://parquet.apache.org

Camacho-Rodríguez, J., Nitu, F., Kemper, A., & Neumann, T. (2023). LST-Bench: Benchmarking Log-Structured Tables in the Cloud. arXiv:2305.01120.

Geetla, D. (2025). Optimizing ETL Workflows Using Databricks and Delta Lake. HEXstream Tech Corner. https://www.hexstream.com/tech-corner

Le Dem, J. (2013). Parquet: Columnar Storage for the People. Strata + Hadoop World.

Library of Congress. (2023). Apache Parquet File Format (FDD000575). https://www.loc.gov/preservation/digital/formats

Melnik, S., Gubarev, A., Long, J., Romer, G., Shivakumar, S., Tolton, M., & Vassilakis, T. (2010). Dremel: Interactive Analysis of Web-Scale Datasets. Proceedings of the VLDB Endowment, 3(1), 330–339.

Microsoft Fabric Documentation. (2025). Delta Table Optimization and V-Order. https://learn.microsoft.com/en-us/fabric/data-engineering/delta-optimization

Powers, M. (n.d.). Delta Lake vs Parquet: Comparison. Delta.io Blog. https://delta.io/blog/delta-lake-vs-parquet-comparison

Evaluating Effectiveness of Delta Lake Over Parquet in Python Pipeline. (2025). International Journal of Data Science and Machine Learning, 5(02), 126-144. https://doi.org/10.55640/ijdsml-05-02-12

Saeedan, M., & Eldawy, A. (2022). Spatial Parquet: A Column File Format for Geospatial Data Lakes. arXiv:2209.02158.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors retain the copyright of their articles published in this journal. All articles are licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). This license permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly cited.