Open Access

Optimizing Modern Data Lakehouse Architectures: A Comprehensive Comparative Study of Delta Lake and Apache Parquet in Python Data Pipelines

4 Center for Data Analytics, King Saud University, Riyadh, Saudi Arabia
4 College of Engineering and IT, Ajman University, UAE

Abstract

The evolution of big data architectures from static warehouses to dynamic lakehouses has driven demand for storage formats that balance high performance, scalability, and strong data reliability. This paper provides an in-depth comparative study of Delta Lake and Apache Parquet within Python-based data pipelines, exploring their architectural differences, performance benchmarks, and operational implications. Using simulated workloads and a review of peer-reviewed literature, the study evaluates read/write throughput, concurrency, schema evolution, governance, and total cost of ownership. Results demonstrate that Parquet remains highly efficient for stable, append-only analytical workloads, while Delta Lake extends functionality with ACID transactions, schema evolution, and time travel, improving reliability and flexibility for mixed batch-streaming environments. The findings inform practitioners designing modern data lakehouse systems that integrate analytical and operational workloads under unified governance frameworks. Β 

How to Cite

Khalid Al-Mutairi, & Mariam Al Mazrouei. (2025). Optimizing Modern Data Lakehouse Architectures: A Comprehensive Comparative Study of Delta Lake and Apache Parquet in Python Data Pipelines. Frontiers in Emerging Computer Science and Information Technology, 2(09), 30–33. Retrieved from https://irjernet.com/index.php/fecsit/article/view/237

References

πŸ“„ Armbrust, M., Das, T., Li, Y., et al. (2020). Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proceedings of the VLDB Endowment, 13(12), 3411–3424. https://doi.org/10.14778/3415478.3415560
πŸ“„ Apache Parquet. (2023). Apache Parquet File Format. Retrieved from https://parquet.apache.org
πŸ“„ Camacho-RodrΓ­guez, J., Nitu, F., Kemper, A., & Neumann, T. (2023). LST-Bench: Benchmarking Log-Structured Tables in the Cloud. arXiv:2305.01120.
πŸ“„ Geetla, D. (2025). Optimizing ETL Workflows Using Databricks and Delta Lake. HEXstream Tech Corner. https://www.hexstream.com/tech-corner
πŸ“„ Le Dem, J. (2013). Parquet: Columnar Storage for the People. Strata + Hadoop World.
πŸ“„ Library of Congress. (2023). Apache Parquet File Format (FDD000575). https://www.loc.gov/preservation/digital/formats
πŸ“„ Melnik, S., Gubarev, A., Long, J., Romer, G., Shivakumar, S., Tolton, M., & Vassilakis, T. (2010). Dremel: Interactive Analysis of Web-Scale Datasets. Proceedings of the VLDB Endowment, 3(1), 330–339.
πŸ“„ Microsoft Fabric Documentation. (2025). Delta Table Optimization and V-Order. https://learn.microsoft.com/en-us/fabric/data-engineering/delta-optimization
πŸ“„ Powers, M. (n.d.). Delta Lake vs Parquet: Comparison. Delta.io Blog. https://delta.io/blog/delta-lake-vs-parquet-comparison
πŸ“„ Evaluating Effectiveness of Delta Lake Over Parquet in Python Pipeline. (2025). International Journal of Data Science and Machine Learning, 5(02), 126-144. https://doi.org/10.55640/ijdsml-05-02-12
πŸ“„ Saeedan, M., & Eldawy, A. (2022). Spatial Parquet: A Column File Format for Geospatial Data Lakes. arXiv:2209.02158.