Open Access
Optimizing Modern Data Lakehouse Architectures: A Comprehensive Comparative Study of Delta Lake and Apache Parquet in Python Data Pipelines
4
Center for Data Analytics, King Saud University, Riyadh, Saudi Arabia
4
College of Engineering and IT, Ajman University, UAE
Abstract
The evolution of big data architectures from static warehouses to dynamic lakehouses has driven demand for storage formats that balance high performance, scalability, and strong data reliability. This paper provides an in-depth comparative study of Delta Lake and Apache Parquet within Python-based data pipelines, exploring their architectural differences, performance benchmarks, and operational implications. Using simulated workloads and a review of peer-reviewed literature, the study evaluates read/write throughput, concurrency, schema evolution, governance, and total cost of ownership. Results demonstrate that Parquet remains highly efficient for stable, append-only analytical workloads, while Delta Lake extends functionality with ACID transactions, schema evolution, and time travel, improving reliability and flexibility for mixed batch-streaming environments. The findings inform practitioners designing modern data lakehouse systems that integrate analytical and operational workloads under unified governance frameworks.
Β
How to Cite
Khalid Al-Mutairi, & Mariam Al Mazrouei. (2025). Optimizing Modern Data Lakehouse Architectures: A Comprehensive Comparative Study of Delta Lake and Apache Parquet in Python Data Pipelines. Frontiers in Emerging Computer Science and Information Technology, 2(09), 30β33. Retrieved from https://irjernet.com/index.php/fecsit/article/view/237
References
Armbrust, M., Das, T., Li, Y., et al. (2020). Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proceedings of the VLDB Endowment, 13(12), 3411β3424. https://doi.org/10.14778/3415478.3415560
Apache Parquet. (2023). Apache Parquet File Format. Retrieved from https://parquet.apache.org
Camacho-RodrΓguez, J., Nitu, F., Kemper, A., & Neumann, T. (2023). LST-Bench: Benchmarking Log-Structured Tables in the Cloud. arXiv:2305.01120.
Geetla, D. (2025). Optimizing ETL Workflows Using Databricks and Delta Lake. HEXstream Tech Corner. https://www.hexstream.com/tech-corner
Le Dem, J. (2013). Parquet: Columnar Storage for the People. Strata + Hadoop World.
Library of Congress. (2023). Apache Parquet File Format (FDD000575). https://www.loc.gov/preservation/digital/formats
Melnik, S., Gubarev, A., Long, J., Romer, G., Shivakumar, S., Tolton, M., & Vassilakis, T. (2010). Dremel: Interactive Analysis of Web-Scale Datasets. Proceedings of the VLDB Endowment, 3(1), 330β339.
Microsoft Fabric Documentation. (2025). Delta Table Optimization and V-Order. https://learn.microsoft.com/en-us/fabric/data-engineering/delta-optimization
Powers, M. (n.d.). Delta Lake vs Parquet: Comparison. Delta.io Blog. https://delta.io/blog/delta-lake-vs-parquet-comparison
Evaluating Effectiveness of Delta Lake Over Parquet in Python Pipeline. (2025). International Journal of Data Science and Machine Learning, 5(02), 126-144. https://doi.org/10.55640/ijdsml-05-02-12
Saeedan, M., & Eldawy, A. (2022). Spatial Parquet: A Column File Format for Geospatial Data Lakes. arXiv:2209.02158.