Optimizing Modern Data Lakehouse Architectures: A Comprehensive Comparative Study of Delta Lake and Apache Parquet in Python Data Pipelines
Keywords:
Delta Lake, Apache Parquet, Python pipelines, PySpark, ACID transactionsAbstract
The evolution of big data architectures from static warehouses to dynamic lakehouses has driven demand for storage formats that balance high performance, scalability, and strong data reliability. This paper provides an in-depth comparative study of Delta Lake and Apache Parquet within Python-based data pipelines, exploring their architectural differences, performance benchmarks, and operational implications. Using simulated workloads and a review of peer-reviewed literature, the study evaluates read/write throughput, concurrency, schema evolution, governance, and total cost of ownership. Results demonstrate that Parquet remains highly efficient for stable, append-only analytical workloads, while Delta Lake extends functionality with ACID transactions, schema evolution, and time travel, improving reliability and flexibility for mixed batch-streaming environments. The findings inform practitioners designing modern data lakehouse systems that integrate analytical and operational workloads under unified governance frameworks.References
Armbrust, M., Das, T., Li, Y., et al. (2020). Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proceedings of the VLDB Endowment, 13(12), 3411–3424. https://doi.org/10.14778/3415478.3415560
Apache Parquet. (2023). Apache Parquet File Format. Retrieved from https://parquet.apache.org
Camacho-Rodríguez, J., Nitu, F., Kemper, A., & Neumann, T. (2023). LST-Bench: Benchmarking Log-Structured Tables in the Cloud. arXiv:2305.01120.
Geetla, D. (2025). Optimizing ETL Workflows Using Databricks and Delta Lake. HEXstream Tech Corner. https://www.hexstream.com/tech-corner
Le Dem, J. (2013). Parquet: Columnar Storage for the People. Strata + Hadoop World.
Library of Congress. (2023). Apache Parquet File Format (FDD000575). https://www.loc.gov/preservation/digital/formats
Melnik, S., Gubarev, A., Long, J., Romer, G., Shivakumar, S., Tolton, M., & Vassilakis, T. (2010). Dremel: Interactive Analysis of Web-Scale Datasets. Proceedings of the VLDB Endowment, 3(1), 330–339.
Microsoft Fabric Documentation. (2025). Delta Table Optimization and V-Order. https://learn.microsoft.com/en-us/fabric/data-engineering/delta-optimization
Powers, M. (n.d.). Delta Lake vs Parquet: Comparison. Delta.io Blog. https://delta.io/blog/delta-lake-vs-parquet-comparison
Evaluating Effectiveness of Delta Lake Over Parquet in Python Pipeline. (2025). International Journal of Data Science and Machine Learning, 5(02), 126-144. https://doi.org/10.55640/ijdsml-05-02-12
Saeedan, M., & Eldawy, A. (2022). Spatial Parquet: A Column File Format for Geospatial Data Lakes. arXiv:2209.02158.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Khalid Al-Mutairi, Mariam Al Mazrouei

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors retain the copyright of their articles published in this journal. All articles are licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). This license permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly cited.