Machine Learning and Large Language Model Approaches for Software Code Understanding, Prediction, and Architectural Decision Support

Theresa Korvic

Open Access

Machine Learning and Large Language Model Approaches for Software Code Understanding, Prediction, and Architectural Decision Support

PDF

Theresa Korvic ¹

⁴ Department of Computer Science, University of Ljubljana, Slovenia

Abstract

The rapid expansion of software systems and the growing complexity of modern applications have created significant challenges in software development, maintenance, and architectural decision-making. Traditional software engineering techniques often struggle to scale with the massive volume of code produced in contemporary development ecosystems. In response, machine learning (ML) and, more recently, large language models (LLMs) have emerged as powerful tools for analyzing source code, predicting software defects, improving maintainability, and assisting developers in architectural design decisions. This study presents an extensive theoretical and analytical investigation into the integration of machine learning and LLM-based approaches for software code understanding, prediction of defects, automated code representation, and architectural knowledge management.

Drawing strictly from existing scholarly literature, this research synthesizes findings related to code representation learning, software defect prediction, program synthesis, architectural decision support, and automated software analysis. Prior studies demonstrate that machine learning methods can successfully extract semantic and syntactic patterns from large-scale code repositories, enabling tasks such as bug detection, code summarization, and maintainability prediction. Techniques such as tree-based ensembles, graph neural representations, and path-based embeddings have shown promising results in modeling complex program structures. Concurrently, large language models trained on extensive software corpora have demonstrated the ability to generate architectural components, assist in design decision-making, and provide real-time programming support.

The methodology of this study involves an integrative conceptual analysis of the referenced works to develop a comprehensive framework explaining how ML and LLM technologies interact with software engineering workflows. The findings highlight that machine learning models significantly enhance defect prediction accuracy, automate program comprehension tasks, and support developers in navigating complex codebases. LLM-driven assistants further extend these capabilities by enabling interactive architectural reasoning and generative code synthesis.

The discussion elaborates on the implications of these technologies for large-scale software development, including benefits for productivity, maintainability, and architectural knowledge management. However, limitations such as model interpretability, data bias, and overreliance on automated systems are also explored. The study concludes that the integration of machine learning and LLM-based approaches represents a transformative paradigm in software engineering, with significant potential for future research and practical application.

How to Cite

Theresa Korvic. (2024). Machine Learning and Large Language Model Approaches for Software Code Understanding, Prediction, and Architectural Decision Support. Frontiers in Emerging Multidisciplinary Sciences, 1(1), 25–34. Retrieved from https://irjernet.com/index.php/fems/article/view/318

⬇ Endnote/Zotero/Mendeley (RIS) ⬇ BibTeX

References

📄 Alikhashashneh, E. A., Raje, R. R., & Hill, J. H. (2018). Using machine learning techniques to classify and predict static code analysis tool warnings. 2018 IEEE/ACS 15th International Conference on Computer Systems and Applications.

📄 Aljamaan, H., & Alazba, A. (2020). Software defect prediction using tree-based ensembles. Proceedings of the 16th ACM International Conference on Predictive Models and Data Analytics in Software Engineering.

📄 Allamanis, M., Barr, E. T., Bird, C., & Sutton, C. (2015). Suggesting accurate method and class names. Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering.

📄 Allamanis, M., Barr, E. T., Devanbu, P., & Sutton, C. (2018). A survey of machine learning for big code and naturalness. ACM Computing Surveys.

📄 Allamanis, M., Brockschmidt, M., & Khademi, M. (2018). Learning to represent programs with graphs. International Conference on Learning Representations.

📄 Allamanis, M., Peng, H., & Sutton, C. (2016). A convolutional attention network for extreme summarization of source code.

📄 Allamanis, M., & Sutton, C. (2013). Mining source code repositories at massive scale using language modeling. Working Conference on Mining Software Repositories.

📄 Allamanis, M., Tarlow, D., Gordon, A. D., & Wei, Y. (2015). Bimodal modelling of source code and natural language. International Conference on Machine Learning.

📄 Allix, K., Bissyandé, T. F., Klein, J., & Le Traon, Y. (2016). AndroZoo: Collecting millions of android apps for the research community. Mining Software Repositories.

📄 Alon, U., Zilberstein, M., Levy, O., & Yahav, E. (2018). A general path-based representation for predicting program properties. SIGPLAN Notices.

📄 Alon, U., Brody, S., Levy, O., & Yahav, E. (2019). code2seq: Generating sequences from structured representations of code.

📄 Alon, U., Zilberstein, M., Levy, O., & Yahav, E. (2019). Code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages.

📄 Alrajeh, D., Kramer, J., Russo, A., & Uchitel, S. (2015). Automated support for diagnosis and repair. Communications of the ACM.

📄 Alsolai, H., & Roper, M. (2020). A systematic literature review of machine learning techniques for software maintainability prediction. Information and Software Technology.

📄 Altarawy, D., Shahin, H., Mohammed, A., & Meng, N. (2018). Lascad: Language-agnostic software categorization and similar application detection. Journal of Systems and Software.

📄 Arun, S., Tedla, M., & Vaidhyanathan, K. (2025). LLMs for generation of architectural components: An exploratory empirical study in the serverless world. IEEE International Conference on Software Architecture.

📄 Dakhel, A. M., Majdinasab, V., Nikanjam, A., Khomh, F., Desmarais, M. C., & Jiang, Z. M. J. (2023). Github copilot ai pair programmer: Asset or liability? Journal of Systems and Software.

📄 Dhar, R., Vaidhyanathan, K., & Varma, V. (2024). Can LLMs generate architectural design decisions? An exploratory empirical study. IEEE International Conference on Software Architecture.

📄 Dhar, R., Vaidhyanathan, K., & Varma, V. (2024). Leveraging generative AI for architecture knowledge management. IEEE International Conference on Software Architecture Companion.

📄 Díaz-Pace, J. A., Tommasel, A., & Capilla, R. (2024). Helping novice architects to make quality design decisions using an LLM-based assistant. European Conference on Software Architecture.

📄 Díaz-Pace, J. A., Tommasel, A., Capilla, R., & Ramirez, Y. E. (2025). Architecture exploration and reflection meet LLM-based agents. IEEE International Conference on Software Architecture Companion.

📄 Felici, M. (2011). Software design and class diagrams.

📄 K. S. Hebbar, “MACHINE LEARNING-ASSISTED SERVICE BOUNDARY DETECTION FOR MODULARIZING LEGACY SYSTEMS,” International Journal of Applied Engineering & Technology, vol. 04, no.02, pp. 401-414, Sep. 2022, https://romanpub.com/resources/ijaet-v4-2-2022-48.pdf

📄 Mallick, B., & Das, N. (2013). An approach to extended class diagram model of UML for object oriented software design. International Journal of Innovative Technology and Adaptive Management.

📄 Swain, R. K., Behera, P. K., & Mohapatra, D. P. (2012). Generation and optimization of test cases for object-oriented software using state chart diagram.

📄 Thakur, J. S., & Gupta, A. (2017). Automatic generation of analysis. arXiv.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors retain the copyright of their articles published in this journal. All articles are licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). This license permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly cited.