4
Department of Computer Science, ETH Zurich, Switzerland
4
Center for Artificial Intelligence, École Polytechnique Fédérale de Lausanne (EPFL), Switzerland
Abstract
The increasing capabilities of generative artificial intelligence (AI), particularly large language models (LLMs) like those developed by OpenAI, raise significant questions about their potential impact on education and assessment. This study investigates the performance of three distinct ChatGPT models—ChatGPT 3.5, ChatGPT 4, and the recently introduced ChatGPT 4o-mini—on a standardized statistics examination. Utilizing a comprehensive set of exam questions, we compare the accuracy, reasoning ability, and response characteristics of each model. The findings provide insights into the evolving proficiency of LLMs in quantitative domains and their implications for educational practices, assessment design, and the future of AI as a learning aid or potential tool for academic dishonesty. While all models demonstrated some level of statistical reasoning, significant differences in performance were observed, highlighting the rapid advancements in newer iterations. The study underscores the need for educators and institutions to understand the current capabilities and limitations of these tools to adapt pedagogical strategies and assessment methods effectively.
How to Cite
Prof. Lukas J. Hoffmann, & Dr. Tobias Schmid. (2024). Assessing Large Language Model Proficiency: A Comparative Study on a Statistics Examination. Frontiers in Emerging Artificial Intelligence and Machine Learning, 1(1), 1–7. Retrieved from https://irjernet.com/index.php/feaiml/article/view/24
📄Aamir S, Mughal SF, Kayani AJ, Yousuf MZ, Rastgar OA, Syed AA (2024). Impact of generative AI in revolutionizing education. In: 2024 8th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), 1–6.
📄AI-Pro Team (2024). Is GPT-4o better than GPT-4? A detailed comparison. Accessed: 2024-07-29.
📄Alharbi A, Hai A, Aljurbua R, Obradovic Z (2024). Ai-driven sentiment trend analysis: Enhancing topic modeling interpretation with chatgpt. In: Artificial Intelligence Applications and Innovations. AIAI 2024. IFIP Advances in Information and Communication Technology (I Maglogiannis, L Iliadis, J Macintyre, M Avlonitis, A Papaleonidas, eds.), volume 712. Springer, Cham.
📄Amin KS, Mayes LC, Khosla P, Doshi R (2024). Assessing the efficacy of large language models in health literacy: A comprehensive cross-sectional study. The Yale Journal of Biology and Medicine, 97: 17–27. https://doi.org/10.59249/ZTOZ1966
📄Anshari M, Almunawar MN, Shahrill M, Wicaksono DK, Huda M (2017). Smartphones usage in the classrooms: Learning aid or interference? Education and Information Technologies, 22: 3063–3079. https://doi.org/10.1007/s10639-017-9572-7
📄Arkansas Council of Teachers of Mathematics (2011). Arkansas Council of Teachers of Mathematics Exam. http://example.com. Accessed: February, 2024. (Note: This is a placeholder URL as the actual exam link was not provided.)
📄Ball T, Chen S, Herley C (2024). Can we count on LLMs? The fixed-effect fallacy and claims of GPT-4 capabilities. Transactions on Machine Learning Research, 382. https://doi.org/10.1098/rsta.2023.0254
📄Ballester O, Penner O (2022). Robustness, replicability and scalability in topic modelling. Journal of Informetrics, 16(1): 101224. https://doi.org/10.1016/j.joi.2021.101224
📄Benoit K, Obeng A (2023). Readtext: Import and handling for plain and formatted text files. R package version 0.90.
📄Benoit K, Watanabe K, Wang H, Nulty P, Obeng A, Müller S, et al. (2018). Quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30): 774. https://doi.org/10.21105/joss.00774
📄Blei DM, Ng AY, Jordan MI (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3: 993–1022.
📄Bommarito MJ, Katz DM (2022). GPT takes the bar exam. Social Science Research Network (SSRN).
📄Brin D, Sorin V, Vaid A, Soroush A, Glicksberg BS, Charney AW, et al. (2023). Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Science Reports, 13: 16492.
📄Callanan E, Mbakwe A, Papadimitriou A, Pei Y, Sibue M, Zhu X, et al. (2023). Can GPT models be financial analysts? An evaluation of ChatGPT and GPT-4 on mock CFA exams.
📄Day AJ, Fenn MK, Ravizza SM (2021). Is it worth it? The costs and benefits of bringing a laptop to a university class. PLoS ONE, 16(5): e0251792. https://doi.org/10.1371/journal.pone.0251792
📄delMas R, Garfield J, Ooms A, Chance B (2007). Assessing students’ conceptual understanding after a first course in statistics. Statistics Education Research Journal, 6: 28–58. https://doi.org/10.52041/serj.v6i2.483
📄Ellis AR, Slade E (2023). A new era of learning: Considerations for ChatGPT as a tool to enhance statistics and data science education. Journal of Statistics and Data Science Education, 31(2): 128–133. https://doi.org/10.1080/26939169.2023.2223609
📄Fast Company (2023). The learning nonprofit: Khan academy piloting a version of GPT called KhanMigo. Fast Company.
📄Graduate Aptitude Test in Engineering (2023). Graduate aptitude test in engineering (gate). In: National-Level Examination for Engineering Graduates. Indian Institute of Technology.
📄Hecking T, Leydesdorff L (2018). Topic modelling of empirical text corpora: Validity, reliability, and reproducibility in comparison to semantic maps.
📄Hidayatullah E (2024). Evaluating the effectiveness of ChatGPT to improve English students’ writing skills. Humanities, Education, Applied Linguistics, and Language Teaching: Conference Series, 1: 14–19.
📄Hochmann A (1986). Math teachers stage a calculated protest. The Washington Post.
📄Huang J, Li S (2023). Opportunities and challenges in the application of ChatGPT in foreign language teaching. International Journal of Education and Social Science Research (IJESSR), 6(4): 75–89. https://doi.org/10.37500/IJESSR.2023.6406
📄Joshi I, Budhiraja R, Dev H, Kadia J, Ataullah MO, Mitra S, et al. (2023). ChatGPT in the classroom: An analysis of its strengths and weaknesses for solving undergraduate computer science questions. In: SIGCSE 2024: Proceedings of the 55th ACM Technical Symposium on Computer Science Education, volume 1.
📄Katz DM, Bommarito MJ, Gao S, Arredondo P (2023). GPT-4 passes the bar exam. Philosophical Transactions of the Royal Society A.
📄Koonchanok R, Pan Y, Jang H (2024). Public attitudes toward chatgpt on twitter: Sentiments, topics, and occupations. Research Square.
📄Kovari A, Katona J (2024). Transformative applications and key challenges of generative ai. In: 2024 IEEE 7th International Conference and Workshop Óbuda on Electrical and Power Engineering (CANDO-EPE), 89–92.
📄Lawrence W, Nesbitt P, Jr PHC (2024). Post-pandemic support for special populations in higher education through generative artificial intelligence. International Journal of Arts, Humanities & Social Science, 05(05). https://doi.org/10.56734/ijahss.v5n5a6
📄Lee D, Seung H (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401: 788–791. https://doi.org/10.1037/44565
📄Li S, Liu H, Bian Z, Fang J, Huang H, Liu Y, et al. (2023). Colossal-AI: A unified deep learning system for large-scale parallel training. In: Proceedings of the 52nd International Conference on Parallel Processing, ICPP ’23, 766–775. Association for Computing Machinery, New York, NY, USA.
📄Massey PA, Montgomery C, Zhang AS (2023). Comparison of chatgpt-3.5, chatgpt-4, and orthopaedic resident performance on orthopaedic assessment examinations. Journal of the American Academy of Orthopaedic Surgeons, 31(23): 1173–1179.
📄McGee M, Sadler B (2024). Equity in the use of chatgpt for the classroom: A comparison of the accuracy and precision of chatgpt3.5 vs. chatgpt4 with respect to statistics and data science exams.
📄McLaughlin GH (1969). Smog grading-a new readability formula. Journal of Reading, 12(8): 639–646.
📄Meaney C, Escobar M, Stukel TA, Austin PC, Jaakkimainen L (2022). Comparison of methods for estimating temporal topic models from primary care clinical text data: Retrospective closed cohort study. JMIR Medical Informatics, 10(12). https://doi.org/10.2196/40102
📄Mervaala E, Kousa I (2024). Order up! micromanaging inconsistencies in chatgpt-4o text analyses. In: Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities.
📄Meta AI (2024). Introducing llama: A foundational, 65-billion-parameter large language model. Accessed: August, 2024.
📄R Core Team (2023). R: A language and environment for statistical computing.
📄Rijcken E, Scheepers F, Zervanou K, Spruit M, Mosteiro P, Kaymak U (2023). Towards interpreting topic models with chatgpt. In: Proceedings of the 20th World Congress of the International Fuzzy Systems Association (IFSA 2023). Presented at the 20th World Congress of the International Fuzzy Systems Association, IFSA; Conference date: 20-08-2023 – 24-08-2023.
📄Taloni A, Borselli M, Scarsi V, Rossi C, Scorcia V, Giannaccare G (2023). Comparative performance of humans versus gpt-4.0 and gpt-3.5 in the self-assessment program of american academy of ophthalmology. Scientific Reports, 13: 18562.
📄Terry OK (2023). I am a student: You have no idea how much we are using chatgpt. The Chronical of Higher Education, 69(19).
📄Together Computer (2023). OpenChatKit: An open toolkit and base model for dialogue-style applications.
📄Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, et al. (2023). Llama: Open and efficient foundation language models. https://doi.org/10.48550/arXiv.2302.13971
📄Watanabe K, Alexander B (2023). Seeded sequential lda: A semi-supervised algorithm for topic-specific analysis of sentences. Social Science Computer Review, 42(1): 224–248. https://doi.org/10.1177/08944393231178605
📄Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. (2020). Huggingface’s transformers: State-of-the-art natural language processing. https://doi.org/10.48550/arXiv.1910.03771
📄Yao S, Yu D, Zhao J, Shafran I, Griffiths TL, Cao Y, et al. (2024). Tree of thoughts: Deliberate problem solving with large language models. In: Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23. Curran Associates Inc., Red, Hook, NY, USA.