Frontiers in Emerging Artificial Intelligence and Machine Learning

  1. Home
  2. Archives
  3. Vol. 1 No. 1 (2024): Volume 01 Issue 01 2024 December
  4. Articles
Frontiers in Emerging Artificial Intelligence and Machine Learning

Article Details Page

Assessing Large Language Model Proficiency: A Comparative Study on a Statistics Examination

Authors

  • Prof. Lukas J. Hoffmann Department of Computer Science, ETH Zurich, Switzerland
  • Dr. Tobias Schmid Center for Artificial Intelligence, École Polytechnique Fédérale de Lausanne (EPFL), Switzerland

Keywords:

Generative AI, Large Language Models, ChatGPT, Statistics Education

Abstract

The increasing capabilities of generative artificial intelligence (AI), particularly large language models (LLMs) like those developed by OpenAI, raise significant questions about their potential impact on education and assessment. This study investigates the performance of three distinct ChatGPT models—ChatGPT 3.5, ChatGPT 4, and the recently introduced ChatGPT 4o-mini—on a standardized statistics examination. Utilizing a comprehensive set of exam questions, we compare the accuracy, reasoning ability, and response characteristics of each model. The findings provide insights into the evolving proficiency of LLMs in quantitative domains and their implications for educational practices, assessment design, and the future of AI as a learning aid or potential tool for academic dishonesty. While all models demonstrated some level of statistical reasoning, significant differences in performance were observed, highlighting the rapid advancements in newer iterations. The study underscores the need for educators and institutions to understand the current capabilities and limitations of these tools to adapt pedagogical strategies and assessment methods effectively.

References

Aamir S, Mughal SF, Kayani AJ, Yousuf MZ, Rastgar OA, Syed AA (2024). Impact of generative AI in revolutionizing education. In: 2024 8th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), 1–6.

AI-Pro Team (2024). Is GPT-4o better than GPT-4? A detailed comparison. Accessed: 2024-07-29.

Alharbi A, Hai A, Aljurbua R, Obradovic Z (2024). Ai-driven sentiment trend analysis: Enhancing topic modeling interpretation with chatgpt. In: Artificial Intelligence Applications and Innovations. AIAI 2024. IFIP Advances in Information and Communication Technology (I Maglogiannis, L Iliadis, J Macintyre, M Avlonitis, A Papaleonidas, eds.), volume 712. Springer, Cham.

Amin KS, Mayes LC, Khosla P, Doshi R (2024). Assessing the efficacy of large language models in health literacy: A comprehensive cross-sectional study. The Yale Journal of Biology and Medicine, 97: 17–27. https://doi.org/10.59249/ZTOZ1966

Anshari M, Almunawar MN, Shahrill M, Wicaksono DK, Huda M (2017). Smartphones usage in the classrooms: Learning aid or interference? Education and Information Technologies, 22: 3063–3079. https://doi.org/10.1007/s10639-017-9572-7

Anthropic (2023). Meet Claude. https://www.anthropic.com/research.

Arkansas Council of Teachers of Mathematics (2011). Arkansas Council of Teachers of Mathematics Exam. http://example.com. Accessed: February, 2024. (Note: This is a placeholder URL as the actual exam link was not provided.)

Ball T, Chen S, Herley C (2024). Can we count on LLMs? The fixed-effect fallacy and claims of GPT-4 capabilities. Transactions on Machine Learning Research, 382. https://doi.org/10.1098/rsta.2023.0254

Ballester O, Penner O (2022). Robustness, replicability and scalability in topic modelling. Journal of Informetrics, 16(1): 101224. https://doi.org/10.1016/j.joi.2021.101224

Benoit K, Obeng A (2023). Readtext: Import and handling for plain and formatted text files. R package version 0.90.

Benoit K, Watanabe K, Wang H, Nulty P, Obeng A, Müller S, et al. (2018). Quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30): 774. https://doi.org/10.21105/joss.00774

Blei DM, Ng AY, Jordan MI (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3: 993–1022.

Bommarito MJ, Katz DM (2022). GPT takes the bar exam. Social Science Research Network (SSRN).

Brin D, Sorin V, Vaid A, Soroush A, Glicksberg BS, Charney AW, et al. (2023). Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Science Reports, 13: 16492.

Callanan E, Mbakwe A, Papadimitriou A, Pei Y, Sibue M, Zhu X, et al. (2023). Can GPT models be financial analysts? An evaluation of ChatGPT and GPT-4 on mock CFA exams.

Day AJ, Fenn MK, Ravizza SM (2021). Is it worth it? The costs and benefits of bringing a laptop to a university class. PLoS ONE, 16(5): e0251792. https://doi.org/10.1371/journal.pone.0251792

delMas R, Garfield J, Ooms A, Chance B (2007). Assessing students’ conceptual understanding after a first course in statistics. Statistics Education Research Journal, 6: 28–58. https://doi.org/10.52041/serj.v6i2.483

Ellis AR, Slade E (2023). A new era of learning: Considerations for ChatGPT as a tool to enhance statistics and data science education. Journal of Statistics and Data Science Education, 31(2): 128–133. https://doi.org/10.1080/26939169.2023.2223609

Fast Company (2023). The learning nonprofit: Khan academy piloting a version of GPT called KhanMigo. Fast Company.

Flesch R (1948). A new readability yardstick. Journal of Applied Psychology, 32(3): 221–233. https://doi.org/10.1037/h0057532

Google (2023). Google Gemini. https://gemini.google.com/.

Graduate Aptitude Test in Engineering (2023). Graduate aptitude test in engineering (gate). In: National-Level Examination for Engineering Graduates. Indian Institute of Technology.

Hecking T, Leydesdorff L (2018). Topic modelling of empirical text corpora: Validity, reliability, and reproducibility in comparison to semantic maps.

Hidayatullah E (2024). Evaluating the effectiveness of ChatGPT to improve English students’ writing skills. Humanities, Education, Applied Linguistics, and Language Teaching: Conference Series, 1: 14–19.

Hochmann A (1986). Math teachers stage a calculated protest. The Washington Post.

Huang J, Li S (2023). Opportunities and challenges in the application of ChatGPT in foreign language teaching. International Journal of Education and Social Science Research (IJESSR), 6(4): 75–89. https://doi.org/10.37500/IJESSR.2023.6406

Joshi I, Budhiraja R, Dev H, Kadia J, Ataullah MO, Mitra S, et al. (2023). ChatGPT in the classroom: An analysis of its strengths and weaknesses for solving undergraduate computer science questions. In: SIGCSE 2024: Proceedings of the 55th ACM Technical Symposium on Computer Science Education, volume 1.

Katz DM, Bommarito MJ, Gao S, Arredondo P (2023). GPT-4 passes the bar exam. Philosophical Transactions of the Royal Society A.

Khan Academy (2024). Four stars for KhanMigo: Common sense media rates ai tools for learning. https://blog.khanacademy.org/four-stars-for-khanmigo-common-sense-media-rates-ai-tools-for-learning-kp/. Accessed: 2024-05-06.

Koonchanok R, Pan Y, Jang H (2024). Public attitudes toward chatgpt on twitter: Sentiments, topics, and occupations. Research Square.

Kovari A, Katona J (2024). Transformative applications and key challenges of generative ai. In: 2024 IEEE 7th International Conference and Workshop Óbuda on Electrical and Power Engineering (CANDO-EPE), 89–92.

Lacey L (2024). Openai now has a gpt4o-mini. here’s why that matters — cnet.com. https://www.cnet.com/tech/services-and-software/openai-now-has-a-gpt-4o-mini-heres-why-that-matters/. Accessed: 13-08-2024.

Lawrence W, Nesbitt P, Jr PHC (2024). Post-pandemic support for special populations in higher education through generative artificial intelligence. International Journal of Arts, Humanities & Social Science, 05(05). https://doi.org/10.56734/ijahss.v5n5a6

Lee D, Seung H (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401: 788–791. https://doi.org/10.1037/44565

Li S, Liu H, Bian Z, Fang J, Huang H, Liu Y, et al. (2023). Colossal-AI: A unified deep learning system for large-scale parallel training. In: Proceedings of the 52nd International Conference on Parallel Processing, ICPP ’23, 766–775. Association for Computing Machinery, New York, NY, USA.

Massey PA, Montgomery C, Zhang AS (2023). Comparison of chatgpt-3.5, chatgpt-4, and orthopaedic resident performance on orthopaedic assessment examinations. Journal of the American Academy of Orthopaedic Surgeons, 31(23): 1173–1179.

McGee M, Sadler B (2024). Equity in the use of chatgpt for the classroom: A comparison of the accuracy and precision of chatgpt3.5 vs. chatgpt4 with respect to statistics and data science exams.

McLaughlin GH (1969). Smog grading-a new readability formula. Journal of Reading, 12(8): 639–646.

Meaney C, Escobar M, Stukel TA, Austin PC, Jaakkimainen L (2022). Comparison of methods for estimating temporal topic models from primary care clinical text data: Retrospective closed cohort study. JMIR Medical Informatics, 10(12). https://doi.org/10.2196/40102

Mervaala E, Kousa I (2024). Order up! micromanaging inconsistencies in chatgpt-4o text analyses. In: Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities.

Meta AI (2024). Introducing llama: A foundational, 65-billion-parameter large language model. Accessed: August, 2024.

Microsoft (2023). Microsoft copilot. https://copilot.microsoft.com/.

Newman D, Asuncion A, Smyth P, Welling M (2009). Distributed algorithms for topic models. Journal of Machine Learning Research, 10: 1801–1828.

Open AI Team (2022). ChatGPT: Optimizing language models for dialogue.

Open AI Team (2024). Hello gpt-4o. https://openai.com/index/hello-gpt-4o/. Accessed: May, 2024.

OpenAI, Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, et al. (2024) GPT-4 technical report.

Özdemir B (2020). Character counter tool. https://charactercalculator.com/. Accessed: 2024-08-14.

R Core Team (2023). R: A language and environment for statistical computing.

Rijcken E, Scheepers F, Zervanou K, Spruit M, Mosteiro P, Kaymak U (2023). Towards interpreting topic models with chatgpt. In: Proceedings of the 20th World Congress of the International Fuzzy Systems Association (IFSA 2023). Presented at the 20th World Congress of the International Fuzzy Systems Association, IFSA; Conference date: 20-08-2023 – 24-08-2023.

Taloni A, Borselli M, Scarsi V, Rossi C, Scorcia V, Giannaccare G (2023). Comparative performance of humans versus gpt-4.0 and gpt-3.5 in the self-assessment program of american academy of ophthalmology. Scientific Reports, 13: 18562.

Terry OK (2023). I am a student: You have no idea how much we are using chatgpt. The Chronical of Higher Education, 69(19).

Together Computer (2023). OpenChatKit: An open toolkit and base model for dialogue-style applications.

Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, et al. (2023). Llama: Open and efficient foundation language models. https://doi.org/10.48550/arXiv.2302.13971

Varanasi L (2023). Gpt-4 can ace the bar, but it only has a decent chance of passing the cfa exams. here’s a list of difficult exams the chatgpt and gpt-4 have passed. https://www.businessinsider.com/list-here-are-the-exams-chatgpt-has-passed-so-far-2023-1.

Wasserstein R, Lazar N (2016). The asa statement on p-values: Context, process, and purpose. American Statistician, 70(2): 129–133. https://doi.org/10.1080/00031305.2016.1154108

Watanabe K, Alexander B (2023). Seeded sequential lda: A semi-supervised algorithm for topic-specific analysis of sentences. Social Science Computer Review, 42(1): 224–248. https://doi.org/10.1177/08944393231178605

Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. (2020). Huggingface’s transformers: State-of-the-art natural language processing. https://doi.org/10.48550/arXiv.1910.03771

Yao S, Yu D, Zhao J, Shafran I, Griffiths TL, Cao Y, et al. (2024). Tree of thoughts: Deliberate problem solving with large language models. In: Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23. Curran Associates Inc., Red, Hook, NY, USA.

Downloads

Published

2024-12-08

How to Cite

Prof. Lukas J. Hoffmann, & Dr. Tobias Schmid. (2024). Assessing Large Language Model Proficiency: A Comparative Study on a Statistics Examination. Frontiers in Emerging Artificial Intelligence and Machine Learning, 1(1), 1–7. Retrieved from https://irjernet.com/index.php/feaiml/article/view/24