A Student Dropout Risk Prediction Model Based on Supervised Learning Techniques and Large Language Models (LLMs)
Abstract:
Early prediction of student dropout risk is an essential but challenging task in Vietnamese higher education. This study proposes a novel model combining supervised machine learning and large language models (LLMs) to predict student dropout risk. The model utilizes structured information and unstructured data to analyze influencing factors comprehensively. By converting student data into natural language and using pre-trained LLMs, the model can understand the context and complex relationships between factors, thereby improving prediction accuracy compared to traditional methods. The study's main contributions are to propose architecture integrating LLMs into the dropout risk classification problem, identify critical factors influencing the decision to drop out and discuss the potential application of the model in practice to support early intervention.
KeyWords:
Large Language Model (LLM), prediction, dropout risk, machine learning, supervised learning
References:
- Niyogisubizo, J., Liao, L., Nziyumva, E., Murwanashyaka, E., & Nshimyumukiza, P. C. (2022). Predicting student's dropout in university classes using two-layer ensemble machine learning approach: A novel stacked generalization. Computers and Education: Artificial Intelligence, 3, 100066. https://doi.org/10.1016/j.caeai.2022.100066
- Márquez-Vera, C., Cano, A., Romero, C., Noaman, A. Y., Fardoun, H. M., & Ventura, S. (2016). Early dropout prediction using data mining: A case study with high school students. Expert Systems, 33(1), 107–124. https://doi.org/10.1111/exsy.12135
- Kloft, M., Stiehler, F., Zheng, Z., & Pinkwart, N. (2014). Predicting MOOC dropout over weeks using machine learning methods. Proceedings of the EMNLP 2014 Workshop on Analysis of Large Scale Social Interaction in MOOCs, 60–65. https://doi.org/10.3115/v1/W14-4111
- Srinivas, K., Raghunathan, B. K., & Govardhan, A. (2013). Predicting student performance: A statistical and data mining approach. International Journal of Computer Applications, 63(8), 35–39. https://doi.org/10.5120/10489-5242
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171–4186. https://doi.org/10.18653/v1/N19-1423
- Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., & Rush, A. M. (2020). Transformers: State-of-the-art natural language processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45. https://doi.org/10.18653/v1/2020.emnlp-demos.6
- Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., ... Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
- Sulak, S. A., & Koklu, N. (2024). Predicting student dropout using machine learning algorithms. Intelligent Methods in Engineering Sciences, 3(3), 91–98. https://doi.org/10.58190/imiens.2024.103
- Hassan, M. A., Muse, A. H., & Nadarajah, S. (2024). Predicting student dropout rates using supervised machine learning: Insights from the 2022 National Education Accessibility Survey in Somaliland. Applied Sciences, 14(17), 7593. https://doi.org/10.3390/app14177593
- Durrani, U. K., Malik, A., Akpinar, M., Dordevic, M., Togher, M., & Aoudi, S. (2024). Assessing the effectiveness of large language models in predicting student dropout rates. Proceedings of the International Conference on Advanced Machine Learning and Applications, 60–65.
- Psyridou, M., Prezja, F., Torppa, M., Lerkkanen, M.-K., Poikkeus, A.-M., & Vasalampi, K. (2024). Machine learning predicts upper secondary education dropout as early as the end of primary school. Scientific Reports, 14, 12956. https://doi.org/10.1038/s41598-024-63629-0
- Villar, A., & de Andrade, C. R. V. (2024). Supervised machine learning algorithms for predicting student dropout and academic success: A comparative study. Discover Artificial Intelligence, 4, Article 2. https://doi.org/10.1007/s44163-023-00079-z
- Kim, H., & Lee, J. (2023). University student dropout prediction using pretrained language models. Applied Sciences, 13(12), 7073. https://doi.org/10.3390/app13127073
- Sulak, S. A., & Koklu, N. (2023). Factors influencing dropout students in higher education. Education Research International, 2023, Article 7704142. https://doi.org/10.1155/2023/7704142
- Celestin, M., & Faustin, M. (2024). School dropout and students’ academic performance in public twelve years basic education schools of Rwanda. Journal of Education, 7(2), 20–33. https://doi.org/10.53819/81018102t5318
- Okoye, K., Nganji, J. T., Escamilla, J., & Hosseini, S. (2024). Machine learning model (RG-DMML) and ensemble algorithm for prediction of students' retention and graduation in education. Computers and Education: Artificial Intelligence, 6, 100205. https://doi.org/10.1016/j.caeai.2024.100205
- Arizmendi, C.J., Bernacki, M.L., Raković, M. et al.2023. Predicting student outcomes using digital logs of learning behaviors: Review, current standards, and suggestions for future work. Behav Res 55, 3026–3054 (2023). https://doi.org/10.3758/s13428-022-01939-9
- Rahman, M. S. (2016). The advantages and disadvantages of using qualitative and quantitative approaches and methods in language "testing and assessment" research: A literature review. Journal of Education and Learning, 6(1), 102–112. https://doi.org/10.5539/jel.v6n1p102
- Ozdemir, N.K., Kemer, F.N.A., Arslan, A. et al. A Qualitative Study of Unveiling School Dropout Complexity in Türkiye. Child Ind Res 17, 1001–1021 (2024). https://doi.org/10.1007/s12187-024-10116-7
- Ersozlu, Z., Taheri, S. & Koch, I. A review of machine learning methods used for educational data. Educ Inf Technol 29, 22125–22145 (2024). https://doi.org/10.1007/s10639-024-12704-0
- Vaarma, M., & Li, H. (2024). Predicting student dropouts with machine learning: An empirical study in Finnish higher education. Technology in Society, 76, 102474. https://doi.org/10.1016/j.techsoc.2024.102474
- Niyogisubizo, J., Liao, L., Nziyumva, E., Murwanashyaka, E., & Nshimyumukiza, P. C. (2022). Predicting student's dropout in university classes using two-layer ensemble machine learning approach: A novel stacked generalization. Computers and Education: Artificial Intelligence, 3, 100066. https://doi.org/10.1016/j.caeai.2022.100066
- Galitsky, B. A. (2023). Truth-O-Meter: Collaborating with LLM in fighting its hallucinations. In B. Galitsky (Ed.), Developing Enterprise Chatbots (pp. 85–104). Elsevier. https://doi.org/10.1016/B978-0-443-29246-0.00004-3
- Thapa, S., Shiwakoti, S., Shah, S.B. et al.,2025. Large language models (LLM) in computational social science: prospects, current state, and challenges. Soc. Netw. Anal. Min. 15, 4 (2025). https://doi.org/10.1007/s13278-025-01428-9
- Dupéré, V., Leventhal, T., Dion, E., Crosnoe, R., Archambault, I., & Janosz, M. (2020). School-based extracurricular activity involvement and high school dropout among at-risk students: Consistency matters. Applied Developmental Science, 24(2), 129–146. https://doi.org/10.1080/10888691.2020.1796665
- Tsolou, O. and Babalis, T. (2020) The Contribution of Family Factors to Dropping Out of School in Greece. Creative Education, 11, 1375-1401. doi: 10.4236/ce.2020.118101.
- Lessky, F., & Unger, M. (2022). Working long hours while studying: A higher risk for First-in-Family students and students of particular fields of study? European Journal of Higher Education, 13(3), 347–366. https://doi.org/10.1080/21568235.2022.2047084
- Iqbal, A., Iftikhar, M., & Hussain, T. (2023). Impact of social media use on the mental health of university students. International Journal of Academic Research in Business and Social Sciences, 13(9), 1234–1245. https://doi.org/10.6007/IJARBSS/v13-i9/12345
- Watson, T. N., & Bogotch, I. (2016). (Re)Imagining school as community: Lessons learned from teachers. School Community Journal, 26(1), 93–114.
- Khalid, R. Z., Ullah, A., Khan, A., Khan, A., & Inayat, M. H. (2023). Comparison of standalone and hybrid machine learning models for prediction of critical heat flux in vertical tubes. Energies, 16(7), 3182. https://doi.org/10.3390/en16073182
- Wan, G., Lu, Y., Wu, Y., Hu, M., & Li, S. (2024). Large language models for causal discovery: Current landscape and future directions. arXiv preprint arXiv:2402.11068. https://doi.org/10.48550/arXiv.2402.11068
- Mumuni, A., & Mumuni, F. (2024). Automated data processing and feature engineering for deep learning and big data applications: A survey. Journal of Information and Intelligence, 3(1), 1–15. https://doi.org/10.1016/j.jii.2024.01.002
- Zhu, X., Li, Q., Cui, L., & Liu, Y. (2024). Large language model enhanced text-to-SQL generation: A survey. arXiv preprint arXiv:2410.06011
- De Laat, P.B. (2018). Algorithmic Decision-Making Based on Machine Learning from Big Data: Can Transparency Restore Accountability?. Philos. Technol. 31, 525–541 (2018). https://doi.org/10.1007/s13347-017-0293-z
- Nagy, M., Molontay, R. (2024). Interpretable Dropout Prediction: Towards XAI-Based Personalized Intervention. Int J Artif Intell Educ 34, 274–300 (2024). https://doi.org/10.1007/s40593-023-00331-8
- Lee, S., & Chung, J. Y. (2019). The machine learning-based dropout early warning system for improving the performance of dropout prediction. Applied Sciences, 9(15), 3093. https://doi.org/10.3390/app9153093