Predicting Academic Failure: Estimating Semester Grade Point Average with Data Mining Methods

This study proposes a new model to analyze the grade point averages (GPAs) in the previous semester using data mining algorithms and to predict the �nal GPAs that students may receive in the following semesters in three gradually expanding categories (department, faculty, and university). The performances of the Random Forest, Linear Regression, Support Vector Machines, and k-Nearest Neighbors algorithms, which are among the data mining algorithms, were calculated and compared to estimate the GPAs of the students at the end of the semester. This study focused on three parameters. The �rst was to predict academic performance with a single independent variable. The second was to compare the performance indicators of four algorithms. The third was to compare the predictions made in three different categories. All algorithms applied correctly classi�ed the samples at rates varying between 92% and 94%. The proposed model correctly estimated students’ grade point averages at the end of the semester with an average deviation of 0.28 points over a 4 with a single variable. Students with a high risk of failure can be determined in advance by estimating their �nal grade point averages.


Introduction
Data Mining (DM) is the process of extracting new orientations and new patterns from big data using different classi cation algorithms (Baker & Yacef, 2009). In other words, it is the discovery of useful information from big data. On the other hand, educational data mining (EDM) "develops and adapts statistical, machine-learning and data-mining methods to study educational data generated basically by students and instructors" (Calvet Liñán & Juan Pérez, 2015, p.100). EDM develops new methods to reveal meaningful sections, original structures, and hidden patterns in the data obtained from educational environments. The main purpose of EDM is to extract information from educational data to support decision-making for educational issues (Calvet Liñán & Juan Pérez, 2015). It involves the collection and interpretation of data produced by learners in order to assess academic performance, predict future performance, and identify current issues, etc. In short, EDM includes the automatic extraction of previously unknown, hidden, meaningful, and bene cial patterns from the huge amount of data.
Various DM algorithms have been successfully applied to identify students at risk of failure in terms of academic performance (e.g., Hu, Lo, & Shih, 2014). Both students and teachers bene t from the knowledge discovered through the use of these algorithms, which serve to improve learning/teaching processes (Akçapınar, Altun, & Aşkar, 2019). In today's education systems, too much data, including demographic data and students' academic grades, are stored in electronic environments. This data can be obtained from various learning management systems (LMS) and student information systems (SIS).
The rapid increase in the amount of educational data available can contribute to improving students' learning outcomes (e.g., Shorfuzzaman et al., 2019;Viberg et al., 2018).
EDM provides statistical information to identify the relationship between relevant stakeholders and optimize the learning environment by discovering hidden patterns in the educational environment (Fernandes et al., 2019). For instance, some studies on EDM have compared e-learning systems (e.g., Lara et al., 2014), some have classi ed educational data (e.g., Chakraborty et al., 2016), and others have tried to estimate student performance (e.g., Fernandes et al., 2019). Thus, corrective strategies and pedagogical methods can then be developed by identifying both successful and at-risk students (Casquero et al., 2016;Fidalgo-Blanco et al., 2015).
Ahmad and Shahzadi (2018) developed a model with machine learning methods to identify students who may potentially fail academically. They determined students' learning skills, study habits, and academic interaction characteristics to be independent variables. The model had an 85% success rate. Cruz-Jesus et al. (2020) attempted to predict students' academic performance using their demographic characteristics. The k-nearest neighbors (kNN), logistic regression, random forest (RF), and support vector machines (SVM) algorithms correctly estimated the academic performance of 65% of students.  Researchers aiming to predict academic performance and retention have applied a range of techniques, including, neural networks, decision trees, logit, probit, and regression (Nandeshwar, Menzies, & Nelson, 2011). However, most recent studies have adopted RF (e.g., Hung et al., 2020), genetic programming (e.g., Pillay, 2020), and Naïve Bayes (e.g., Sutoyo & Almaarif, 2020) algorithms. When the literature in this area was examined, it was found that a wide variety of variables were used in the studies: • Homework, projects, quizzes (e.g., Kardaş & Güvenir, 2020).
It is observed that the classi cation accuracy rate varies between 70% and 95% in almost all models developed in such studies. However, the collection and processing of such a variety of data takes a lot of time and requires expert knowledge. Furthermore, Hoffait and Schyns (2017) stated that socio-economic data (e.g., parents' educational level and occupation) is unnecessary and that it is di cult to collect so much data. Besides, these demographic or socio-economic data may not always provide the right ideas for how to prevent failure (Bernacki et al., 2020). In this context, this study aimed to develop a new model that can analyze the grade point averages (GPAs) of the previous semester with data mining methods and predict the nal GPAs in the following semesters in three categories (department, faculty, and university). The dataset was divided into these three categories. Thus, the performance of the developed model could be evaluated by the group. For this general purpose, it was determined which algorithms from the DM had the highest performance. This will contribute to the development of pedagogical interventions and new policies that will contribute to students' academic development. In this way, the number of students who have the potential to fail can be reduced with the assessments made at the end of each academic term.

Dataset
Educational institutions regularly store data about students electronically. This data can be of a wide variety and volume, from the demographic characteristics of the students to their academic performance. In this study, the data were obtained from the SIS, which holds all the student records of a state university in Turkey. Department of primary education students' records were selected as the dataset for the department category, faculty of education students' records were selected as the dataset for the faculty category, and a total of 5649 records of the students who were enrolled in the fall and spring semester of the 2017-2018 academic year were selected as the dataset for the university category. The dataset was divided into three categories to evaluate the signi cance and consistency of the performance of the model in different groups. In other words, the dataset was grouped to determine the performance of the model in three categories: Department, faculty, and university as a whole. The distribution of the students by the academic unit is given in Table 1.
The GPA at the end of the fall semester of the 2017-2018 academic year was determined as an independent variable, and the GPA at the end of the spring semester of the 2017-2018 academic year was determined as the dependent variable. The model, developed based on the students' GPA of the students in the fall semester, estimates the GPA for the spring semester. In other words, it was examined at what level the academic performance of the student in the fall semester explained their potential academic performance in the spring semester. There are approximately ve months in which to perform any corrective activities for students who may have the potential to fail according to their estimated GPA.

Data preparation
Data preparation is the process of making the data ready for use. It is the conversion of raw data into processable and noise-free data. For this purpose, a total of 1080 records with incomplete values were deleted (e.g., the records of students who attended the classes in the fall semester but did not attend in the spring semester or who canceled their registration). Furthermore, the dataset obtained from SIS contained the grades for the students' midterm, nal, and make-up exams for each course. Each of these grades was recorded as a row. These entries, which consisted of many lines for each student, were grouped on a semester basis and averaged. Then, the midterm, nal, and make-up exam grades, which consisted of lines, were transformed into columns and turned into features.

Applying the algorithms
After the data identi cation and collection, the development phase of the model was started. For this purpose, DM algorithms were applied. In this context, linear regression (LR), RF, SVM, and kNN were applied to predict students' academic performance, similar to earlier studies (e.g., Akçapınar et al., 2019; Cruz-Jesus et al., 2020; Zabriskie et al., 2019). Thus, students who had the potential to fail and who were likely to drop out from the course/university were identi ed. In the faculty and university categories, 70% of the data were distributed as training data, 30% as test data, in the department category 95% as training data and 5% as test data. Table 2 shows the distribution of training and test data according to the categories.

Results
The entire experimental stage was carried out with Orange software (Ratra & Gulia, 2020). The data included the 2017-2018 fall and spring semester GPAs of 426 students studying in the department of primary education, 2,379 students studying in the faculty of education, and 5,649 students studying at the university. Since each observation in the dataset could be represented with a su cient number of training data samples, no dataset imbalance occurred in the pre-processing phase. Fall semester GPAs was the independent variable in the design of the model. The variable to be explained was the spring semester GPAs. Table 3 shows the model variables.
In Table 4, the values in the 2017-2018 Spring column are the actual values. The values in the LR, RF, SVM, and kNN columns are the values estimated by the relevant model. For example, the spring semester GPA of the student numbered std1 in the department category was 3.05. The predicted values of the LR, RF, SVM, and kNN models were 2.86, 2.97, 2.91, and 2.87, respectively. As can be seen from the rst example, the models made accurate predictions with a deviation of about 0.28 points.
DM methods analyze the data measured and predict the results of samples in similar situations. Two types of these methods are regression and classi cation algorithms. While a regression algorithms continuously predict values, the classi cation algorithms predict categorical values. As a result, the main difference is that the output variable is numerical (or continuous) for regression and categorical (or discrete) for classi cation. That is, the independent variable was a continuous variable. Therefore, the accuracy of the prediction results was measured by regression metrics. The estimated values for the GPAs were evaluated using four different metrics (Coe cient of Determination-CoD, Mean Absolute  Table 5 shows the results of the analysis regarding the estimation of the students' nal GPAs. In the department category, the kNN algorithm gave the highest R² (0.775) value. According to this nding, there was a very high-level correlation between the data predicted in the department category and the actual data. Furthermore, the MAE value (0.250), the actual value was correctly predicted with a deviation of 0.250 points up or down. As a result, the kNN algorithm was approximately 94% correct in their classi cations of the samples.
In the faculty category, the LR algorithm gave the highest R² (0.543) value. According to this nding, there was a very medium-level correlation between the data predicted in the faculty category and the actual data. Moreover, the MAE value (0.296), the actual value was correctly predicted with a deviation of 0.296 points up or down. As a result, the LR algorithm was approximately 93% correct in their classi cations of the samples.
In the university category, the LR and SVM algorithms gave the highest R² (0.723) value. According to this nding, there was a very high-level correlation between the data predicted in the university category and the actual data. In addition, the MAE value (0.315), the actual value was correctly predicted with a deviation of 0.315 points up or down. As a result, the LR and SVM algorithms were approximately 92% correct in their classi cations of the samples.

Discussion And Conclusion
This study proposed a new model based on DM algorithms to identify students who have the potential to fail and who may be likely to drop out the university. This new model analyzes the students' GPAs from the previous semester with DM algorithms and predicts the GPAs they may receive in the following semesters in three categories (department, faculty, and university). The performance of the developed model was evaluated on the basis of these categories. In addition, the performance indicators of four algorithms (LR, RF, SVM, and kNN) were compared. In short, this study focused on three parameters. The rst was to predict academic performance with a single independent variable. The second was to compare the performance indicators of four algorithms. The third was to compare the predictions made in three different categories.
With regard to the LR, RF, SVM, and kNN algorithms in the university categories and RF and kNN algorithms in the department categories, there was an high-level correlation between the students' GPAs for the previous semester and their GPAs of the following semester. In addition to the high performance indicators of the algorithms, the fact that predictions were made using only one variable indicates the originality of the study. The ndings allow it to be stated that the students' GPAs for the previous semester explained the GPAs they would receive for the following semester at a high-level.
In the faculty category, the LR, RF, SVM, and kNN algorithms demonstrated a medium-level correlation between the students' GPAs for the previous semester and their GPAs for the following semester. Although there was an high-level correlation in the department and university categories, the reason for the medium-level of correlation in the faculty category can be explained by the fact that the high number of departments in the faculty category (eight departments). This is because of the placement scores while the placement score was very high in some of the departments in the faculty of education, in others it was very low.
As of the date of this study, no studies have been found in which the GPAs for the following semester/academic period have been predicted based on the previous semester GPAs using a single variable. Therefore the results of the research were compared with studies that tried to predict students' academic performance based on various demographic and socio-economic variables. Hoffait and Schyns (2017) developed a new model with DM methods to identify students at high risk of academic failure based on their various demographic characteristics. They compared the performance indicators of the Logistic Regression, Arti cial Neural Network (ANN), and RF algorithms. They were able to predict students at high risk of academic failure with 90% accuracy. Waheed et al. (2020) used deep learning models to identify students who were at risk of poor academic performance and had the potential to leave the course. They developed a model with a total of 54 student behavioral characteristics in the LMS along with the demographic characteristics of the students. The model had an average of 88% accuracy in making correct classi cation, and it was claimed that the results obtained will contribute to decisionmaking processes. Similarly, Xu et al. (2019) examined the relationship between students' internet usage behaviours and academic performance through machine learning methods. In like manner, Bernacki et al. (2020) tried to predict students' academic performance based on the digital traces they left in an LMS. They had a 75% success rate in predicting which students would need to repeat the course. Ahmad and Shahzadi (2018) determined the relationship between students' study habits, learning skills, academic interaction, and academic performance through machine learning methods. The model they proposed made had a predictive accuracy of 85%. As a result, a high-level of relationship was found between these variables, and they argued that machine learning methods will contribute to the development of educational management. Machine learning methods have thus had very successful results in determining the relationship between students' demographic and socio-economic characteristics and academic performance (Cruz-Jesus et al., 2020; Costa-Mendes et al., 2020). However, the prediction model in all of these studies was established with a large number of independent variables.
The model proposed here accurately estimated the GPAs of the students at the end of the semester with an average deviation of seven points out of a hundred with a single variable. By estimating the nal GPAs, students who are at risk of failure or who are at risk of drop out can be identi ed. So, education and training authorities can be given opportunities to implement corrective actions for these students. Modules that predict academic performance with DM methods can also be added to the LMS. It will thus be possible to make the most accurate predictions automatically and quickly. In short, teaching-learning processes can be managed more effectively and more e ciently thanks to the predictions for academic performance made by DM methods. Timely and targeted individual interventions can be ensured.
In conclusion, although this research uses various predictors, different algorithms, and different approachs to determine the student's GPAs, the results are consistent with previous research and con rm that DM methods can create an effective model for predicting student academic performance. The LR, RF, SVM, and kNN algorithms had high-performance rates. It was also observed that these algorithms can be applied to the categories of department, faculty, and university as a whole. It can be said that such datadriven studies can make very important contributions to decision-making processes. It is, however, also necessary to support students, manage decision-making processes, and develop corrective strategies to ensure students' attendance.
In the present study aimed to analyze the students' GPAs in the previous semester using data mining methods and to predict the nal GPAs that students may receive in the following semesters, the data of students at a state university in Turkey. Therefore, students of different education levels can be studied in future research. Furthermore, future studies can be planned by taking into account the various individual differences that affect students' academic performance.