Data Mining (DM) is the process of extracting new orientations and new patterns from big data using different classification algorithms (Baker & Yacef, 2009). In other words, it is the discovery of useful information from big data. On the other hand, educational data mining (EDM) “develops and adapts statistical, machine-learning and data-mining methods to study educational data generated basically by students and instructors” (Calvet Liñán & Juan Pérez, 2015, p.100). EDM develops new methods to reveal meaningful sections, original structures, and hidden patterns in the data obtained from educational environments. The main purpose of EDM is to extract information from educational data to support decision-making for educational issues (Calvet Liñán & Juan Pérez, 2015). It involves the collection and interpretation of data produced by learners in order to assess academic performance, predict future performance, and identify current issues, etc. In short, EDM includes the automatic extraction of previously unknown, hidden, meaningful, and beneficial patterns from the huge amount of data.
Various DM algorithms have been successfully applied to identify students at risk of failure in terms of academic performance (e.g., Hu, Lo, & Shih, 2014). Both students and teachers benefit from the knowledge discovered through the use of these algorithms, which serve to improve learning/teaching processes (Akçapınar, Altun, & Aşkar, 2019). In today’s education systems, too much data, including demographic data and students’ academic grades, are stored in electronic environments. This data can be obtained from various learning management systems (LMS) and student information systems (SIS). The rapid increase in the amount of educational data available can contribute to improving students’ learning outcomes (e.g., Shorfuzzaman et al., 2019; Viberg et al., 2018).
EDM provides statistical information to identify the relationship between relevant stakeholders and optimize the learning environment by discovering hidden patterns in the educational environment (Fernandes et al., 2019). For instance, some studies on EDM have compared e-learning systems (e.g., Lara et al., 2014), some have classified educational data (e.g., Chakraborty et al., 2016), and others have tried to estimate student performance (e.g., Fernandes et al., 2019). Thus, corrective strategies and pedagogical methods can then be developed by identifying both successful and at-risk students (Casquero et al., 2016; Fidalgo-Blanco et al., 2015).
Ahmad and Shahzadi (2018) developed a model with machine learning methods to identify students who may potentially fail academically. They determined students’ learning skills, study habits, and academic interaction characteristics to be independent variables. The model had an 85% success rate. Cruz-Jesus et al. (2020) attempted to predict students’ academic performance using their demographic characteristics. The k-nearest neighbors (kNN), logistic regression, random forest (RF), and support vector machines (SVM) algorithms correctly estimated the academic performance of 65% of students. Furthermore, Fernandes et al. (2019) developed a model using the demographic characteristics of the students and their achievement scores for in-term activities. Moreover, Musso et al. (2020) developed a machine learning model based on students’ socio-economic characteristics and academic performance.
Waheed et al. (2020) developed a new machine learning model using students’ interactions in LMS from a different perspective. According to the researchers, who stated that the model they developed made predictions that were 85% accurate, students who had been browsing previous courses online were more successful. Xu et al. (2019) examined the relationship between the internet usage characteristics and academic performance of university students. The model they developed predicted the performance of the students with a high accuracy rate. Burgos et al. (2018) examined the relationship between the academic performance of the students in past semesters and their academic performance in the following semesters.
In summary, EDM provides an early estimation of the probability of situations such as students dropping out of university or showing decreased interest in a course, analyzes internal factors affecting their performance, and uses statistical techniques to measure students’ performance. Various DM methods can be used to predict students’ performance, identify slow learners, and identify who is going to leave university (e.g., Hardman, Paucar-Caceres, & Fielding, 2013; Kaur, Singh, & Josan, 2015). In this context, early prediction is a relatively new phenomenon that uses assessment methods to support students by proposing appropriate corrective strategies and policies in this field (Akçapınar et al., 2019; Waheed et al., 2020).
Researchers aiming to predict academic performance and retention have applied a range of techniques, including, neural networks, decision trees, logit, probit, and regression (Nandeshwar, Menzies, & Nelson, 2011). However, most recent studies have adopted RF (e.g., Hung et al., 2020), genetic programming (e.g., Pillay, 2020), and Naïve Bayes (e.g., Sutoyo & Almaarif, 2020) algorithms. When the literature in this area was examined, it was found that a wide variety of variables were used in the studies:
• Various digital traces that students leave on the internet (browsing, time spent watching courses, attendance percentage) (e.g., Fernandes et al., 2019; Waheed et al., 2020; Xu et al., 2019),
• The demographic characteristics of students (gender, age, economic status, number of courses attended, internet access, etc.) (e.g., Aydemir, 2017; Bernacki et al., 2020; Cruz-Jesus et al., 2020; García-González & Skrita, 2019; Rebai, Yahia, & Essid, 2020; Rizvi, Rienties, & Ahmed, 2019),
• Learning skills, study approaches, study habits (e.g., Ahmad & Shahzadi, 2018),
• Learning strategies, perception of social support, motivation, health, academic performance characteristics (e.g., Costa-Mendes et al., 2020; Musso et al., 2020; Kılınç, 2015; Gök, 2017),
• Homework, projects, quizzes (e.g., Kardaş & Güvenir, 2020).
It is observed that the classification accuracy rate varies between 70% and 95% in almost all models developed in such studies. However, the collection and processing of such a variety of data takes a lot of time and requires expert knowledge. Furthermore, Hoffait and Schyns (2017) stated that socio-economic data (e.g., parents’ educational level and occupation) is unnecessary and that it is difficult to collect so much data. Besides, these demographic or socio-economic data may not always provide the right ideas for how to prevent failure (Bernacki et al., 2020). In this context, this study aimed to develop a new model that can analyze the grade point averages (GPAs) of the previous semester with data mining methods and predict the final GPAs in the following semesters in three categories (department, faculty, and university). The dataset was divided into these three categories. Thus, the performance of the developed model could be evaluated by the group. For this general purpose, it was determined which algorithms from the DM had the highest performance. This will contribute to the development of pedagogical interventions and new policies that will contribute to students' academic development. In this way, the number of students who have the potential to fail can be reduced with the assessments made at the end of each academic term.