Machine Learning Classication Algorithms for Systematic Analysis to Understand Learners Drop out of MOOCs courses

The increasing popularity of massively online open courses (MOOCs) has been attracting a lot of learners. Despite the popularity, it has been observed that there is a signicant percentage of learners who discontinue courses and drop out of the platform. This is a problem that most of the MOOC courses face. The dropout probability of any student depends on his/her interaction with the platform, and the features of the course in which the student has enrolled. The research work is intended to study and analyze the dropout behavior of the students in online learning with identication of the reasons and to understand their impact. The current research accounts for the activity log of learners of 13 different online courses offered by Harvard and MIT during 2012 to 2013. The work examines the attributes which affects the student dropout rate. The research can be useful in improving the existing features of the MOOC courses and content to ensure persistence turnout of their learners. The dropout rate in the student data of 13 MOOCs and over 300,000 learners was calculated using 5 machine learning (predictive) models: Deep Learning, Gradient Boosted Trees, Logistic Decision Tree and Neural Networks. The performance matrix of these algorithms displayed class prediction accuracy above 88%. Most of the dropout students belonged to the age group of 19 to and they were majorly from United States, India and Europe. The results revealed that the signicance


Introduction
The prevalence of MOOCs at present is quite at its peak. The MOOC serves as a good alternative to geographical incompatibilities, or when a student cannot manage attending regular classes. Since MOOCs are easily accessible from anywhere and anytime, they naturally become a preferred choice. Most MOOCs are self-paced, so that the student can study comfortably. Coursera leads the MOOC industry with over 53 million students and thousands of specialization courses and degree programs. Followed by edX and Udacity, there isn't any subject matter, theme or eld rendered untouched by these platforms. MOOCs provide their students an opportunity to harness the knowledge at ease. According to Class Central's MOOC report (edX, 2018), it has been observed that most of the online learners are beyond the age of 25, that categorizes them as 'continuing learners', suggesting that either they take up these courses in addition to their full time studies, or due to the need of updating oneself as required by the professional environment. The onset of successful MOOCs was marked by an online course of 'Arti cial Intelligence' in 2011 fall. It was organized by ex-Stanford professors Sebastian Thrun and Peter Norvig (who founded Udacity the same year). There was a huge number of enrollments from the world over, and thus, by 2012, the era of online learning through MOOCs set in.
Even though the introduction of MOOCs as a part of (distant) learning has transformed the teaching system and bene tted many individuals, there are persistent problems that most MOOCs still face after over eight years of their foundation, that is, a high dropout/withdrawal rate of students. A research by Justin et al. (2019) suggests that the position of low completion rate of MOOCs has not been improved since the past six years. Nearly 52% of students never explored the course content after their enrollment, and the dropout rate is highest in the rst two weeks of course initiation.
This fact is con rmed by Rosé, C. P. et al. where it has been asserted that it is important to keep the student interested and engaged during the rst two weeks to avoid attrition. Also, a research by Jacobsen, D. Y. (2019) claims that the learners who explore less pages of the course, and do not attempt and submit assignments regularly, are more likely to dropout; however, if the learners apply whatever knowledge that is gained through a MOOC, in practicality, then it is a positive outcome of the course. To keep the record of the students' intention and understand them better, HarvardX conducts a pre-course survey that includes questions not only about general information of the student, but also about the reason of enrollment, con dence regarding completion of the course, and most importantly, asks the student to formulate a goal plan on how is he/she going to accomplish his goal. HarvardX has taken a step towards the examination of the reasons why students may tend to drop out, and that is suggested by/asked to the enrollees themselves. Figure 1 shows a fragment of a Harvard University course survey form.
This research, it is intended to observe the trends in the characteristics of a typical dropout student, and also study the gravity of different attributes or factors resulting in the same. There are many purposes for which the students enroll in a course. It can be for exploring the course structure, aspiring to earn a certi cate or just out of curiosity. Monitoring the clickstream activity and performance of a learner, it is possible to identify the probability of a student completing a particular course. The insights can be achieved by applying educational data mining upon the student data. The rationale behind this research is to bring out the underlying factors affecting attrition and dropout rate of students to light.

Literature Review
There can be numerous reasons that result in people dropping out, or people completing a course but never coming back to a particular website again. Now that the dropout is a popular interest among researchers, there is a huge amount of relevant literature available that focus on this particular area. Table 1 gives an overview of the authors and the techniques that they have used to address the problem to give possible solutions. It has been observed that most of these research works are based on clickstream analysis and study of patterns in student engagement.  Balakrishnan & Coetzee (2013) had used Hidden Markov Model and Support Vector Machines for a better prediction model creation. "HMMs associate different states of the data with a probability distribution. In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is the number of features you have) with the value of each feature being the value of a particular coordinate. Then, classi cation is performed by nding the hyper-plane that differentiates the two classes very well." Lastly, Rebecca M Stein and Gloria Allione (2014) also used the Cox proportional hazard model "which assumes that the covariates affect dropout in a proportional manner that is time invariant".
Systematic Literature Review: We have observed the related work to get an overview of current status of the research related to machine learning classi cation techniques. As the eld of machine learning classi cation is developing rapidly (e.g., application in academic sector with deep learning) to provide an outline of the current trends of these procedures. Moreover, we see that most other secondary studies emphasis on the study of different techniques as a part of the systematic literature review process. This study aims to offer an over-view of all steps of the process. The current work analyses the following keywords for the analysis "Machine", "Learning", "Classi cation", and "MOOC".
The above Fig. 3 characterizes the year wise documents type over the years 2013-2021. It can be observed that year wise documents type has elevated over 2016 onwards. From Table 1, and referring to various other research, the commonly used attributes for analysis were identi ed. These have been outlined in Fig. 9.
It can be inferred from Table 2 that the most researched attributes in existing literature are student engagement, student feedback & reviews and also, the quality and the type of content the course offers. It can be seen that the language barrier, personality of students (punctuality) and freedom of choice of course (skipping assignments, tutorials) has not been the popular choice, hence, some of these factors are also considered in this research. The considered dataset (Refer Table 3) includes some of the new data introduced for evaluation purpose.
2.1 Description of Machine Learning Algorithms used

Decision Tree
The Decision Tree is a supervised learning technique that branches a dataset on the basis of different features. From the head node to the root node, the tree will pick the most important input attribute and split branches according to the class label predictions. The end node will always be the column values needed to be predicted. Hence, the appropriate prediction according to the inputs, will be the end node in the tree.

Logistic Regression-
Logistic Regression is a supervised classi cation algorithm used to predict binary dependent variables, resulting into a gure of probability that strictly lies between 0 and 1. In this case, the dependent variable is the column having values of the students being a drop out or not. The algorithm presents P(Y = 1) as a function of X.

Neural Networks-
The Neural Networks are a set of algorithms inspired by human neurons, (that are modeled as perceptrons in machine learning) that are used to identify pattern in large data. They label, cluster and classify raw input data to identify the pattern, and arrange them into layers of class-wise input nodes, thereby containing the pattern in vectors.

Gradient Boosted Trees-
The Gradient Boosting algorithm means turning weak learners in a training dataset into strong learners, usually by the means of decision tree. Modifying the weights of features that are di cult to classify, and using those as a base for the next improved trees, the predictions of the nal tree is a summation of the previous boosted trees.

Deep Learning-
The Deep Learning is an implementation of arti cial neural networks. The algorithm processes data in the form of several layers (more layers than a neural network, hence the name 'deep'), to draw conclusion based on it. It uses different input data to produce insights and then draw larger conclusions (or predictions). This is similar to human's behavior of 'learning by experience/example'.

Dataset selection
With reference to the attributes in Table 2, the chosen dataset is from Harvard and MIT MOOCs that includes relevant information. A detailed description about the dataset is given in Sect. 4.1.

Data preprocessing
To make the data useful, all the missing, invalid values were eliminated. Furthermore, to improve the integrity of the data, the independent variables were normalised using 'Nominal to Numerical' operator in RapidMiner Studio. This helped in acquiring fast and precise result because all the values follow the same scale. Using 'Split Data' operator, the dataset was divided into two parts: 75% for the training, and 25% for the prediction test.

Feature Selection
The prepared dataset was examined for the most relevant input feature. A research by Gupta, S., & Sabitha, A. S. (2019) presented its ndings according to the clickstream analysis of the students and discussion participation, that included 'number of times videos played', 'exploration of course' and 'number of forum posts by student'. In this research, four new features were introduced in the dataset. Therefore, a total of 12 attributes were chosen as input for the prediction model. The dataset elds are explained in Table 3 of Sect. 4.

Determining the signi cance of given characteristics
Five different machine learning algorithms were used in this research so as to compare the weights of the input features, to determine the important ones that impact the dropout rate of students in any MOOC.

Dataset
The data used for analysis is from the 2012-2013 academic year of the rst year of introduction of MOOC by HarvardX and MITx, retrieved from Kaggle. The features of this dataset are explained in Table 2.

Characteristics of dropouts evident from the dataset
The distribution of the noticeable characteristics of the dropout students can be explained as follows. The statistical study of this dataset revealed that the dropouts mainly belonged to a certain age group and country. This con rms the impact of demographics on the retention of a student. The maximum number of dropouts were from one particular course (6.00x), which implies that the course in which a student is enrolled in, may affect his/her engagement. It was also seen that the dropouts here held bachelor's level of education, and the ones with the highest education level as 'doctorate' comprised the least of the dropouts (2.7%). This implies that the education level of a student must match the prerequisites of the course taken (for example, 6.00x was an undergraduate level course, and 32.3% of the dropouts had highest education as secondary or less).
However, this study did not focus on all course-related and student-related factors that were derived through the literature survey. For this purpose, the new data according to those features were gathered through appropriate resources, to be examined as well (Refer Sect. 3.2). These data were then incorporated into the research methods adopted, to retrieve new relevant results.

Case Study 2 Prediction of Student Attrition
The attributes chosen for the further analysis of the dataset to produce insights about dropout rates were-'date', 'difference', 'duration', 'lang', 'course_id', 'semester', ' nal_cc_cname_DI', 'LoE_DI' (Loe),'Gender', 'Age', 'start_time_DI', and 'incomplete_ ag'. To understand the impact of these new features, ve predictive machine learning algorithms: deep learning, neural networks, decision tree, logistic regression and gradient boosted trees, were applied to the inputs, so as to identify the weights of all the features, and determine if they are some new dominant features that can contribute towards the tendency of a student dropping out of an online course. The machine learning models predicted whether a student will drop out (1) or not (0). The weights of the features thus calculated, suggested their respective impacts on the prediction results. The following section describes these results.

Attribute Weights
The student attrition probability was predicted using various input values so as to determine which attribute had the most impact on the results. Each algorithm assigned weights to the input features according to the prediction results. A brief description of the same is given in the subsequent paragraphs: i) Neural Network-The algorithm laid emphasis on three input attributes, i.e., 'semester = Spring', the weight by correlation being 0.138; 'semester = Fall'-the weight by correlation being 0.133 and 'course_id = 8.02x'-the weight by correlation being 0.108, which roughly calculates to 13.8%, 13.3% and 10.8% Figure 5 shows the values of Sigmoid function for an improved neural network formation.
ii) Decision Tree-According to this algorithm, the attributes having the highest weights were 'difference'with a weight of 0.416, and 'age'-with a weight of 0.The decision tree for this prediction is given in Fig. 6 below. It shows that the rst attribute for branching is 'difference', because it has the greatest impact on the result.
iii) Logistic Regression-Using this algorithm, it was observed that 'course_id' being 6.00x, 6.002x or CB22x meant a greater probability of a student dropping out of these This implies that the course contents play an important role in dropout behavior of a The combined weight of these features was calculated as 14.825%. It was seen that students who belonged to the United States or India, were more likely to drop Also, attributes like 'semester' as Fall weighted in the negative scale, and 'lang' (language pro ciency of the carried a near-zero A bar chart comparing the weights of various attribute values in given in Fig. 7. iv) Gradient Boosted Trees-The algorithm created 100 progressive decision trees, portraying the best possible attribute relations so that their weights could be The results revealed that 'difference' feature was present in most of the dropout cases (over 75,000 students). The most dominant attribute here was 'duration', with a weight of 19258.It was also seen that the 'course_id = 6.002x' had an impact on the Again, this clearly lays emphasis on the duration, along with the type of course content the student is  Boosted Trees result also suggested the in uence in the dropout behavior depending on the duration of a course. Table 4 below summarizes the prediction results related to the attribute weights.

Conclusion
This paper advances research work to study and analyze the dropout behavior of the students in online learning with identi cation of the reasons and to understand their impact.
Our systematic literature review identi es the theoretical approaches which are most used to study the development, potential and dynamics. The work examines the attributes which affects the student dropout rate. The research can be useful in improving the existing features of the MOOC courses and content to ensure persistence turnout of their learners. It was made clear through this research that the likelihood of a student dropping out of a MOOC essentially depends on both course-related and student's behavioural factors. As an outcome of the previous literature analysis, many factors were marked as important in making an impact in this area. There was a great emphasis on the clickstream pattern of a student, and interaction with the co-learners, as this is the most obvious way to infer the interest of a student in a course. The dropout rate in the student data of 13 MOOCs and over 300,000 learners was calculated using 5 machine learning (predictive) models: Deep Learning, Gradient Boosted Trees, Logistic Regression, Decision Tree and Neural Networks. The performance matrix of these algorithms displayed class prediction accuracy above 88%. Most of the dropout students belonged to the age group of 19 to 25, and they were majorly from United States, India and Europe. The results revealed that the signi cance of a few less explored features not addressed in the previous works. For example, the highest number of dropouts were from the course '6.00x', which was 'Introduction to Computer Science and Programming'. It was also seen that if a student had a rst activity difference of more than 20 days, he is more likely to drop out of the respective course. The results are consistent with Banerjee and Du o (2014), who found that students who enroll one day late are 17 percent less likely to earn a certi cate than students who enroll on time, con rming that patterns of retention can be detected by early behavior. Hence, it was deduced that, in addition to the student engagement intensity, the course content, the time of the course offering, and punctuality of a student regarding the course activities, and sometimes, the duration of the course also made a crucial impact in the dropout trend in MOOCs.

Declarations
Author Contributions Dr Seema Rawat conceived and designed the study, Ms Chhaya Khattri performed the research, Dr Deepak Kumar analyzed the data, and Dr Praveen Kumar contributed to editorial input.    Attributes researched the most over the years 2013-2017