Understanding Voter Turnout through Big Data Analytics

This study’s objective is to predict Malaysian turnout in the general election in Malaysia using classification tree algorithms. This research work has been approached as a binary classification problem; the model needed to predict whether each citizen entitles to vote would cast their vote in the general election. The Asian Barometer Survey 2010 and 2014 datasets were used in this study. A decision tree algorithm with a different experimental setting of CHAID, CART, and C5.0 was used in this study. The best prediction model was selected based on several popular performance matrices. The important variables in signaling voter turnout were reveals based the best model selected. Abstract. In the concept of democracy, citizens can come out to vote and choose who they would prefer to lead the country in a competitive election. Malaysia is one of the countries that practicing democracy in shaping the country’s future. There is a total of 14 general elections held in Malaysia. The turnout rate of the 2013 general election, 84.8% is the highest record in Malaysia history. However, the actual participation rate is considered lower when voter turnout is compared to voting age population and the number of eligible voters. Thus, this study’s objective is to predict Malaysian turnout in the general election in Malaysia using classification tree algorithms. This dataset used in this study are the Asian Barometer Survey 2010 and 2014 dataset. The three selection decision tree algorithms used in this study are CHAID, CART, and C5.0. Among these three methods, CHAID perform the best to predict the Malaysian turnout in Malaysia’s general election.


Introduction
Malaysia has been practicing democracy and allow citizen to choose their desired leader to shape the country's future over sixty years. Malaysia has held 14 General Elections (GEs) after granted its independence on 31 August 1957. Generally, Malaysia's GE is held every five years. Malaysians who aged above 21 years old is eligible and responsible to vote on the condition that they need to register with Election Commission (EC).
However, Malaysia does not enforce compulsory voting. In year 2013, Malaysia's population who fulfil the requirements to vote are estimated 17.9 million people. However, only 13.2 million voters were registered with the EC as depicted in Table 1. Although there is a significant increase in terms of number of registered voters and number of voters turnout, the turnout rate is decreased if it is compared with voting age population (VAP). Fowler (2013) observes that election outcomes and public policies can be change when the numbers of voters increases during the election. Thus, some policies may only be beneficial to particular groups of people if the political participation or voter's turnout is imbalance. As a result, government's performance as a whole will drop due to the policies are made not allot well and focus on specific cluster of people. Therefore, voter turnout plays an important role in deciding which political party or leader who will lead the country in a competitive election. It is crucial for both ruling and opposition political parties to identify who will vote because the numbers of voters decide their victory.

Voter Turnout Model
There are three contextual political participation models used to explain the voting patterns in Malaysia including sociological model, political mobilization and psychological involvement. The sociological model describes voter's socio-demographic background, for instance age, gender and education. Political mobilization discloses the context of communication between political candidates and the citizen. On the other hand, psychological involvement defines the amount of interest about politics among voters.

Sociological Model
The sociological model suggests that sociodemographic variables such as age and education are important factors in explaining voter turnout. There are numerous studies have been conducted to connect socio-demographic factors and voting behavior in past decades. For instance, Mohd Hed and Grasso (2020) reveal that the numbers of young people are less compare to older people in participating political engagement. Although education, ethnic group and gender does not affect political activism in Malaysia significantly, Blais (2000) and Tenn (2005) suggest that education is another factor that contribute to the rate of turnout in United States (US). Besides, Blais (2000) points out that more variables including government employees, marital status, religiosity and income are also correlated with political participation in US. In a study conducted by Norris (2004), demographics including age, gender, income level, education level, membership, religiosity and cultural attitudes are proven significantly related to voter turnout. Therefore, focus on societal context plays an important role to understand the voter turnout.

Political Mobilization
The concept of political mobilization is the people who practicing to manipulate the existing distribution of power (Nedelmann, 1987). Political mobilization model was developed by Rosenstone and Hansen (1993). They point out that campaigns and interpersonal conversations about politics have causal influence on voting. Political parties try to mobilize voters by using a variation of campaigning methods from offline and traditional to online and modern campaigning. Traditional campaigning requires many resources to get support and trust from the people (Farrell, 1996) whereas modern campaigning used information and communication to influence voters. Welsh (2018) claims that modern campaigning which utilize social media such as Facebook and Whatsapp encourages more people engage in politics and indirectly bring changes to country political landscape in Malaysia.

Psychological Involvement
The voter turnout is strongly related to the voter's political interest (Blais, 2007). In other words, the more interest a voter in politics, the more likely the voter will vote in an election. Verba, Schlozman, and Brady (1995) imply that educational and parental influence affect one person's political interest.
In addition, Internet and social media are crucial variables in influencing voters' turnout. The Internet is the fastest way to engage with young voters and contribute higher voter turnout among young voters (Hirzalla, Van Zoonen, & de Ridder, 2011). An interesting study conducted by Rauf, Hamid, and Ishak (2016) shows that a positive relationship between ability to access to political information and participation in politics. Wang (2007) maintain that the role of Internet as a medium in politics as a place of information seeking and opinion expressing when he conducted a study in year 2007. Thus, the internet plays an important role to enhance and give voter information about the elections, political parties and candidates. Furthermore, stimulate the turnout rate and participation of young people in an election. Put differently, the changes and improvement of communication and technology are influencing the electoral behavior and political landscape.

CART
The Classification and Regression Trees (CART) is developed by Breiman, Friedman, Stone, and Olshen (1984). CART can be applied in both categorical and continuous data. CART performs binary splits and using Gini or Entropy splitting rules to achieve an optimal purity node.
The following are the steps taken to implement CART algorithm: (1) Start at the root node.
(2) Find the split set that minimizes the sum of Gini indexes and use it to split the node into two child nodes to achieve an optimal purity node; (3.1) where pi is the relative frequency of class i in D,

D is the dataset
(3) If a stopping criterion is reached, then exit; (4) Prune the tree based on cost-complexity pruning (5) A test will be performed to examine the accuracy of the Model, if model evaluation criteria is not satisfying, repeat step 2 -4.

CHAID
CHAID or Chi-Squared Automatic Interaction Detection is developed by Kass (1980). CHAID can only be conducted in categorical data. CHAID determines the size tree based on the Chi-Square test for independence.
The following are the steps taken in modelling using the CHAID algorithm: (1) Start at the root node.
(2) Find the split set based on Likelihood ratio Chi-Squared Statistics and use it to split the node into two child nodes to achieve an optimal purity node; where e is the expected value, o is the observe value pi is the relative frequency of class i in D D is the dataset (3) If a stopping criterion is reached, then exit; (4) Prune the tree based on chi-square test for independence (5) A test will be performed to examine the accuracy of the Model, if model evaluation criteria is not satisfying, repeat step 2 -4.

C5.0
Quinlan (1993) combine and modify early version of C4.5 and ID3 and invented C5.0. The C5.0 offers new features such as the winnowing, boosting, generate smaller tree and unequal costs for different types of errors (Kuhn & Johnson, 2013). C5.0 uses information gain theory as the purity criterion to split the dataset and applied pessimistic pruning for the pruning process.
The following are the steps taken in implementing the C5.0 algorithm: (1) Start at the root node.
(2) Find the split set based on Entropy measure and use it to split the node into two child nodes to achieve an optimal purity node; where pi is the relative frequency of class i in D D is the dataset (3) If a stopping criterion is reached, then exit; (4) Prune the tree based on pessimistic pruning A test will be performed to examine the accuracy of the Model, if model evaluation criteria is not satisfying, repeat step 2 -4.

Data
The Asian Barometer Survey (ABS) is collecting public opinion on issues such as political values, democracy and governance in 14 East Asia countries. Therefore, the dataset used in this study are secondary data that come from the third and fourth waves of ABS. The third wave and fourth wave of ABS was carried in October year 2010 and year 2014 respectively. These surveys usually carried out after 17 or 18 months of Malaysia's general election.  Figure 1 shows the experiment flow in this study. First stage of this study is preprocessing the data.

Experimental Setting
The main reason of the existence of missing values is respondent refuse or unable to respond to the survey question. The missing values usually recorded as "No Response" (NR) or "Don't Know" (DK) and will be filtered from the dataset. Then, the data is partitioning into training set and testing set with the ratio of 70% and 30% respectively. Testing set is used to validate the pattern generated from the training sample. The records in the training set are selected through simple random sampling method.
The third stage is building the classification model through the datasets. The classification model include CART, CHAID and C5.0. The control parameter of each model is summarized in Table 3, Table 4 and Table 5 respectively. Lastly, the model is evaluated through accuracy, sensitivity, specificity, positive prediction value (precision), negative prediction value, Area Under ROC curve (AUC).   A logical to toggle whether the final, global pruning step is needed to simplify the tree. CF 0.5 A number in (0, 1) for the confidence factor. lower factor levels will likely prune away the leaves which over specify the classification Mincases 7

Figure 1. Experiment Setup
An integer for the smallest number of samples that must be put in at least two of the splits fuzzyThreshold TRUE A logical toggle to evaluate possible advanced splits of the data. earlyStopping TRUE A logical to toggle whether the internal method for stopping boosting should be used.

Evaluation Metrics
The evaluation metrics used to measure the performance of classification models in this study are confusion matrix and area under curve (AUC) under receiver operating characteristic (ROC).

Confusion Matrix
A confusion matrix is a table that give information about the result of classification and actual values. Table 6 shows a sample of confusion matrix. True negative (TN) and true positive (TP) represent the amount of voters that have been correctly classified between the actual and predicted values. Whereas, false negative (FN) and false positive (FP) are the number of voters that have been falsely predicted or incorrectly classified. Besides confusion matrix, the performance metrics of a predictive model is measured by accuracy rate, sensitivity rate, specificity, positive predictive value and negative predictive value. The formula of performance metrics stated is summarized in Table 7.

AUC under ROC
The Receiver Operating Characteristic can be illustrated in two-dimensional graph where sensitivity rate is plotted against one minus specificity rate (1-specificity). The graph contains values from 0 to 1 in each of the axis. Each point on the curve of the ROC curves provide the true-positive rate and false-positive rate and visually it in attractive way to summarize the accuracy of predictions. Using the same graph,the area under the ROC curve is called an Area Under Curve (AUC). The closer the value to 1, the better the performance of a classifcation model. Figure 2(F) illustrates wave 3 and wave 4 rules representation for CART, CHAID and C5.0 classification models. Age variable is the first rule of the breakdown of the tree for the three classification models. In CART and C5.0, age is split into two groups, below 30 and above 30. On the other hand, CHAID split age variable into 21-30, 31-50 and above 50 years old. It is interesting to note that age below 30 years old who helped out or worked for a candidate or party in the election are most likely to vote in wave 3 dataset. In wave 4 dataset, young people who attending rally or campaign meeting have high probability to turn out and vote in an election. Table 8 and Table 9 are the summary of model performance for wave 3 and wave 4 dataset respectively.

Figure 2(A) -
Among three algorithms, CHAID perform the best because it provides 84.03% and 85.57% of accuracy on wave 3 and wave 4 dataset respectively. Although CART model has high amount of sensitivity (97.31%), the specificity is very low (33.85%) in wave 3 dataset. In wave 3 dataset, C5.0 have highest rate of specificity that is 60%. However, C5.0 have the lowest rate of negative predictive value in wave 3 dataset. C5.0 produce highest true positive at 97.15% in wave 4 dataset but has a very low specificity at 13.33%. In other words, C5.0 predicts 86.67% of respondents most likely turnout in an election. The specificity of CHAID performs better compare to CART and C5.0 although the sensitivity of CHAID is slightly lower compare to CART and C5.0. Most model perform fairly good from the point of view of AUC under ROC curve in wave 3 and 4 dataset. The AUC ranges start from 0.5 to 1.0. The best rate of AUC under ROC curve is 0.9 to 1.0 whereas 0.5 means no predictive value.

Limitation and Future Research
There are few limitations in this study. The first limitation is only three theoretical models applied in examining the relationship between voter turnout and political participation models. In the future research, behaviour of voter turnout is suggested to include other theoretical models such as cultural modernization and rational choice theory. In terms of methodological aspect, there are many imbalance cases in both datasets. To illustrate, voter turnout in dataset 2013 have 84%. That is to say, only 16% people who did not vote in the GE year 2013. Thus, future study should consider this factor and apply some possible and feasible approaches such as resampling methods. Furthermore, more decision tree algorithm such as Support Vector Machine (SVM), random forest and Boosting C5.0 can be used and evaluated to predict voter turnout.

Ethics Approval
Not Applicable

Consent to Participate
Not Applicable

Availability of Data and Material
Not Applicable

Competing Interests
To the best of my knowledge and belief this paper contains no material previously published by any other person except where due acknowledgment has been made. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. The authors have no conflicts of interest to declare that are relevant to the content of this article.

This research was funded by Universiti Utara Malaysia (UUM) through University Research Grant
Scheme.

Authors' Contributions
Kueh Chong Hua extracted the data from Election Commission Malaysia. All authors conducted the analyses, wrote, and reviewed the manuscript.