Telecom Churn Prediction Using Seven Machine Learning Experiments integrating Features engineering and Normalization

– Machine Learning and Deep learning classification has become an important topic in the area of Telecom Churn Prediction. Researchers have come out with very efficient experiments for Churn Prediction and have given a new direction to the telecommunication Industry to save their customers. Companies are eagerly developing the models for predicting churn and putting their efforts to save the potential churners. Therefore, for a better churn prediction model, finding the factors of churn is very important. This study is aiming to find the factors of user ’ s churn by evaluating their past service usage details. For this purpose, study is taking the advantage of feature importance, feature normalisation, feature correlation and feature extraction. After feature selection and extraction this study performing seven different experiments on the dataset to bring out the best results and compared the techniques. First Experiment includes a hybrid model of Decision tree and Logistic Regression, second experiment include PCA with Logistic Regression and Logit Boost, third experiment using a Deep Learning Technique that is CNN-VAE (Convolutional Neural Network with Variational Autoencoder), Fourth, fifth, sixth and seventh experiments was done on Logistic Regression, Logit Boost, XGBoost and Random Forest respectively. First four experiments are hybrid models and rest are using standalone techniques. The Orange dataset was used in this technique which has 3333 subscriber’s entries and 21 features. On the other hand, these experiments are compared with already existing models that have been developed in literature studies. The performance was evaluated using Accuracy, Precision, Recall rate, F-measure, Confusion Matrix, Marco Average and Weighted Average. This study proved to get better results as compared to old models. Random Forest outperformed in this study by achieving 95% Accuracy and all other experiments also produced very good results. The study states the importance of data mining techniques for a churn prediction model and proposes a very good comparison model where all machine Learning Standalone techniques, Deep Learning Technique and hybrid models with Feature Extraction tasks are being used and compared on the same dataset to evaluate the techniques performance better.


Introduction
In the present world, digital media has become a powerful tool for managing large data especially in the telecom industry where there is an essential need to store large dataset.A huge volume of data is being generated by telecom companies at an exceedingly fast rate [8].The large data generated in these companies are bulky and managing and accessing the information out of this data is a main challenging task.
Data mining resolves this issue.Data mining is the process of analysing data from various aspects and summarizes it into valuable information [4].Since the early 1960 Data Mining techniques have been considered to be an area of applied artificial intelligence [3].A large number of data mining techniques available to find out the hidden knowledge about the customer data.Some of them are clustering, classification, attribute selection, Association etc.A churn prediction model is purely based on the customers past service usages behaviour data.Telecom companies develop churn prediction models to increase their client share, maximise profit and stay active in a competitive environment.A consumer churn is switching from one service provider to another.In today's competitive environment customers have multiple options for better services and prices.
There are multiple reasons for customer churn.Unlike post-paid customers, prepaid customers are not bound to a service provider and may churn at any time [8].Customer churn normally happens due to lack of engagement, lack of promotions or new offers, lack of customer service support, high call rate or SMS charges, non-payment bills, fraud or miss usages of services and change of location.When the number of customers dropping below it causes major revenue loss.Churn Prediction model uses a telecom database for prediction.It analyses customer's behaviour and predicts the future churners.Telecom databases are running into terabytes and petabytes having large numbers of attributes and hence to model these complex datasets it needs advanced data sciences models to be developed.
There is a huge advancement in the field of big data and machine learning.Due to that many models have been developed widely.Researchers have developed and compared different machine learning techniques in their models.
Research [3] contributed to develop a churn prediction model to assist telecom companies for predicting customers who are near to churn.This research compared the machine learning techniques that are XGBoost, Decision Tree, Random Forest, Gradient Boosted Machine Tree.This research analysed the factors which played an important role in customer churn by feature engineering and selection.[8] also identified the factors WHICH LEAD TO CUSTOMER CHURN by selecting features using correlation attributes, ranking attributes and information gain.These researches proved factor identification is useful for churn prediction models.Research [10] proved that data preparation techniques they choose affects the churn prediction model performance and enhances Logistic Regression is competitive with advanced single ensemble data mining techniques.[12] have shown that customer misclassification, the amount of service they used and some demographic attributes plays an important role in customer churn.This research used binomial Logistic Regression for the prediction.
Research [9] used different data mining techniques for churn prediction and compared them.For comparing the different techniques this research used different evaluation metrics and also worked on extracting datasets features.DT handles interaction effects between variables very well but has difficulties to handle linear relations between variables.For LR the opposite is true: it handles linear relations between variables very well but it does not detect and accommodate interaction effects between variables [ 1].
This study added multiple functionalities of feature engineering and selection at one place and worked on improving the model performance.this study used literature models and identified some new work and applied multiple feature analysis tasks to improve performance and at last compare them with each other and with literature works.This study using correlation matrix, feature engineering, feature importance, handling categorical feature, handling continuous features, normalising features and giving this altered and informative dataset to four different hybrid models and to five standalone techniques.some hybrid models already have some feature extraction functionality in it therefore it added the double feature extraction capability to the model.The idea of comparing hybrid techniques and standalone techniques is very helpful for future research and the double feature selection process really worked on improving performance.This paper is organized in following sections: section II: Literature review highlighting work already done by researchers; Section III briefly describes methodologies leveraged in this study.In Section IV Proposed work and database are detailed while in section V Results and discussion are discussed, section VI is the conclusion of this paper detailing what the author has accomplished and what is planned in future.

II. Literature Review
Telecom Churn prediction is a crucial factor for companies to be concerned about.Many works have been done on the same.In literature Many techniques and methods have been used in prediction models.Machine learning and data mining were the most used approaches in literature.Most researchers have added one technique for knowledge gain and one technique for prediction and many of researches included factor indemnification at most.Some of the literature are compared and discussed in this section.Table1 shows a very good state of art comparison of the literature work that has been used in this study and tried to get better results.

III. Proposed Work
This study is proposing a good example of KDD.KDD (Knowledge Discovery in Databases) is defined as the "non trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns of in data" [17].In churn prediction the customer's service usage data is very useful for prediction.But telecom companies have bulk data and there is much need to select important features.
Running a churn prediction model on the selected features makes prediction easy for the model and also saves time.
Figure 1 shows this study using Feature Importance and Co-Relation Matrix, handling Categorical and continuous features for feature extraction.This study used multiple experiments some of them are hybrid methods and some are single techniques.Later in this study the experiments performance is being compared before and after features selection and compared with similar literature work.Every prediction model starts the process with Dataset acquisition and processing.

• Dataset
This study used the Orange dataset which is publicly available on a data repository website.Dataset contains 3333 subscriber's entries and 21 attributes.Dataset has one target attribute "Churn" and 20 independent attributes and 483 churners entry.All the independent fields are predictor attributes that will be used to find target attributes.It is required to train the model with the predictors attributes which possess more information for the target attribute [7].Noise removal is an important phase in a prediction model.Initially the dataset is filled with some missing values, outliers and null values that does not let machine learning to execute the process.In case of larger missing or unknown values of an attribute, that attribute is removed from the dataset.In this dataset there are no missing values therefore for missing values no steps are taken.Outlier detection method is applied on the dataset.Outliers are unusual values that are typically defined as being more than three standard deviations away from a variable's mean value [1].The outlier values that are in the three slandered divisions are transformed in the accepted values.The next step taken for noise removal is handling NA values.All NA values are placed with a particular column mean.So that values having NA remain in the average range.Having more 0 values might degrade the value performance.There is one more step in data-pre-processing that is essential to take that removing Unique attributes.Orange dataset has one Unique attribute that is "Phone".They are not used in the training process because they have a direct correlation with the target output (specific to the customer itself [ 3].

• Correlation
Correlation Attributes Ranking Filter techniques is used for selecting a subset of relevant features [8].Correlation is used to find out the variable or attributes which have co-related in the sense of dependency or association.It is a very good way to impute missing values by predicting one attribute from another.This study used correlation matrix and reduced the data dimensionality.Figure 2 depicting the correlation presents in the dataset.Figure 2 clearly shows the relationships between the features."Vmail Plan", Day Charges", "Night Charges" and "International Charges'' are co-related with 'Vmail Messages", "Day Mins", "Eve Mins", "Night Mins" and "Intl Mins" respectively.These attributes are dependent on each other therefore keeping all in the dataset is not worth it.In this study one attribute is removed from each pair based on the feature importance that will be discussed later in this study.
• Handling Categorical and continuous Variable In telecom dataset there are a number of categorical features that exist, they may store useful information about customers therefore that features are essential for the churn prediction model.Some categorical features may have null importance for prediction.Features having null less importance can be removed from the dataset but removing categorical features having much importance will decrease model performance.Machine learning cannot handle these variables.Therefore, they need to be handled in such a way so that model performance may increase.Encoding a categorical variable is a good idea for handling categorical variables.Dataset has two categorical variables "State" and ``Area''."Area" feature is converted to 3 dummy features and assigned with 0 or 1. "state" feature converted into 52 dummy features and assigned with 0 or 1 values.After Encoding categorical features now, the dataset has 73 total features.categorical variables dummy variables are created based on the values of categorical variables.The number of dummy variables depends on the number of values a feature has.Later in this study feature importance is checked for all features that will be discussed later, based on feature importance, dummy features are removed and kept for the prediction.
as stated in Table 2 there are 17 continuous features and all features have different value ranges.Therefore, there is a big need to normalise all Continuous fields.Having values in different ranges makes problems for machine learning algorithms.Normalisation basically is a scaling technique which scales feature values in a range of 0 and 1.That works on a min-max scalar.It takes the maximum and minimum value of feature and scale value accordingly.Normalisation works in the following way: are the maximum and minimum value of the feature respectively.If the value of X is minimum value of the feature then it will 0 in numerator and the value of  ′ will be 0 or else if the value is maximum in the feature column then the numerator will be equal to the denominator where the value of field will be 1 else the value will lie in range between 0 to 1.After creating new dummy features and new normalised feature original features are removed from the dataset.

• Feature Importance
One very important key purpose of the churn prediction model is to find out the factor why customers are churning.Feature importance is a very good technique to visualize the importance of each feature in the dataset.Random forest provides a very good ranking method that is "feature importance".Feature Importance is calculated as the decrease in the node impurity weighted by the probability of reaching that node.The probability of a node is calculated by the number of samples that reach the node divided by total number of samples.The higher the value calculated the importance will be that high.It can be understood in following way: Here   is the node importance,   are weighted number of samples reached to node,   the impurity value of node where ℎ() and () are the right split child node and left split child node respectively.The feature importance for each feature is calculated by this way Where   is the importance of feature j and   is the importance of node i. after calculating importance of all feature values are normalize in the range 0 to 1 in the following way At the end the final importance of a feature at the Random Forest level is calculated by calculating its average over all the trees.The sum of the features importance on each tree is divided by the total number of trees.
Here   is final feature importance,   is the normalized feature importance of feature f in tree I and K is the total number of trees.The Feature importance in this study is visualized below: Figure 3 shows that dummy features created after encoding "State" feature increased the dataset size to 52 new attributes and all dummy state features do not have much importance for the churn model.Therefore the "state" feature is removed from the dataset before category Encoding.Where feature "Area Code '' gave three new features to the dataset with good importance therefore 'Area Code" is not removed.On the other hand, after correlation it was examined that features "Vmail Plan", Day Charges", "Night Charges" and "International Charges" are co-related with 'Vmail Messages", "Day Mins", "Eve Mins", "Night Mins" and "Intl Mins" respectively.Therefore, to effectively remove features from the dataset, features are removed based on their importance.

Experiments
• Decision Tree with Logistic Regression (Hybrid LLM) Logistic Regression and Decision Tree both are very popular techniques and well known for good prediction and comprehensibility.Logistic Regression and Decision Tree both are most used Techniques in the literature.Having very good strength both techniques have some flaws as well.DT handles interaction effects between variables very well but has difficulties to handle linear relations between variables.For LR the opposite is true: it handles linear relations be-tween variables very well but it does not detect and accommodate interaction effects between variables [1].In this study both techniques are combined and using the strength of both techniques.Decision tree first split the dataset into subsets based on their similarity.On each subset Logistic Regression is fit for classification.The first experiment consists of two phases, the first phase using a decision tree and returning homogeneous customer segments.Second phase using Logistic regression for prediction on each customer segment.Decision Tree uses a process where the data is recursively split into smaller and purer subsets by repeatedly applying a greedy search through the space of possible decision trees branches and choosing optimal splits based upon a splitting criterion [1]. in a disjoint subdivision of S into customer subsets S t where every subset is represented by a leaf t in the tree: Decision tree using pruning to overcome overfitting.Overfitting occurs due to repeatedly splitting the tree and makes the model more complex.Logistic regression is a single classification technique mostly used for churn prediction.Logistic Regression proved to provide very good prediction results in the field of churn prediction.This experiment was done on R platform using Orange dataset.Dataset is divided into training and test sets in a 75:25 ratio.
• Convolutional Neural Network with Variational Autoencoder (Hybrid) The second experiment was performed using deep learning techniques.This is also a hybrid model implemented using CNN(Convolutional Neural Network) and VAE (Variational Autoencoder).Convolutional Neural Network also called ConvNet is a Deep Learning Algorithm of a type Artificial Neural Network.The general idea behind Convolutional Neural Network (CNN) has been explained in four steps i.e. first convolution, second non-linearity, third pooling and finally classification [7].This algorithm takes input (images/features) and assigns the weights or importance for prediction to the input attributes so that they can be distinguished from other attributes.Churn prediction related models contain one input layer that transfers all extracted features from the training set.Sigmoid function is used to calculate weight and that weight is assigned to the input features.This weighted sum is sent to the activation function in the hidden layers and output layers which generates output.It is important to increase the number of hidden layers to increase the performance.The mathematical understanding can be seen below.(; , ) Here in the input layer, Linear function is used.Output of this linear function is dependent on the value of the weight w, a represent the bias factor or co-officiant and x is the input vector.Sigmoid function maps the input values from 0 to 1 that is more useful and is given by: (; , ) = (   + ) , where (9) In the last few years Deep Learning based generative models have become the centre of interest.VAE allows us to create a complex generate model and also make them fit to large data.Variational autoencoders encoding distribution regularised at the time of training and it ensures that its latent space's properties are good and allow us to generate new data.VAE automatically helps in Dimensionality Reduction.Dimensionality Reduction is the process of reducing the number of features that describes some data.In machine learning PCA is a very good technique for dimensionality reduction.in autoencoder, every data point is encoded as the real value that leads to no reconstruction loss in decoding it.In this case there is a high degree of freedom for autoencoders that ensures no reconstruction loss and low dimension latent space.However, there is a major issue in this process which is overfitting.Some data points come in the decoded data that are meaningless and to resolve this problem, model need to ensure the latent space to be regular enough.Regularising of the training process is done to avoid overfitting and making latent space properties more meaningful in Variational Autoencoder.VAE provides implementation in keras.VAE models are trained using loss function and compare the original data which is reconstructed.For optimisation purposes VAE is trained using a variational lower bound £ using stochastic gradient ascent method.The negative value of stochastic gradient descent is used for loss function .This loss function is calculated by summation of reconstruction loss  and kullback-Leibler divergence loss . =  + (10) Here x is the data to be reconstructed and t is a latent space vector.⟨|⟩ is the probabilistic decoder of VAE.CNN-VAE implementation was done using keras libraries.This implementation includes phases: o Input layers were defined by giving the dimension of the dataset.o Encoded 2 dense layers were created.This dense layer is itself a CNN (convolutional Neural Network).This convolutional neural network filters all the features by giving them weights and generating the output layer that have filtered feature set.This dense layer will change the dimension of the input vector.In this dense layer RelU activation function was used and in further hidden layers linear activation function was used.After this the model was encoded by converting inputs into latent variables.The output at any layer was calculated as mean i.e. centre point.From the hidden layers decoder dense layers were created that is again a CNN (Convolutional Neural Network) and decoded the output in the output layer.

• PCA with Logistic Regression and Logitboost
In the third experiment PCA was used for more dimensionality reduction on the orange dataset and for prediction Logistic Regression and logitBoost is used.PCA (Principal Component Analysis) mainly used for dimensionality reduction.It is used to transform large sizes of features into smaller one with containing more information with respect to churn prediction.Smaller dataset is easy to explore and visualise, it makes the machine learning process easy.In PCA dimensionality reduction is to trade a little accuracy for simplifying the process.PCA does the process in a few steps First Standardization, that standardizes the range of continuous features so that each feature contributes equally in the process.Mathematically this process works by subtracting the mean and dividing with standard deviation for each value of each variable.
Second Covariance Matrix Computation, is used to identify the relationship between variables so that variable having similar information and effect can be handled and redundancy can be removed.
The monotonic logit transformation o the left side says that for any value of F(x)= ∑  =1   ().∈ , the estimated probability will lie in the range 0 and 1. Inverting we get: Here is the fitting of additive logistic regression by stagewise optimisation of log likelihood.Here probability y=1 by p(x) where () + −() (16) Logistic regression is the one machine-learning algorithm that is not black box model.Normally black box models are complex but the logistic regression we understand what it does actually.Logistic regression can be binary, multinomial or ordinal.In our case, it is binary logistic regression.The logistic regression takes the real valued inputs and makes the prediction like input class belonging to the class 0. If the prediction is >0.5 then it takes the output as class 0 otherwise it takes output as class 1(here class 0 refers to non-churners and 1 refers to churners).Logistic regression is achieved by taking the log odds of where P is the probability of being churn or not churn.P always will come in rage 0 to 1.
Here β is the coefficients to be learnt and  1 …   are the independent variables.Here churn is the dependent variable and rest features are independent variables.Taking the exponent of both side: In machine learning algorithms, we estimate the value of the coefficient by using stochastic gradient descent.It just calculates the value of prediction for each instance in the training set and calculates error for each prediction.In addition, this process continues until the model is accurate enough.In addition, the coefficient keeps updated in the process.For updating the coefficient, the following equation is used: This experiment is performed on Weka. in this experiment orange dataset is evaluated twice, once in the earlier feature selection phase and second in PCA.

• Experiments on Singles Techniques:
All the experiments discussed above are hybrid experiments.In this experiment few standalone techniques were applied for prediction on Orange evaluated dataset.Those techniques are Logistic Regression, Logit boost, SVM, XGBoost, Random Forest.Logistic Regression and Logit Boost are already discussed above and used with PCA, now both techniques are being used alone.Both techniques individually performed well as well as with PCA.Next classification technique used is SVM.SVM is a powerful algorithm.SVM is based on mapping training data points into a higher dimensional space.This mapping is accomplished using a nonlinear function and then SVM performs linear regression in that space [5].SVM does not minimise the training error but works on the generalisation error by minimising the upper bound.SVM plot datapoint in n dimensional space here n is taken for the number of feature dataset have.Then SVM separates data in two classes by defining a HyperPlane.Support vectors are the coordinates of the observations.These data points help in building SVM.While building SVM, the margin between the coordinates and hyperplane tries to be maximised.The loss function is calculated by:

Support Vectors
Loss is calculated 0 in case both actual and predicted are in the same range and is not the loss is calculated.For balancing margin maximisation and loss the regularisation parameter is added to the cost function.Cost function is given by: After loss function partial derivatives are taken to find gradients with respect to weights.Gradient is used to update weights.

𝛿 𝛿𝑤 𝑘
‖‖ 2 = 2  (22 In order to train SVM, two main parameters are required: C and Sigma.The C parameter affects the prediction.It indicates the cost of penalty.Large value For C means high accuracy in training and low accuracy in testing.While small value for C indicates unsatisfactory accuracy [14]. in SVM Sigma values influence the hyper parameter partitioning more.The large value of sigma and small value of sigma leads to overfitting and underfitting respectively.

• XGBoost
Both AdaBoost (Adaptive Boost) and Stochastic Gradient Boosting algorithms are ensemble-based algorithms that are based on the idea of boosting.They try to convert a set of weak learners into a stronger learner [14].The boosting methods are different from Random Forests and follow a constructive ensemble formation strategy.The idea behind boosting is to add new learning models in a continuous manner while building ensembles [2].XGBoost is an optimised distributed gradient boosting library designed to be highly efficient, flexible and portable.XGBoost is the abbreviation for extreme Gradient Boosting.The primary purpose of using XGBoost is due to its execution speed [16].Gradient descent helps in minimising the differentiable function but in gradient boosting the average gradient components will be computed.For each node in the tree there is a factor y with which each learner hm(x) is multiplied.This function adds the difference on the impact of splitting of each branch.Gradient boosting helps in predicting the optimal gradient for the additive model unlike classical gradient descent techniques which reduce error in the output at each iteration.The gradient boosting works in the following steps: The gradient is the loss function that is computed iteratively.
, ℎ   ℎ Each hm(x) is fit on the gradient obtained at each step.The factor ym a multiplicative factor for each terminal node is derived and then boosted model is given as: XGBoost is also known as Regularised Boosting.It helps in reducing overfitting and perform parallel tasks that make it faster.The boosting methods are different from Random Forests and follow a constructive ensemble formation strategy [2l].XGBoost adds new learning models continuously at the time of building ensembles.After every iteration a cumulative error is considered and based on that error a basic new weak learner is trained.This experiment is performed on python.

• Random Forest
Random Forest is a learning that is operated by multiple decision trees.The final decision is made based on the majority vote of the tree and chosen by Random Forest.

Figure. 7: Random Forest Structure
Random Forest belongs to the family of classifiers which populate the forest of Decision Tree.In Random Forest bootstrap aggregation or bagging is applied on the tree learning of training algorithm, where the training set is given by  =  1 …   and responses as  =  1 …   .Bagging repeatedly calls random samples from and fit trees to that samples.Let's take random samples as R  = 1 … .
Samples for replace with training samples calls training samples X and Y as      , then train the classification tree        .for prediction on unseen samples  ′ are made by averaging of the prediction of all individual trees on  ′ .
Prediction on  ′ unseen training samples also can be made by majority votes.Bootstrap procedure of random forest is good for better model performance.It does not increase the bias.
• Parameters: While fitting Random Forest some parameters are needed to be given.Tuning this parameter is very essential for improving model efficiency.These parameters are listed below: a) Bootstrap: Bootstrap tells the Random Forest about sampling data point methods.Sample will be sent with replacement or without replacement.In this study it is set to be False for using all samples.b) n_estomators: It is set to tell the Random Forest about the number of trees.The larger the number of trees leads good performance.But sometimes very large values come out with overfitting.In this study 100 default values were used only.Using a value greater than 100 is not affecting the performance and using a lesser value decreases the performance.

Fnal Decision
Decision Tree 1

Random Forest
Decision Tree 2 Decision Tree 3 c) max_depth: This parameter tells Random forest about the max number of levels every decision tree.In this study it is set to 20. d) max_features: This is used to tell the max number of features considered for splitting the node.It is set to 'auto'.
e) minimum_sample_split: This parameter tells the minimum number of datapoints to be placed in the node before splitting it.This is set to 2. f) minimum_sample_leaf; This parameter gives the value of minimum data points that are allowed in leaf nodes.This is set to 1.
Random forests can be easily deployed in a distributive manner because of parallel execution while Gradient boosted trees cannot as it executes trial after trial [2].

IV. Result and Discussion
In this section the result visualisation is presented.The results are analysed to compare the performance of 8 experiments done on Orange dataset.all experiments are performed on the same dataset to analyse the result better.this study dealing with unbalanced dataset and comparing results in two scenarios.The first scenario was when the feature engineering and feature selection task is not performed using feature engineering.In the second scenario feature extraction task is performed using feature engineering In both parts, the performance is evaluated using the accuracy score of all models and it is compared with already existing similar models from the literature review.in the second part all the models are compared based on other performance measures.In the third part all models are compared using a confusion matrix.

• Accuracy
Accuracy indicates the ability to differentiate the credible and non-credible cases correctly [14].It is the true positive and true negative portion from all the predicted instances.Figure 8 shows the accuracy comparison of all developed models before feature engineering (FE) and extraction tasks and after FE and extraction tasks and also compared with the accuracy achieved in literature on the same model.The other performance measure added in this study is Marco-Average.Marco-Average performance measure is used when there is a need to check the overall performance on all classes.Marco average is calculated simply by taking an average of precision and recall achieved on all classes.F-score Marco average is the harmonic mean of both calculated Marco average of precision and recall.Weighted Average is also an important performance matrix for a machine learning model.Weighted average is also used to tell the overall performance of the model.It is also calculated for precision, Recall, f-score and support.To calculate the weighted average of all, get the precision, recall, f-score and support of each class and weight by the number of instances of each class.This study is working on two class classification problems so the weighted average will be calculated in this way.Overall Random forest proved to be the best standalone technique for churn prediction model.Random Forest (RF) is a useful algorithm that suits classification and can handle nonlinear data very efficiently.RF produces better results and better accuracy and performance compared to the other techniques [12].On the other hand, CNN with VAE proved to be the best Hybrid technique for churn prediction models.

• Confusion Matrix
Given the number of categories C, Confusion matrix represents the results of a machine learning model in CXC tabular format.that display the records count by their actual and predicted class.Confusion matrix is used to evaluate a classification model but not for a regression model.itcategorise the outcome into two or more categories.Confusion matrix is used to calculate some performance measure like precision, recall, f-score, error rate etc.

V. Conclusion
This model is presenting a very good comparison model for Customer churn prediction in Telecommunications using a wide variety of machine learning and deep learning techniques.Additionally, this study set a very good example for feature engineering and feature extraction for a churn prediction model.On the other hand, this study showed a start art comparison of all the similar literature work that has been used in this study.Later in this study all literature works are compared with the similar models developed in this study using feature engineering based on Accuracy, precision, Recall, f-score, support, Marco-average, weighted average and confusion matrix.For feature engineering this study used correlation matrix, handled continuous features, handled Categorical features and used the feature importance function of random forest.These feature engineering tasks helped at best for improving the accuracy of churn prediction models.This study used two types of models: first hybrid models and second standalone techniques.all the techniques and models are compared later.compared predictive accuracy and comprehensibility of explicit, implicit, and hybrid machine learning models for telecom churn prediction on Orange Dataset.for machine learning models Weka, R and Python platforms are used.
In this paper several promising machine learning models have been identified which are suitable for learning knowledge and decision support.These models produced very good and understandable results.This study also used several feature engineering tasks like correlation matrix, feature normalisation, feature extraction, feature engineering, feature importance, handling categorical variables and continuous variables that also set an example of feature engineering and proved encouraging for future research.Random Forest and CNN with VAE have achieved good prediction results but all other models got similar results that need to be improved.Future research may include two or more big datasets.Accuracy Comparison if all models before and after FE with Literature work Marco Average ang Weighted Average result Comparison of all models.

Figure 1 :
Figure 1: Proposed Model for Customer Churn Prediction.

Figure 2 :
Figure 2: Correlation representation of the dataset Orange.

Figure 4 :
Figure 4: Structure of experiment one using logistic Regression with Decision tree.

XFigure 5 :
Figure 5: Representation of Second Hybrid Model using CNN with VAE.

Figure 9 :
Figure 9: Marco Average ang Weighted Average result Comparison of all models.

Figures Figure 1 Proposed
Figures

Figure 3 Feature
Figure 3

Figure 4 Structure
Figure 4

Table 2 :
Orange Dataset Description• Noise Removal and Filtering:

Table 3 :
Data Pre-processing handled based on Feature Importance.
Third Identifying the Principal components, that is constructed by computing the Eigenvectors and Eigenvalues of covariance matrix.PCA give a new angle from there data can be evaluated and seen effectively.The dataset given by PCA then given for the prediction and two powerful techniques are used logistic Regression and Logitboost.Logit Boost is an Additive Logistic Regression Model.The LogitBoost model is like the AdaBoost model.The main idea behind Logit Boost is to apply boosting in building logit model.The Logit Boost is classified as a "weak" or "base" learning algorithm, Logit Boost takes different training examples repeatedly due to that the base learning algorithm generates a new weak prediction rule, that causes so many rounds and later boosting algorithm must convert these weak rules into one strong prediction rule that, normally, become much more accurate than a weak rules.The difference between AdaBoost and logit boost is to use a weak classifier.Logit Boost is is Additive Logistic Model.An additive Logistic model forms the equation:

Accuracy Comparison if all models before and after FE with Literature work In
It can be seen that the maximum accuracy achieved by ensembled model Random Forest (RF) is 93 before FE boost, and 95 after FE in standalone technique where in literature it was 80%.Logistic Regression, Logit Boost, XGBoost got 85% accuracy before FE tasks After FE tasks these techniques got 86%, 89% and 88% accuracy respectively where in literature accuracy achieved was 79%, 87% and 78% respectively.SVM got 85% accuracy before FE and 89% after FE where in literature SVM got 86%.Standalone technique Random forest outperformed in terms of accuracy.are essential to add.In the next part models are being compared according to some performance measures.
hybrid techniques maximum accuracy achieved by CNN with VAE (Convolutional Neural Network with Variational Autoencoder) is 88% before FE and it is 90% after FE. in literature the accuracy achieved on the same hybrid model is 76%.Where the Decision Tree with Logistic Regression (DTLR) got 87% Accuracy before FE and 88% After FE. the literature model of DTLR got 80% accuracy.PCALR and PCALB got 85% accuracy before FE and 87% and 88% after FE respectively.Feature engineering plays an important roll right prediction that can be seen by the results.Accuracy is a good performance measure but not enough to tell a model performance is good.so more performance measures

Table 4
shows all performance measures achieved by all models before and after FE. in hybrid models CNN with VAE out performed with .92precision,.98 recall and .94f-score.where after feature engineering CNN model got 88 precision, .99recalland.93f-score.Precision is how many are actual positive out of what we have predicted correctly.When a model achieves a low precision rate with high recall then it becomes difficult to measure the model performance and vice-versa.In this case Performance is calculated by f-score.F-score uses the Harmonic Mean at the place of Arithmetic Mean by punishing the extreme values more.CNN model got a .93f-scorethat is a very good score.Other hybrid models also got very good scores of precision, recall and f-score but less than CNN models.where in single techniques Random Forest outperformed with .93 precision, 1 Recall and .96f-scorebeforefeature engineering process.After feature engineering Random Forest got .95precision,.99recalland .97f-score.Other models also performed well with good scores that are given in Table4.

Table 4 : Comparison based on other performance measures.
Table shows the confusion matrix achieved in different models after feature engineering.

Table 5 : Confusion Matrix comparison.
Standalone techniques used in this study are Logistic regression, Logit Boost, SVM, Random Forest, XGBoost.out of all techniques Random forest outperformed with 95% prediction accuracy where without Feature Engineering tasks Random forest got 93% accuracy.SVM and Logit Boost got 89% prediction accuracy After Feature Engineering that is second highest accuracy.On the other hand, Random Forest got the highest value as .93precision, 1 Recall and .96f-scorebefore feature engineering process.After feature engineering Random Forest got .95precision,.99recalland .97f-score.In this study Marco Average and Weighted average also explained and listed the achieved value.Other standalone techniques also performed well and got very good value of accuracy and other performance measures but Random Forest proved to be the best standalone technique.In this study four hybrid models LLM (Decision Tree with Logistic Regression), CNN with VAE, PCA with Logistic Regression and PCA with Logit Boost are used for churn prediction.Out of all models CNN with VAE outperformed with 88% accuracy before Feature Engineering and 90% accuracy after Feature Engineering.where CNN with VAE got .92precision, .98 recall and .94f-score before Feature Engineering and 88 precision, .99recall and .93f-score after Feature Engineering.all hybrid models also performed well after feature engineering tasks.This Study we