Recommender System for Telecom Packages Based on the Deep & Cross Network

: With the evolution of 5G technology, the telecom industry has influenced the livelihood of people and impacted the development of the national economy significantly. To increase revenue per customer and secure long-term contracts of users, telecommunications firms and enterprises have launched different types of telecom packages to satisfy the varying requirements of users. Although several recommender systems have been proposed in recent years for telecommunication package recommendation, extracting effective feature information from large and complex consumption data remains challenging. Considering the telecom package recommendation problems, traditional recommendation methods either use complex expert feature engineering or fail to perform end-to-end deep learning training. In this study, a recommender system based on deep & cross network (DCN), deep belief network (DBN), Embedding, and Word2Vec is proposed using the powerful learning abilities of deep learning. The proposed system fits the telecom package recommender system in terms of click-through rate prediction to provide a potential solution for the recommendation challenges faced by telecom enterprises. The proposed model can effectively capture the finite order interactional features and deep hidden features. Additionally, the text information in the data is completely used to further improve the recommendation ability of the model. Moreover, the proposed method does not require feature engineering. We conducted comprehensive experiments using real-world datasets, and the results verify that our method can generate improved recommendation accuracy in comparison with those observed in DBN, DCN, deep factorization machine, and deep neural network models individually.


Introduction
The telecom industry has become a critical support industry, serving almost all Machine based neural network (DeepFM) uses the power of factorization machines for recommendation and deep learning for feature learning [19] . The eXtreme Deep Factorization Machine (xDeepFM) combines the Compressed Interaction Network (CIN) and a classical DNN, which can learn certain bounded-degree feature interactions explicitly and arbitrary low-order and high-order feature interactions implicitly [20] . Deep & Cross Network (DCN) can maintain the DNN's benefit of the generalization capability to unseen feature combinations and efficiently obtain certain bounded-degree feature interactions [21] . Although these deep learning-based models were developed to predict the Click Through Rate (CTR) on the Internet, their direct application to telecom package recommender systems remain a challenge.
Therefore, we propose a novel recommender system for telecom packages based on DCN, DBN, Embedding, and Word2Vec. We used Embedding [22] for mapping discrete categorial features, and Word2Vec [23] was used to obtain similarities between numerical features and data wrangling (imputation and drop). Additionally, the raw dataset conversion was used to fit the format of the CTR paradigm.
The remainder of the paper is organized as follows. A brief overview of the related work and proposed model is presented in Section 2. Section 3 explains the comprehensive set of experiments performed to validate the proposed method by comparing its performances with that of other methods. Finally, the conclusions are summarized in Section 4.

Related Work
With the rapid advancement of smartphones, communication service providers need to establish a comprehensive recommender system to ensure high-quality and custom-made services for different customer demands. Before the development of deep learning, traditional machine learning and recommender systems were used for telecom package recommendations.
The K-Nearest Neighbor (KNN) [24] can separate the historical datasets of customers into several classes to predict the classification of a new customer. However, the computing cost is typically high for a large characteristic dimension. Moreover, KNN is sensitive to noise in the dataset and exhibits a low fault tolerance.
CF is a popular technique for developing recommendation systems and has been employed in several applications [25][26][27] . Herein, the behaviors of users based on the data collected in the past are analyzed to determine the connection between users and their items of interest, which can aid in recommending items to the users with similar preferences and support considering the opinions of different users. A hybrid recommendation approach that combines user-based and item-based CF techniques was proposed for mobile product and service recommendation [28][29][30] . Although CF can achieve efficiency based on similar measurements of users' interests and recommended items, it is difficult to exploit the cross features completely owing to the lack of deep and effective feature extraction. Furthermore, problems of cold start and low recommendation accuracy exist in CF recommendation systems.
Although recommender systems are widely used in the modern telecom industry, the methods employed have been focused on traditional machine learning and CF techniques. Only a few practical or academic cases exist, wherein the state-of-the-art deep CTR model was used in telecom package recommendations. However, the significant success of deep learning introduced deep learning-based CTR into the field of recommender systems, which were applied in websites on the Internet.
Factorization Machine (FM) [31] integrated with deep learning exhibited outstanding performance and achieved promising results in CTR prediction. However, the two types of deep learning models proposed, namely FM-supported Neural Network and Sampling-based Neural Network, can capture only high-order feature interactions [32] . Wide & Deep Learning [15] , which trains wide linear models and deep neural networks simultaneously to combine the benefits of memorization and generalization for recommender systems, was initially introduced for app recommendation in Google Play. However, it requires the designing of cross-product transformations through expertise feature engineering.
PNN [13] attempts to capture high-order feature interactions by involving a product layer between the embedding and hidden layers. Unlike the traditional embedding combined with multi-layer perceptron, PNN explicitly captures the second-order feature correlation in multiple fields.
Influenced by Wide & Deep, DeepFM [19] replaced the "wide" segment of the Wide & Deep model with FM to learn the low-order features. The "deep" segment was developed with a feedforward neural network to learn high-order features. Both the "wide" and "deep" segments share identical raw feature inputs. In comparison with the Wide & Deep model, DeepFM is an end-to-end learning model that is independent of manual feature engineering. Moreover, the xDeepFM [20] explicitly models low-order and high-order feature interactions by using a novel CIN segment.
To address the drawbacks of the aforementioned models, we propose a novel recommender system for telecom packages based on DCN, DBN, Embedding, and Word2Vec, which can provide a potential solution to the recommendation challenges faced by telecom enterprises.

Proposed Method
In this section, we describe the architecture of the proposed model and present its underlying theory. The process of developing the dataset is explained, and the methods used to evaluate the proposed model are outlined. Fig. 1 illustrates the architecture of the proposed model based on the DCN. The model begins with an embedding and stacking layer, wherein the sparse, dense, and converted features are stacked with Word2Vec. Subsequently, the stacked value is fed into a cross network and a deep network in parallel inputs. The two results are combined in the combining layer, and the final output is computed through a single-layer neural network and activated by a sigmoid function. Unlike the original DCN, the Word2Vec model is combined with the DCN to train text features, which solves the problem of DCN being unable to process text information. This enhances the recommendation accuracy of the proposed model. Moreover, using the DBN network to replace the "deep" segment of the original DCN can exploit the advantages of DBN completely.

Dataset
The dataset used in this study was provided by the China Unicom Research Labs, which is available from DataFoundatin 2 in csv format. The dataset comprised 743 990 samples. Each sample indicates a consumption record, implying that a telecom package is currently being used by a customer. The dataset comprises 27 fields, such as USERID, current_type, and service_type. Excluding the USERID field, 22 features, such as age, gender, and complaint level, are used to describe a user, whereas four features, such as the type of package and whether the mixed package is opted, are used to describe the current package. Each categorical field is represented as a vector of one-hot encoding. Conversely, each continuous field is represented as the value itself or as a vector of one-hot encoding discretization.
For the dataset to be suitable for the CRT prediction task, it must be transformed into binary classification data. In the case of CTR, each sample represents a historical click behavior from a specific user on a telecom package, which is considered as a displayed advertisement. Typically, all samples of the raw dataset serve as positive samples. The raw dataset used in this study comprised 12 unique telecom packages. We randomly chose five samples from 11 telecom packages to construct negative samples by maintaining consistency with the user-side features. According to the CTR model, a "clicked" column must be created in the dataset. Additionally, the click values of positive and negative samples were "1" and "0", respectively.

Data Wrangling
As the raw dataset comprised missing values, we used two methods to determine these values from the existing data.
1) In the case of numerical features, the datasets were grouped in terms of the current package by first calculating the mean of the non-missing values in a column and then replacing the missing values within each group separately, independent of others groups.
2) In the case of categorical features, the datasets were grouped considering the current package, replacing the missing data with the most frequently occurring values within each group separately. ..

1) Input and Embedding Layer
We classified all features of the raw dataset into three types, namely categorical features, continuous integer features (such as times of payment), and decimal numerical features. We used Embedding for dimensional reduction to transform the categorical features into a low-dimensional space. Additionally, we porformed normalization on the continuous integer features. In the case of numerical decimal features, we converted the decimal type to textual type and employed Word2Vec to transform these features into embedding vectors. Finally, we stacked the embedding vectors of categorical features, normalized integer features, and textual features into one vector: (1) 0 serves as the input for both cross and deep networks simultaneously.

2) Cross Network
The cross network comprises several cross layers, wherein each cross layer can be calculated as follows: where and +1 represent the outputs of the cross layer l and l+1, respectively; and and denote the weight and bias parameters of the l-th layer, respectively. All the aforementioned variables are column vectors. Furthermore, the features of each layer are cross-combined by the previous layer and original features ( 0 ) and added back to the previous layer. This is similar to the structure of a residual network, wherein the function f of each layer fits the residual of +1 − . Thus, the gradient dispersion problem caused by the DNN can be solved using this residual network.

3) Deep Network
The deep network is a fully connected feedforward neural network that shares the same input with the cross network. It enables the model to learn the combined features of higher-order nonlinearity. The output of each layer can be calculated as: where ℎ +1 and ℎ represent the output of the i+1-th and i-th layers, respecitvely; and denote the weight and bias parameters of the i-th layer, respectively; and F indicates the activation function to enhance the non-linear capability of the network. The Rectified Linear Units (ReLU) [33] is used as the activation function in the proposed model owing to its calculation simplicity. Moreover, the convergence speed of ReLU significantly outperforms those of other activation functions, such as sigmoid [34] and tanh [35] .
When 0 flows through the cross network and deep network, and are generated, respectively. In the combining layer, and are stacked into as follows: Here, we can calculate the output through a perceptron using a sigmoid activation function , which is calculated as:

Results and discussion
In this section, we evaluate the performance of the proposed model using the dataset provided by the China Unicom Research Labs. The experiments were conducted using a high-performance computer configured with a dual Intel Xeon CPUs E5-2660 v4 processor at 2.00 GHz, including 56 cores, 630 GB memory, 1 TB hard disk, and 3 Tesla P40 graphic processing unit (GPU) cards. The system used Ubuntu 18.04.1 LTS with the Python 3.6.5, Torch 1.1.0, and Tensorflow 1.14.0.

Comparison of models and their hyper parameters
We compared the proposed model with four models, namely the DBN, DCN, DeepFM, and DNN. All models used Adam [36] optimizer with the learning rate and mini-batch size set to 0.001 and 1000, respectively. Furthermore, as dropout can be used to prevent neural networks from overfitting [37] , the dropout rates of all models were set to 0.2. The Adam optimizer addressed the sparseness faced by the models, whereas the dropout prevented the neural networks from overfitting.
Proposed Model: Table 1 lists the hyperparameters of the proposed model. The dimension of word embedding was set to eight, with three and four cross and hidden layers, respectively, in the deep network. Additionally, the number of hidden neurons in each hidden layer was 32, and the number of RBMs was set to 4, with a network structure of [200,100,50,20]. Mini-batch size 1000 Learning rate 0.001 DBN: Table 2 lists the hyperparameters of the DBN. The value of k in CK-k was set to two, and four RBMs were used, with the network structure set to [200,100,50,20]. Mini-batch size 1000 Learning rate 0.001 DCN: Table 3 lists the hyperparameters of the DCN. Similar to the proposed model, the dimension of word embedding was set to eight, with three and four cross and hidden layers, respectively, in the deep network. The number of hidden neurons in each hidden layer was 32. DeepFM: Table 4 lists the hyperparameters of the DeepFM. Herein, the dimension of word embedding was set to eight, with four hidden layers in the deep network and 32 hidden neurons in each hidden layer.  Table 5 lists the hyperparameters of the DNN. Herein, the deep network comprised four hidden layers, with the network structure set to [200,100,50,20].

Evaluation Methods
We used the Area Under ROC Curve (AUC) [38][39][40] and cross entropy (Logloss) [41] , which are widely used metrics in the CTR prediction, as the evaluation metrics in our experiments. The upper bound of AUC was set to 1, wherein a larger value indicates better performance. Logloss measures the distance between two distributions, and a smaller log loss value indicates better performance. All experiments were repeated 10 times to determine the averaged results. Fig. 2 illustrates the AUC metrics. As indicated in the figure, the AUC metrics obtained by the proposed model were larger than those of the other models. Furthermore, the AUC metrics increased rapidly as the value of epoch increased, and subsequently attained the optimal or maximum/best values. Fig. 3 illustrates the Logloss metrics, which shows that the Logloss metrics obtained by the proposed model were less than those of the other models. Moreover, the Logloss metrics decreased rapidly with increase in the epoch value, which then attained the optimal or minimum/best values. Table 6 summarizes the best metrics of both AUC and Logloss obtained by all models.
In other words, the proposed model outperformed all the other models. First, in comparison with other models that do not consider text feature information, the proposed model combined with the Word2Vec model, which can train text information, can extract richer text information. This significantly improved the performance of the proposed model than those of other models. Second, unlike DBN and DNN, the proposed model considers the extraction of high-level features and the impact of low-level features, which ensures that the proposed model achieves better performance. Third, unlike the DeepFM and DCN, the proposed model uses the DBN rather than the DNN. This implies that the proposed model is less likely to encounter the local optimal situation. Therefore, we trained the proposed model in the deep network layer considering the global optimal situation

Conclusions
To provide a potential solution to the recommendation challenges faced by telecom enterprises, we proposed a recommender system for telecom packages using the DCN, DBN, Embedding, and Word2Vec models. The experimental results indicate that the proposed model outperforms the existing models, namely DBN, DCN, DeepFM, and DNN, in terms of AUC and Logloss. The aspects contributing to the performance improvement of the proposed model can be summarized as follows: 1) the addition of the new embedding features generated by applying Word2Vec on textual features originated from the decimal features, which enables the model to learn more low-order and high-order feature interactions; 2) the replacement of the deep segment of the original DCN by DBN renders the proposed model an efficient generative model, which considers unlabeled data and a powerful predictive classification function. In the future, we intend to investigate the distributed training on multiple GPUs and the imputation of missing values using machine learning techniques.

Availability of data and material
The Dataset used in this paper is provided by China Unicom Research Labs, and is available at https://www.datafountain.cn/competitions/311/datasets.

Competing interests
We declare that there is no conflict of interest regarding this submission.

Authors' contributions
Congming Shi and Wen Wang introduced the idea of this work and finished partial English writing. Shoulin Wei, and Feiya Lv designed the algorithms and finished partial English writing. Shoulin Wei, and Wen Wang implemented the algorithms and finished partial English writing.