Exploring different interaction among features for CTR prediction

Advertising Click-Through Rate (CTR) prediction is a relatively successful application in the field of recommendation system. Improving the accuracy of advertising click-through rate can not only make the user experience better, but also give more benefits to advertising platforms and advertisers. It can be seen from the research status that the interaction between learning features has become a very important part of the advertising CTR prediction model. Although the existing CTR prediction models based on deep learning have achieved good results, some models only consider a single interaction mode and lack the diversity of feature interaction. To resolve this problem, this paper proposes a CTR prediction model based on multi-feature interaction, called EDIF, which aims to enhance the diversity of feature interaction. Firstly, the model learns multiple different embedding vectors for each feature in the embedding layer, which reflects the correlation between features; secondly, in the interaction of high-order features, the embedding vectors of each feature are added and pooled to form the aggregation vector of the whole feature as the input, which reflects the integrity of the feature; finally, after the feature embedding operation, the model introduces two layers in parallel: compressed excitation network layer and explicit high-order interaction layer, which improves the ability of feature interaction. We have done a lot of experiments on two public data sets, Avazu and Criteo. The results show that the model effect of this paper has great advantages over the latest model.


Introduction
At present, we have entered the era of big data, the remarkable feature of the era of big data is huge. Using this large amount of data, we can obtain useful information through data mining technology (Abualigah et al. 2019). The recommendation system can help users find the information they want, make producer production information more targeted displayed in front of users. In the advertising field of recommendation system application, CTR prediction is significant. There are three prominent roles in the advertising ecology: users, advertisers and advertising platforms. Users are the group that advertisers want to influence through the advertising platform, and advertisers can promote their products and improve their reputation through the advertising platform. Advertising platform can provide advertising space for advertisers to display advertisements. Advertising is an B Wenguang Zheng wenguang_zheng@163.com 1 Tianjin Key Laboratory of Intelligence Computing and Novel Software Technology, Tianjin University of Technology, Tianjin, China important business model for Internet companies. For advertisers, advertising can promote products and services, provide channels to obtain users and achieve rapid user growth. Advertising can increase revenue and bring economic value to advertising platforms, so it is of great practical significance to study advertising CTR prediction and improve the accuracy of CTR prediction.
Logistic regression (LR) is one of the proposed methods in the development process of CTR (Juan et al. 2016;Wang et al. 2017;Guo et al. 2017;Zhou et al. 2018b;Feng et al. 2019;Ouyang et al. 2019a;Huang et al. 2019;Zhou et al. 2019;Pi et al. 2019;Ouyang et al. 2019b;Lyu et al. 2020;Xu et al. 2020;Liu et al. 2020). The LR uses one-hot coding, which converts categorical features into vectors as input. However, the features after one-hot are too sparse and lead to a larger feature space. Moreover, logical regression relies on artificial feature engineering. This problem is solved by Factorization Machines (FM) (Rendle 2010) , which achieves the purpose of learning combination feature weights by performing inner product operation on two latent vectors, while the Field-aware Factorization Machines (FFM) add the concept of the field in FM. Each feature uses different latent vectors for different fields; thus, each feature corresponds to a group of latent vectors that enhance the expression of the model. Since it is imperative to learn combined features in the CTR prediction, Product-based Neural Networks (PNN) (Qu et al. 2016) learn feature interactions in the form of inner product. Nevertheless, the PNN does not use different vectors when one feature interacts with other features. The Operation-aware Neural Networks (ONN) (Yang et al. 2020) performs inner product operation after the operation-aware embedding, solving the problem of using different vectors when the same feature interacts with other features.
Some models only consider a single interaction mode and lack the diversity of feature interaction. To resolve this problem, this paper proposes a model called EDIF, which aims to enhance the diversity of feature interaction. We know that the correlation between each feature and different features is different. When a feature uses the same vector when interacting with other features, this correlation will be ignored. Therefore, our EDIF model takes into account the different embedding vectors when interacting with other features (Juan et al. 2016) and adds and pools the embedding vector of each feature to form the aggregation vector of the feature whole. In addition, during high-order feature interaction, we add up the embedding vectors of each feature to form the aggregation vector of the integral feature as an input, which reflects the integrity of the feature. After the feature embedding operation, two layers are introduced in parallel in our model: SENet layer (Hu et al. 2018) and Cross-layer (Wang et al. 2017). In different scenarios, users pay extra attention to different features. If we give the same weight to different cross-features, much valuable information will be ignored. The SENet layer gives different weights to different interactive features in our model. Compared with the previous deep neural networks, the Cross-layer can explicitly carry out high-order interaction and learn more nonlinear relationships.
The main contributions of this paper can be stated as follows: -Our model does not lose the correlation between features when performing second-order interaction after embedding operation and does not lose the integrity of features when performing high-order feature interaction after embedding operation. Specifically, in the secondorder feature interaction, we consider that a feature uses different embedding vectors when interacting with other features, reflecting the correlation between features. In high-order feature interaction, the embedding vectors of each feature are added to form the aggregation vector of the integral feature as the input, which reflects the integrity of the feature.
-Different interactive features are given different weights by introducing the SENet attention module after the second-order feature interactions. -To improve the ability of feature crossing, second-order and explicit high-order interactions are also carried out to explore more nonlinear relationships between features.
The rest of this paper is structured as follows: the second section mentions the related work. In work related to our model, we have mentioned the investigation and research in four different directions. The third section introduces the model in detail, while in the fourth section, we describe the extensive experiments conducted on two open datasets, and we summarize our work in the fifth section.

Deep learning evolution of the FM model
The CTR is a very important parameter of the computational advertising and recommendation system. In CTR prediction, it is often necessary to combine multiple features. The FM solves the problem of high dimension and highly sparse input feature combinations well. A second-order part is added to the model based on the logistic regression to obtain the corresponding latent vector for each dimension. The latent vector inner product models the weight, and the feature combination, which has not appeared in the training set, can also be learned effectively. Due to the limitation of the combinatorial explosion, the model is not easy to be extended to the third-order feature crossing. FFM adds the concept of the "feature field" to the FM model so that each feature adopts different weights when crossing with features of different fields. Compared with the FM, the ability of feature interactions is further enhanced. By modifying the second-order part of FM, the Neural Factorization Machines (NFM)  replace the feature crossing part of FM with a Deep Neural Network (DNN) with a Bi-interaction pooling layer. Bi-interaction pooling can be regarded as element-wise product embedding with different features. As compared to FM, the NFM has a more robust expression and feature crossing abilities. The FM model is adopted for the embedding layer of FNN  to conduct the dimensional reduction with supervision for sparse features and transform them into dense, continuous features. The convergence speed of FNN becomes faster by using the FM initialization parameter. The Translation-based Factorization Machines (TransFM) (Pasricha and McAuley 2018) combines the ideas of FM and TransRec and applies them to sequential recommendations. The TransFM changed the inner product calculation method in the FM and used the square Euclidean distance to improve the transitivity between sample features.

Combination model
Combining different models is a common method to build a recommended model to integrate the advantages of multiple models. Wide & Deep (Cheng et al. 2016) combined the Wide Linear Model and DNN for training. The advantage of this model is that the memorization ability of the Wide Model and the generalization ability of the Deep Model are obtained at the same time. This has a significant impact on the subsequent development of the recommended deep learning model. However, the wide section needs to be screened manually for feature combination. The Deep & Cross-Network (DCN) can effectively capture the feature combination of specific order and learn highly nonlinear interaction, without artificial feature engineering. The xDeepFM (Lian et al. 2018) introduces the idea of convolutional neural network to achieve the effect of explicit learning of high-order feature interaction.

The combination of attention mechanism and recommendation model
The "Attention mechanism" is inspired by human habits. For example, when people browse the website or Taobao home page casually, they are attracted by specific areas. Therefore, if the attention mechanism is taken into account in modeling, the accuracy of the recommendation results can be significantly improved. By introducing the attention mechanism, Attentional Factorization Machines (AFM) (Xiao et al. 2017) are used to assign different importance to different feature combinations where the weights in the network can be learned automatically without introducing any additional field knowledge. Moreover, the Deep Interest Network (DIN) introduced the attention mechanism based on the traditional deep learning recommendation system model and calculated the attention score by using the correlation between the historical items of user behavior and the target advertising items, and thus, according to the different target advertising items, more targeted recommendations are made. However, the user interests are constantly evolving, while the DIN extracts user interests that are independent of each other, without capturing the dynamic evolution of interests. ATRank (Zhou et al. 2018a) proposed a general user behavior sequence modeling framework to integrate different types of user behaviors and conduct more detailed processing of the user heterogeneous behavior data. Momenta's network architecture SENet, which is the champion of the ImageNet 2017 Challenge, can learn from the importance of different features, thereby weighting the important features and weakening the features that contain little information. The AutoInt (Song et al. 2019) studies the explicit learning problem of high-order feature interaction, which has good interpretability. The user behavior in each session is similar, but the difference between different sessions is significant. Therefore, the Deep Session Interest Network (DSIN) is proposed to model user behavior closely related to the session. The closer the user's conversational interest is to the target item, the greater is the weight assigned by DSIN through the attention mechanism. Additionally, the Deep Spatio-Temporal Neural Networks (DSTN) consider both the spatial and temporal field information to estimate the CTR of advertising. The DSIN puts forward two Attention models, one is the Self-Attention Model, and the other is the Interactive Attention Model, in which the latter improves the former.

Changing the way of features cross
The proposed model enriches the way of feature crossing in a deep learning network. The PNN learns the interaction of features in the form of inner product, outer product, inner product and outer product. Its product operation combines the features between different feature fields. The Neural Collaborative Filtering (NCF) (He and Chua 2017) replaces the traditional dot product operation by a neural network and only the ID features of users and items are used; no other features are added. The cross-network in the DCN uses a multi-layer residual network to feature cross each dimension of the feature vectors fully. Figure 1 is the structure diagram of our EDIF model. In EDIF model, the input is sorted as sparse and dense inputs. The sparse input is one-hot coding, but this coding method has the disadvantages of data sparseness and large space occupation, and hence, inspired by the FFM (Juan et al. 2016) and ONN (Yang et al. 2020), we embed the sparse features. After the embedding layer, shown on the left side of the model diagram, we perform the inner product operation for learning the second-order interaction between the features. After feature interactions, we go through the SENet layer (Hu et al. 2018) to learn the importance of the feature interactions. On the right side of the model diagram, each feature is added into an aggregation vector by sum pooling multiple embedding vectors generated by each feature. Then the aggregation vector and dense features are spliced into the explicit high-order interaction layer. Some symbol definitions in our paper are shown in Table 1.

Problem description
Advertising CTR prediction is defined as a two category problem: given the test sample feature x and the sample label y, y ∈ {0,1}, the purpose is to predict the probability of the user clicking the test sample. According to the predicted probability of users clicking on advertisements, the advertisements are sorted from top to bottom. If the advertisements with greater probability are placed in the front, the users will

Input layer
In our model, the input is sorted as sparse input and dense input, denoted as x sparse and x dense , respectively, where where n is the number of features. Sparse input is input to the model through one-hot coding, and dense features can be directly input to the model. The one-hot coding process is introduced as follows: The feature generates a new vector after the one-hot coding operation. Each dimension in the vector represents a category of the original feature, with the significant bit set to 1 and the others set to 0. The values of "Click or not" in Table 2 are 0 and 1, respectively, indicating that the advertisement is not clicked and the advertisement is clicked. "Gender," "Region" and "Advertisement Type" are the characteristics. After coding the above characteristics, it is shown in Table 3. It can be seen that the feature space is very sparse after the single hot coding operation. Therefore, our EDIF model uses the embedding method to turn sparse features into dense embedding vectors.

Embedding layer
The correlation between each feature is different, and this correlation is ignored when a feature interacts with other features if the same vector is used. Suppose we have three features: x 1 , x 2 , x 3 and v 1 , v 2 , v 3 are the embedding vectors of the three features, respectively. When x 1 interacts with x 2 and x 3 , if different representations are not considered for different features interaction, then the weights of x 1 and x 2 and x 1 and x 3 are w x 1 ,x 2 = v 1 · v 2 and w x 1 ,x 3 = v 1 · v 3 , respectively. As far as we know, When x 1 interacts with x 2 and x 3 , the importance of each interaction is different. For example, if x 1 represents male users, x 2 represents basketball and x 3 represents lipstick, then in general, male users prefer to buy basketball rather than lipstick, and therefore, we should consider different representations of different features interactively. Then the weights of x 1 and x 2 , x 1 and x 3 are w x 1 ,x 2 = v 2 1 · v 1 2 and w x 1 ,x 3 = v 3 1 · v 1 3 , respectively. In this represents the embedding vector of feature x i interacting with feature x j , that is, each feature will learn the corresponding embedding vector for other n − 1 features. e i = v i · x i on the right side of Fig. 3 represents the ith aggregation vector. The aggregation vector is the result of Where e i is the aggregation vector after the sum pooling of matrix v i , e k i represents the kth dimension of v i , and e k i is obtained by adding the kth row of matrix v i . After the aggregation vectors of n features are spliced, the vector e = [e 1 , e 2 , . . . , e i , . . . , e n ] is obtained.

Second-order interaction layer
As shown in Fig. 4, each feature learns n−1 vectors. Thus, each feature has a unique embedding vector corresponding to other features, so Hadamard product operation on the corresponding two embedding vectors can be performed, such as v j i , v i j , i, j ∈ [1, . . . , n], j = i. For Fig. 4, the Hadamard product is calculated in the following formula: Where d is the dimension of the embedding vector. After the second-order interaction, we get m new vectors, which are represented as P = [p 1 , p 2 , . . . , p m ], where m = (n * (n − 1))/2 and vector P is the input of SENet.

SENet layer
When people usually buy goods, they may prefer to buy one kind of goods. Thus while forecasting these kinds of goods, we should give more weight. For example, for people who usually like to buy skirts rather than lipstick should be given more weight to skirts than lipstick. To achieve this goal, we introduce the SENet, first applied in the image field, which can increase the weight of important features and weaken the weight of relatively unimportant features. The SENet block is divided into three-step processes: squeeze step, explanation step, and reweight step, as shown in Fig. 5: Let's start with the squeeze step: In this step, the input is the vector P = [p 1 , p 2 , . . . , p m ]. Then, we compress each of the dimensions of the vector into one dimension and compress the input into vector S = [s 1 , s 2 , . . . , s i , . . . , s m ], where s i is a scalar, i ∈ [1, . . . , m]. The calculation of the squeeze is represented as follows: (3) Fig. 6 The merging process of aggregation vectors and dense features Next, we introduce the excitation step: In this step, we use two fully connected layers to learn the weight of vector S. In the first layer, we reduce the dimension with parameter W 1 and then through the activation function σ 1 . In the second layer, W 2 is used to restore the dimension before the dimension reduction and then through the activation function σ 2 . Then we can learn the new vector A = [a 1 , a 2 , . . . , a m ], calculated as: Where A ∈ R m , W 1 ∈ R m× m r , W 2 ∈ R m r ×m , r is the dimension reduction ratio.
Finally, we introduce the reweight step: In this step, we multiply the vector P with the corresponding position elements of A, i.e., reweighting P to get the final output of SENet: Q = [q 1 , q 2 , . . . , q m ]. The calculation of Q is given as: Here, a i ∈ R, i ∈ R d , q i ∈ R d . d is the embedding dimension.

Explicit high-order interaction layer
This section introduces the cross-network to learn the highorder interaction between features (Wang et al. 2017). The inputs of the cross-network are aggregation vector and dense feature. Firstly, we need to compress the aggregation vector e into a line and splice it with the dense feature to generate a new vector c, while the merging process is shown in Fig. 6.
After merging into a vector, we input c into the crossnetwork. The interaction process of cross-network is shown in Fig. 7.
From Fig. 7, we can see that each layer interacts with the input c, where deeper the layers are, greater is the degree of interaction. The interaction process from the first layer to the lth layer is calculated as follows: . 7 The process of cross-network interaction . . .
Here, W l−1 and b l−1 are the weights and bias of layer l, respectively.

Output layer
We combine the linear part, the output of the cross-network, and the output of SENet as the input of the output layer. The formula for the output layer is as follows: Here, W o and b o are the weights and bias terms of the output layer, respectively. We train the model by minimizing the log loss: Where y ∈ {0,1} represents whether the user has clicked or not.

Datasets
We used two datasets to evaluate our model, as described below.
-Avazu dataset: Avazu is an Internet advertising company. Avazu dataset comes from the CTR prediction dataset of the Kaggle data mining competition. The purpose is to predict whether users will click on advertisements. In this chapter, 1 million pieces were randomly selected from the data set, 800000 pieces in the training set and 200000 pieces in the test set, respectively. The dataset has 24 attribute columns.
-Criteo dataset: Criteo is a marketing technology company with a reach across the globe. The Kaggle dataset shows an advertising challenge, which aims to predict the CTR of ads. The label attribute in the data indicates whether the advertisement has been clicked or not. I1-I13 are the numerical features, and C1-C26 are the category features.
We have intercepted the first one million data points, and the training and the test sets were 80% and 20% of the intercepted data, respectively.

Evaluation metrics
We used Log loss and AUC as evaluation indicators, described as: -Log loss: It is often used in the offline evaluation in which the convergence of the model can be observed. According to this indicator, the smaller the loss value, the better is the model effect. -AUC: It can be used to evaluate the ranking model of the recommendation system, in which the higher the AUC, the better is the model effect.

Implementation details
In our experiment, the parameters are set as follows: The embedding dimension is 2, the SENet dimension reduction ratio is 3, and the batch size is 256. For the Avazu dataset, the number of cross-layers considered is 4. For the Criteo dataset, the number of cross-levels considered is 3. Among many excellent optimization algorithms (Agushaka et al. 2022;Oyelade et al. 2022;Abualigah et al. 2022Abualigah et al. , 2021a, we find that Adam optimization algorithm (Kingma and Ba 2014) is the most suitable for our model. All the methods mentioned above are implemented on Intel i7-6700 3.4GHz CPU and 16GB RAM, with Python version of 3.7 and Tensorflow version of 1.14.

Model comparison
We have compared our model to the following models:  Table 4, Figs. 8 and 9. From the perspective of experimental results, our model performs better in terms of Log loss or AUC. This is because our EDIF model enhances the diversity of feature interactions and improved the accuracy of advertising click rate prediction. Specifically, there are following points: First, our EDIF model takes into account the correlation between each feature and different features: not only uses different embedding vectors when embedding when the same feature interacts with other features, The embedding

Study of the parameters
In this section, we study the influence of different hyperparameters on the results of the model. Table 5 shows our comparison results:

Embedding dimension
We have performed experiments in the range of 2-12. From Figs. 10 and 11, we can see that the model works best when the embedding dimension is 2, in both the Avazu and the Criteo datasets. Moreover, the larger the embedding dimension is, the more complex the model is, which may lead to overfitting.

Cross-layers
We have performed experiments in the range of 2-5 crosslayers. From Figs. 12 and 13, we can see that in the Avazu dataset, when the number of cross-layers changes from 2 to 4, the effect of the model is improved steadily. In the Criteo dataset, when the number of cross-layers changes from 2 to 3, the model significantly improves. However, when the number of layers is too small, the model training is not enough, and the effect cannot be achieved. When the number of crosslayers changes from 4 to 5, the effect will not be improved due to overfitting. Bold values indicate the process of parameter tuning, when the parameter takes which value, the model performance is the best

Reduction ratio
We performed experiments with the dimension reduction ratio between 2 and 5. From Figs. 14 and 15, we can see that, in the Criteo dataset, the model achieves the best effect when the dimension reduction ratio is 3 and 4, but when the dimension reduction ratio is 5, the model does not perform very well. The reason can be stated as the dimension is reduced too much, and the learning ability of the model is not enough. In the Avazu dataset, the model has a good effect when the dimension reduction ratio is 3 and 4. After the comparative experiments on the above parameters, it is found that each parameter can find a parameter with the best effect, which requires many experiments to be found. We also found that too small or too large number of parameters could not achieve the best effect because a too small number of parameters may lead to underfitting, and a too large number of parameters will lead to overfitting.

Conclusion
This paper proposes an EDIF model, which enhances the ability of feature interaction and improves the accuracy of CTR prediction. Firstly, the model learns several different embedding vectors for each feature to enhance the correlation between features; secondly, in the process of second-order feature interaction, the model obtains the interaction results of different embedding vectors and uses compressed excitation network to dynamically learn the importance of feature interaction and improve the quality of feature interaction; thirdly, the model uses the explicit high-order interaction layer to explicitly learn the high-order interaction of features and learn more nonlinear relationships between features; finally, the results of the model on two public data sets, Avazu and Criteo, are better than the latest model.
In this paper, attention mechanism and other structures are used to capture the cross-relationship between features and improve the effect of advertising CTR prediction. In order to maintain the effect of advertising CTR prediction and make the model lighter, the method of knowledge distillation can be used in the future. Transfer knowledge to lighter student networks through teacher networks, so that students can learn from the experience of teacher networks, so as to achieve similar or better results.

Author Contributions
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Leilei Yang, Wenguang Zheng and Yingyuan Xiao. The first draft of the manuscript was written by Leilei Yang, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Funding No funding was received for conducting this study.
Data availability A data availability statement is mandatory for publication in this journal. Please confirm that this statement is accurate, or provide an alternative.

Declarations
Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.

Conflict of interest
The authors declare that they have no conflict of interest.
Informed consent Informed consent was not required as no humans or animals were involved.