A model fusion method based on multi-source heterogeneous data for stock trading signal prediction

In the prediction of turning points (TPs) of time series, the improved model of integrating piecewise linear representation and weighted support vector machine (IPLR-WSVM) has achieved good performance. However, due to the single data source and the limitation of algorithm, IPLR-WSVM has encountered challenges in profitability. In this paper, a model fusion method based on multi-source heterogeneous data and different learning algorithms is proposed for the prediction of TPs (MF-MSHD). Multi-source heterogeneous data include weighted unstructured and structured information with different granularities. RF, WSVM, BPNN, GBDT, and LSTM are selected to be the learning algorithms. The differences among meta-models are constructed by different inputs and algorithms as much as possible, and a model fusion rule is designed to determine the final TPs. Moreover, the TPs are generated based on the characteristics of individual stock. For sentiment analysis, a more accurate sentiment dictionary of stock market comments is established. Specifically, the fine-grained data is introduced to jointly determine the accurate trading moment. The prediction level of the proposal improves the accuracy and profitability, and also outperforms the composite indexes. Experimental results show that the profit rate of randomly selected stocks in MF-MSHD reaches 0.5172, while the highest value is 0.2841 in single meta-model and 0.0992 in buy and hold strategy, respectively. The other indicators including the accuracy are also modified. Compared with the increases of 0.1648, 0.4051, and 0.3397 in Shanghai Composite Index, Shenzhen Composite Index, and CSI 300 Index, MF-MSHD shows higher profitability in stock trading signal prediction.

the trend prediction of stock market is challenging (Abu and Atiya 1996). In recent years, the predictions of stock are mainly divided into two categories. One aims to predict the price (Ismail et al. 2020;Chandar 2020;Yang et al. 2021), which is a regression problem. The other aims to predict the turning points (TPs) of trend (Long et al. 2020;Hao et al. 2021;Thakkar et al. 2021), which is a classification problem. As the high-frequency characteristics of stock data, the prediction of stock price has the limit of accuracy while the prediction of TPs is relatively easier. Compared with the price prediction, trend TPs prediction is more attractive (Chang et al. 2009).
The analysis of stock market is mainly based on technical and fundamental. Technical analysis refers to judging market trends by analyzing transaction data. Fundamental analysis pays more attention to the financial information of listed stocks. Many methods have been applied to technical analysis, such as back propagation neural network (BPNN) (Cao and Wang 2019) and support vector machine (SVM) (Luo et al. 2017). As a combined classification algorithm, the random forest (RF) was used to predict long-term stock direction (Ballings et al. 2015). Gradient boosting decision tree (GBDT) has some common features with RF, and it is considered to be a useful algorithm for stock prediction (Gu et al. 2021). Long et al. (2020) proposed a deep neural network model based on transaction records and market information to predict stock price trends. Yu and Yan (2020) designed a deep neural networks prediction model by using phase space reconstruction method and long shortterm memory (LSTM) to predict stock prices. The results showed that the proposed prediction model has higher prediction accuracy. In recent years, convolutional neural networks (CNN) has been introduced to analyze the trends of stocks and achieves great success. The time series data is processed and transformed into graphs, which are used as the input of CNN. Hao and Gao (2020) extracted two types of features on different time scales through the first and second layers of CNN, and the trends of stock were predicted based on multi-time-scale features (CNN-LSTM). The proposed hybrid neural network outperforms the baseline models on real datasets. Barra et al. (2020) proposed a new approach for the prediction of the U.S. stock market trend by encoding time series to Gramian angular fields (GAF) images, and CNN was applied to the GAF images for the classification of trends (GAF-CNN). Wu et al. (2021b) proposed a model that uses CNN to extract features from financial time series and makes predictions based on classification (SSACNN), in which time series were converted into matrix to avoid data dispersion and reduce useless information. Moreover, the proposed model also referenced the leading indicators such as options and futures to improve the performance of stock trends prediction. SSACNN outperforms SVM-based models and other CNN-based models in accuracy. Wu et al. (2021c) also designed the SACLSTM model for stock prediction, which considered options, futures, and historical data. The stock sequence array convolution LSTM algorithm was also involved. The experiments verified that the neural network framework combining convolution and long short-term memory unit achieves better performance in the prediction of stocks. The proposed SACLSTM improves the effectiveness of stock price prediction.
In the analysis of stocks, a proper stock classification approach facilitates matching trading strategies with suitable stocks. In Wu's work (Wu et al. 2022), a system based on fuzzy analysis methods was presented (FOCUS), which adopted a random trading algorithm that uses a stop-loss and take-profit mechanism to extract stocks features. The features were then quantified using the profitability index and combined with a type-2 fuzzy set to describe the features as fuzzy degrees. The proposed FOCUS greatly improves the classification performance and profitability. The daily limit on increase and decrease is 10% in Shanghai and Shenzhen Stock Exchange, real-time predictions of stock prices will improve the profit. Yadav et al. (2022) proposed two models based on different applications. The FastRNNs focused on faster computation speed with lower model complexity, while FASTRNN_CNN_BiLSTM model balanced computation speed with accuracy. The experimental results showed that compared with the baseline models, the two proposed models achieve better results in both root mean square error and computation time. However, in the field of stock prediction, it is better to consider various external factors in addition to the stock price, which will affect the stability of the model.
As early as 2010, some scholars pointed out that the sentiment analysis (SA) of information on Twitter can be used to predict the rise and fall of stocks (Bollen et al. 2011). Renault, Wu, and others (Renault 2017;Wu et al. 2021a) have verified that online public opinions have a significant impact on the stock market. More and more researchers are applying SA to the stock market prediction. For fundamental analysis, financial news and the annual reports of securities companies were used for the prediction of stock trends (Dang et al. 2018;Hao et al. 2021).
Due to the unsatisfactory performance of single model, many scholars focus on the combination of models. Xu et al. (2020) combined k-means clustering and ensemble learning. The results demonstrated that the combination model obtains the best accuracy. Patel et al. (2015) proposed a twostage combination model to predict stock indexes. The first stage of the model uses support vector regression (SVR) to predict the technical indicators which are used as the input of second stage. The second stage of model uses artificial neural networks (ANNs), RF, and SVR to predict the close price to form the SVR-ANNs, SVR-RF, and SVR-SVR models. Experiments showed that the two-stage model is better than the second stage models and SVR-ANNs obtain the best results.
Since piecewise linear representation (PLR) was introduced to the generation of TPs, the PLR-based structure has been adopted in many models. The integration of PLR and weighted SVM was used to analyze the relationship between the indicators and price (PLR-WSVM) (Luo and Chen 2013). The improved PLR-WSVM proposed in (Luo et al. 2017) provided steady profits with accepted retracement (IPLR-WSVM). In the experiment of (Yao 2019), three sets of features were selected based on expert experience. PLR-RF was used for training and testing. The TPs were determined according to majority vote. Moreover, the sentiment model of stock market comments was proposed to obtain the final trading points. For convenience, the descriptions, pros and cons of some representative models for stock trend prediction mentioned above are summarized in Table 1. In addition to the stock prices, it is better to consider various external factors, which will affect the stability of the model Wu et al. (2022) A system based on fuzzy analysis methods was presented which uses a random trading algorithm A random trading algorithm was adopted to extract features, which are quantified by the profitability index with a type-2 fuzzy set module The selection of some factors and parameters depend on expert experience Wu et al. (2021a) The authors proposed a deep learning model to predict stock trading strength and stock market trends for data annotation Online public opinions were introduced into the prediction model, and showed a significant impact on the stock market The analysis of the technical indicators of the stock is also of great significance to the prediction of the trading points Wu et al. (2021b) The authors proposed a model that uses CNN to extract features from financial time series and makes predictions based on classification  Long et al. (2020) A deep neural network model based on transaction records and market information was proposed to predict stock price trends Investors were clustered to reduce the dimensions of the transaction records, which is also the input of a CNN to mine investment patterns The performance of the model will be improved by using textual and sentiment analysis from financial news and social media Yao (2019) The authors proposed a trading signals prediction model based on sentiment analysis and model fusion The TPs were determined by majority vote of PLR-RF models and the sentiment analysis of online opinions The differences between meta-models are not obvious, the setting of the TPs depend on expert experience Although PLR-based models have achieved good performance in the prediction of stock market, they also have certain shortcomings.
• In Yao's work (Yao 2019), the meta-models are created only based on RF. The differences between meta-models should be as large as possible to improve the generalization and profitability of the model.
• For the model fusion rule, compared with majority vote, the performance of each meta-model for the prediction should be considered.
• The TPs ratio is set with the same value for all stocks, which is not reasonable. The activities of stocks are different, and the trading frequency should increase for the stock with high activity. The weight of each TP obtained by PLR is only determined by the change rate of close price, which is not suitable for the overall evaluation of the weight.
• In terms of SA, the existing sentiment dictionary is not accurate for the analysis of stock market comments.
To overcome the drawbacks of PLR-based models, a model fusion method is proposed for the prediction of TPs.
• More differences between meta-models are created in our proposed model, in which the inputs and the algorithms are different. Specifically, the inputs include structured information of different granularities, and unstructured information such as the comments of the investors. The algorithms include RF, WSVM, BPNN, GBDT, and LSTM.
• After the prediction of different meta-models, the metamodels are fused based on the performances, including predicted TPs, the profit brought by the trading strategy, and the accuracy and recall.
• In addition, the generation of the TPs by PLR directly affects the labels of the training set, so the TPs are generated according to the characteristics of specific stock. The weight of each TP is not only determined by the change rate of close price, but also the local and global factors, which gives a more comprehensive consideration of the importance.
• In order to obtain the sentiment tendency and intensity, a more accurate domain sentiment dictionary is established by the expansion of the seed words.
• Moreover, existing PLR-based models only predicted the trading days, while in our proposed model, fine-grained data is introduced to jointly determine the exact trading moments based on LSTM.
The rest of this paper is organized as follows. A brief introduction for the related work is provided in Sect. 2. The proposed prediction model will be discussed in detail in Sect. 3, called MF-MSHD. In Sect. 4, a series of comparative experiments are carried out to verify the effectiveness and profitability of our model. Finally, conclusions and future work are given in Sect. 5.

PLR
PLR replaces the original time series data with a number of straight lines that connect end to end (Keogh et al. 2001). The original data is compressed effectively, and the result can reflect the trend of the data directly. Given a time series data T = {y 1 , y 2 , . . . , y l }, the PLR for T can be described by: The approximation in PLR can be framed in several ways (Keogh et al. 2001). The most popular one is producing the best representation such that the maximum error for each segment does not exceed the user-specified threshold. Linear interpolation and linear regression are used to find the approximation line. The algorithms for PLR can be divided into three types, which are sliding windows, top-down, and bottom-up (Keogh et al. 2001).

WSVM
The main idea of SVM is to generate a classification hyperplane that separates two classes of data with the maximum margin. Given a training set {(x 1 , y 1 ) , (x 2 , y 2 ) , . . . (x n , y n )}, where x i ∈ R n and y i ∈ {−1, +1} are the train instance and the corresponding label. The decision boundary can be described as < w, x > +b = 0, where w is the normal vector of hyperplane in the feature space, < ·, · > denotes the inner product of two vectors, b is a bias value. The constraint condition is that the distance from the point to the decision boundary is greater than or equal to 1: The decision boundary that satisfies this condition actually constructs two parallel hyperplanes as the interval boundary to discriminate the classification of sample: The distance between the two interval boundaries is 2 w . For linearly separable problems, SVM can be transformed into a quadratic convex optimization problem, which is the standard SVM model: A nonlinear SVM can be obtained by applying a linear SVM after mapping the input data to a high-dimensional space using a nonlinear function φ(x i ). Moreover, the standard SVM can be extended to WSVM (Chang and Lin 2011) when each train data has a different weight μ i (i = 1, . . . , n). The WSVM model can be described by: where ξ i (i = 1, . . . , n) are slack variables. The penalty factor C is introduced as a predefined parameter that balances the training accuracy and generalization ability. Introducing the Lagrangian multipliers α i and β i into (5), the Lagrangian function is constructed: Then the partial derivative of the optimization objective w, b, ξ in the Lagrangian function are set to be 0: After taking (7) into the Lagrangian function (6), the dual model of (5) is obtained: > is the kernel function, the decision function of the binary classification problem can be obtained by: where α * is an optimal solution of convex programming problem (8), SV is the set of support vectors. x i ∈ SV represents the instance x i is a support vector that lie on or within the interval boundaries.

BPNN
BPNN is a multilayer feedforward neural network with supervised learning and error backpropagation algorithm (Rumelhart et al. 1986). BPNN is composed of an input layer, an output layer, and some hidden layers. Suppose the input of BPNN is X =[x 1 , x 2 , . . . , x n ] T , Y = y 1 , y 2 , . . . , y q T is the output of the network. The relationship of input and output for BPNN with one hidden layer can be expressed as: where n, q represent the number of inputs and outputs in the network. m is the number of neurons in the hidden layer. v i j is the weight from the input layer to the hidden layer. b 1 is the threshold of the hidden layer. w jk is the weight from the hidden layer to the output layer. b 2 is the threshold of the output layer. h (x) is the transfer function adopted by the hidden layer node. f (x) is the transfer function adopted by the output layer node. The error function of the network output is: where d k is the real output.

RF
RF is proposed by Breiman (2001). For the train data RF is a collection of decision trees m n x; j , S n , j = 1, 2, · · ·, N tree . Meta-classifier m n x; j , S n is an unpruned tree constructed by classification and regression trees algorithm (CART). 1 , . . . , N tree are independent random variables, distributed the same as a generic random variable and independent of S n .
Suppose the number of trees N tree > 0. The training set S i for the i th decision tree is sampled by bootstrap from the original dataset S n . The spilt feature on each internal node is determined from the candidate M tr y features according to Gini index, where the candidate M tr y features are selected from p features randomly. The split will be stopped until the tree grows to the largest. The RF estimate m N tree ,n x; 1 , . . . , N tree , S n at the query point x is computed according to:

GBDT
GBDT is an iterative tree algorithm that combines a series of weak prediction models (Friedman 2001). In each iteration, CART is trained from the residual of the previous tree. For the train data S n = {(X i , Y i ) , i = 1, 2, . . . , n} , X ∈ R p , Y ∈ R, L(y, f (x)) is the loss function, and m is the iteration number. The negative gradient of the train data (x i , y i ) is: A CART is fitted by (x i , r im ) to get the m th tree, and its corresponding leaf node area is R jm ( j = 1, 2, . . . , J m ), where J m is the number of leaf nodes of the regression tree m. The best-fitting negative gradient value for the leaf area is calculated by: The final function is: where f 0 (x) = arg min c N i=1 L(y i , c) is the initial function. I is an indicator function.

LSTM
LSTM is a type of recurrent neural network (Gers et al. 2000). A single cell in LSTM has a cell state and three gates, which are input gate, forget gate, and output gate. The input gate decides what new information can be stored in the cell state. The forget gate decides what information will be thrown away from the cell state. The output gate decides what information can be output based on the cell state. The formulas of the LSTM cell are as follows: where c t denotes the cell state of LSTM. x t and h t denote the input value and the recurrent information at time step t, bc, b o are the corresponding weight matrices and biases, σ is the sigmoid function. The operator " " denotes the pointwise multiplication of two vectors.

PLR-based model
In 2013, PLR-WSVM model was firstly proposed by Luo and Chen (2013), which treated TPs prediction as a fourclassification problem. The flowchart of PLR-based model is shown in Fig. 1. Firstly, the stock history data D is collected, cleaned, and normalized. After that, the dataset D is divided into q overlapping train-test sets according to: where r represents the size of D, r 1 and r 2 are the sizes of each train and test set, respectively. r 2 is also the update Next, for the i th train set and test set, the input variables are selected and the corresponding labels of TPs are generated by PLR. Subsequently, a classifier is trained on the train set by BPNN, WSVM or RF. The TPs of the i th test set are predicted by the trained classifier and then be further divided into different trading types and corrected by prior information. After traversing the dataset D, the final predicted trading points will be performed based on the trading strategy to get the evaluation index and profit rate.
The IPLR-WSVM model made a series of improvements to PLR-WSVM (Luo et al. 2017). For example, In Step 4, IPLR-WSVM removed absolute indicators and simultaneously increased relative indicators. In Step 5, four-classification problem was simplified into a binary classification problem, in which one class is TP while the other is ordinary point. In Step 7, the predicted TPs were divided into buying and selling points. In Step 8, delay-one-day strategy (DODS) was proposed for the trading signal confirmation. In Yao's work (Yao 2019), the PLR-based model was also adjusted. In Step 4, three sets of technical indicators were selected based on expert experience, respectively. Moreover, the sentiment indicators were added into the input variables. In Step 5, four classifiers were trained by RF. The TPs were determined according to majority vote in Step 6.

The generation of TPs
Top-down algorithm based on the linear interpolation of PLR is used to generate the approximation line of the close prices. The result of PLR is greatly impacted by the TPs ratio. The TPs ratio directly affects the frequency of transactions, which is related to the activity of the stocks. In (Luo et al. 2017), the TPs ratio is given by expert experience, and the value is the same for all the stocks. We propose an automatic selection algorithm of the TPs ratio based on the characteristics of stocks. The cluster of TPs is also generated. As the importance of each TP is different, the weight of TP is set according to local and global factors.

The generation of TPs ratio
The TPs ratio is related to the activity and the activity can be checked by the change rate of prices. Figure 2 shows the distributions of price change rates for 600634.SH, 600780.SH, and 000010.SZ. The stocks can be divided into three categories according to their activity (high, middle, and low). It can be seen from Fig. 2 that the distribution of 600634.SH is quite drastic, while it is relatively flat of 600780.SH, the distribution of 000010.SZ is in the middle of the first two stocks.
In order to find the TPs ratio of a specific stock, an automatic selection algorithm of TPs ratio is implemented. Some inputs should be prepared before finding the TPs ratio, the corresponding pseudo-code is shown in Algorithm 1.

Algorithm 1
The selection of the TPs ratio Input: The close price sequences for all listed stocks: n is the number of listed stocks; { p t (i)} means the close price sequence of stock t; M is the length of change rate of close price sequence; γ is the threshold of the change rate; N is the number that satisfy the listed condition; δ is the proportional of change rate with high value; corr γ is the Pearson correlation coefficient.

Output:
For a specific stock t, the TPs ratio is calculated by δ t based on the threshold γ . 1: for γ = 0.01 : 0.005 : 1 do 2: for t = 1 : n do 3: Calculate the change rate of prices: All of the stocks are sorted and divided into three categories according the value of δ t based on γ .

9:
Calculate the Pearson correlation coefficient corr γ of δ t for the three categories. 10: end for 11: return γ when corr γ is the lowest.
In Algorithm 1, for each listed stock, the corresponding close price sequence is collected first. In line 1, the threshold of the change rate γ will start from 0.01 to 1 with the step size of 0.005. For each γ , we will traverse all listed stocks with the number of n in line 2. Next, given { p t (i)} as the close price sequence of stock t, we will obtain the change rate of close price sequence {close_rate t (i)} based on the formula shown in line 3. The length of {close_rate t (i)} is recorded as M in line 4. As mentioned earlier, the TPs ratio is inseparable from change rate of close price. As shown in line 5, we look for the number of change rate with high value, where the threshold of the change rate is γ , and the number that satisfy the condition is N . Then, for each stock t, the proportion of change rate with high value is calculated by δ t = N M based on threshold γ in line 6. After that, for all listed stocks, in the loops, the corresponding proportions of change rate with high value are obtained. In line 8, all stocks are sorted according to δ into three categories. The Pearson correlation coefficient corr γ is also calculated for each γ in line 9. The smaller the correlation coefficient of the three categories of stocks, the better discrimination of γ . Therefore, after traversing γ , we will return γ when corr γ is the lowest in line 11. As the threshold γ is determined, the proportion of high change rate δ t for stock t is calculated, which is also the corresponding TPs ratio.
In Algorithm 1, the Pearson correlation coefficient is described as follows: X and Y are the average values of sample X and sample Y , respectively. The smaller the correlation coefficient of the three types of stocks, the better discrimination of γ . The TPs ratio determined based on γ can reflect the characteristics of specific stock. Figure 3 shows the TPs generated by Algorithm 1. It can be seen that the number of TPs in Fig. 3a is larger than Fig. 3b, which means 000010.SZ is more active than 600780.SH in this period. The more TPs, the more trading opportunities.
The trading points are not limited to the TPs, it is still profitable to operate near the TPs. The nearby points are called the interval turning points (ITPs) (You 2017). The ITPs ratio is set the same as the TPs ratio. Given {T P i } m i=1 as a sequence of TPs, the generation of ITPs is as follows: where p T P i is the price of the i th TP, p t is the price of the t th trading day, the t th trading day is between T P i−1 and T P i+1 . λ represents the threshold of the price change rate. Suppose n is the number of ITPs corresponding to λ, λ is determined when (n + N ) M=2δ.

The weights of TPs
TPs have different weights. As mentioned in (Luo et al. 2017), the weight of the i th TP is set by the change rate between TPs: The formula is not suitable for the overall evaluation of the weights. In our experiment, the weights are calculated based on the local and global factors. Local factor is used to measure the relative change between TPs, and global factor is used to measure the importance of the overall trend.
For the TPs sequence {T P i } m i=1 , local factor μ l is calculated not only based on the change rate between TPs, but also on the difference between one TP and its neighboring is the number of data between T P i and T P i±1 . In formula (26), d j (T P i ) is the distance from p T P i to the straight line formed by p T P i− j and p T P i+ j . x j is the corresponding weight coefficient of d j (T P i ), and its expression is as follows: As shown in Fig. 4, the corresponding orthogonal distance increases when p i changes to p (i), the difference increases between T P i and its neighbors. Correspondingly, the local factor μ l is larger.
The local factor μ l (T P i ) measures the weight of T P i in a local range. However, μ l (T P i ) cannot measure the importance of T P i to the overall trend. Therefore, the global factor μ g (T P i ) is given below as another evaluation of weight: if p T P i > p T P i+1 and p T P i > p T P i−1 : if p T P i < p T P i+1 and p T P i < p T P i−1 : If p T P i > p T P i+1 and p T P i > p T P i−1 , n l is the number of TPs on the left of T P i that satisfy the condition p T P i > p T P i−l , n r is the number of TPs on the right of T P i that satisfy the condition p T P i > p T P i+r . If p T P i < p T P i+1 and p T P i < p T P i−1 , n l is the number of TPs on the left of T P i that satisfy the condition p T P i < p T P i−l , n r is the number of TPs on the right of T P i that satisfy the condition p T P i < p T P i+r . The global factor μ g (T P i ) is determined by comparing the relationship between T P i and other TPs, which fully considers the influence of each TP on the overall trend. It can be seen that the global maximum and minimum values in TPs both have the largest global factor value, that is, μ g (T P x ) = m − 1.
The weight of T P i is comprehensively evaluated based on local factors μ l and global factors μ g , and its expression is as follows: Given I T P j n j=1 as the ITPs sequence of T P i . Considering the profit that the ITPs could bring, each ITP should be weighted according to the change rate of prices: Figure 5 shows the comparison of the TPs generated by the old and new algorithms. For the convenience of obser-  Fig. 5a, the weight of the highest TP generated by the old algorithm is only 3.9, but it is important in the whole phase. In Fig. 5b, the weights obtained by the local and global factors are more suitable for the situation. For example, the weight of the highest TP is 8.89 due to the global factor, while the weight of the 300th day is 9.88 according to the local factor. The results are close because of the importance in different aspects.
The partial magnification of TPs, ITPs, and corresponding weights from the 80th to the 120th day are shown in Fig. 6. The ITPs are around the TPs. The weights of ITPs are also close to the corresponding TPs.

Structured and unstructured heterogeneous data
There are structured and unstructured historical information in stock market. The technical indicators, such as trading volume and turnover, are structured information. The textual data, such as the comments on the company announcements, are unstructured information. The granularity of technical indicators is also different. Some have the granularity of days, some have the granularity of tick-by-tick. This paper uses multi-source heterogeneous data to reflect the market environment.

Technical indicators of daily data
There are many indicators in the stock market.  Table 2. Correlation analysis is used to select the indicators from three aspects: price, volume, and trend, based on ma(i, 50), relative strength index (RSI), and MACD, respectively. There are five indicators for each type. ma(i, N ) represents moving average of close price in N days. In terms of volume, RSI reflects the relationship of supply-demand in market and trading power. The indicator MACD analyzes the dispersion and aggregation of the moving averages. MACD reflects the long-term and short-term trend.

Sentiment value of unstructured information
The comments on the company announcements directly reflect the enthusiasm of investors. The analysis of comments is inseparable from the sentiment dictionary. The existing dictionary of Chinese linguistic inquiry and word count (LIWC) used in (Hao et al. 2021) actually has certain limitations. Some words have specific meanings due to the particularity of the stock market. In Yao's work (Yao 2019), an extended Chinese sentiment dictionary in the field of stocks investment (ExSISD) was constructed. The number of positive words and negative words is 840 and 1399, respectively. In Yao's experiment, the candidate word will be added to the sentiment dictionary when the intensity exceeds the threshold.  Change rate of close price Change rate of volume However, some candidate words are found to be weakly correlated with seed words. For example, the similarity between "restraint" and "bull market" is 0.02497, but "restraint" is still added to the positive sentiment dictionary. In order to improve the accuracy of the dictionary, top-N algorithm is performed based on the seed words, which could avoid the addition of some neutral words. The contribution of this study includes the improvement of sentiment dictionary ExSISD (IExSISD). This dictionary is more accurate than ExSISD and more applicable to the stock market than the traditional sentiment dictionaries. The corpus of stock market is selected from the financial news and the comments of company announcements. The word vectors are trained by the continuous bag-of-words framework of Word2vec (Mikolov et al. 2013b, a) after data preprocessing. In order to obtain a more comprehensive stock sentiment dictionary, in addition to the corpus of the stock market, this paper uses the word vectors of more than 8 million Chinese words released by Tencent_AI Lab (Song et al. 2018) to expand the corpus. The sentiment seed words of stock market are selected with positive and negative types (Yao 2019). For each seed word, candidate word i will be added to the sentiment dictionary if the similarity ranks in the top 20. The corresponding intensity of candidate word i is calculated based on the word vectors: In formulas (33) and (34), candidate word i belongs to the positive and negative sentiment words, respectively. cos(i, j) means the cosine similarity of word vector i and j. s pos and s neg represent the word vectors of the positive and negative seed words. n pos and n neg represent the number of positive and negative seed words, respectively. Table 3 shows the number of sentiment words in IExSISD and ExSISD. It can be seen that the number of sentiment words in IExSISD is reduced. Table 4 shows the prediction accuracy of sentiment inclination for 500 comments of stock announcements which are selected randomly. The number of positive, negative, and neutral comments are 230, 220, and 50, respectively. The HowNet sentiment dictionary (Qi et al. 2019;Dong and Dong 2003), the Baidu_AI interface (Baidu 2020), and ExSISD are used as baseline dictionaries. The bold font represents the best result.
It can be seen from Table 4 that IExSISD is superior to the other baseline dictionaries in the classification of comments. The total accuracy of IExSISD is 88.89%, which is the highest. For one thing, it verifies that the texts in the stock field have a unique way of expressing emotions, traditional sentiment dictionaries cannot analyze sentiment inclination accurately. For another, it confirms the comprehensiveness and accuracy of the IExSISD constructed in this paper. Although IExSISD has less sentiment words than ExSISD, the accuracy of SA has been improved.
The sentiment values of the comments are calculated based on the IExSISD. The sentiment indicators include the positive and negative sentiment ratios, and the total sentiment value of each announcement. The formulas are as follows: where pos(i) and neg(i) are the positive and negative sentiment ratios on the i th trading day. C pos (i) , C neg (i) , and {C(i)} are the sets of positive, negative, and all of the comments. score(c) is the sentiment value of each comment which is calculated based on inten(i). Since the announcements are issued after the trading day and will have an impact on the next trading day, {C(i)} is all the comments from the issue of the announcement to the opening of the next trading day. The reading volume read(i), the number of comments card ({C(i)}) are also added to the sentiment indicators. In summary, the selected indicators are listed in Table 5.

Data of different granularities
According to the historical data with a granularity of days, the i th day is predicted to be the trading point. The price of the i th day fluctuates all the time. The daily limit on increase and decrease is 10% in Shanghai and Shenzhen Stock Exchange. Higher profit will be brought if the model could predict the extremum moment of the i th trading day. In the experiment, data with a granularity of tick-by-tick is used as the research object. The timeline data and plate information before the predicted TP are analyzed to model the extremum. The indicators are the tick-by-tick transaction data and order data which are selected from Level-2 of the plate information.
The selected plate indicators are listed in Table 6. The prediction of extremum is treated as a binary classification problem, that is, the extremum and others. Considering that it is still feasible to trade around the extremum, the extremum and surrounding prices are classified into one label: where d h , d m , and d l represent the highest price, the median price, and the lowest price of the t th trading day, respectively.

Model fusion
After the generation of TPs and the selection of input variables, a weighted binary classification problem is constructed    Table 7. For each set of input indicators in the test set, there are four sets of outputs corresponding to four algorithms. The results include the prediction of the TPs, the F1 value, and the profit brought by the trading strategy based on the prediction, and so on. F1 value is used as a more comprehensive evaluation of the model, the formula is as follows: where AccTP and RecTP are the accuracy and recall of the model. It should be noted that the RecTP is calculated accord-ing to the ITPs. At the end of the test set, all holdings or short-selling stocks are cleared at the close price of the last trading day. The profit of the entire investment period is: where b m and v m denote the balance money and total investment money, respectively. Taking indicators of PR as an example, the four metamodels are RF_PR, WSVM_PR, BPNN_PR, and GBDT_PR. After the four classifiers for the binary classification problem are obtained, they will be used to predict the TPs in the test set. For a piece of test data, there are four outputs y RF_PR (t), y WSVM_PR (t), y BPNN_PR (t), and y GBDT_PR (t). The F1 value and the profit of each model are also calculated based on the predicted TPs during the train period. The classification of a test data is obtained by analyzing the outputs of the four meta-models. The formula is as follows:  (43). ω PR (t) is the weight of the t th trading day. y i (t) is the predicted result, F1(i) and profit(i) represent the F1 value and the profit of meta-model i during the train period. In the stock market, besides accuracy, the investors pay more attention to the profit brought by the model. In the case of y PR (t) = 1, the t th test data is classified as a TP for PR model. y PR (t), y VOL (t), y TR (t), and y SEN (t) are predicted results for four sets of input indicators based on formula (42) and (43). The F1 value and the profit of each model are also calculated based on the predicted TPs of the train set. The final classification of the test data y(t) is provided by : In formula (45), i ∈ (PR, VOL, TR, SEN). The probability that a test data is a TP increases as ω(t) increases. In our experiment, the division of the TPs and DODS are used before trading (Yao 2019). The predicted results y(t) are divided into buying points (y (t) = 1), selling points (y (t) = 4), and ordinary points (y (t) = 0).
The daily limit on increase and decrease is 10% in Shanghai and Shenzhen Stock Exchange. In this paper, after the t th day is predicted to be the trading point based on the daily technical indicators and sentiment information. In order to obtain more profit, we will predict the extremum moment of the t th trading day. LSTM can grasp the structure of data dynamically (Long et al. 2020;Yao et al. 2018). We apply LSTM to solve the binary classification problem. The plate indicators which are listed in Table 6 of two days before the t th trading day are analyzed based on LSTM. The corresponding labels are obtained by formula (38) and (39). The plate indicators of the t th trading day are used to be the test set. It should be noted that if the extremum is not predicted by LSTM model, we will still use the close price of the t th trading day as the transaction price in the trading strategy.
Structured and unstructured heterogeneous data with different granularities are used to obtain the comprehensive information about the stock market. Different algorithms are introduced to find the potential laws. This trading signal prediction model is called MF-MSHD. The framework  Fig. 7, where y PR (t), y VOL (t), y TR (t), and y SEN (t) are first-level fusion results calculated based on the meta-models listed in Table 7 according to formula (42) and (43), and the weights of meta-models are different. The fusion result y(t) is calculated according to formula (44) and (45).
Moreover, position and risk management strategy could regulate the trading behavior. A more flexible dynamic investment strategy is proposed below, which is an improvement of the regular one (Luo et al. 2017). The profit in formula (41) is calculated based on this dynamic investment strategy: when y (t) = 1: where b s and c denote the balance number of stocks and the transaction cost rate, respectively. If y (t) = 0, the formula in the strategy recommends holders to liquidate the stock when the loss exceeds the threshold. p(T P i ) is the close price for the i th TP, which is the nearest TP before the t th trading day. For the meta-models listed in Table 7, as the bottom of MF-MSHD, there is no weight information for the TPs. The profit is calculated based on ω = 1.
Data standardization is prerequisite. This paper chooses Z-score standardization. The formula is as below: where x i is the original value,x i is the standardized value, and n represents the number of samples.

Experimental results
In order to demonstrate the performance of the proposed model, a series of experiments are carried out. There are 18 stocks which are selected randomly from the Chinese stock market. For each stock, the daily dataset is divided into train set and test set. The time span is from Jul. 31th, 2016 to Jul. 31th, 2018 for the train set, and Aug. 1st, 2018 to Aug. 1st, 2020 for the test set. The corresponding TPs ratio δ is calculated according to Algorithm 1. The randomly selected stocks and δ are listed in Table 8. It can be seen from Table 8 that the activity is different between stocks. The higher δ is, the higher frequency of transactions. Through Algorithm 1, the TPs ratio is set according to the characteristics of the stock, which is more in line with actual operating conditions. The experiments use scikit-learn, TensorFlow, and keras to deduce the algorithms, which are all based on Python. The public parameter for all of the models is the transaction cost rate, which is set to c = 0.003. The default settings are selected for some parameters. The main parameters are listed in Table 9.
According to the analysis theories of stocks, for daily data, there are three sets of technical indicators which are PR, VOL, and TR. The sentiment indicators SEN of the comments are calculated based on IExSISD. The TPs clusters and the corresponding weights are set according to Sect. 3.1. For each group of indicators, four algorithms are used to solve this binary classification problem. The fusion method introduced in Sect. 3.3 is performed to obtain the predicted TPs. In addition to the previously mentioned AccTP, RecTP, and F1, the results of the predicted TPs ratio (Num) and the fatal error ratio (CerTP) are also calculated to verify the capabilities of the model. The profit is obtained by the improved trading strategy. The results in the following tables are the  Tables 10, 11, 12, and 13 mainly compare the performance of algorithms on different data. Each algorithm has its advantages. The Num of RF and BPNN win on two models, respectively, which could provide more TPs. In terms of AccTP, BPNN wins three models (PR model, VOL model, and TR model), WSVM wins one (SEN model). The RecTP of GBDT and BPNN win on two models, respectively, which means that enough true TPs have been captured. In terms of CerTP, the performance of each algorithm is good, and the value is only permillage. As related to AccTP and RecTP, the F1 value in BPNN wins three models (TR model, VOL model, and SEN model), GBDT wins one (PR model). The Profit of each algorithm is different. GBDT wins two models (PR model, TR model), RF wins one (VOL model), BPNN wins one (SEN model), which indicates that the nature of data is different. In VOL model and TR model, more profit could be created by our fusion rule. The highest profit of metamodels in VOL model is 0.2841, the value increases to 0.2894 after the fusion. The profit increases from 0.2255 to 0.2549 in TR model. The corresponding standard deviations of WSVM is always the lowest, which explains the stability of WSVMmodels. In PR model and SEN model, the highest profit in meta-models is higher than fusion result, the corresponding standard deviations of meta-model are worse. Indicating that the stability of the fusion model is improved. From the fusion results, stability and profitability are improved comprehensively. Table 14 shows the prediction results of MF-MSD based on PR model, VOL model, TR model, and SEN model. From Table 14, it can be seen that the four meta-models have their own merits and values in different aspects, laying a good foundation for the final determination of the TPs. The Num of VOL model is 0.1311, more trading opportunities are created which can reduce investment risk and improve the stability of profit. The performance of AccTP and RecTP in three technical models are close to each other. For AccTP, the maximum value is 0.4030, and the minimum is 0.3974, ensuring the accuracy of the classification. In SEN model, the AccTP is 0.3488, and RecTP is only 0.1791. The reason is that sentiment indicators will only be calculated when the    announcements occur, which leads to a large number of TPs that are not predicted. The value of AccTP after fusion is the highest, reaching 0.4136, indicating that the fusion strategy could bring higher accuracy to the prediction. The value of CerTP is low in all of the models. Moreover, the models can reduce losses by the trading strategy with position and risk management. Since F1 is calculated based on AccTP and RecTP, the value of SEN model is still the lowest. The Profit brought by MF-MSD based on the fusion rule is the highest, reaching 0.4869, which is much higher than each meta-model. Moreover, the corresponding standard deviations of MF-MSD are always the best, indicating the stability of fusion model. MF-MSD combines the advantages of the PR model, VOL model, TR model, and SEN model, has excellent capabilities on TPs prediction and trading points determination.
To compare the performance of each meta-model and MF-MSD used for the prediction of trading signals, the area under the ROC curve (AUC score) was calculated based on the sen-sitivity and specificity during the test period. Figure 8 shows the average AUC score of meta-models. All AUC curves tend to be stable, and the values are greater than 0.5, which is related to the stability of each meta-model. The stability and reliability of the meta-models lay the foundation for model fusion. Figure 9 shows the average AUC score of PR model, VOL model, TR model, SEN model, and MF-MSD. The AUC score of MF-MSD is the best among all models almost through the entire test period, indicating that the confidence of the model is improved.
For investors, it is crucial to obtain steady, growing profit. Figure 10 shows the average profit rate of PR model, VOL model, TR model, SEN model, and MF-MSD during the test period. The curve is upward with small retracement in each model. The profit rate of the MF-MSD is the best among all models almost through the entire test period. Once a retracement occurs, the loss will stop in time due to the improved trading strategy. In order to show the improvement of the weights and ITPs, the method in (Luo et al. 2017) is used as the baseline to obtain the four meta-models. MF-noITP is the result after the fusion of the meta-models. The profit is also obtained by the improved trading strategy. The results are listed in Table 15. The meta-models in MF-noITP have their own advantages in different aspects, laying a good foundation for the determination of the TPs. For Num, PR model provides more trading opportunities. For AccTP, TR model ensuring the accuracy of the classification, and SEN model has the best RecTP and F1. Compared with Table 14, as can be seen in Table 15, the Num of each model drops drastically, especially for VOL model, with the value of 0.0293, while it is 0.1311 in MF-MSD, too many TPs are not predicted. AccTP, RecTP, and F1 have also declined, with the best values of 0. 2906, 0.1663, and 0.1910 in Table 15, while corresponding to 0.4136, 0.3118, and 0.3411 in MF-MSD. As mentioned before, the profit of MF-noITP is 0.2667, which is better than the meta-models in Table 15. The fusion rule can well obtain the advantages of meta-models. However, it is still lower than 0.4869 of MF-MSD. The model without ITPs and improved weights shows weaker predictive ability than MF-MSD, which can be explained, the addition of ITPs and weights could increase the TPs, thereby increase the frequency of transactions. The profit and its stability will rise accordingly.
In order to show the effectiveness of feature selection, all the features in the pool are used as the input of the four algorithms. MF-noFS is the result after the fusion of the algorithms. The improved trading strategy is selected to obtain the profit. The results are shown in Table 16. In Table 16, the results are not satisfactory. The performance of the metamodels varies greatly, especially for Num and RecTP. The Num of GBDT is 0.1752, which is acceptable. However, the Num of WSVM and GBDT are only 0.0454 and 0.0247, respectively. There were almost no transactions throughout the test period. Compared with MF-MSD in Table 14, the performance of meta-models in MF-MSD is stable, indicating that the stability of meta-models can be improved after the first-level model fusion. For MF-noFS, although the values of AccTP, RecTP, and F1 are 0.3725, 0.3867, and 0.3679, which are close to that of Table 14, the profit is only 0.2470 of MF-noFS. Indicating that most of the TPs predicted by MF-noFS are actually ITPs, while the important TPs are not accurately predicted. The ability of TPs prediction is insufficient of the model without feature selection.
Note that, the profit rate of PR model (Table 10), VOL model (Table 11), TR model (Table 12), SEN model (Table 13), MF-MSD (Table 14), MF-noITP (Table 15), and MF-noFS (Table 16) are calculated based on the improved trading strategy, with weight information and risk control. In order to show the effectiveness of the improved trading strategy, taking MF-MSD as an example, we perform three different trading strategies. The corresponding profit rates are shown in Table 17. The profit of Improved strategy is obtained according to the weights and risk control. Baseline_1 is the trading strategy only with weight information. Baseline_2 is the traditional trading strategy (Luo et al. 2017). As can be seen in Table 17, no matter in meta-models or MF-MSD, the profit obtained by the improved trading strategy is always higher than others. On the one hand, as the weights are set according to the predicted TPs of meta-models, the profit explains the accuracy of the prediction. On the other hand, position and risk management strategy reduces the losses caused by errors. Figure 11(a) shows the average profit rate of MF-MSD, MF-noITP, and MF-noFS during the test period. The profit  rate of the MF-MSD is almost the best through the entire test period. Figure 11(b) shows the average profit rate of different trading strategies. The overall trend of the three curves is the same. The profit of improved trading strategy is always the highest, indicating that it is reasonable and effective to trade with weights and risk control. The plate information of two days before the predicted trading day is analyzed based on LSTM. The data of the predicted trading day is used to be the test set. Table 18 lists the profit of 18 stocks based on MF-MSD and MF-MSHD, buy and hold strategy (BHS) is selected to be the baseline method.
The results in Table 18 show that the profit of five stocks have reduced while the value of thirteen stocks have increased. This should be related to the classification method of label. As shown in formula (38) and (39), label 1 belongs to a region instead of a fixed value. In the trading strategy, we will trade immediately once the trading moment is predicted. Although the profit of five stocks have reduced, the overall profit of MF-MSHD has not been affected, indicating the accuracy of the TPs predicted by MF-MSD. As a result of introducing different granular data, the average profit of 18 stocks increases from 0.4869 to 0.5172, while it is only 0.0992 of BHS. At the same time, the increases of Shanghai Composite Index, Shenzhen Composite Index, and CSI300 Index during the test period are 0.1648, 0.4051, and 0.3397, respectively.
In MF-MSHD, both structured and unstructured data are considered. More differences between meta-models are created to improve the generalization and stability of the model. The trading day is determined after the fusion of meta-models. Furthermore, LSTM is used to predict exact trading moment for real-time operation. FastRNNs and FAS-TRNN_CNN_BiLSTM consider both computation speed and accuracy to make real-time predictions of stock prices (Yadav et al. 2022). Although these two models were proposed to predict exact prices, the frameworks could be used in comparison experiments with adjusted input and output. Therefore, after the t th day is predicted to be the trading point based on the daily technical indicators and sentiment information, LSTM is replaced by FastRNNs and FAS-TRNN_CNN_BiLSTM to predict exact trading moment, named as MF-FastRNNs and MF-FASTRNN, respectively. CNN-LSTM proposed by Hao and Gao (2020) also utilizes multi-time-scale feature learning to predict stock trends. GAF-CNN considers graph modeling of time series data, and then uses CNN for stock trends analysis (Barra et al. 2020). SSACNN and SACLSTM are also CNN-based models which extract features from financial time series and make predictions based on classification (Wu et al. 2021b, c). Note that, the trends in SSACNN and SACLSTM were divided into three categories, which are +1, -1, and 0. In the comparison experiments, there are only two categories, that is TPs or not, and only history data is considered. In summary, MF-MSHD is compared with MF-FastRNNs, MF-FASTRNN, CNN-LSTM, GAF-CNN, SSACNN, and SACLSTM. The main parameters of the comparison models are the same as those shown in the corresponding references. The running time of each model is also recorded in seconds. Table 19 shows the average results of 20 times experiments for 18 stocks in MF-MSHD and comparison models, where the best are marked in bold.
As can be seen from Table 19, the computation efficiencies of MF-MSHD, MF-FastRNNs, MF-FASTRNN, CNN-LSTM, and SSACNN are close to each other, but for SACLSTM, the computation efficiency is low. Especially, GAF-CNN requires the longest training time. The results of MF-MSHD, MF-FastRNNs, and MF-FASTRNN are the same except for a slight difference in Profit and Time. This is because the three models use different algorithms only   For convenience, we also list the profit rate of each stock in different models in Table 20. Note that, for each stock, the result is the average value of 20 times experiments.
As can be seen in Table 20, the performance of these models is different on each stock. For example, the profit rate of 000010.SZ in CNN-LSTM reaches 2.8167, which is much higher than the other models, while the performance on 600753.SH is quite different, SSACNN has the best profit rate with a value of 3.4982. The same thing happens on 603789.SH. At this time, it is necessary to analyze the stability of each model, which can be reflected from the overall average profit rate. Figure 12 shows the average profit rate of 18 stocks during the test period in MF-MSHD, MF-FastRNNs, MF-FASTRNN, CNN-LSTM, GAF-CNN, SSACNN, and SACLSTM.
In Fig. 12, during the test period, the profit rate of MF-MSHD rises steadily, with small retracement and slight fluctuations, the same as MF-FastRNNs and MF-FASTRNN. For CNN-LSTM, although the overall trend is upward, greatly fluctuations occurred. Generally speaking, in highrisk situations, investors hope to obtain stable and high profit. Although the profit rate of CNN-LSTM exceeded MF-MSHD around the 50th day and 200th day, the profit rate varies greatly and fell back to lower than MF-MSHD in the following operations. The curves of GAF-CNN, SSACNN, and SACLSTM are more stable than CNN-LSTM, but the overall profit rate is under the curve of MF-MSHD. However, compared with the increase of 0.0992 in BHS, the performance of CNN-LSTM, GAF-CNN, SSACNN, and SACLSTM is still acceptable. Considering Table 19, although CNN-based models predict more TPs, the profit rate during the test period is not as good as MF-MSHD, indicating that the proposed MF-MSHD can predict important trading points more accurately. Moreover, the differences between the meta-models are obvious, which improves the generalization and stability of MF-MSHD. The analysis of fine-grained data based on LSTM further increases the prof-itability of the model with lower computational complexity than FastRNNs and FASTRNN_CNN_BiLSTM. Figures 13 and 14 show the prediction results of metamodels, MF-MSD, and MF-MSHD on stock 000010.SZ. It can be seen that due to the fusion strategy, some wrong signals in meta-models are eliminated and the correct signals are strengthened. MF-MSHD uses the adjusted price in Fig. 14b. The prices of some trading moments are also marked in Fig. 14a and 14b. For example, for the penultimate predicted point, the adjusted price of 4.42 is higher than the close price of 4.3 which is predicted to be a selling point. The results show the effectiveness of introducing different granular data.

Conclusions
In this paper, MF-MSHD makes improvements in many aspects.
(1) A model fusion method based on multi-source data and different learning algorithms is proposed for TPs prediction. Multi-source data include unstructured information and structured information with different granularities. RF, WSVM, BPNN, GBDT, and LSTM are selected to be the learning algorithms. The differences between meta-models are designed as much as possible. (2) In the fusion rule, the predicted TPs are determined based on the performance of meta-models. (3) The TPs and ITPs are generated according to the characteristics of specific stock. The corresponding weights are set according to the local and global factors. (4) The sentiment indicators in SEN model are analyzed based on the IExSISD, which is a more accurate sentiment dictionary of stock market comments. (5) Fine-grained data is used to jointly determine the exact transaction moment on the trading day based on LSTM network. (6) The improved trading strategy with position and risk management is performed by using the weights of predicted TPs.  Due to the use of multi-source heterogeneous data, MF-MSHD conducts a comprehensive analysis of a variety of information to provide the trading decisions. The experimental results show that the meta-models have their own merits and values in different aspects, laying a good foundation for the final determination of the trading moment. The accuracy and profitability have been improved according to the fusion strategy. The average profit rate of MF-MSHD is 0.5172, which is much higher than 0.0992 of BHS. At the same time, the increases of Shanghai Composite Index, Shenzhen Composite Index, and CSI300 Index are 0.1648, 0.4051, and 0.3397, respectively. MF-MSHD shows great profitability in the field of stock investment.
Although the MF-MSHD proposed in this paper has a good performance in the prediction of trading signals, there are still many issues to be further studied.
(1) The experiment analyzes the information separately based on the properties, and determines the TPs based on the fusion strategy. Both data fusion and decision fusion could be introduced to improve the performance of the model. More information at home and abroad also needs to be considered.
(2) The exact trading moment of the trading day needs to be determined with a balance of computational efficiency and accuracy to ensure real-time operations.
(3) Deep learning-based models such as CNN and graph convolutional networks (GCN) can be used in the construction of meta-model. Data availability All the datasets used are publicly available.

Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent All authors checked the final draft and agreed on the submission.