Penalized logistic regressions with technical indicators predict up and down trends

Correctly predicting up and down trends for stock prices is of immense important in the financial market. To further improve the prediction performance, in this paper we introduce five penalties: ridge, least absolute shrinkage and selection operator, elastic net, smoothly clipped absolute deviation and minimax concave penalty to logistic regressions with 19 technical indicators, and propose the five penalized logistic regressions to predict up and down trends for stock prices. Firstly, we translate the five penalized logistic log-likelihood functions into the five penalized weighted least squares functions and combine them with the tenfold cross-validation method to calculate the solution path to parameter estimators. Secondly, we combine the binomial deviation with cross-validation error as a risk measure to choose an appropriate tuning parameter for the penalty functions and apply the training set and the coordinate descent algorithm to obtain parameter estimators and probability estimators. Thirdly, we employ the testing set and the chosen optimal thresholds to construct two-class confusion matrices and receiver operating characteristic curves to assess the prediction performances to the five regressions. Finally, we compare the proposed five penalized logistic regressions with logistic regression, support vector machine and artificial neural network and found that the minimax concave penalty logistic regression performs the best in terms of the prediction performance to up and down trends for Google’s stock prices. Therefore, in this paper we propose the five new prediction methods to improve the prediction accuracy of stock returns and bring economic benefits for investors.


Introduction
Stock market exists some inherent characteristics such as model uncertainty, parameter instability and noise accumula- tion. These characteristics make the stock market prediction more complex. Different viewpoints spring up in economic and finance. For example, both efficient market hypothesis and random walk theory assumed that the stock market was unpredictable, whereas Dow theory and Murphy (1999) assumed that financial market was predictable. In particular, Murphy (1999) proposed many technical indicators and developed the technical analysis methods for finance market, whereas Elliott et al. (2013) systematically summarized the economic forecasting problems, emphasized the challenges from stock price forecasting and provided the strategies to improve the forecasting performances. In recent years, some machine learning methods have been proposed to predict stock market. For example, Wang and Zhu (2010) developed support vector regression and a two-step kernel learning method for financial time series prediction. Nair et al. (2011) proposed adaptive artificial neural network (ANN) to predict the second-day closing price of stock market index. Cavalcante et al. (2016) systematically reviewed the progress on artificial intelligence, neural network and support vector machine (SVM) in predicting the change of stock price or direction. Zhang et al. (2018) proposed a novel stock price trend prediction system that could predict both stock price movement and its interval of growth (or decline) rate within the predefined prediction durations. Wen et al. (2019) introduced a new method to simplify noisy-filled financial temporal series via sequence reconstruction by leveraging motifs (frequent patterns) and then utilized a convolutional neural network to predict up and down trends for stock prices. Nabipour et al. (2020) applied machine learning and deep learning algorithms to significantly reduce the risk of trend prediction. Shen and Shafiq (2020) proposed a comprehensive customization of feature engineering and deep learning-based model to predict price trends for China's stock markets.
It is well known that public sentiment is closely linked to financial markets. In recent years, the impact of investor sentiment on stock returns has been investigated. For example, Joshi et al. (2016) predicted the future stock movements through news sentiment classification. Li et al. (2017) proposed a long short-term memory neural network by combining investor sentiment with market factors to improve the prediction performance. Xing et al. (2019) proposed a novel sentiment-aware volatility forecasting model to produce the more accurate estimation for temporal variances to asset returns by capturing the bi-directional interaction between movements of asset price and market sentiment. Khan et al. (2020) proposed machine learning methods with sentiment and situational features to predict future movements of stocks. Li et al. (2021) constructed the return distributions for the Shanghai Security Composite Index by adding sentiment-aware variables. In addition, market sentiment perspectives and public sentiment-driven portfolio or asset allocation has been also analyzed. For example, Malandri et al. (2018) discussed how the public sentiment would affect portfolio management.  investigated the role of market sentiment in an asset allocation problem.  proposed to formalize public sentiment as a market views and integrated it into modern portfolio theory. Picasso et al. (2019) combined technical analysis with sentiment analysis for news and constructed a portfolio return forecasting model by machine learning, etc..
Predicting up and down trends for stock prices is an important puzzle in the financial field. Even very small improvements in the prediction performance can be very profitable. For example, Hu and Jiang (2021) proposed logistic regression with 6 technical indicators to predict up and down trends for Google's stock prices and obtain the higher prediction accuracy. In this paper we introduce the five penalties: ridge, least absolute shrinkage and selection operator (LASSO), elastic net, smoothly clipped absolute deviation (SCAD) and minimax concave penalty (MCP) to logistic regressions with 19 technical indicators, and propose the five penalized logistic regressions to further improve the prediction performance to stock returns. Firstly, we combine the iterative weighted least squares algorithm with the tenfold cross-validation method, calculate the overall solution path of model parameters and select a specific solution path from the overall solution path. Secondly, we combine the binomial deviation with cross-validation error as a risk measure to choose an appropriate tuning parameter λ and apply the training set and the coordinate descent algorithm to obtain parameter estimators and probability estimators. Thirdly, we employ the testing set and the chosen optimal thresholds to construct two-class confusion matrices and receiver operating characteristic (ROC) curves to assess the prediction performances to the five regressions. Finally, we compare the proposed five penalized logistic regressions with logistic regression, SVM and ANN, and found that the MCP logistic regression performs the best in terms of the prediction performance to stock returns. So we recommend investors to employ the MCP logistic regression to predict up and down trends for stock prices and gain the richer economic benefit.
The rest of this paper is organized as follows. In Sect. 2, we establish the five penalized logistic regressions with technical indicators. In Sect. 3, we apply the training set to learn the five penalized logistic regressions and obtain parameter estimators and probability estimators. In Sect. 4, we adopt the testing set to obtain two class confusion matrices and ROC curves for the five regressions to assess their prediction performances. In Sect. 5, we compare the proposed five prediction methods with logistic regression, SVM and ANN.

Penalized logistic regressions
Let C t be the closing price of a given stock at the end of the t-th trading day, K t = C t+1 − C t be the stock excess return, represents the direction indicator function, where Y t = 1 represents up trends, and Y t = 0 represents down trends. The main goal of this paper is to predict up and down trends for stock prices. In the following we apply a training set D = {x t , y t } n t=1 to learn up and down trends for stock prices and construct a two-category classification rule that may be hidden deeply in the raw dataset, where x t is the sample from the predictor vector X t whose distribution is usually unknown. It is well-known that logistic regression is a powerful two-category classification method. In this paper we combine logistic regression with technical analysis developed by Murphy (1999) and proposed the following logistic regression with 19 technical indicators: where β 0 is an unknown intercept term, β = (β 1 , β 2 , . . . , β 19 ) is an unknown parameter vector, and X t = (X t,1 , X t,2 , . . . , X t,19 ) is the predictor vector composed of 19 technical indicators listed in Table 1. To avoid multi-collinearity and over-fitting, we introduce the five penalties for logistic regression to remove some technical indicators that are irrelevant to up and down trends for stock prices and construct the five penalized logistic regressions to predict up and down trends for stock prices. . . . , x t,19 ) and y t be the observation samples for X t and Y t , respectively. Given the training set {x t , y t } n t=1 , we obtain the following negative log-likelihood and the penalized negative log-likelihood function where p λ,γ (β) is a function of the coefficients indexed by a tuning parameter λ that controls the trade-off between the loss function and penalty, and that also may be shaped by one or more regularization parameters γ . In this paper we choose the five penalty functions listed in Table 2.

Parameter estimators and probability estimators
Negative log-likelihood function (4) is not differentiable. Hence if the current estimates of the parameters are ( β 0 , β(m)), we transform (4) into the weighted leastsquares function and form a quadratic approximation to negative log-likelihood function (4): where , of P t add the estimator β 0 of the intercept β 0 as follows: and C( β 0 , β(m)) 2 is constant. Similarly, penalized negative log-likelihood function (5) is not differentiable. Therefore, we replace the negative log-likelihood function l(β) in (5) by the weighted least-squares function l Q (β 0 , β), run the coordinate descent algorithm to obtain the parameter estimator where the intercept term β 0 does not be penalized. More details refer to Breheny and Huang (2011) on the coordinate descent algorithm for penalized logistic regressions. Table 3 lists three specific parameter estimators. For j in {1, 2, . . . , p}, the coordinate descent algorithm partially optimizes a target function Q(β; λ, γ ) with respect to a single parameter β j with the remaining parameters β l , l = j fixed at their most recently updated values β λ,γ , then iteratively cycling through all the parameters until convergence or a maximum iteration number M is reached, and this process repeats over a grid of values for λ to produce a path of the solution. Usually, we are interested in obtaining β λ,γ not just for a single value of λ ∈ [λ min , λ max ], but for a range of values extending from a maximum value λ max for which all penalized coefficients are 0 down to λ = 0 or to a minimum value λ min at which the model becomes excessively large or ceases to be identifiable. Thus, by starting at λ max with β(0) = 0 and proceeding toward λ min , we can ensure that the initial values will never be far from the solution. For γ , we generally take γ = 3.7. Here we take the different values for γ and found that γ = 5 for MCP and γ = 10 for SCAD are better. Algorithm 1 provides the specific pseudocode on how to apply the coordinate descent algorithm to calculate the parameter estimators for the MCP logistic regression. The coordinate descent algorithms to parameter estimators for the other four penalized logistic regressions are similar to Algorithm 1. We would not list them here for lack of space.
In this paper we apply the coordinate descent algorithm to the five penalized logistic regressions to obtain the final Descriptions Formulae exponential moving average with a slow exponential moving average difference of a series over two observations technical analysis to assess where the closing price of a security falls relative to its day's high and low prices X t,10 (CMF) Chaiken money flow compares to the close, high and low prices.
Money flow index uses price and volume data for identifying X t,15 (VHF) Vertical horizontal filter Parabolic stop-and-reverse is of a trend and the potential reversal of a price oscillator is to filter price noise dynamic technical indicator that determines whether the Market is overbought or bought.
the stock Table 2 Penalized functions Penalties Formulae ENet represents elastic net parameter estimators β λ,γ 0 and β λ,γ , then compute the probability estimators Remark Compared with local linear/quadratic approximation algorithm, the coordinate descent algorithm has the following advantages: 1) The optimization over each single parameter has a single closed solution; 2) updating can be computed very rapidly; 3) initial values will never be far from the solutions and a few iterations are required.

Two-class prediction performance
Two-class confusion matrix is a contingency table of the true class and the predicted class that describes two-class classification results, see Table 4.
that is the simplest index to evaluate the prediction performance. However, it cannot reflect the losses from two types of errors. Therefore, a ROC curve is introduced to evaluate the prediction performance. Suppose that T P R(c) = P(X 1 < c) represents the true positive rate at the threshold c, and F P R(c) = P(X 2 < c) represents the false positive rate at the threshold c. By setting the different threshold c, we calculate {(T P R(c), F P R(c))} or (Sensitivity, 1-Specificity) to draw a ROC curve, where Sensitivity(True positive rate,TPR) Specificity(1-False positive rate,1-FPR) In Sect. 5 we adopt the R package pROC to draw a ROC curve and compute AUC (the area under the ROC curve, a summary indicator of classification performance). More details on ROC can refer to Chapter 7 in Hu and Liu (2020).

Technical indicators and variance inflation factors
The stock market fluctuates greatly during December 2019 because of the novel coronavirus pandemic. Therefore we select Google's stock prices from January 2010 to November 2019 as the observation data with the sample size n + N = 2450, choose the 80% observation data as the training set with the sample size n = 1960 to learn up and down trends for stock prices and choose the remaining 20% observation data as the test set with the sample size N = 490 to predict up and down trends. In this paper we apply the R function getSymbols from the Yahoo Finance port to obtain opening price (O t ), highest price (H t ), lowest price (L t ), closing price (C t ), volume (V t ) and adjusted price (A t ) for Google corporation and then adopt the R package TTR to calculate the 19 technical indicators: WMA, DEMA, ADX, MACD, CCI, Mo, RSI, ATR, CLV, CMF, CMO, EMV, MFI, ROC, VHF, SAR, TRIX, WPR, SNR. In this paper we take Y t as T RI X t , W P R t and S N R t have smaller range, mean and standard deviation. The mean value of momentum line M O t at 1.9375 reflects the overall upward trend of Google stock price. The mean value of RS I t is 54.1628, and the maximum value is 98.7890 that is greater than 80 and corresponds to the selling period, whereas the minimum value is 5.5085 less than 10 and corresponds to the buying period. Through the analysis for median and mean to 19 indicators, we found that they are evenly distributed. However, indicators have different degrees of variation, and the values of some indicators differ greatly. Therefore, in order to eliminate the influence of scale variations, we standardize the data before modeling. In order to check whether collinearity exists among 19 indicators, we introduce VIF to check. It can be observed from Table 5 that the VIF for W M A t , DE M A t and S AR t are far greater than 10, and the VIF for M O t , RS I t , C M O t , ROC t and W P R t are also greater than 10. This indicates that there exists collinearity among 19 indicators. Thus, it is statistically significant to introduce the penalty functions for logistic regression to reduce collinearity and avoid overfitting.

Tuning parameter selection
For ridge or LASSO or elastic net penalty, variable selection is determined by the tuning parameter λ. In order to select an appropriate λ, we apply a tenfold cross-validation method to calculate the full solution path to model parameters, select a specific solution path from the full solution path and take the binomial deviation as the risk measure. Then we get the mean cross-validation error curve and the one standard deviation band, see Fig. 1. The parameter estimators for MCP logistic regression and SCAD penalized logistic regression depend on the tuning parameter λ and the regularization parameter γ .
In this section we combine binomial deviation with the tenfold cross-validation method to choose an appropriate tuning parameter λ. Figure 1a, b, c, respectively, represents the binomial deviance curves for ridge, LASSO and elastic net that are drawn by the R function cv.glmnet, whereas Fig. 1d, e, respectively, represents the cross-validation error curves for SCAD and MCP that are drawn by the R function plot.cv.ncvreg. For Fig. 1, the numbers above each graph indicate the selected variable numbers. The left vertical line corresponds to log(λ) when the minimum mean square error occurs, the right vertical line represents the corresponding log(λ) when 1 times standard error occurs, and log(λ) between the two vertical lines indicates that their errors are within a minimum standard error range (i.e., the "one-standard-error" rule). We often use the rule to select the relatively optimum model. From Fig. 1 we observe that the range of "one-standard-error" for ridge, LASSO and elastic net is 0.0173 − −0.0401, 0.0020 − −0.0154 and 0.0033 − −0.0213, respectively. However, for MCP and SCAD, there is only one vertical line and corresponds to the log(λ) when the average minimum error occurs, see Fig. 1d, e. We evaluate the prediction performance at each λ and γ value, select the relatively optimum model corresponding to λ = 0.0121 and γ = 5 for MCP or λ = 0.0035 and γ = 10 for SCAD and obtain the final five penalized regressions. We compare the five penalized regressions with logistic regression and found that ridge logistic regression preserves 19 variables without removing one variable, which is similar to logistic regression, whereas the other four penalized logistic regressions choose different variables, more details see Table 6.
For the five penalized logistic regressions, we calculate their VIF values, see Table 7. From Table 5, we found that the VIF of W M A t , DE M A t and S AR t are 58264.2178, 57089.3227 and 289.9360, respectively, whereas the VIFs of M O t , RS I t , C M O t , ROC t and W P R t are greater than 10, which indicates that the strong multicollinearity relations among these indicators exist. From Table 7, we observe that the VIFs of the remaining indicators after the LASSO penalty are all less than 10, after the elastic net, MCP and SCAD   Fig. 1 The relationships between binomial deviance/cross-validation error and log(λ)

The prediction performance
We take advantage of the training set {x t , y t } n=1960 t=1 to learn up and down trends for Google's stock price and apply the testing set {x t , y t } 2450 t=1961 , and the ROC curve to evaluate the prediction performance. According to the predicted class from the training set and the actual class from the testing set, we establish the following two-class confusion matrix, see Table 8. From Table 8 we calculate accuracy, sensitivity and specificity for logistic regression as follows: Similarly, we calculate accuracy, sensitivity and specificity for the five penalized logistic regressions. Their specific values are listed in Table 9. From Table 9 we observe the following facts: (1) For elastic net and LASSO, accuracy is higher than that of ridge, but is lower than that of logistic regression; (2) accuracy for MCP is higher than that of SCAD, whereas accuracy for SCAD is higher than that of elastic net and logistic regression. However, accuracy is the simplest index to evaluate the prediction, and it cannot fully reflect the corresponding loss of two kinds of errors. Therefore, in the following we first compute sensitivity and specificity corresponding to different thresholds for the six methods and then apply them to draw the ROC curve to evaluate accuracy, see Fig. 2. In Fig. 2, the AUC corresponding to logistic regression, ridge, LASSO,elastic net,MCP and SCAD is 0.776,0.752,0.757,0.760,0.778 and 0.777,respectively. Combined with accuracy listed in Table 9, it can be concluded that among the six methods, the MCP logistic regression with technical  indicators performs the best in terms of in terms of accuracy. In order to further explain the superiority to the MCP logistic regression in predicting stock prices trends movement, we compare the prediction performances for the MCP logistic regression with those for SVM and ANN, see Table 10. From Table 10, we can observe that among the aforementioned three methods, MCP performs the best in terms of sensitivity, accuracy and AUC. The reason that SVM performs the worse may be that Gaussian kernel function is a typical local kernel function, and it only affects the data points in a small area near the test point and has strong learning ability and weak generalization performance. In addition, ANN is unstable, so we choose the average of the 10 predicted results as the final values, and they are worse than MCP. Obviously, the MCP logistic performs best in predicting the trend of stock price ups and downs. Therefore, we recommend the MCP logistic regressions to predict the stock price trend movements.

Discussion
Methodologically, we introduce the five penalty functions to logistic regression with 19 technical indicators and propose the five penalized logistic regressions to predict up and down trends for Google's stock prices. These prediction methods not only can provide classification probability estimation and class index information, but also improve the prediction accuracy by shrinking regression coefficients and avoiding multicollinearity and overfitting. Computationally, we combine the iteration weighted least squares, the coordinate descent algorithm and the tenfold cross-validation method for the five penalized logistic regressions to obtain their parameter estimations and probability estimations. According to the VIF analysis in Table 5, we found that there exists collinearity among the different technical indicators. Thus, it is statistically significant to introduce the different penalty functions to reduce collinearity relations in logistic regression with 19 technical indicators. Therefore, we propose the five efficient penalized logistic regressions to predict stock price trend movement. Wen et al. (2019) and Khan et al. (2020) predicted Google stock trend movements, whose accuracy is 0.636 and 0.641, respectively. From Table 9 we observe that the prediction accuracies of the five penalized logistic regressions are higher than 0.693. In particular, the prediction accuracies of MCP and SCAD are 0.732 and 0.731, respectively. The AUCs of MCP and SCAD are 0.778 and 0.777, respectively. Obviously, MCP and SCAD penalized logistic regressions outperform logistic regression in terms of the prediction performance. Furthermore, compared MCP and SCAD with SVM and ANN, we found that the proposed MCP and SCAD penalized logistic regression performs better than SVM and ANN. Therefore, in this paper we provide the new methods to predict stock market trends movement. Moreover, the proposed methods help investors to better understand the internal mechanism of stock market trends movement.

Conclusion
Based on Murphy's technical analysis method, we combine technical indicators with five penalized logistic regressions and propose the five penalized logistic regressions to predict the up and down trends of Google's stock price. The prediction results show that the MCP logistic regression with technical indicators is superior to logistic regression, the other four penalized logistic regressions, SVM and ANN. Therefore, in this paper we combine technical indicators with MCP logistic regression and provide the new effective prediction method to further improve the prediction performance to stock returns. For other stock price trends prediction problems, we can also apply statistical charts, data analysis, empirical knowledge and the penalized method to extract some important technical indicators that may affect stock price trends movement, establish some penalized logistic regressions with different technical indicators to predict up and down trends for stock prices and apply the two-class confusion matrixes and ROC curves to assess their prediction performances.
Author Contributions XH provided the basic idea and improved the writing to the manuscript. HJ collected data, provided the figures and tables, and finished the basic writing. HJ improved the program.

Data availability
The datasets analyzed during the current study are available in the Yahoo Finance, uk.finance.yahoo.com.