Forecasting the Pakistan’s Leading Stock Exchange During Covid-19 Using Machine learning (ML) Algorithms: Model Development and Validation

During COVID-19, marketing shows sharp �uctuation in upward and downward trends. Forecasting price actions is one of the most challenging problems in this situation. It is challenging to build an accurate model, which integrates economic and Covid-19 variables as input for KSE index prediction. To tackle this problem, our proposal comprises applying machine learning (ML) techniques to predict the KSE during Covid-19. The principal aim of this study is to examine accuracy of combined models with individual models to forecast the Karachi Stock Exchange during COVID-19. This study has analyzed the indices of KSE from March 1st, 2020, to November 26th, 2021. Therefore, this study is keen to �nd the best-tted model that forecasts more accurately during the pandemic. To select the most suitable machine learning technique, the six inferred models (i.e., Linear regression (LR), Arti�cial Neural Network (ANN), Regression Tree (RT), Random Forests (RF), (KNN), and Support Vector Regression (SVR)) are selected to forecast the Karachi Stock Exchange During Covid-19. Performance metrics (i.e., MAE, MSE, MAPE, and R 2 ) are applied to measure and compare accuracy. The modeling outputs presented the RF model provided the best performance of 0.98 versus the other models in predicting the KSE100 index. Thus, the addition of ML methods improves the exchange indications and the competitiveness of future trading guidelines. These projections helped the government to make strategies for the stock exchange KSE-100 and �ght against a pandemic disease. The results suggest that the performance of the KSE-100 index can be predicted with machine-learning techniques.


Introduction
In the world of today, a signi cant role is played by the stock markets in determining the economy of a country.The Karachi Stock Exchange 100 Index (KSE-100) is a major stock market index that tracks the performance of the largest companies by market capitalization from each sector of the Pakistani economy.The KSE-100 Index is designed to measure the performance of 100 companies comprising the largest as well as the highest market capitalization.The primary objective of the KSE-100 index is to have a benchmark using which the stock price performance can be compared over a period.In particular, the KSE-100 is designed to deliver investors with the nous of how the Pakistan equity market is performing.
Thus, the KSE-100 is similar to other indicators that track various sectors of the country's economic activity such as the gross national product, consumer price index, etc.The market value of rms and stock prices may signi cantly be affected by numerous factors.Exchange rate uctuations are one such signi cant factor.The infectious disease of Novel Coronavirus (2019-nCOV) or COVID-19 rst emerged in December 2019 in Wuhan City, Hubei Province of China (WHO, 2020).It quickly got the attention of the world due to its vast spread in January 2020 due to its quick spread compared to other kinds of viruses.The economic losses due to COVID-19 have also affected the global stock markets.The contagion effect of the pandemic on global stock markets has been observed in almost every continent.Pakistan's stock market is also one of the exchanges that have been in uenced as a result of COVID-19.
Studies have found that even in the early stages of COVID-19, its impact on the real economy has already been re ected [1], the pandemic has negatively affected trade, tourism, and transportation, and it increased the unemployment rate [2]; even some studies showed that the global spread of COVID-19 had made a similar impact to an economic crisis [3].COVID-19 not only has a great impact on economic and social development but also in uences the operation of nancial markets.Some studies showed the impact of COVID-19 had been re ected in the return of nancial markets [4][5][6][7][8], the volatility of nancial markets [9][10][11][12], and the risk contagion among nancial markets [13].Baig et al. investigated the impact of COVID-19 on the liquidity and volatility of the stock market and found that the increase in con rmed cases and deaths due to COVID-19 had signi cantly aggravated the lack of liquidity and volatility in the market, and strict closure measures had also deteriorated the liquidity and stability of the market [14].
Rizwan et al. studied the banking systemic risk in eight major countries which were seriously affected by COVID-19; it was found that the nancial systemic risk of each country increases signi cantly during the pandemic period [15].
It is challenging to develop a classi ed model for the Karachi Stock Exchange prediction during Covid-19 due to the complexity and the interrelated relationships among attributes affecting the outcome.Thus, to ensure modeling accuracy, it is vital to implement a Machine learning (ML) model to provide reliable, accurate, functional, and robust modeling.The current research aims to investigate the impact of COVID-19 on the KSE -100 Index in Pakistan by considering the other major factors that in uence the performance of stock markets such as interest rate and exchange rate.Most importantly, we consider the measures taken by the government in the early response to a pandemic such as the closure of business activities and implementing smart lockdown later by relaxing the restrictions.Several economic measures such as scal stimulus and easing of monetary policy to mitigate the slowdown in economic activities are also considered.The Pakistani stock market posted record losses due to jerks to the sentiments of investors facing the lowest six-year intra-day value, that is, a reduction of 28 percent this year Pakistan Institute of Development Economics (2020).

Literature Review
As for a prediction, several researchers have identi ed several predictors that are useful to predict future stock returns.Those include (but are not limited to) dividend yield and dividend-price ratio (Fama and French 1988;Campell and Shiller 1988), price-earnings ratio (Campell and Shiller 1988; Welch and Goyal 2008), short interest rate (Campbell 1987;Ang and Bekaert 2007), term and default spreads (Campbell 1987;Fama and French 1989), and consumption-wealth ratio (Lettau and Ludvigson 2001).Besides predictors, forecasting techniques also play an important role in determining forecast accuracy.According to Mallikarjuna and Rao (2019), traditional regression techniques generally outperform others including arti cial intelligence and frequency domain models in providing accurate forecasts.In terms of stock volatility, academic researchers used to make forecasts by traditional GARCH models using indicators based on the past behavior of stock price and volatility (Gokcan 2000;Emenogu et al. 2020).
More recent studies become aware of issues such as parametric assumptions, leverage, asymmetric effects, power transformations, and long memory (e.g., Brooks 2007; Bandi and Reno 2012; Hou 2013).In this paper, we introduce GARCH models for volatility forecasting because we aim to test for the instability of the volatility process, which is primarily built upon those modeling techniques.
In addition to g growing number of papers linking COVID-19 and nance relates to stock markets.The literature in this scope contains works published even before the outbreak of the pandemic but suitable for explaining investor behaviors during the COVID period as well as works completed during the pandemic.In the rst group, one may nd papers addressing the issues of contagion [33], spillovers between markets during shocks [34] as well as the impact of bad news on the time-varying betas [35].The second, there are papers related to issues of dependencies between global factors and markets [36] or links between individual stock market reactions and the severity of the outbreak of the pandemic in various countries [37].Moreover, one can list some other works, e.g., related to pricings of stock during the pandemic.Singh [38] found that investors become more attentive to corporate fundamentals and ESG that support the long-run sustainability of rms during turbulence.Fundamental aspects of investments were also pointed out by Mirza et al. [39] who found that social entrepreneurship investment funds outperformed their counterparts during the outbreak of the pandemic.In the eld of stock pricing and price tendencies, Shehzad et al. [40] found that the pandemic has in uenced the variance of the US, Germany, and Italy's stock markets stronger than the global nancial crisis.Against this background, Narayan et al. [41] and Phan & Narayan [42] found positive effects of lockdowns, travel bans, and economic stimulus packages on stock markets, and Sharif et al. [43] found that in the US the pandemic outbreak has a greater effect on the geopolitical risk and economic uncertainty than the stock market itself.
Moreover, researchers argued that the stock markets are always affected by major events (Haque & Sarwar, 2013; Waheed, Wei, Sarwar, & Lv, 2018).However, as this virus becomes a global pandemic, it starts affecting the businesses which are re ecting in world stock markets.Some studies have examined the impact of COVID-19 on developed stock return (Al-Awadhi, Al-Sai , Al-Awadhi, & Alhamadi, 2020; Kowalewski & Śpiewanowski, 2020), which reported that the Hang Seng index and Shanghai stock exchange, United States and European stock markets re ect negative returns.In March, the United States marwas ket hita by circuit brake mechanism, four times in 10 days.Similarly, the United Kingdom stock market index, FTSE, has a decline of more than 12% worse after 1987 (Al-Awadhi et al., 2020).In Pakistan, the rst case of COVID-19 is reported on February 26, 2020, which has crossed the gure of 13,000, till conducting the study.However, the recovery rate is better as compared to developed countries, like Italy, France, and United States.The impact of this pandemic situation on Pakistan's economy depends on the time taken in taking preventive measures and the intensity of spreading the disease.According to the Asian Development Bank (ADB), this pandemic situation can cost the Pakistani economy approximately $16.38 million to $4.95 billion, nearly 1.57% of the overall GDP.The report also mentioned that this pandemic cost more than 946,000 job losses.In this way, a country that is at the recovery stage, in the last 2 years, is affected badly.This research developed different modeling with several advantages over predictable statistical analysis approaches.Furthermore, the existing traditional model, such as predictive ML is more useful for inferencing patterns or training data rules.Statistical distribution assumptions or postulated functional models provide a statistical regression model arbitrarily and differ from model to model.In contrast, the ML model is extracted from an algorithm centered on accessible data and requires limited user intervention in model development.Moreover, the adopted modeling techniques can be epitomized using a computational model to consider complicated relationships and con gurations via indexes, algorithms, and data structures.Another advantage is that the utilized models could be envisioned in an imaginative manner that suits human understanding.Furthermore, the used data mining-based models are xable in adapting to the required changes as it their algorithms are developed as an automatic method in a computer framework that can be modi ed in real-time after upgrading the data sources.

Materials And Methods
The main objective of this study is to develop a predictive model for better forecast accuracy of KSE during Covid-19.Moreover, data collection, data preprocessing, ML model comparison, and model performance evaluation and validation are the four primary aspects of the suggested methodology.The framework of the machine learning model proceeds is shown in Fig. 1.

Data collection and Feature De nition
For accurate KSE-100 index predations, variables need to be recorded and considered.The rst step in development of a machine-learning model is collecting data.In this study, the relationship between COVID-19 and KSE was studied.

Machine Learning Models Implementation for KSE Prediction
In this section, brie y describe the methods of ML were used to examine the accuracy of combined models with the individual models in terms of forecasting the Karachi Stock Exchange during COVID-19.
Training and testing processes were also conducted to check the suggested ML algorithm's e ciency.
The training applied 90% of the database to train the proposed model, and the testing part applied 10% of the database to carry out the test process.5-fold cross-validation to ensure the robustness and effectiveness of the suggested prediction models was utilized.Six prediction algorithms (i.e., Linear Regression (LR), Arti cial Neural Network (ANN), Regression Tree (RT), Random Forests (RF), Regression (KNN), and Support Vector Regression (SVR)) to forecast the KSE-100 are considered among the most recent and e cient machine learning-based prediction algorithms.These algorithms were selected mainly, among other reasons, because they are scalable, accurate, relatively fast, exible, and more provide regularized model formalization to control over-tting.The model features and implementation procedure are represented in the next section.

Quantile Regression (QR)
QR is an expansion of linear regression used when the conditions of linear regression are not met.It is mostly used for nding out the relationship between variables and forecasting.Different regression models differ based on -the kind of relationship between dependent and independent variables, they are considering, and the number of independent variables being used.A linear form for general quantile regression is described by Bachinsky (1998): 1 , , where is a ( ) vector of coe cients, is the column vector corresponding to the transposition of the row of the matrix of explanatory variables, is the dependent variable observation, and is the unknown error term in the presence of , the con ditional quantile of can be rewritten as 2 The continuous increase in the conditional distribution of } given } is traced out.The conditional quantile of , conditional on }, is assumed to satisfy , for several different values of , , resulting in .This allows for parameter heterogeneity across different types of regressors using quantile regression.As a result, the quantile regression estimator can be used to solve the minimization problem described below: 3 The quantile function is a weighted sum of the absolute value of the residuals.Where the weights are symmetric for the median regression case in , the minimization problem above reduces to , otherwise, it's asymmetric.

Support Vector Machine (SVM)
SVM is a exible supervised machine learning process to analyze data amid at both classi cation, regression, and the other purposes like outlier's detection.To achieve said objectives, it is constructed a hyperplane or set of hyperplanes in a high-or in nite-dimensional space.SVR having some key features to work; 1) Kernel function is used for mapping a lower-dimensional data into higher dimensional data.2) Hyperplane is a line that draws the separation amid two classes in general SVM.While in SVR, helps to predict the continuous variables and cover most of the datapoints.3) Boundary lines are the two lines apart from hyperplane, which creates a margin for datapoints.Finally, in 4) Support vectors are the datapoints which are nearest to the hyperplane and opposite class.The hyperplane gives intuitively optimal separation which has the largest distance to the nearest training-data point of any class and used for minimizing an error.In case, it is termed as support vector regression (SVR), if use for regression analysis.Generally, effort in SVR, to consider the maximum data-points within the boundary lines and the hyperplane (best-t line) must contain a maximum number of data-points.

K-Nearest Neighbors (KNN)
KNN is one of the simplest and easy-to-implement supervised machine learning process that can be used to solve both classi cation and regression problems.KNN is robust to the noisy as well as more effective if the training data is large.It assumes the similarity between the new case/data and available cases and put the new case into the category that is most like the available categories.This technique stores all the available data and classi es a new data point based on the similarity.This means when new data appears then it can be easily classi ed into a well suite category by using KNN algorithm.In the present study, the actual Manhattan distance was enhanced using weighting.The weighted Manhattan distance is determined by the following:

Regression Decision Tree (RDT)
RDT is considered as a predictive modeling technique in machine learning.It uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves).Decision trees where the target variable can take continuous values are called RT. Decision trees are among the most popular machine learning algorithms given their intelligibility and simplicity.

Arti cial Neural Network (ANN)
ANN is a commonly used ML model in regression.ANNs deemed in the study involve of three node layers: input, hidden, and output.Each node contains of a transfer function applied to the weighted sum of the previous layer's nodes and a bias.Linear activation function was evaluated and analysis with others like ReLU, SoftMax activation function.Since our study was linear, SoftMax was the best activation function.
An example of an ANN can be seen in Fig. 3, which displays the input (attributes), hidden, and output layers. ( where the Activation function was de ned as ( 5) 4 Model Building And Validation (K-fold Cross Validation) Datasets were divided into two portions (i.e., training and testing sets).This phase is critical to regulating the e ciency of the utilized machine learning procedures.The adopted algorithms are trained using the training portion of the datasets, and then the remaining portion is used for testing purposes, which is vital to demonstrate the developed model's response towards new data being processed for the rst time.In the current research, the suggested prediction models were tested for robustness and effectiveness using In Fig. 7, the training data is divided into ve equal subsets in which one subset represents the validation set and the other four subsets represent the training.A ve-fold cross-validation, iteratively repeats the process for ve times, each time using a different subset as validation set and the rest as training data.
After the iteration is completed, the hit rate of all ve iterations are considered and the average of the results is taken.In this study, the results are the rate and the trading strategy outcome.If the average maximum result is achieved in that particular test, the input parameters for the machine learning technique are captured and stored for the best performing parameters set.Various cross-validation tests are performed to achieve the best performing parameters using different variables.All the best performing parameters of each machine learning technique for the average rate and pro t are used once on the testing data.

Performance Measurements
In this section, the prediction performance of our studied is measured through multiple metrics to assess the quality of the learning methods.Four measures were used that are common measures to evaluate models as accuracy metrics to assess the model's performance.The modeling process outputs' accuracy was then measured and compared using three performance indecencies (i.e., the Mean Absolute Error (MAE), the Mean Squared Error (MSE), and the mean absolute percentage error (MAPE)).The mathematical representation of four implemented measures is given as below:

Sample size
The actual value represented by ith case.
The predicted value represented by the ith case.
5 Results And Discussion

Descriptive Statistics of the Dataset
In the dataset offered, there were ve independent (predictor) variables and one dependent (response) variable.The independent variables were USD(PKR), Pak.10Y, Deaths, Auto Liquidated Damage Indicator, Deaths, Cases and Recovered, and the one dependent variable was the KSE.100 prediction.The following tables illustrate the results based upon regression model.The KSE.100-related features data indices' descriptive statistics, such as such as range, mean, median and quantiles were measured, are calculated, and presented in Tables.According to the descriptive statistics mean; it can be observed that the average price of KSE-100 index has persisted on 37,008 points during the pandemic situation.The average of positive cases at the beginning of year 2020 is 189166, death cases is 15.00 and the recoveries is 157568 centred on daily basis.

Meeting Assumptions
The normality test is implemented to test whether the data into attention is normally distributed or not.These tests are provided for the analysis of two numerical procedures, the shape skewness, and the surplus kurtosis.The information sets are normally distributed if those measures are near zero.The acceptance of Jurque-Bera test also focused on skewness and kurtosis.Hence, the test of normality consists of checking the skewness and kurtosis on which the Jurque-Bera test relies.As shown in Figure .1, the distribution of the KSE-100 index is perfectly normal.
Normal Q-Q test is also implemented to test whether the data into consideration is normally distributed or not.The degree of normality can be imagined using a Q-Q-Plot, which plots the histogram of a normal distribution critical of the ML model's predictions.If this gives a straight line, the residuals have a normal distribution.Based on the few skewed distributions of the dependent variable in Fig. 2, it was likely that the ML model quali ed from that data would not have a normal distribution.As likely, the Q-Q-Plot did not show that the residuals from the model were normally distributed.To obtain a normal distribution, the data would need to be transformed Therefore, to determine whether our dataset is suitable for modeling with ML, we applied the following criteria to the to be transformed.

Statistical Analysis
In this section, results for the stock forecasting prediction were present.

Correlation matrix analysis
Correlation matrix among and in between selected attributes to designate the pairs with high negative or positive correlation and the KSE.100 was applied to evaluate the impact of these features, as shown in Figure .3.The current study has further carried out the correlation analysis among the variables USD (PKR), KSE-100, Pak-10Y, Deaths, Cases and Recovered to gauge the strength of their association.The correlation matrix indicates a minor level of positive relationship between KSE-100 and USD (PKR).There were three pairs of features with correlations more than 0.5.One pair was recovered and cases (correlation coe cient = 0.972), and the other pair was KSE.100and Cases (correlation coe cient = 0.848).In addition to KSE.100and recovered (correlation coe cient = 0.892).Highly correlated features added useless information and noise to the models.Furthermore, this study create that the predictive ability of the models lowered when the feature of recovered was involved.Therefore, taking the feature importance ranking and model performance into consideration, ve of eleven independent features, including Covid Cases, Covid Recovered, Pakistan 10Y, Covid Deaths, and USD(PKR), were used as input features in the RF model.Therefore, the plots compare the predicted performance of the proposed models at K = 5, and 10.Overall, the RF model exhibited better overall performance with an of 0.98, and the KNN model achieved acceptable performance in predicting the KSE100 with an R^2 of 0.979.Both can achieve higher predictive accuracy compared with other ML models.As a result, the at 5-fold is the most e cient and pro cient model in predicting the KSE100.On the other hand, the results also indicate that had remaining prediction ability compared to .

Features importance analysis
An improved understanding of the model's features assistances investors effectively evaluators trends.
Therefore, feature importance assessment has been performed using models to determine the importance mark of each variable involved in predicting the KSE.100.The present study, RF models contained Covid Cases, Covid Recovered, Pakistan 10Y, in the top-ranking features (Fig. 14).
Thus, the feature score plot has been performed to provide a relative score for each variable as shown in Further studies would be helpful to improve the predictive accuracy by inducing more ML models, such as support vector machine (SVM) and deep neural network (DNN).The rank of feature importance showed the contribution of each factor to the forecast prediction.Because a more considerable value suggests a more signi cant correlation between the features and the outcome.Features importance analysis

Conclusion
well-established supervised learning technique used for classi cation, regression and other tasks which based on ensemble learning method.It works through developing multiple decision trees to control the over-tting process at the time of training and out-putting classes (classi cation) as a pattern or average prediction (regression) of each decision trees.To train each decision trees, the training data is splitting into n bags which are used to train their respective decision trees.At the nal stage to predict by regressing or taking the average of all tresses.

10 -
fold cross-validation.The training portion (i.e., 80% of the dataset) is used to train the suggested model, and the testing portion (i.e., 20% of the dataset) is used to conduct the tests.The performance of our proposed model through -fold cross-validation and hold-out validation were evaluated Fig with the metrics of accuracy.The K-fold cross-validation based model is important for ne tuning the parameters of a prediction model.In this study, the proposed system uses a 5-fold cross-validation.A cross-validation model allows you to nd the best performing parameters set for any machine learning technique without over tting.
Likewise, contour plot which shows the relation between COVID recovered versus Pakistan 10 Y, KSE 100.The COVID recovered increase as KSE 100 increase, as demonstrated in 'Figure 10'.Darker regions indicate higher covid recovered.These higher response values seem to form a ridge running from the upper middle to the lower of the graph.Furthermore, the 'Figure 10' shows that the maximum expected KSE 100 index occur at cover recovered more than 300000.The lowest values of COVID cases are in the lower left corner of the plot, which corresponds with low values of both Pakistan 10 Y and KSE100.This contour plot illustrates the relationship between the Pakistan 10 Y and KSE100 Pakistan 10 Y and KSE100 by covid deaths.Small darker regions indicate less deaths which are less than zero.These higher response values seem to form a ridge running from the upper middle to the lower right of the graph.The valleys in the lower left of the graph represent Pakistan 10 Y and KSE100 that result in 100-150 covid death.The contours are curved because the model contains quadratic terms that are statistically signi cant.The highest values of rating for line resistance of cotton cloth are in the upper right corner of the plot, which corresponds with high values of both formaldehyde Pakistan 10 Y and KSE100.The lowest values of COVID cases are in the lower down corner of the plot, which corresponds with low values of both Pakistan 10 Y and KSE100.Furthermore, the 'Figure 10' shows that the maximum expected KSE 100 index occur at COVID cases more than 300.

Fig
Fig. Features' signi cance was in down order as follows: Covid Cases, Covid Recovered, Pakistan 10Y,Covid Deaths, and USD(PKR)'.Figure14presentations the top three most importance ranking of the selected features used in RF predictive models, respectively.Covid Cases was the rst import feather in predicting of KSE.100.In addition, Covid Recovered and Pakistan 10Y were the second and third most important feature in predicting of the RF model as respectably, but another feather was the less important feather.

1 Framework of machine learning model proceed Figure 3 Flowchart
Figures

Figure 4 K
Figure 4 Figure 5

Table 1
Variable descriptions for the KSE-100 index model.Data Pre-Processing Data pre-processing is a vital stage for developing ML models.The data preparation process was accomplished before each model development to improvement the model's prediction.The method comprises several phases that embrace data noise, outlier cleaning, standardization, normalization, characteristic assortment, and conversion.Data assessment was done to tackle inconsistent stock codes.Duplicate instances were removed from the dataset.Cleaning and noise removal procedures were also applied to COVID-19 dataset.In addition, feature aggregations were done to make dataset ready for analysis purpose.The datasets used comprehend numerical values for representing the attributes.The numerical inputs are in very different ranges; therefore, it is essential to standardize the dataset in a close range to enable the model's faster teaching.a z-score was used to convert data, which is the zero mean and standard deviation of the data.
The data of COVID-19 from World Health Organization (WHO) was collected and the KSE-100 index data is taken from Karachi stock exchange (https://www.psx.com.pk/).Thus, the rst patient of COVID-19 was reported on February 26, 2020.However, the data of COVID-19 and KSE from March 1st, 2020, to November 26th, 2021, were used by omitting the value of o cial holidays.Variable descriptions for KSE-100 index model was represented in Table1.All the data was saved as a numerical value, demonstrating the variables used in the linear regression model (Quantile Regression (QR)).3.2

Table 2
Mathematical illustration for performance metrics

Table 4
The proposed system compares six machine learning techniques, namely (i.e., LR, DT, RF, KNN and SVM).All six techniques are tested with the same data using different cross-validation based model.Each ML procedure includes three tests having different history stock prices as training data.Each test is trained RF, KNN and SVM)) in the KSE-100 predictions.Overall, the results suggested that the RF model exceeded on our data set in terms of RMSE, with values of 512.4574and 720.9236, respectively.The mean accuracy of the RF model was 0.98, which was the same excellent as that of the KNN model.As for other with a K-fold cross-validation based model.Different models are created by selecting different values of K-fold (K = 5, and K = 10) and another model is also