A New Composite Lognormal-Pareto Type II Regression Model to Analyze Household Budget Data via Particle Swarm Optimization

When data exhibits heavy-tailed behavior, traditional regression approaches might be inadequate or inappropriate to model the data. In such data analyses, composite models, which are built by piecing together two or more weighted distribution at speciﬁed threshold(s) are alternative models. When data contains covariate information, composite regression models could be used. In the existing literature, there is not many work done on this topic. The only study is Gan and Valdez (2018)’s paper. In this study, a novel Lognormal-Pareto Type II composite regression model was proposed. Particle swarm optimization (PSO) was performed to obtain model parameters of the proposed model. The proposed model was applied to model monthly consumption expenditure and aﬀecting factors using National Household Budget Survey, conducted by Turkish Statistical Institute, annually. Since the sampling design of Household Budget Survey is stratiﬁed two-stage cluster sampling, the parameters were estimated under weighted data by updating the proposed model and estimation method, PSO. Additionally, the proposed regression model performance was compared with Lognormal and Lomax regression models. The results showed that the proposed model provided better ﬁt to data.


Introduction
The researchers have still been working on finding a model that fits well into data where a single component distribution usually does not provide a better fit due to the properties of the data such as heterogeneity, multimodality, heavy-tailed etc.. Finite mixture models have been widely used by researchers when modeling the data involving heterogeneous sub-populations with a single distribution is almost impossible. Whether a uni-modal, bi-modal or even multi-modal distribution is used depends on the degree of heterogeneity among all sub-populations [29]. Especially after the publication of the monograph by McLachlan & Basford (1998) [20], the researchers working in various fields have been started to interest the potential usefulness of mixture models for inference and clustering [22]. Finite mixture distributions are obtained by taking a weighted average of a finite number of the same type of distributions with different parameter values or completely different distributions [9]. The numerous works which address the special topic reviews for finite mixture models in the literature have been carried out by many authors. Some of the studies are: the mixture of two normal distributions by Cohen (1967) [6], the mixture of t-distributions by   [25], the mixture of skew normal and skew t-distributions by Lee & McLachlan (2013) [19], the mixture of generalized hyperbolic distributions by Browne & McNicholas (2015) [4]. For detail information, interested readers can also refer to the books written for example by Everitt and Hand (1981) [13],   [21] and Frühwirth-Schnatter (2006) [14].
Finite mixture models have been extended by introducing relevant risk covariates usually associated with the distribution parameters or even the mixing proportions to allow for the modeling of heterogeneous regression relationship which are called as finite mixture regression model [8]. Some of the studies where finite mixture regression model is used are: a two-component finite mixture Lognormal regression model by Tooze et al. (2002) [33], a twocomponent finite mixture normal regression model by Yau et al. (2003) [34], a two-component finite mixture generalized linear mixed effect regression model by Hall and Wang (2005) [16].
Composite models constructed by joining the two (or more) weighted distributions at a given threshold value(s) can be interpreted as a finite mixture models with mixing weights c and (1 − c) [10]. These models are especially used to model data with fat-tailed behavior. Some of the studies where composite models used are: composite Lognormal-Pareto model by Cooray and Ananda (2005) [7], composite Lognormal-Pareto models by Scollnik (2007) [28], composite exponential-Pareto models by Teodorescu and Vernic (2009) [32], composite Pareto models by Teodorescu and Vernic (2013) [31], composite Stoppa models by Calderin-Ojeda and Kwok (2016) [5].
The composite regression models formed by modeling the distribution parameters using regression as in finite mixture regression models can be used in order to capture both fat-tailed behavior and heterogeneity in the data. Although there are many studies on composite models in the literature, there is only one study on composite regression models by Gan and Valdez (2018). Gan and Valdez (2018) [15] proposed the use of two composite regression models consisting of three components to model Singapore auto claims data which capture both fat-tail behavior of data and the policyholders heterogeneity.
The composite regression models are quite new topic and many studies are needed in this area. In this study a novel composite regression model was proposed. The proposed composite regression model consists of two component-Lognormal and Pareto type II distributions. The proposed model was employed for analyzing monthly consumption expenditure and affecting factors using National Household Budget Survey, conducted by Turkish Statistical Institute (TurkStat), annually via proposed model. Monthly consumption expenditure data shows heavy-tailed behavior. Therefore traditional regression approach is not suitable to model this data. In the proposed model, particle swarm optimization was adapted to obtain the model parameters.
The rest of the paper is organized as follows. Section 2 involves the proposed Lognormal-Pareto Type II regression model along with a brief information about composite models. Particle swarm optimization (PSO) which is used to estimate parameters of the proposed model is introduced with its steps in Section 3. Section 4 presents the simulation study to assess the performance of PSO. The application of the proposed model to household budget survey data is given in Section 5. It is concluded with a brief discussion of the results of this study in Section 6.

Composite Models
Composite models consisting of two parts, a sub-threshold and an over-threshold distribution, can be more suitable to model data which has skewed distribution with a heavier right tail than the models including the Gamma, the Lognormal, the Weibull and the Pareto [3].
Assuming that data Y involves two sub-populations respectively with a light-tailed probability density function f 1 (y) and a heavy-tailed probability density function f 2 (y) then the random variable Y has the following probability density function: where θ is threshold and c = 1 is the normalizing constant.
A smooth probability density function can be obtained by imposing the continuity condition f 1 (θ−) = f 2 (θ+) and differentiability conditions f [7]. Here, as the mixing weight is fixed and known priorly. Scollnik (2007) [28] expressed the model given in Eq. 1 as convex combination of two probability density functions: where 0 ≤ c ≤ 1 and f * 1 (y) and f * 2 (y) are adequate truncations of f 1 (y) and f 2 (y) given respectively: The mixing weight c is a function of the threshold θ and the parameters of f 1 (y) and f 2 (y) varying in the interval [0, 1] obtained by imposing the continuity and differentiability conditions at threshold.

Composite Regresion Models
This section involves the probability density functions for composite Lognormal-Pareto Type II regression model along with its log-likelihood function.

Lognormal-Pareto Type II Composite Regression Model
Suppose the first and second components in Eq. 3 have the Lognormal distribution with location parameter µ > 0 and shape parameter σ > 0 and Pareto type II distribution with scale parameter α > 0, shape parameter λ > 0 and location parameter θ > 0 respectively and the scale parameters of these distributions are modeled using regression. Then, the Composite Lognormal-Pareto Type II regression model is: , y i ∈ (θ, ∞) (4) Here, y y y = (y 1 , y 2 , . . . , y n ) , i = 1, 2, . . . , n are the values of response variable. Let X X X = (X X X 1 , X X X 2 , . . . , X X X p ) denotes the n × p matrix of values of explanatory variables, with the first column assumed to be a 1 to accommodate the estimation of an intercept, β β β = β β β T 1 , β β β T 2 denotes the component specific parameter vectors. The explanatory variables are incorporated through link function g (·).
In this setting, the scale parameters µ and α are related to the linear predictors through identity and exponential link functions as µ (β, X β, X β, X) = β β β T Log-likelihood function for the proposed model given in Eq. 4 is as follows: Let E 1 (Y |Y ≤ θ) and E 2 (Y |Y > θ) denote the contribution of Lognormal distribution and Pareto type II distribution to the expected value of response variable, respectively. The expectation of response variable with probability density function could be obtained as As seen from Eq. 6a, the contribution of Lognormal distribution to the expected value is proportional to exp (µ (β, X β, X β, X)) and other terms in the Eq. 6a do not affect the proportionality. Similarly, in Eq 6b, the contribution of Pareto type II distribution to the expected value is proportional to α (β, X β, X β, X). Consequently, the effect of explanatory variables on the response variable Y could be interpreted proportionally.

Particle Swarm Optimization
Heuristic methods such as Genetic Algorithm, Differential Evolution Algorithm and PSO inspired by the events in nature are powerful optimization methods used to solve a complex system. PSO was developed by Eberhart and Kennedy in 1995 [18] inspired by the social behavior of bird flocking or a school of fish. It has been observed that the actions of the animals moving in herd, often randomly, such as food and safety, enable them to reach their goals more easily. When a school of fish, flocks of birds and other social animals were examined, it was seen that these animals interacted in search of food, and when one found a food, the others turned their position to the location of the food and updated their speed accordingly without breaking away from the herd. Therefore, PSO was developed by using social interaction between birds.
In PSO, individuals are called "particles" (i.e. each bird in swarm) in the swarm. The change of particle position in the search space is based on the social and psychological tendency of individuals to imitate the success of other particles. Therefore, the changes of particles in the group are affected by the experience or knowledge of their neighbors. Hence, the search behavior of particles is affected by the search behaviors of other particles in the group. The result of modeling this social behavior is that the search process causes particles to randomly return to previously successful regions in the search space [23].
The PSO starts with randomly generated a set of solutions called as "swarm". In the swarm, each solution defined as "particle". Particles (birds) are flown through the multidimensional search space. Each particle has its own position and speed information that guides its flight and a fitness value obtained by the fitness function (log-likelihood function -in this study) to be optimized. Each particle adjusts its position according to its own and its neighbors' previous experiences. PSO is mainly based on approximating the position of particles in the swarm to the best positioned particle of the swarm. This approximation approach develops randomly, and most of the time, particles in the swarm positioned better by their new movements than their previous position and this process continues iteratively until the fitness function is optimized. At each iteration, each particle is updated according to the two "best" values. One of them is personal best (pbest), the best fitness obtained by particle so far, and the other is the global best (gbest), the global solution obtained so far by whole swarm [27], [30].
The algorithm basically consists of the following steps; 1. The initial swarm is created with randomly generated starting locations and velocities for prespecified swarm size (K). 2. Fitness values are calculated for each particle. 3. Inertia weight (w), c 1 and c 2 are set. 4. The personal best (p best ) is found for each particle. 5. The global best (g best ) is found among all the particles in the swarm. 6. The velocity (Eq. 7a) and the location (Eq. 7b) are updated at each iteration using the formulas given below: . . . , K shows the location of k th particle and v kj = (v k1 , v k2 , . . . , v kj , . . . , v kd ) , k = 1, 2, . . . , K represents the velocity of k th particle at the t th iteration in the d-dimensional search space. Here, c 1 and c 2 are accelerating factors used to scale cognitive and social components respectively, w is the inertia weight and r 1,kj and r 2,kj are random numbers generated from Uniform distribution, U [0, 1]. p best,k (personal best), represents the best location of the particle i experienced so far and g best (global best), represents the best location among all the particles in the swarm experienced so far [1], [2], [12], [24].
7. The steps 1-6 are repeated until stopping criterion is satisfied.
Velocity regulation is very important in the PSO algorithm. In the Eq. 7a, the coefficient w, inertia weight is used to limit the velocity of the particles. When w > 1, the velocities of particles increase with time to the maximum speed and the swarm diverges. On the other hand, small inertia values facilitates the local search but weaken the global search capability of PSO [17], [30]. Inertia weight was taken as constant in the early studies in the literature. Later in the paper by Eberhart and Shi(2000) [11], linear decreasing value for the inertia weight (w) was proposed.
In this study, the PSO is used to estimate the parameters of the models given in Eq. 4. PSO is an efficient algorithm in terms of being simple to use, not needing any score functions or their derivatives. Its computational simplicity and lower elapsed time compared to traditional optimization algorithms makes PSO so appealing especially for complex models. By taking into consideration that the score functions of the likelihood function are extremely complex and it is difficult to obtain solutions using traditional methods, it was decided to use PSO method to obtain the model parameter estimates.

Simulation Study
To determine the value of inertia weight is vital in the PSO algorithm. Therefore, the simulation setting was carried out as two-stage process. In the first stage, the simulation study was designed in order to determine optimal hyper parameters of PSO algorithm. In the second stage, another simulation study was designed to obtain model parameters given in Eq 4 using PSO method with its hyper parameters which were chosen in stage 1.
In both stages, the number of explanatory variables was taken as 4, component specific parameter vectors were taken as β 1 = (0.69, −0.29, 0.41, 0.92), β 2 = (0.41, 0.83, 0.56, −0.51) and the value of threshold was specified as 10. Explanatory variable matrix was generated using normal distribution with mean 2 and standard deviation 0.5. The values of dependent variables being lower and higher than the threshold θ were generated respectively from Lognormal distribution with location parameter µ (β, X β, X β, X) = β β β T 1 X 1 X 1 X 1 and shape parameter σ = 0.1 and from Pareto type II distribution with scale parameter α (β, X β, X β, X) = exp β 2 β 2 β 2 T X 2 X 2 X 2 shape parameter λ = 2 and location parameter θ = 10. Here, the mixing weight was taken as 0.50.
In the first stage of the simulation study, grid search was implemented in order to optimize inertia weight parameter of PSO algorithm. The inertia weight values were taken fixed as (0.1, 0.3, 0.5, 0.7, 0.9, 1.1). In addition, inertia weight value was taken dynamically decreasing linearly from 1.4 to 0.4. The other hyper parameters of PSO algorithm used in first stage were given below: • Sample size was taken as 300.
• Swarm size was taken as 20.
• The coefficients c 1 and c 2 were taken equal to 1.49.
Each scenario was repeated 500 times. The maximum number of iterations of PSO algorithm was taken as 20 and the simulation was stopped when PSO algorithm reached to the maximum iteration number. For each scenario, Akaike Information Criteria (AIC) and total elapsed time were calculated. The scenarios for different inertia weight values were compared according to AICvalues.
PSO algorithm given detailed in Section 3 and all data generation process were carried out using R ver.4.0.0 [26] with self-written code. The results of first stage of the simulation study were given in Table 1.
[ Table 1 about here] As seen from the Table 1, the total elapsed times for different scenarios were found similar. When the AIC values were compared, it was concluded that the dynamic inertia weight decreasing linearly from 1.4 to 0.4 had the lowest AIC value. Therefore, the PSO algorithm with dynamic inertia weight value was used to estimate the model parameters in Eq. 4 for second stage.
In the second stage of the simulation study, the scenarios given below were run in order to obtain parameter estimates of the model in Eq 4 according to maximum likelihood method.
• Sample size was taken as 300, 500 and 1000.
• Swarm size was taken as 20 and 40.
• The coefficients c 1 and c 2 were taken equal to 1.49.
• As a result of first stage of the simulation, inertia weight values were dynamically taken from 1.4 to 0.4 in linearly decreasing way.
Each scenario was repeated 1500 times. The maximum number of iterations of PSO algorithm was taken as 20 and the simulation was stopped when PSO algorithm reached to the maximum iteration number. For each scenario, Akaike Information Criteria (AIC), Bayesian Information Criteria (BIC), mean of total absolute bias and total elapsed time were calculated. The results of the proposed model obtained for different sample size and swarm size were compared. Using the simulated data, the parameters of the Lognormal regression and Lomax regression models were also estimated. The AIC and BIC values for these models were compared with the proposel model, Lognormal-Pareto Type II composite regression model. The results were given in Table 2, Table 3 and Table 4.
[ The parameter estimates and their MSE values for sample sizes 300, 500 and 1000 were given respectively in Table 2, Table 3 and Table 4 for swarm size 20 and 40. The AIC and BIC values were close to each other for each sample sizes. The results indicated that the AIC, BIC and total absolute bias values were decreasing as swarm size was increasing for each sample size. Furthermore, total elapsed time was increasing as the swarm size was increasing for each sample size. It was seen that swarm size 40 in each sample size had the lowest AIC and BIC values. When the AIC and BIC values obtained for different swarm size of each sample size were compared, it was seen that swarm size 40 had smallest AIC and BIC values. As the sample size was increasing, MSE of parameter estimates were decreasing. Therefore, it could be concluded that the parameter estimates were consistent.
The parameters of the Lognormal regression and Lomax regression models were also estimated using the simulated data. The probability density function with their mean structures of these models were given below: Lognormal Regression Model: where σ > 0 is shape parameter, µ > 0 is scale parameter and γ γ γ T = (γ 0 , γ 1 , γ 2 , γ 3 ) is parameter vector.
The BIC values of Lognormal regression model for sample sizes 300, 500 and 1000 were 2418.2109, 4037.6981 and 8001.3424 respectively. The BIC values of Lomax regression model for sample sizes 300, 500 and 1000 were 4045.0708, 6744.2876 and 13234.6622 respectively. Comparing the results of Lognormal and Lomax regression models with those of the proposed Lognormal-Pareto Type II regression model, it was seen that the difference between the BIC values between Lognormal regression model and the proposed model, BIC ∆ = BIC LogN − BIC P rop and the difference between Lomax regression model and the proposed model, BIC ∆ = BIC Lomax − BIC P rop for all sample sizes were greater than 10 (BIC ∆ >> 10). Hence, it was concluded that the proposed model provided a better fit to the data than Lognormal and Lomax regression models.

Household Budget Data
The proposed Lognormal-Pareto Type II regression model was deployed to analyze monthly consumption expenditure and related factors using National Household Budget Survey in 2018 which was conducted by Turkish Statistical Institute (TurkStat). The household budget survey was first conducted in 1964. Since 2004, it has been carried out annually to reveal consumption patterns and income levels of individuals and households by socio-economic groups, rural & urban areas and regions of Turkey. In this study the survey which was conducted between January 1 st and December 31 st , 2018 on 1296 households was used. The sampling design of the Household Budget Survey was stratified two-stage cluster sampling method. Therefore, parameter estimates were obtained using sampling weights. The micro data of 2018 Household Budget Survey consists of three data sets which are Household, Individual and Consumption Expenditure. These three data sets were linked to each other with one-to-one matching on the "Household ID", which was common in all data sets in order to select variables to be used in the analysis. After mapping process, the data was filtered to the individuals who live alone in household since it is more practical to evaluate the relationship between monthly consumption expenditure and related factors for individual.
The explanatory variables used in the analysis including continuous variables and categorical variables with reference levels were given below: The explanatory variables were thought to be correlated with the dependent variable which was monthly consumption expenditure. The categorical variables were coded using leave-one-out method relative to the reference level.
As mentioned before, the parameters were estimated under weighted data by taking the sampling method into consideration. Besides, parameter estimates under unweighted data were also obtained and compared with the weighted results. Similar to the simulation study, parameter estimates were also obtained under weighted and unweighted data using Lognormal and Lomax regression models.
In order to provide ease of operation in the analysis, estimates were obtained by dividing the monthly consumption expenditure by 100. The histograms for weighted and unweighted of the monthly consumption expenditure were given in Figure 1.
[ Figure 1 about here] Summary statistics calculated under weighted and unweighted data for the variables to be used in the regression model were given in Table 5.
[ Table 5 about here] When the results in Table 5 were examined, it was seen that the summary statistics obtained for weighted and unweighted data were generally similar with some differences. As the data was obtained from the stratified two stage cluster sampling, summary statistics obtained for weighted data were interpreted. The average age of the participants in the study was 55.97 with standard deviation 18.47, 42.7% of them were male, 39.0% of them had primary education and 94.1% of them were single. More than half of the participants was working (57.3%), and about two-thirds of the participants had own property. 6.2% of them had the habit of buying daily newspapers. Only 10% of them had private insurance and one third of them had smoking habit. In addition, 11.7% had a cable TV subscription and 5.1% was playing games of chance. The average monthly consumption expenditure of the participants was 28.33 with standard deviation 23.35 and varied between 1.09 and 247.98. The average annual income of individuals was 2832.84 Turkish Lira (TL) with standard deviation 2334.91 and varies between 109 and 24798.19. 60% of the participants had savings habit, 39% of them had a credit card and 77.5% of them had the habit of online shopping.
The parameter estimates for the Lognormal-Pareto Type II composite regression model, which were obtained for weighted and unweighted data using variables considered as related to the average monthly consumption expenditure, were given in Table 6. It was observed that the histograms of the monthly consumption expenditure for weighted and unweighted data were very similar which could be seen in Figure 1. Similarly, the summary statistics obtained under weighted and unweighted data were also found similar. However, when the parameter estimates given in Table 6 were examined, it was observed that the results of weighted and unweighted data were very different from each other. In particular, the estimate of threshold value (θ) for weighted data was around 9, while the estimate for the unweighted data was around 40. However, since the sampling method was stratified two-stage cluster sampling, parameters should be estimated under weighted data. Therefore, by examining models generated under weighted data, it was seen that the model with swarm size 40 had lowest AIC and BIC values and gave better results. The reason to obtain parameter estimates for weighted and unweighted data separately was to show that parameter estimates for weighted data were quite different from the parameter estimates for unweighted data, although both distributions were found quite similar in Figure 1.
Similar to the simulation study, parameter estimates were also obtained by using Lognormal and Lomax regression models. Using Lognormal regression model, AIC w and BIC w values for weighted data were found as 80190799.46 and 80191062.23, respectively. Using Lomax regression model, AIC w and BIC w values for the weighted data were found as 54178799.27 and 54179062.04, respectively. There was a significant difference between the BIC values of Lognormal-Pareto Type II composite regression model, Lognormal and Lomax regression models. This result confirmed that the most compatible model with the data was the proposed Lognormal-Pareto Type II composite regression model.
The parameter estimates under weighted data were presented in Table 6. The threshold value θ for the Lognormal-Pareto Type II composite regression model with swarm size 40 was estimated as 9.8795. Households, whose monthly consumption expenditure was equal to or less than 9.8795 TL were called "low expenditure class", those above 9.8795 TL were called "high expenditure class". This classification were used in following comments.
[ Table 6 about here] When parameter estimates in the weighted data for swarm size 40 in Table 6 were examined, it was seen that the age variable had an increasing effect to the amount of average monthly consumption expenditure in low, but it had a decreasing effect to the amount of high expenditure classes.
For low expenditure class, the average monthly consumption expenditure of women compared to men was exp (−0.0134) times greater. For high expenditure class, women spent exp (0.0129) times more than men.
For low expenditure classes, the average monthly consumption expenditure of people who had the level of primary or secondary education compared to those who did not completed any school was exp (0.1024) and exp (0.055) times higher, respectively, while people who had bachelor's degree was exp (−0.0534) times greater. For high expenditure class, the average monthly consumption expenditure was increasing as the level of education increased.
For both expenditure classes, the average monthly consumption expenditure was lower for married individuals compared to single ones.
The average monthly consumption expenditure of the participants in low expenditure class, who had the habit of dining out was exp (0.1321) times greater than the individuals who did not have this habit. For high expenditure class, those who had the habit of dining out spent exp (0.2792) times more than those who did not have the habit of dining out.
Those who had savings habit for both expenditure classes, had a higher amount of expenditure than those who did not have savings habit.
For both expenditure classes, the average monthly consumption expenditure was higher for individuals who were working compared to those who were not working.
The annual income variable had an increasing effect to the amount of average monthly consumption expenditure for both expenditure classes.
The average monthly consumption expenditure of the participants in low and high expenditure classes, who had their own property was exp (0.0999) and exp (0.3266) times greater, respectively, than the individuals who did not have any property.
For both expenditure classes, the average monthly consumption expenditure was higher for individuals who had habit of online shopping compared to those who did not have this habit.
Those who had smoking habit for both expenditure classes, had a higher amount of expenditure than those who did not smoke.
The average monthly consumption expenditure of the participants in low expenditure class, who had habit of buying daily newspaper was exp (0.0339) times higher than the individuals who did not have habit of buying newspaper. Similarly, in high expenditure classes, having habit of buying daily newspaper had an increasing effect {exp (0.1732)} on average monthly consumption expenditure.
For both expenditure classes, the average monthly consumption expenditure was higher for individuals who had Cable TV subscription compared to those who did not.
Those who had habit of playing games of chance in low expenditure class had a lower amount of expenditure than those who did not. On the other hand, those who had habit of playing games of chance in high expenditure class had exp (0.2336) times more than those who did not have this habit.
The average monthly consumption expenditure of the participants in low expenditure class, who had credit card was exp (0.0412) higher than the individuals who did not have credit card. Similarly, for spending amounts in high expenditure classes, having credit card had an increasing effect {exp (0.1267)} on average monthly consumption expenditure.
Those who had private health insurance in low expenditure class had a lower amount of expenditure than those who did not, while those who had private health insurance in high expenditure class had exp (0.2429) times more than those who did not have private health insurance.
Summary statistics calculated for low and high expenditure classes in Table 7.

Conclusion
The main purpose of this study was to propose a novel composite regression model for heavy-tailed data. The proposed model consisted of two components which were Lognormal and Pareto type II distributions. In the literature, the only study on composite regression models was Gan and Valdez (2018)'s paper. They [15] proposed Gamma-Pareto and Pareto-Type I Gumbel spliced regression models to estimate auto insurance claims. The proposed model in this study is different from the models proposed by Gan and Valdez. Consequently, this study provide an important contribution to the existing literature.
PSO algorithm is capable of dealing with a large number of parameters. PSO algorithm is a powerful tool for solving complex systems. Moreover, algorithm's elapsed time is quite short. In this study, PSO algorithm was used to obtain parameter estimation of the proposed model. PSO algorithm was implemented in two stages. The first stage was designed to find optimal parameter value for inertia weight which is one of the hyper parameter of PSO algorithm. According to the first stage result of the simulation study, the dynamic inertia weight decreasing linearly from 1.4 to 0.4 had the lowest AIC value compared to the other fixed inertia values. As a result of the first stage, the dynamically decreasing inertia weight was used for the second stage of the simulation setting. In the second stage of the simulation setting was performed to obtain parameter estimates of the proposed model using PSO algorithm. As expected, MSE values of parameter estimates were decreasing as the sample size was increasing for each swarm size. In addition, the proposed composite Lognormal-Pareto type II regression model was compared with those of Lognormal regression and Lomax regression models according to AIC and BIC values. The results of the simulation study showed strong support in favor of the proposed model.
After that, monthly consumption expenditure and affecting factors using National Household Budget Study conducted by Turkish Statistical Institute (TurkStat), annually were modeled via the proposed composite Lognormal-Pareto type II regression model. The explanatory variables in the model were age, gender, level of education, marital and employment status, whether to have private health insurance or not, habits of smoking, buying daily newspaper, playing games of chance, eating out, online shopping and savings, own any property, having Cable TV subscription and credit card. To set up a regression model for providing better fit to the data was challenging. As seen in Figure 1, monthly consumption expenditure had heavy-tailed behavior. Therefore, standard regression models were deficient to capture the relationship between monthly consumption expenditure and its related variables. The results of Household Budget Study data showed that the proposed regression model indicated more favorable fit than Lognormal and Lomax regression models.
Additionally, the sampling design of Household Budget Study was stratified two-stage clustering sampling. Accordingly, it was another challenge to obtain parameter estimates under weighted data. To handle with this challenge, model equation and PSO algorithm were updated by taking into account sampling weights. Finally, parameters were estimated for weighted data and the effect of explanatory variables on average monthly consumption expenditure was explained in detail.
Acknowledgements I would like to thank the Turkish Statistical Institute (TurkStat) in Turkey for their valuable contributions in providing the data for this research. I would also like to thank Dr. Ayten Yigiter for valuable comments to help improve the presentation of this paper.

Declarations
Funding : Not applicable Conflicts of interest/Competing interests : Not applicable Availability of data and material : The data was provided Turkish Statistical Institute (TurkStat) with restricted use. I am not allowed to share with third parties. Code availability : Not applicable