A new composite lognormal-Pareto type II regression model to analyze household budget data via particle swarm optimization

When data exhibits heavy-tailed behavior, traditional regression approaches might be inadequate or inappropriate to model the data. In such data analyses, composite models, which are built by piecing together two or more weighted distributions at specified threshold(s), are alternative models. When data contain covariate information, composite regression models can be used. In the existing literature, there is not much work done on this topic. The only study is Gan and Valdez (N Am Actuar J 22(4):554–573, 2018) paper. In this study, a novel Lognormal-Pareto type II composite regression model is proposed. Particle swarm optimization (PSO) is performed to obtain model parameters of the proposed model. The proposed model is applied to model monthly consumption expenditure and affecting factors. The data is obtained from the National Household Budget Survey, which is conducted annually by the Turkish Statistical Institute (TurkStat). Since the sampling design of the Household Budget Survey is stratified two-stage cluster sampling, the parameters are estimated under weighted data by updating the proposed model and PSO. Additionally, the proposed regression model performance is compared with Lognormal, Lomax, Gamma, and Gamma–Pareto type II regression models. The results demonstrate that the proposed model provides an improved fit to data.


Introduction
Researchers are often working on finding models that fit well to data where a single component distribution does not provide a good fit due to certain properties of the data, such as heterogeneity, multimodality, heavy-tailedness, etc. Finite mixture models have been widely used by researchers for modeling data involving heterogeneous sub-populations. Modeling such data with a single distribution often yields an inadequate fit to whole data. Whether a uni-modal, bimodal, or even multi-modal distribution is used depends on the degree of heterogeneity among all sub-populations (Shen 2011). Especially after the publication of the monograph by McLachlan and Basford (1988), researchers working in various fields have noticed the potential usefulness of mixture models for inference and clustering ( Peel 2000b). Finite mixture distributions are obtained by taking a weighted average of a finite number of the same type of distributions with different parameter values or completely different distributions (van Dijk 2009). The numerous works which address the special topic reviews for finite mixture models in the literature have been carried out by many authors. Some of the studies are: the mixture of two normal distributions by Cohen (1967), the mixture of t-distributions by Peel and McLachlan (2000), the mixture of skew normal and skew t-distributions by Lee and Mclachlan (2013), and the mixture of generalized hyperbolic distributions by Browne and McNicholas (2015). For detailed information, interested readers can also refer to the books written for example by Everitt and Hand (1981), McLachlan and Peel (2000a) and Frühwirth-Schnatter (2006).
Finite mixture models have been extended by introducing relevant risk covariates usually associated with the distribution parameters or even the mixing proportions to allow for the modeling of heterogeneous regression relationship, which is called as finite mixture regression model Dietz and Böhning (1995). Some of the studies where finite mixture regression model is used are: a two-component finite mixture lognormal regression model by Tooze et al. (2002), a two-component finite mixture normal regression model by Yau et al. (2003), and a two-component finite mixture generalized linear mixed effect regression model by Hall and Wang (2005).
Composite models constructed by joining the two (or more) weighted distributions at a given threshold value(s) can be interpreted as a finite mixture model with mixing weights c and (1 − c) (Dominicy and Sinner 2017). These models are especially used to model data with fat-tailed behavior. Some of the studies where composite models used are: composite Lognormal-Pareto model by Cooray and Ananda (2005), composite Lognormal-Pareto models by Scollnik (2007), composite exponential-Pareto models by Teodorescu and Vernic (2009), composite Pareto models by Teodorescu and Vernic (2013), and composite Stoppa models by Calderín-Ojeda and Kwok (2016).
The composite regression models formed by modeling the distribution parameters using regression as in finite mixture regression models can be used in order to capture both fattailed behavior and heterogeneity in the data. Although there are many studies on composite models in the literature, there is only one study on composite regression models by Gan and Valdez (2018). Gan and Valdez (2018) proposed the use of two composite regression models consisting of three components to model Singapore auto claims data which capture both fat-tail behavior and zero-inflation of data and the policyholders heterogeneity.
The composite regression models are quite a new topic and many studies are needed in this area. In this study a novel composite regression model is proposed. The proposed composite regression model consists of two components: Lognormal and Pareto type II distributions. The proposed model is employed to analyze monthly consumption expenditure and affecting factors. The data were obtained from the National Household Budget Survey, annually conducted by the Turkish Statistical Institute (TurkStat). Monthly consumption expenditure data shows heavy-tailed behavior. Therefore, a traditional regression approach is not suitable to model this data. In the proposed model, particle swarm optimization is adapted to obtain the model parameters.
The rest of the paper is organized as follows. Section 2 presents the proposed Lognormal-Pareto type II regression model along with a brief information about composite models. Particle Swarm Optimization (PSO), which is used to estimate parameters of the proposed model, is introduced with its steps in Sect. 3. Section 4 presents a simulation study to assess the performance of PSO. The application of the proposed model to household budget survey data is given in Sect. 5. A brief discussion of the results of this study is given in Sect. 6.

Composite models
Composite models consisting of two parts, a sub-threshold and an over-threshold distribution, can be more suitable to model data which are skewed with a heavier right tail than classic models, including the gamma, the lognormal, the Weibull, or the Pareto (Bølviken 2014).
Assuming that the random variable Y involves two subpopulations respectively with a light-tailed probability density function f 1 (y) and a heavy-tailed probability density function f 2 (y), then its probability density function can be expressed as follows: where θ is threshold and r = 1 A smooth probability density function can be obtained by imposing the continuity condition f 1 (θ −) = f 2 (θ +) and differentiability conditions f 1 (θ −) = f 2 (θ +) at the threshold (Cooray and Ananda 2005).
Scollnik (2007) expressed the model given in Eq. 1 as a convex combination of two probability density functions: where 0 ≤ c ≤ 1 and f * 1 (y) and f * 2 (y) are adequate truncations of f 1 (y) and f 2 (y) given by, respectively, The mixing weight c is a function of the threshold θ and the parameters of f 1 (y) and f 2 (y) varying in the interval [0, 1] obtained by imposing the continuity and differentiability conditions at the threshold.

Composite regression models
This section introduces the probability density functions for composite Lognormal-Pareto type II regression model along with its log-likelihood function.

Lognormal-Pareto type II composite regression model
Suppose the first component in Eq. 3 is the Lognormal distribution with location parameter μ > 0 and shape parameter σ > 0. The second component in Eq. 3 is Pareto type II distribution with scale parameter α > 0, shape parameter λ > 0 and location parameter θ > 0. The location parameter of Lognormal distribution and the scale parameter of the Pareto type II distribution are modeled using a regression. Then, the composite Lognormal-Pareto type II regression model is: Here, c is mixing weight, y y y = (y i , i = 1, 2, . . . , n) are the values of the response variable. Let X X X = (X i j , i = 1, 2, . . . , n, j = 0, 1, . . . , p) denotes the n × ( p + 1) matrix of the values of the p explanatory variables, with the first column assumed to be a 1 to accommodate the estimation of an intercept and X i X i X i = 1, X i1 , X i2 , . . . , X i j , . . . , X i p is vector of the p explanatory variable values for the ith observation. β β β = β β β T 1 , β β β T 2 denotes the component specific parameter vectors. The explanatory variables are incorporated through the link function g (·). In this setting, the location parameter μ and the scale parameter α are related to the linear predictors through identity and exponential link functions as Let E (Y |Y ≤ θ ) and E (Y |Y > θ) denote, respectively, the contribution of the Lognormal distribution and Pareto type II distribution to the expected value of the response variable. The expectation of the response variable with probability density function given in Eq. 4 can be obtained as where μ * represents the mean of μ i (β β β 1 , X X X i ) and α * represents the mean of α i (β β β 2 , X X X i ).
As seen from the right-hand side of Eq. 5a, the contribution of the Lognormal distribution to the expected value is approximately proportional to exp (μ * ). The term Φ ln(θ)−μ * −σ 2 σ /Φ ln(θ)−μ * σ is the ratio of two probabilities and the value of this ratio is approximately equal to 1. This expression does not affect the proportionality. As seen from the right-hand side of Eq. 5b, the effect of Pareto type II distribution has an additive contribution α * on the expected value. Conse-quently, the effect of explanatory variables on the response variable Y could be interpreted proportionally and additively for the Lognormal and Pareto type II regression parts, respectively.
Since the parameter estimates of the proposed model given in Eq. 4 will be obtained based on the maximum likelihood method, the log-likelihood function given the observed data is written in Eq. 6.
It should be noted that it is very difficult to solve score functions obtained by derivation of the log-likelihood of the proposed model to find the parameter estimates. In this case, heuristic-based algorithms such as genetic algorithm, PSO algorithm, etc. are widely used in parameter estimation for complex problems. In this study, PSO is performed to obtain the model parameters of the proposed model. The log-likelihood function given in Eq. 6 is used as a fitness function in the PSO algorithm.

Particle swarm optimization
Heuristic methods such as Genetic Algorithm, Differential Evolution Algorithm and PSO algorithm inspired by the events in nature are powerful optimization methods used to solve a complex system. PSO, developed by Kennedy and Eberhart (1995), was inspired by the social behavior of bird flocking or a school of fish. It has been observed that the actions (e.g., to search food and have safe location) of the animals' moving in herd, often randomly, enable them to reach their goals more easily. When a school of fish, flocks of birds and other social animals were examined, it was seen that these animals interacted in search of food, and when one found food, the others turned their position to the location of the food and updated their speed accordingly without breaking away from the herd. Therefore, PSO was developed by using social interaction between birds.
In PSO, an individual is called a "particle" (e.g., each bird in swarm) in the swarm. The change of particle position in the search space (i.e., parameter space) is based on the social and psychological tendency of individuals to imitate the success of other particles. Therefore, the changes of particles in the group are affected by the experience or knowledge of their neighbors. Hence, the search behavior of particles is affected by the search behaviors of other particles in the group. The result of modeling this social behavior is that the search process causes particles to randomly return to previously successful regions in the search space (Özsaglam and Çunkaş 2008).
The PSO starts with a set of possible solutions called a "swarm", which are randomly generated from the parameter space at the beginning of the algorithm. In the swarm, each solution is defined as a "particle". Particles (birds) are flown through the multidimensional search space. Each particle has its own position and speed information that guides its flight and a fitness value obtained by the fitness function (log-likelihood function -in this study) to be optimized. Each particle adjusts its position according to its own and its neighbors' previous experiences. PSO is mainly based on approximating the position of particles in the swarm to the best positioned particle of the swarm. This approximation approach develops randomly, particles in the swarm positioned better by their new movements than their previous position and this process continues iteratively until the fitness function is optimized. At each iteration, each particle is updated according to the two "best" values. One of them is personal best (pbest), the best fitness obtained by particle so far, and the other is the global best (gbest), the global solution obtained so far by the whole swarm (Saravanan 2006;Shi and Eberhart 1999).
Let X k j (t) = (x k1 , x k2 , . . . , x kd ) denote the location and v k j = (v k1 , v k2 , . . . , v kd ) represent the velocity of the kth particle (k = 1, 2, . . . , K) and jth position ( j = 1, 2, . . . , d) at the tth iteration in the d-dimensional search space. For the (t + 1)th iteration, the velocity and location of the kth particle and jth position are updated by the following formulas: Here, c 1 and c 2 are accelerating factors used to scale cognitive and social components respectively, and r 1,k j and r 2,k j are random numbers generated from the standard uniform distribution, U [0, 1]. p best,k (personal best), represents the best location of the particle i experienced so far and g best (global best), represents the best location among all the particles in the swarm experienced so far (Acitas et al. 2019;Amoshahy et al. 2016;Engelbrecht 2013;Parsopoulos et al. 2002). Velocity regulation is very important in the PSO algorithm. There are many methods to limit the velocity in the literature. One of them is inertia weight (w), which was introduced to limit velocity of the particles by Shi and Eberhart (1998). The modified velocity formula by adding inertia weight to the Eq. 7a is given below: When the inertia weight, w > 1, the velocities of particles increase with time to the maximum speed and the swarm diverges. On the other hand, small inertia values facilitate the local search but weaken the global search capability of PSO (Karaboga 2017;Shi and Eberhart 1999). Inertia weight was taken as constant in the early studies published in the literature. Later in the paper by Eberhart and Shi (2000), a linear decreasing value for the inertia weight (w) was proposed.
The algorithm consists of the following steps; 1. The initial swarm is created with randomly generated starting locations and velocities for prespecified swarm size (K). 2. Fitness values are calculated for each particle. 3. Inertia weight (w), c 1 and c 2 are set. 4. The personal best ( p best ) is found for each particle. 5. The global best (g best ) is found among all the particles in the swarm. 6. The velocity and the location of the particles are updated at each iteration by using Eq. 8 and Eq. 7b, respectively. 7. The steps 1-6 are repeated until stopping criteria is satisfied.
In this study, the PSO is used to estimate the parameters of the models given in Eq. 4. PSO is an efficient algorithm in terms of being simple to use, not needing any score functions or their derivatives. Its computational simplicity and lower elapsed time compared to traditional optimization algorithms make PSO so appealing especially for complex models. By taking into consideration that the score functions of the likelihood function are extremely complex and it is difficult to obtain solutions using traditional methods, it is decided to use PSO to obtain the model parameter estimates.

Simulation study
To determine the value of inertia weight is vital in the PSO algorithm. Therefore, the simulation setting was carried out as a two-stage process. In the first stage, the simulation study is designed in order to determine optimal hyper parameters of PSO algorithm. In the second stage, another simulation study is designed to obtain model parameters given in Eq. 4 using PSO with its hyper parameters which were chosen in stage 1.
In both stages, the number of explanatory variables is taken as 4, component specific parameter vectors are taken as β 1 = (0.69, −0.29, 0.41, 0.92), β 2 = (0.41, 0.83, 0.56, −0.51), and the value of threshold is specified as 10. An explanatory variable matrix is generated using a normal distribution with mean 2 and standard deviation 0.5. The values of the dependent variables being lower and higher than the threshold θ are generated, respectively, from a Lognormal distribution with location parameter μ i (β β β 1 , X X X i ) = β β β T 1 X X X i and shape parameter σ = 0.1 and from a Pareto type II distribution with scale parameter α i (β β β 2 , X X X i ) = exp β β β T 2 X X X i , shape parameter λ = 2 and location parameter θ = 10. Here, the mixing weight is taken as 0.50.
In the first stage of the simulation study, a grid search is implemented in order to optimize inertia weight parameters, swarm size, maximum number of iterations, and the accelerating coefficients c 1 and c 2 of the PSO algorithm. The inertia weight values are taken fixed as (0.1, 0.3, 0.5, 0.7, 0.9, 1.1). In addition, inertia weight values are taken as dynamically decreasing linearly from 1.4 to 0.4. The other hyperparameters of the PSO algorithm used in the first stage are given below: • Sample size is taken as 300, 500 and 1000.
• Swarm size is taken as 20 and 40.
• The accelerating coefficients are taken equal to: • The maximum number of iterations are taken as 100 and 250.
Each scenario is repeated 500 times. The simulation is stopped when the PSO algorithm reaches the maximum iteration number. For each scenario, the Akaike Information Criteria (AIC), total absolute bias, and total elapsed time are calculated. The scenarios for different inertia weight values are compared according to their AIC values.
The PSO algorithm is given in greater detail in Sect. 3 and all data generation processes are carried out using R ver.4.0.0 (R Core Team 2020) with self-written code. The results of the first stage of the simulation study are given in Tables 1  and 2. As seen in Tables 1 and 2, total elapsed times for different inertia weight values are found to be similar while the other hyper-parameters, sample size, swarm size, maximum number of iterations, c 1 and c 2 are not changed.
When the sample size is 300 or 500, the swarm size is 20 or 40, the maximum number of iterations is 100 or 250, the accelerating coefficients are c 1 = c 2 = 1.49, and the inertia weight value with the lowest AIC value is found to be 0.5. This inertia weight value also has the lowest total absolute bias value for all simulation settings.
In Table 1, the total absolute bias values of each inertia weight value are generally decreasing when the maximum number of iterations increase from 100 to 250 under the corresponding sample and swarm size. The total absolute bias values of each inertia weight value are decreasing when the swarm size increases from 20 to 40 under the corresponding sample and maximum number of iterations.
As seen in Table 2, when the AIC values are compared, it is concluded that the inertia weight value 0.1 has the lowest AIC value in most of the simulation settings. The inertia weight value with the minimum AIC value has been found to be 0.3 when the sample size is 300 or 500, swarm size is 40, and maximum number of iterations is 250.
The total absolute bias values of each inertia weight value generally increase as the maximum number of iterations increases from 100 to 250, where the corresponding swarm size is 20. However, the total absolute bias values of each inertia weight value generally decrease as the maximum number of iterations increases from 100 to 250, where the corresponding swarm size is 40.
According to the results of Tables 1 and 2, there is a slight decrease in AIC values when the swarm size is increased from 20 to 40 or the maximum number of iterations increased from 100 to 250. Therefore, the values of swarm size and maximum number of iterations are selected as 20 and 40, respectively for the second stage of simulation study.
Comparing the results of Tables 1 and 2 under the same sample size, the AIC values corresponding to c 1 = c 2 = 1.49 are better when the swarm size is 40 and the maximum number of iterations is 250.
According to the above results, the hyper-parameters which will be used in order to obtain parameter estimates of the model in Eq. 4 at the second stage of the simulation study are determined as: • Swarm size is taken as 40.
• The accelerating coefficients are taken equal to c 1 = c 2 = 1.49. • The maximum number of iterations is taken as 250.
• The inertia weight value is taken as 0.5.
In addition to the parameters given above, the sample size is taken as N = 300, 500 and 1000 and the mixing weight is  Tables 3, 4 and 5.
The parameter estimates and their MSE values for different sample sizes of 300, 500 and 1000, under different mixing weights, c = 0.30, 0.50 and 0.70, are given respectively in Tables 3, 4 and 5. The results indicate that the total absolute bias values are decreasing as the sample size is increasing. Under different mixing weight parameter value, MSEs of the parameters are generally decreasing as the sample size is increasing. The only exception is the increase observed in the estimation of MSEs of the threshold parameter when the sample size changed from 300 to 500 and the mixing ratio The 'VGAM' package in the R ver.4.0.0 (R Core Team 2020) software is used to estimate the parameters of the Lognormal, Lomax, and Gamma regression models using simulated data. The probability density function with their mean structures of these models are given below: Lognormal Regression Model: where σ > 0 is shape parameter, μ > 0 is scale parameter, X X X = (X i j , i = 1, 2 . . . , n, j = 0, 1, 2, 3) denotes the matrix of explanatory variables, with the first column assumed to be a 1 to accommodate the estimation of an intercept and γ γ γ T = (γ 0 , γ 1 , γ 2 , γ 3 ) is parameter vector.  Lomax Regression Model: where λ > 0 is shape parameter, α > 0 is scale parameter, X X X = (X i j , i = 1, 2 . . . , n, j = 0, 1, 2, 3) denotes the matrix of explanatory variables, with the first column assumed to be a 1 to accommodate the estimation of an intercept and γ γ γ T = (γ 0 , γ 1 , γ 2 , γ 3 ) is parameter vector.
Gamma Regression Model: where ν > 0 is shape parameter, λ > 0 is scale parameter, X X X = (X i j , i = 0, 1, 2 . . . , n, j = 0, 1, 2, 3) denotes the matrix of explanatory variables, with the first column assumed to be a 1 to accommodate the estimation of an intercept and γ γ γ T = (γ 0 , γ 1 , γ 2 , γ 3 ) is parameter vector. As seen in Table 6, the Lognormal regression model has the lower BIC value than Lomax and Gamma regression models under the same sample size and same mixing weight. Comparing the results of Lognormal, Lomax, and Gamma regression models with those of the proposed Lognormal-Pareto type II regression model, it is seen that the difference between the BIC values between Lognormal regression model and the proposed model, BIC Δ = BIC LogN − BIC Prop , the difference between Lomax regression model and the proposed model, BIC Δ = BIC Lomax −BIC Prop , and the difference between Gamma regression model and the proposed model, BIC Δ = BIC Gamma − BIC Prop for all sample sizes are greater than 10 (BIC Δ >> 10). Hence, it is concluded that the proposed model provides a better fit to the data than Lognormal, Lomax, and Gamma regression models.

Household budget data
The proposed model is applied to analyze monthly consumption expenditure and affecting factors. The data is obtained from National Household Budget Survey in 2018, annually conducted by the Turkish Statistical Institute (Turk-Stat). The household budget survey was first conducted in 1964. Since 2004, it has been carried out annually to reveal consumption patterns and income levels of individuals and households by socio-economic groups, rural & urban areas and regions of Turkey. In this study the survey which was conducted between January 1st and December 31st, 2018 on 1296 households is used. The sampling design of the Household Budget Survey is a stratified two-stage cluster sampling method. Therefore, parameter estimates are obtained by considering sampling weights. The microdata of 2018 Household Budget Survey consists of three data sets which are Household, Individual and Consumption Expenditure. These three data sets are linked to each other with one-to-one matching on the "Household ID", which is common in all data sets in order to select variables to be used in the analysis. After a mapping process, the data are filtered to the individuals who live alone in household since it is more practical to evaluate the relationship between monthly consumption expenditure and related factors for individual.
The explanatory variables used in the analysis including continuous variables and categorical variables with reference levels are given below: The explanatory variables are thought to be correlated with the dependent variable, which is monthly consumption expenditure. The categorical variables are coded using the leave-one-out method relative to the reference level.
As mentioned before, the parameters are estimated under weighted data by taking the sampling method into consideration. Besides, parameter estimates under unweighted data are also obtained and compared with the weighted results. Similar to the simulation study, parameter estimates are also obtained under weighted and unweighted data using Lognormal, Lomax and Gamma regression models. In addition, a Gamma-Pareto type II composite regression model is also applied and compared with the proposed model.
In order to provide ease of operation in the analysis, estimates are obtained by dividing the monthly consumption expenditure by 100. The histograms for weighted and unweighted of the monthly consumption expenditure are given in Fig. 1. Summary statistics calculated under weighted and unweighted data for the variables to be used in the regression model are given in Table 7.
When the results in Table 7 are examined, it can be seen that despite some differences, the summary statistics obtained for weighted and unweighted data are generally similar. As the data is obtained from the stratified two stage cluster sampling, summary statistics obtained for weighted data are interpreted. The average age of the participants in the study is 55.97 with standard deviation 18.47, 42.7% of them are male, 39.0% of them have primary education and 94.1% of them are single. More than half of the participants are not working (57.3%), and about two-thirds of the participants have their own property. 6.2% of them have the habit of buying daily newspapers. Only 10% of them have private insurance and one third of them have a smoking habit. In addition, 11.7% have a cable TV subscription and 5.1% is playing games of chance. The average monthly consumption expenditure of the participants is 2832.84 with standard deviation 2334.91 and varies between 109 and 24798.19. The average annual income of individuals is 33822.48 Turkish Lira (TL) with standard deviation 31546.51 and varies between 1800 and 393316. 60% of the participants have a savings habit, 39% of them have a credit card and 77.5% of them have the habit of online shopping.
The pairwise correlations among the explanatory variables are investigated. According to the results of the correlation analysis, it is observed that no pair of explanatory variables is highly correlated. Consequently, it is accepted that there is no problem of multi-collinearity in the Lognormal-Pareto type II composite regression model.
Before obtaining the parameter estimates, hyper-parameter optimization is performed for the parameters of the PSO algorithm as in the first step of the simulation study. According to the first-stage of the simulation results, it is seen that more consistent estimations are obtained when the swarm size is 40 and the maximum number of iterations is 250. Hence, the swarm size and maximum number of iterations are taken as 40 and 250, respectively. A grid search is implemented for the remaining hyper-parameters: • The inertia weight values are taken fixed as (0.1, 0.3, 0.5, 0.7, 0.9, 1.1). • In addition, inertia weight value is taken dynamically decreasing linearly from 1.4 to 0.4. • The accelerating coefficients are taken equal to: The results of the hyper-parameter optimization are given in Table 8. As seen in Table 8,when the accelerating coefficients are taken as c 1 = c 2 = 1.49, the inertia weight value Fig. 1 The weighted/unweighted histograms for the monthly consumption expenditure   with the minimum AIC value has been found to be 0.7. Similarly, the inertia weight value with the minimum AIC value has also been found to be 0.7, when the accelerating coeffi-cients are taken as c 1 = c 2 = 2.00. Comparing the results of different accelerating coefficients under the swarm size is 40 and the maximum number of iterations is 250; the AIC values corresponding to c 1 = c 2 = 1.49 are obtained better. Consequently, according to these results, the hyper-parameters, which will be used in order to obtain parameter estimates of the proposed model, are determined as: • Swarm size is taken as 40.
• The accelerating coefficients are taken equal to c 1 = c 2 = 1.49. • The maximum number of iterations are taken as 250.
• The inertia weight value is taken as 0.7.
After the hyper-parameters of the PSO algorithm are decided, the PSO algorithm is ran 10 times by taking different random seeds. The solution set with the lowest AIC value is determined as the best result.
The parameter estimates for the Lognormal-Pareto type II composite regression model, which are obtained for weighted and unweighted data using variables considered as related to the average monthly consumption expenditure, are given in Table 9. It is observed that the histograms of the monthly consumption expenditure for weighted and unweighted data are very similar as it can be seen in Fig. 1. Similarly, the summary statistics obtained under weighted and unweighted data are also found similar. However, when the parameter estimates given in Table 9 are examined, it is observed that the results of weighted and unweighted data are very different from each other. In particular, the estimate of the threshold value (θ ) for weighted data is around 30, while the estimate for the unweighted data is around 63. However, since the sampling method is stratified two-stage cluster sampling, parameters should be estimated under weighted data. The reason to obtain parameter estimates for weighted and unweighted data separately is to show that parameter estimates for weighted data are quite different from the parameter estimates for unweighted data, although both distributions are found quite similar in Fig. 1.
Similar to the simulation study, parameter estimates are also obtained by using Lognormal, Lomax, and Gamma regression models. Additionally, Lognormal-Pareto type II composite regression model is also compared to Gamma-Pareto type II composite regression model. Suppose, the first component in Eq. 3 is the Gamma distribution with scale parameter k > 0 and shape parameter ν > 0. The second component in Eq. 3 is Pareto type II distribution with scale parameter α > 0, shape parameter λ > 0 and location parameter θ > 0. The probability density function of Gamma-Pareto type II composite regression model is given below: Here, c is mixing weight, γ ν, θ k i (β β β 1 , X X X i ) represents the lower incomplete gamma function, y y y = (y i , i = 1, 2, . . . , n) are the values of the response variable. Let X X X = (X i j , i = 1, 2, . . . , n, j = 0, 1, . . . , p) denotes the n × ( p + 1) matrix of the values of the p explanatory variables, with the first column assumed to be a 1 to accommodate the estimation of an intercept and X i X i X i = 1, X i1 , X i2 , . . . , X i j , . . . , X i p is vector of the p explanatory variable values for the ith observation. β β β = β β β T 1 , β β β T 2 denotes the component specific parameter vectors. The explanatory variables are incorporated through the link function g (·). In this setting, the scale parameters k and α are related to the linear predictors through inverse and exponential link functions as 1   The parameter estimates under weighted data are presented in Table 9. The threshold value θ for the Lognormal-Pareto type II composite regression model when the swarm size is 40, the maximum number of iterations is 250 and c 1 = c 2 = 1.49 is estimated as 30.626. Households, whose monthly consumption expenditure is equal to or less than 30.626 TL will be called "low expenditure class", those above 30.626 TL will be called "high expenditure class". This classification will be used in following comments.
When parameter estimates in the weighted data in Table 9 are examined, it is seen that the age variable has a decreasing (0.9996 times) effect on the amount of average monthly consumption expenditure in low expenditure class. Similarly, it has a decreasing (0.9971 units) effect on the amount of high expenditure classes.
For low expenditure class, the average monthly consumption expenditure of women compared to men is 1.0659 times larger. For high expenditure class, women spend 1.2376 units more than men.
For low expenditure classes, the average monthly consumption expenditure of people who have the level of primary, secondary education and bachelor degree compared to those who do not complete any school is 1.1298, 1.097 and 1.114 times higher, respectively. For high expenditure class, the average monthly consumption expenditure is increasing as the level of education is increasing.
For both expenditure classes, the average monthly consumption expenditure is lower for married individuals compared to single ones (For low expenditure class: 0.9211 times; for high expenditure class: 0.9481 units).
For low expenditure classes, the average monthly consumption expenditure is higher for individuals who have the habit of buying daily newspaper, have smoking habit and have the habit of online shopping while for high expenditure classes, the average monthly consumption expenditure is lower for individuals who have the same habits.
For both expenditure classes, the average monthly consumption expenditure is higher for individuals who have a savings habit, have the habit of dining out, are working, have their own property, have Cable TV, have the habit of playing games of chance, have credit card and have private health insurance.
Summary statistics calculated for low and high expenditure classes are given in Table 11.

Conclusion
The main purpose of this study is to propose a novel composite regression model for heavy-tailed data. The proposed model consists of two components which are Lognormal and Pareto type II distributions. Differently from Gan and Valdez's study (Gan and Valdez 2018), the proposed model provides an important contribution to the existing literature.
Since the PSO algorithm is capable of dealing with a large number of parameters, is a powerful tool for solving complex systems and the algorithm's elapsed time is quite short, in this study, the PSO algorithm is used to obtain parameter estimation of the proposed model. The PSO algorithm is implemented in two stages. The first stage is designed to find optimal hyper-parameters of the PSO algorithm. The second stage of the simulation setting is performed to obtain parameter estimates of the proposed model using the PSO algorithm with the best values of the hyper-parameters according to the first stage of the simulation results. In addition, the proposed composite Lognormal-Pareto type II regression model is compared to those of Lognormal, Lomax, and Gamma regression models using BIC values. The results of the simulation study show strong support in favor of the proposed model.
To demonstrate applicability, monthly consumption expenditure and affecting factors using the National Household Budget Study conducted by the Turkish Statistical Institute (TurkStat), are modeled via the proposed composite Lognormal-Pareto type II regression model. To set up a regression model for providing better fit to the data is challenging. As it can be seen in Fig. 1, monthly consumption expenditure have heavy-tailed behavior. Therefore, standard regression models are deficient in capturing the relationship between monthly consumption expenditure and its related variables. The results of Household Budget Study data show that the proposed regression model indicates more favorable fit than Lognormal, Lomax, Gamma and Gamma-Pareto type II regression models.
Additionally, the sampling design of Household Budget Study is stratified two-stage clustering sampling. Accordingly, it is another challenge to obtain parameter estimates under-weighted data. To address this challenge, both the model equation and PSO algorithm are updated by taking into account sampling weights. Finally, parameters are estimated for weighted data and the effect of explanatory variables on average monthly consumption expenditure is explained in detail.