Analysis of the Mixed Effects Regression Model for Clustered Count Response Data


 Background

The Poisson regression model is useful to analyze count data, but, when the observations are correlated the Poisson estimate will be biased. Whereas, when the over-dispersion and heterogeneity problems occur the imposition of the Poisson model underestimate the standard error and overestimate the significance of the regression parameters. Therefore, the objective of this paper was to develop a test statistic to model and predict clustered count response data via application and simulation data.
Methods

This paper concentrated on the clustered count data model to take into account heterogeneity. Accordingly, we developed a score test based on the multilevel Poisson model for testing heterogeneity with the alternative Poisson regression model. In addition, for the model application, we used the EDHS children`s data. Therefore, to evaluate the proposed model, we used both simulation and application data.
Results

Simulation results showed that the proposed score test has high power to predict and used to control heterogeneity between groups. Oromia, Amhara, and SNNPR are among the regions with the highest child mortality rates (Table 1). The results indicated that women who made marriage a mean age of 16 years and gave birth for the first child a mean age of 18 years and 8 months. Table 1 showed that 81% of all child deaths have recorded in rural areas. 78% of child families were illiterate, as a result, 75% of children don't have access to latrines and drinking water. Rivers and open-sources waters are the common sources of drinking water which comprised 79% of the total water supply. Therefore, from the research finding, it is possible to conclude that most child mortality is due to scarcity of water.
Conclusion

The Power of test estimates indicated that the proposed method was better than the existing models. All covariant and dummy explanatory variables have a significant effect on the deaths of children. Hence, the multilevel Poisson model results indicated that there exist high variability among regions for the deaths of children. Therefore, this work suggested that the applications of the random-effects model provided a simple and robust means to predict the count response data model.

One of the classical assumptions of the regression model is: the error follows a normal distribution with mean zero and variance unit. Hence, when a distribution of the continuous response variable is asymmetric, to meet the assumption of normality, a simple transformation of the response variable can produce normally distributed errors. Nevertheless, if the response variable of interest is discreet, then, a simple transformation of the skewed distribution cannot produce normally distributed errors. Besides, nested data are common in the social and health sciences, hence, the assumption of independent observation is violating, and resulting in an incorrect standard error and inefficient estimate [17].
Fitting separate regressions for each group doesn`t allow examination of what group characteristics may be important in explaining the outcome. Classical regression methods, including multiple linear regression, logistic regression, and generalized linear modeling, assume independence of the observations. However, the dependence of clustered data can lead to the imprecision of coefficient estimates which affects the statistical significance of risk factors. Testing homogeneity among clustered data adjusting for the effects of covariates [18]. Hence, ignoring such inter-cluster correlations may result in overlooking the importance of cluster effects and call into question the 4 validity of traditional statistical techniques used for studying data relationships consequently, this analysis has lower statistical power.
Data with a multilevel nature happens in public health, health service research, behavioral sciences, and medicine. For clustered count data, the observations are correlating, the main approaches for correlated count data are conditional models, random-effects model, generalized estimating equations (GEE's) [19]. Conditional models are convenient for particular cases, such as data with small group sizes or with order structure [20]. The multilevel regression models incorporate cluster-specific random effects that account for the interdependence observations. This paper focused on a multilevel Poisson regression model approach, the distribution of the response variables is conditionally modeled in a group-specific parameter itself a random variable [21], [22]. This paper developed a score test to test heterogeneity of variance among groups, and estimating the regression parameters and the cluster-specific random effects. The power of the proposed score test has evaluating through simulation and application data.

Objectives
 To develop a score test for testing heterogeneity of groups for discrete distribution.
 To evaluate the efficiency of the score test and compare it with an alternative model by using simulation and application data.
 To assess the various problems of the GLM for clustered discrete response data.

Significance of the Study
 To highlight the appropriate models for clustered count response data.
 Clustering information provides a correct standard error, confidence intervals, and significance tests, [23] 5

Multilevel Poisson Regression Model
Suppose ij Y stands for the outcome variable for the case j of cluster i, i =1, 2… k; j = 1, 2… where′ denotes the differentiation with respect to parameter . The mixed effects model consider at least one regression coefficient to be random is = In the model the mean and the variance 2 are conditional on α i , now, to examine the model under the null hypothesis we assumed that all of the variables and the correlation among the random effects are zero in a generalized linear mixed model. Therefore, the parameter α i can be written as = + 1/2 [24].

Score Test of the Multilevel Poisson Model
With the model defined in (1) and (3), the conditional probability distribution function of Using L` Hopital`s rule following 6 [25] and [26]the score function for testing the hypothesis of homogeneity can be written as . Using the first and second partial derivation of the log likelihood equation with respect to α i , the score statistic is given by .Under homogeneity test, In the above equation, the second and the fourth central moments of the outcome variable are 2 and 4 , respectively, it can be written as a function of the fourth and the second cumulates, cumulates 4 and 2 of ( [28] , 4 = 4 +3 2 2 ℎ 4 = +3 2 since 2 = 2 = 2 = [29]. Subsequently, the simplified form of the information matrix ( ) is defined as The variance of the score functions can be derived from the Fisher information. ( ) = ( ).
the parameter in the score test statistic ( ) is given equation (1), which is replaced by their ML estimates obtained from the Poisson regression model under the null hypothesis, [30]. Then, the score test statistic H PD will be reduced to In large samples, the proposed test statistic follows a chi-square distribution with unit degrees of freedom, now the maximum likelihood estimate of β can be estimated iteratively by using Fisher`s 8 scoring method from the following equations.

Method of Estimation for the Regression parameters
Define the function  = , then, Then the score equation for the regression parameters `, s=1, 2…p.

Alternative Tests for Coefficient of Regression Parameters
To test the significance of the coefficient of regression parameter ( ), the hypothesis denoted as 0 : = 0 0 : ≠ 0 and the LRT test statistics for testing the null hypothesis (6)

Model Selection
Information criteria used to select the best model which fits the data instead of using the likelihood ratio test by using the Akaike information criteria (AIC) and Bayesian information criteria (BIC).

Akaike Information Criteria
AIC is the most common means of identifying the model which fits the data well by comparing two or more than two models. The goodness of fit test against the complexity of the model is similar to that of the coefficient of multiple determination( 2 ); however, it penalized by the number of parameters included in the complexity of the model. Unlike the 2 , the good model is the one that has the minimum AIC value. It is given by the following formula =−2ℓ+2k, where ℓ is the log-likelihood function of a model that will compare with the other models and is the number of parameters in the model including the intercept [32].

Bayesian Information Criteria
Unlike the Akaike information criteria, the Bayesian information criteria take into account the size of the data under consideration. It is given by =−2ℓ+ log ( ) where ℓ is the log-likelihood of a model that will compare with the other models, is the sample size of the data and is the number of parameters in the model including the intercept.

Result Simulation study
In this section, a simulation study is conducted to compare the proposed and existing models in terms To simulate correlated data the `s are identically and independently distributed with a standard normal distribution with mean zero and unit variance. Therefore, `s are identical and independently 11 distributed with mean and variance D. For each set of generated data, a multilevel Poisson model is fitted for calculating the score test and the existing tests followed by the powers of the score tests.
Results from the simulation study are presented in Table-2   Simulation results of the model selection criteria and the power of the test have given in Figure 1and 2.

Application Study
This paper to intake account of the random parameters of the Poisson model. Hence, to illustrate the proposed model, we used the Ethiopian Demographic and health survey children's data.

Discussion
Oromia, Amhara, and SNNPR are among the regions with the highest child mortality rates (Table 1).
Whereas, the child mortality rates were significantly lower in Afar and Harari regional states. The results indicated that women who made marriage a mean age of 16 years and gave birth for the first child a mean age of 18 years and 8 months. Besides, the results from Table 1 showed that 81% of all child deaths have recorded in rural areas. 78% of child families were illiterate, as a result, 75% of children don't have access to latrines and drinking water. Rivers and open-sources waters are the 15 common sources of drinking water which comprised 79% of the total water supply. Therefore, from the research finding, it is possible to conclude that most child mortality is due to scarcity of water.
In Table 2  involved an increase of sample size, despite differences in performance among the information criteria.
In Table 3 the simulated results showed that the performance of the model depends on the sample size and the number of clusters.
In Table 2 noted that an error probability increases power increases. The power increases when α increases and for large sample groups and small variance for the group effect (k = 50, n = 10, D=0.05) the power increase fastly, and approaches to 1. For small sample groups, when the standard deviations of the group effects increase, the power increases slowly, whereas, in large sample groups, (k = 50, n = 50, D=0.15 and α =0.1) as standard deviations of the group effect increases, the power increases slowly, however, when the values of D increase from 0.05 to 0.15, the power decreases. Generally, as the number of groups increases the power increased. Therefore, the proposed score test is important to examine the heterogeneity of group effects and fixing the number of observations and groups due to its high power to predict the model.
To illustrate the proposed method for fitting a multilevel Poisson regression model, we considered the 2005 Ethiopian demographic and health survey children's deaths data, 25420 women whose ages 15 and 49 years had interviewed. This paper considered the number of child mortality aged lower than 18 years that each mother has experienced in her lifetime. The minimum and maximum values of count for the response variable are lies in the interval zero and eighteen with a mean 1.2 and variance 1.7.
Estimation of the random and regression parameters for the two models have given in Table 4 showed that the predicted probability of the proposed method was closer to the probabilities of the true values.Therefore, these results showed that the proposed method is superior to the existing method.

Conclusion
This paper used simulation and application data to illustrate the proposed method, the power of the multilevel Poisson regression model has presented in Table 2. The results revealed that the proposed score test is preferable to the existing model. Application results revealed that there would be a great variability of child mortality among regions, and the predicted probability showed that the proposed model is better than the standard model.
Simulation and application study results showed that when we considered the random-effects parameter in clustered count data, the proposed method gives accurate and valid results whereas the Poisson regression model doesn't handle heterogeneous data. For fixed sample size, when the regression and random-effects parameters are increasing the powers are increasing, whereas, for fixed regression coefficient and random-effects parameters, when the sample size increasing as the power is increasing. 17 The simulation results showed that the power is smaller for small values of the random effect parameter while the power is increasing as the random effect is increasing. Hence, the score test is appropriate to model the number of child mortality and the results showed that there would be a significant variation of deaths of children among regions (Table 4). Therefore, Table 5 and 6, the information criteria and the predicted probability results revealed that the proposed model is better than the standard Poisson model. Permission to undertake the study was also obtained from the Central Statistical Agency of Ethiopian.

Consent for publication
Not Applicable