Multilevel Zero-inflated Censored Beta Regression Modeling for Proportions and Rate 1 Data with Extra-zeros

25 Objectives: Zero-inflated proportion or rate data nested in clusters due to the sampling structure 26 can be found in many disciplines. Sometimes, the rate response may not be observed for some 27 study units because of some limitations (false negative) like failure in recording data and the 28 zeros are observed instead of the actual value of the rate/proportions (low incidence). In this 29 study, we proposed a multilevel zero-inflated censored Beta regression model that can address 30 zero-inflation rate data with low incidence. 31 Methods: We assumed that the random effects are independent and normally distributed. The 32 performance of the proposed approach was evaluated by application on a three level real data set 33 and a simulation study. We applied the proposed model to analyze brucellosis diagnosis rate data 34 and investigate the effects of climatic and geographical position. For comparison, we also 35 applied the standard zero-inflated censored Beta regression model that does not account for 36 correlation. Results: Results showed the proposed model performed better than zero-inflated censored Beta based on AIC criterion. Height (p-value <0.0001), temperature (p-value <0.0001) and 39 precipitation (p-value = 0.0006) significantly affected brucellosis rates. While, precipitation in 40 ZICBETA model was not statistically significant (p-value =0.385). Simulation study also 41 showed that the estimations obtained by maximum likelihood approach had reasonable in terms 42 of mean square error. 43 Conclusions: The results showed that the proposed method can capture the correlations in the 44 real data set and yields accurate parameter estimates. 45


Introduction 48
Beta regression model is an extension of generalized linear models (GLM) that provides a 49 standard framework for analyzing ratio, rate and proportion data. The main assumption of this 50 model is that there is a continuous, percentage-scaled dependent variable that ranges between 0 51 to1 and can be characterized by the Beta distribution [1][2][3][4]. Beta distribution characterizes 52 unimodal as well as bimodal densities with varying severity of skewness. This interesting fact 53 provides incredible flexibility for Beta Regression in modeling dependent variables for which 54 normalizing transformations are impossible. In Beta regression, both parameters of the 55 dependent variable including mean and precision (the scaling factor related to variance) are 56 assumed to be associated with explanatory variables [5]. 57 To date, several modeling approaches of the Beta regression have been developed. For example, 58 a Bayesian approach has been utilized by [6], [7] and [8] to associate the mean and the precision 59 parameters with explanatory variables. A semiparametric Beta regression model has been applied 60 by [9] using penalized splines. Zimprich [10] and Figueroa et al [11] extended Beta regression to 61 longitudinal data analysis through adding a random effect in the mean and precision parameters 62 respectively. Another extension of the Beta regression was used by [12] to analyze time series 63 rate data with correlated errors using Bayesian approach. Other extensions of the Beta regression 64 includes spatial data analysis [13] and spatial analysis of structured additive regression model 65 [14]. In the last recent paper in the Beta regression content [15], the authors focused on random 66 effect models to study spatially correlated rate data to account for the spatial correlation of data 67 in the model. They implemented Bayesian approach for parameter estimation in a two level Beta 68 regression for the correlated data. 69 There are many situations for which proportion data may contain zeros. In this case, standard 70 form of the Beta regression may fail to model this data. One remedy is to use mixture modeling 71 (a mixture of a Beta distribution on (0, 1) and the Bernoulli distribution which gives non-72 negative probabilities to 0) to capture the probability mass at 0 [4]. This situation is called zero-73 inflated Beta distribution [16][17][18]. "Inflated Beta Regression incorporates the existing Beta 74 distribution with degenerate distributions to model the extreme values, thereby allowing for 75 complete modeling of the entire continuous percentage space" [5]. 76 Often due to the hierarchical design of studies or data collection processes, there are found an 77 intra-class correlation in the data within a cluster. This leads to a correlated data structure 78 (clustered) that is commonly encountered in biomedicine. This characteristic is inherent in health 79 surveys with individuals are nested in clusters, regions or provinces. It is also observed for the 80 observations obtained from the same subject [19]. Ignoring these within-subject (cluster) 81 We apply the proposed model to analyze brucellosis diagnosis rate data and investigate the 105 effects of climatic and geographical position. For comparison, we also apply the standard zero-106 inflated censored Beta regression model that does not account for correlation. We show 107 MLZICBETA is better than zero-inflated censored Beta (ZICBETA).The rest of the paper is 108 organized as follows. In Section 2, we present a brief review of the standard Beta regression and 109 its zero-inflated version and then a multilevel zero-inflated censored Beta regression 110 incorporating random effects to account for data dependency. The details of the proposed 111 models, the estimation of the parameters, and their variance estimators are also provided. In 112 Section 3, we report our data analysis for the brucellosis diagnosis rate data. The paper concludes 113 with a discussion in Section 4. 114

ZICBETA distribution 116
Let w denote the proportion/rate/ratio response. The probability density function of a random 117 variable of W with a Beta distribution can be written as follows: 118 Now, let us consider the response variable Y as a mixture of a degenerated distribution at zero 126 with probability of p and a censored Beta distribution with mean λ with probability 1 -p. The 127 ZICBETA distribution of y can be written: 128  ( 1 ) 1 Therefore, it can be shown that the mean and variance of Y are given by 137 Then the loglikelihood for a sample of n observations censored at c (recorded as zero) and 141 subject to excess zeros is 142 where the covariates ijk z and ijk x appearing in the two logistic components are not necessarily 157 the same. The vectors i w and i u denote the province-specific random effects, whereas ij s and ij v 158 are the random variations at the cluster level. The above equations can be written in vector form 159 The log-likelihood can be maximized by a numerical method like Expectation Maximization 172 (EM) algorithm. We considered the penalized log-likelihood as l = l 1 + l 2, where l 1 stands for the 173 log-likelihood function when the random effects are assumed conditionally fixed and l 2 stands 174 for the -log-likelihood of the random effects. Treating the random effects as parameters leads to 175 the minus l 2 to be viewed as a penalty function for the random effects: 176 Estimating process continues through maximizing l1 by considering variance components as 183 fixed values and updating their values obtained from optimization of l2 using restricted 184 maximum likelihood (REML). 185 Using the penalized log-likelihood, the complete data log-likelihood (l c ) can be constructed as l c 186 = l ξ +l η with 187 where u ijk =1 (a latent variable) when y ijk is zero and u ijk =0 if y ijk is drawn from the censored 193 Beta distribution variable. Therefore, parameter estimation can be performed through EM Brucellosis is a prevalent infectious zoonotic disease that can be transmitted from animals to 207 human directly by contacting with the Brucella carriers or indirectly by contacting with 208 consumption of the unpasteurized dairy products of infected animals (1-3). Symptoms in infected 209 people include fever, sweating and depression as well as arthralgia, and tiredness for weeks or 210 even months (1, 4). Brucellosis can affect public health and economic of a country adversely due 211 to significant human burdens and widespread problems (1, 2). It has been reported that there are 212 more than 500,000 new cases annually worldwide (6, 7). While brucellosis has been eliminated 213 in several industrial countries, it has remained as an important public health menace in Iran with 214 a high annual incidence rate especially in western and northwestern parts of the country (2, 3, 8). There were zeros in the rates (37.4%). Here, we encountered with a mixture of zeros and a 223 continuous variable. 224 Table 1 shows the summary statistics of brucellosis rates by 30 provinces. The majority of the 225 patients were male (58.5%) and aged between 1 and 100 with mean and standard deviation of 226 38.1 and 17.82 years. About 36.5% of diagnosis had a job connected to livestock and 74.5% of 227 them had a history of using non-pasteurized dairy products. In this cross-sectional study, because 228 of the structure of the data, months were nested within the cities and cities were nested within 229 clusters (provinces). We tried to identify the climatic factors affecting the brucellosis rate using a 230 MLZICBETA regression model. The potential covariates influencing the brucellosis rate were 231 height of the location, monthly mean temperature (degree of Celsius) and mean monthly 232 precipitation (mm). Table 2 shows the results of MLZICBETA as well as zero-inflated Beta 233 regression models (ZICBETA). We also applied multi-level count regression models including 234 zero-inflated Poisson, zero-inflated negative binomial, zero-inflated generalized Poisson with 235 populations as offset (results were not shown). According to the AIC, our proposed model 236 outperformed other models. 237 As can be seen, the variance components of random effects were significant in provinces and 238 cities for both parts of the model (zero and Beta parts). Therefore, it can be concluded that the 239 correlations between months in cities and cities that are belong to a province were significant. 240 After adjusting for the province and cities clustering effects, covariates height (p-value <0.0001), 241 temperature (p-value <0.0001) and precipitation (p-value = 0.0006) significantly affect 242 brucellosis cases rates. These results were in concordance with results of other studies [27]. 243 While, precipitation in ZICBETA model was not statistically significant (p-value =0.385). As the 244 results shown (Table 2) the MLZICBETA had a better AIC compared with the ZICBETA model. 245 According to the results, brucellosis diagnosis rate increases as the temperature increases. This 246 positive relationship was also observed for height. So, cities that are paced in higher locations are 247 associated with greater brucellosis diagnosis rate. In addition, brucellosis diagnosis rate was 248 significantly lower in the area with higher amount of precipitation. 249 According to the findings, it has been shown that with increasing precipitation amount, the 250 diagnosis rate of brucellosis is reduced. So, in the months and areas with low precipitation the 251 diagnosis rate of brucellosis appears to be more. It seems that in the area with low amount of 252 precipitation the impoverishment of pastures leads to lack of access to nourishing forage with 253 sufficient quality for livestock and because this disease is a viral infection, whenever the 254 livestock becomes weak due to low protein consumption, the disease incidence increases. In fact, 255 this disease is indirectly related to drought and low precipitation. In recent years, the occurrence 256 of drought phenomenon in semi-arid regions of Iran has caused impoverishment of pasture and 257 deterioration of its quality which has led to protein-energy malnutrition of livestock and 258 consequently the susceptibility of livestock losing in animal husbandries has been increased. Due

Discussion 287
In this study we proposed a multilevel ZICBETA regression model to analyze hierarchical 288 proportion/rate/ratio data containing extra zeros. 289 This method provides an insight into the source of zeros as well as the apparent heterogeneity. At 290 the same time, it accommodates the within-cluster and within-individual correlations that are 291 introduced by the data structure. Application to the brucellosis diagnosis rate data showed the 292 usefulness of this approach in capturing correlation between study units. In the presence of zeros, 293 the MLZICBETA regression model enables us to draw sensible and valid results and conclusions 294 about significant factors that affect the rate of brucellosis diagnosis and distinguishing those 295 cities with zero rates. 296 With appropriate modifications, the proposed model can be easily extended to random 297 coefficients setting. It can also be developed for the case of Zero-One-inflated Beta regression 298 model. It is also possible to consider the precision parameter to vary according to the explanatory 299 variables. This can be done using a logarithm link to associate τ and potential covariates ( 300 log( ) T X τ γ = ) with or without random effects. Considering the effects of covariates with an 301 unknown functional structure can be handled using splines which can be another interesting 302 extension of the proposed model. Other extensions include cross-random effects, and non-303 parametric specification for the distribution of random effects in the two or three parts of the 304 The data was registry based and innocent. Not applicable. 319 Consent for publication 320 Not Applicable. 321 Availability of data and material 322 The datasets used and/or analyzed during the current study are available from the corresponding 323 author on reasonable request. 324

Competing interests 325
The authors declare no conflict of interest. 326 Funding 327 This study was partially founded by Hamadan University of medical Science.

329
Authors' contributions 330 LT and OH conceived the research topic. LT, OH, HD and MS explored that idea, performed the 331 statistical analysis and drafted the manuscript. GM provided the data and participated in and 332 drafted the manuscript. All authors read and approved the final manuscript.

334
Acknowledgement 335 We would like to appreciate the Vice-chancellor of Education for technical support and the Vice-336 chancellor of Research and Technology of Hamadan University of Medical Sciences for their 337 approval and support of this work. 338 339 340 Appendix A 341 The Maximization algorithm 342 The maximization was performed by the following two sets of Newton-Raphson equations, in 343 view of the orthogonal decomposition of l C , 344 345 346