Our study demonstrated that the quality of these included papers was unsatisfactory, the result was line with prior systematic reviews[7, 10, 12, 13], despite some guidelines about the PS method had found in recent years. Many articles ignored the essential details and adopted the inappropriate methods, causing the misleading interpretation of the treatments in published article. Thus, we would mainly discuss the following aspects.
For the variables in PS models, all included articles used demographic and clinical variables to conduct the PS, which is a good practice. 112 articles(78.3%) clearly listed out the variables, and not all variables were incorporated into the ultimate PS models, this omission of variables implicitly means that potential biases from these variables are considered negligible in theory, meanwhile we should pay more attention to the number of variables, these models, such as PS models with few variable and PS models with many variables but smaller sample size, usually do not produce unbiased causal reference[14]. On the other hand, only 15 articles(10.5%) reported how variables were selected to construct the PS, and the non-parsimonious that include all variables is the most typically ways to choose variables, this method could be deleterious if the contain variables associated without the prognosis[15]. When the justification of the variables included in the PS method is not clearly stated, it is possible that important variables are not included in the analysis and limits the reproducibility of the results. In addition, some studies[16, 17] showed that adequate professional knowledge and practical clinical experience were the key factors for determining included variables, interaction, and/or higher order term. The variables eventually incorporated into PS models should be associated with the outcome[4, 18], but some quality reports of PS methods ignored this part[3, 13].
Another concern was the choice of the estimation of PS model, like T. L. Zakrison and colleagues reported that most researchers would like to adopt a logistic regression[13], however, in other fields, boosted regression trees and neural networks were proposed to construct a PS model, these ways based on data mining algorithm are rapidly spread by the development of computer statistical software in recent years[19], one of the advantages is that these methods could automatically find any nonlinear terms and take them into the estimation model of the PS. Although the logistic regression model could meet many circumstances, it would introduce us to select the ‘optimal’ model that contains the expected results displaying in front of you[4, 20]. Therefore, we encourage researchers to employ statistical methods that based on data mining algorithm, because these methods could yield more precise estimates of treatment effect in conditions of both non-additivity and non-linearity[21–23].
For propensity score methods, PS matching was the most popular method, similar with other reporting[3, 13], a focused analysis on PSM methodology was undertaken. Only 34 articles(26.6%) had a sufficient description about the matched algorithm and caliper width. The caliper width of 0.2 of the SDs of the logit of the propensity score was considered as an ideal choice, and the recommendation was used in most included articles(50%) in our study, because this value could eliminate bias as much as possible[7, 24]. When matches were hard to find, a looser caliper might be acceptable to avoid loss of sample size[18]. For reports with matching ratios, our reporting rate was lower than other studies[3], but there were some matching ratio such as 1:n and n:1 in our study, these matching ratios could maximize the use of the included participants to ensure the precision of the estimation and the generalizability of results[6]. However we found that little report guideline about PS method suggested to discussed the effects about incomplete matching, especially these subjects who were excluded after matching in the treatment group, this inadequate guideline could cause the waste of valuable information and the less reliable results. Other PS methods were also used, like Monte Carlo simulations study indicated that matching and weighting eliminates better systemic differences between treated groups than stratification and covariate adjustment[20, 25]. In our study, the number of researches using matching and weighting was much more than stratification and covariate adjustment. Meanwhile, we found the Tookit for Weighting and Analysis of Nonequivalent Groups method that has not used in other quality reporting of the PS, different from other methods, because this algorithm(not user) determines the most appropriate model for the propensity score[26].
When checking for covariate balance, 28.7% articles didn’t report the comparability of measured baseline covariates in our study. Xiaoxin Yao and colleagues found 21.9% cancer studies and 15.6% cancer surgical studies didn’t report them[3], and the literature[15] showed that 20% of 97 surgical studies analyzed included an assessment of covariate balance using standardized differences. In present study, the reporting of checking balance had not improved, 41 articles(28.7%) didn’t report the comparability of measured baseline covariates, and most articles(120, 82.0%) used test of significance instead of standardized differences to check the covariates balance. We don’t recommend test of significance, the reason is that the method is susceptible to the sample size[6, 20] and might ignore the imbalance because of lower statistical power. It’s necessary to encourage authors to report the comparability of measure baseline covariates with appropriate methods. We suggest using standardized differences to check balance of baseline data measured between treated groups in PSM, because it is not confused with other factors [4]. A cutoff value of a standard deviation of less than 0.1 that indicates negligible was approved by most experts.[13]. Therefore the standardized differences should be encouraged[27]. If the results did not achieve the intended purpose, we could repeat the process by adding more variables or interactions based on the existing model to balance covariates.
Furthermore, a caveat of PS method is that it could only account for measured confounders, those potential biases caused by unmeasured confounding variables could influence authors to obtain an accurate estimation of treatment effect. The optimal ways to avoid the effects of unmeasured covariates is to implement sensitivity analyses or subgroup analyses. In our study, 5 articles(31.5%) reported using these methods, which is consistent with other reports on PS methods[28].
This article was the first evaluation about the reporting quality of PS method of the published articles in this filed, and we hope that our study would be valued by researchers who want to apply PS methods to medical research in the gastric cancer. Meanwhile, there are limitations in our study. Firstly, we didn’t discuss the details of covariate adjustment and weighting, because our study contained only a small part of articles, readers could refer elsewhere for these methods[29, 30]. Secondly, some advanced software systems have made it easier for researchers to use the PS method, which might lead them to have no explicate description in their articles. Thirdly, our study only reflected the field of gastric cancer, articles from other fields were excluded.