Reporting quality evaluation of propensity score method of the published articles in the gastric cancer filed: a systematic review


 Background The inability to reproduce principal results of some published studies was often due to the poor reporting quality. Propensity score(PS) method has been increasingly employed to balance confounders in observational researches. A few studies showed poor reporting quality of PS method in some medical fields, this would contribute the misleading interpretation of the results and effect clinicians to determine treatment measures. The study of reporting quality of published articles applied PS method in the gastric cancer field had not been available. The aim of this study was to assess the reporting quality of PS method of the published articles in the gastric cancer filed and provide critical recommendations for the investigators who would like to conduct and report PS method. Methods The published articles applied PS method in gastric cancer field were searched in PubMed from inception to July 2019. Two reviewers independently extracted information and evaluated the reporting quality of PS method of the included articles. Results A total of 143 eligible articles were identified by the inclusion and exclusion criteria. These articles were published from 2007 to 2019 and increased over time roughly. 112 articles(78.3%) clearly listed out the variables and 15articles(10.5%) indicate the variables selection justification for PS models. 34 articles(23.8%) reported interaction between variables or subgroup analysis. Propensity score matching(PSM) was the most used method(124 articles, 86.7%), followed by weighting(8 articles, 5.6%), stratification(4 articles, 2.8%) and regression adjustment(3 articles, 2.1%), using more than one methods was 4(2.8%). In PSM, 34 articles (26.6%) had an sufficient description about the matched algorithm and caliper width, 32 articles(25%) used standardized differences to check the balance, the reporting of replacement was poor(30 articles,21%). 10 articles(7.9%) utilized all subjects and 121articles(94.5%) did not discuss the influence of incomplete matching. Conclusions There were methodological deficiencies in the reporting and conducting of PS method of the published articles in the gastric cancer filed. The researchers should report more details for PS method so that the authors, journal editors and peer-reviewers judge reliability and authenticity of the results.


Background
The reproduction of primary results depends on the high quality of reporting of papers in medical researches. Problems with inadequate reporting using propensity score method in the medical researches caused the growing concerns. Description with adequate of crucial components in papers could relieve these concerns about the inability to reproduce.
Propensity score method firstly entered our eyesight in 1983 by Rosenbaum and Rubin[1]. As a widely used statistical method, the PS may control confounding factors by conditioning the probability of receiving the treatment of each participants. It performed better than multivariable regression in terms of randomized design, because the PS precluded those patients who have no similar distributions in groups[1] and it was not limited by the total sample size [2]. The PS also allows researchers to better understand the potential impact of medical interventions and complement the findings of observational studies. Thus, the PS is a more practical tool for researchers to accurately assess the treatment effect in researches where many biases could exist. However, despite a growing number of publications using PS method, the quality of the reporting of the PS did not increase as desired [3,4]. Prior literatures showed that most researchers did not report enough details regarding the balance of covariates between the treatment groups and the choice of covariates included in PS models [5][6][7].
This poor quality of report may directly cause clinicians to choose suboptimal treatment measures that would delay the recovery of the patient [3], and an inadequate report often makes it difficult for other researchers to confidently judge the appropriateness of reported analyses to conduct and reproduce published results [4]. The authors should describe adequate essential details to allow readers and other researchers to validate the findings, which has been considered as an imperative role for high-quality researches.
Incidence of gastric cancer ranks fifth among common cancers, with a higher rates in east Asian countries [8,9]. We found the PS has been used in many literatures in the gastric cancer. Despite the widespread use of the PS in gastric cancer field, the reporting quality of the PS has not been evaluated and lacks the proper report guidelines for the PS.
The aim of this study was to assess the reporting quality of PS method of the published articles in the gastric cancer filed and provide critical recommendations for the investigators who would like to conduct and report PS method.

Search strategy
A well-designed search strategy was conducted in PubMed to identity the published articles using PS in the field of gastric cancer. The search strategy for PubMed was outlined in Figure1. The search was performed from inception of journal to July 8th,2019, and the language was limited to English. The specific search procedure for PubMed was showed in Figure2.

Criteria for literature inclusion and exclusion
Criteria for literature inclusion: 1)the title, abstract or key words described gastric cancer; 2) the title, abstract or key words described PS method; 3)published in English; 4) the object of study was human.
Criteria for literature exclusion: 1)non-observational studies, such as systematic reviews, metaanalysis, randomized controlled trials (RCTs), quasi-randomized trials, other interventional studies, case series analysis, case reports, meetings, guidelines; 2)studies not related gastric cancer; 3)full text were not available; 4)the object of study was animals.

Data extraction
The reviewers received training from professionals before extracting data, and then the data from included articles was independently extracted by two of the authors. When some discrepancies exited, we resolved them by making discussions or consulting with a third author. The items based on previously literatures [2,3,10] were critically adopted to extract information. The general characteristics of included articles contained year of publication, name of journal, origin region of first author, author's affiliations, participation of statistician or epidemiologist in author ( identified from author's affiliations or the acknowledgements part ), international cooperation, journal source of Science Citation Index(SCI), impact factors(IF), number of citations, number of pages, number of authors, funding of support, the way to determine the sample size and number of patients engaged.

Evaluation of report quality of PS methods of included papers
These items of reporting of PS method were recorded.

2.4.1
For the assessment of variables in PS method, we extracted the information about the variables, for example, justification of the variables chosen; the number of variables in the PS model; the inclusion of interaction or polynomial terms of variables in the PS model were also extracted.

2.4.2
For the aspect of how to construct the PS, the reporting of the type of regression model used to estimate the PS was recorded.

2.4.3
The type of PS methods was also extracted. In fact, there are 4 main methods of PS analyses: PS matching, PS stratification, PS adjustment and PS weighting.

2.4.4
The comparability of baseline characteristics in PS analyses was extracted.

2.4.5
In the aspect of propensity score matching(PSM), matching ratio (1/1, 1/n, n/1, etc.), the matching algorithm and distance metric, balance check, replacement or not, the proportion of matched sample size, whether discussed the influence of incomplete matching were recorded.

2.4.6
In the aspect of weighting and stratification, the type of weighting and the number and definition of stratification were abstracted.

2.4.7
The information about whether reported the way to address potential sources of bias was abstracted. Some authors recommend performing sensitivity analyses and subgroups analyses to determine how susceptible the data are to bias unmeasured by the investigators.
If these information could be find in the section of method or result, then we would believe this item was reported. We applied the extracted items to assess the sufficiency of reporting of the PS.

Data analysis
Categorized variables of characteristics and reporting were described with frequencies and percentages. Continuous variable were described with rang, mean, interquartile range(Q1 to Q3) and median. The degree of agreement was examined using Kappa coefficient. All the statistics analyses were conducted by SPSS 19.0.

Literature search
325 articles using PS methods published in gastric cancer filed were identified. After screening the titles and abstracts,182 papers were excluded. Ultimately, this procedure yielded 143 eligible articles published from 2007 to 2019. The degree of agreement between data extractors was acceptable (Kappa coefficient = 0.86, P < 0.01).

The General characteristics of selected Articles
The primary features of these articles were outlined in table1.
In the past five years, the number of articles published has grown rapidly, especially in 2018, with 41 articles(see Figure 3). However, because the deadline for retrieval was the middle of 2019, the number of articles could be underestimated in this year. 38 articles(26.6%) were from Japan, 37 articles(25.9%) from China, 33 articles(23.1%) from South Korea, 9 articles(6.3%) from USA. The articles for participation of statistician or epidemiologist in author were 32(22.4%). What's more, 140 articles(98.0%) didn't explain how to determine the sample size.

Reporting quality of PS methods
The characters of PS method of included articles were showed in table2.

3.3.1.
128 articles(89.5%) did not indicate the covariates selection justification for PS models. All articles used demographic and clinical variables to perform PS analyses. 112 articles(78.3%) clearly listed out the covariates in PS models, the number of assessed covariates ranged from 3 to 37, and the median was 7. 34 articles(23.8%) reported interaction between variables or subgroup analysis.

3.3.2.
103 articles(72.0%) reported estimation of PS models and all of them adopted a logistic regression to construct the PS model. Probit regression, discriminate analysis, regression tree and other methods based on data mining algorithm didn't be used in the included articles.
None article used matching with replacement.10 articles(7.9%) utilized all subjects, 58 articles(45.3%) used more than 50% of sample size. 121articles(94.5%) did not discuss the influence of incomplete matching. The detail of PSM report was shown in table3.

3.3.6.
For PS weighting, 4 articles(50%) used inverse probability of treatment weighting, one article [11] used the Tookit for Weighting and Analysis of Nonequivalent Groups method to conduct the research. And for the PS stratification, all of articles about PS stratification reported their own strata and identification.

3.3.7.
The way to address potential sources of bias were sensitivity analyses and subgroups analyses, which were reported in 15 articles(10.5%) and 30articles(21%), respectively.

Discussion
Our study demonstrated that the quality of these included papers was unsatisfactory, the result was line with prior systematic reviews [7,10,12,13], despite some guidelines about the PS method had found in recent years. Many articles ignored the essential details and adopted the inappropriate methods, causing the misleading interpretation of the treatments in published article. Thus, we would mainly discuss the following aspects.
For the variables in PS models, all included articles used demographic and clinical variables to conduct the PS, which is a good practice. 112 articles(78.3%) clearly listed out the variables, and not all variables were incorporated into the ultimate PS models, this omission of variables implicitly means that potential biases from these variables are considered negligible in theory, meanwhile we should pay more attention to the number of variables, these models, such as PS models with few variable and PS models with many variables but smaller sample size, usually do not produce unbiased causal reference [14]. On the other hand, only 15 articles(10.5%) reported how variables were selected to construct the PS, and the non-parsimonious that include all variables is the most typically ways to choose variables, this method could be deleterious if the contain variables associated without the prognosis [15]. When the justification of the variables included in the PS method is not clearly stated, it is possible that important variables are not included in the analysis and limits the reproducibility of the results. In addition, some studies [16,17] showed that adequate professional knowledge and practical clinical experience were the key factors for determining included variables, interaction, and/or higher order term. The variables eventually incorporated into PS models should be associated with the outcome [4,18], but some quality reports of PS methods ignored this part [3,13].
Another concern was the choice of the estimation of PS model, like T. L. Zakrison and colleagues reported that most researchers would like to adopt a logistic regression [13], however, in other fields, boosted regression trees and neural networks were proposed to construct a PS model, these ways based on data mining algorithm are rapidly spread by the development of computer statistical software in recent years [19], one of the advantages is that these methods could automatically find any nonlinear terms and take them into the estimation model of the PS. Although the logistic regression model could meet many circumstances, it would introduce us to select the 'optimal' model that contains the expected results displaying in front of you [4,20]. Therefore, we encourage researchers to employ statistical methods that based on data mining algorithm, because these methods could yield more precise estimates of treatment effect in conditions of both non-additivity and non-linearity [21][22][23].
For propensity score methods, PS matching was the most popular method, similar with other reporting [3,13], a focused analysis on PSM methodology was undertaken. Only 34 articles(26.6%) had a sufficient description about the matched algorithm and caliper width. The caliper width of 0.2 of the SDs of the logit of the propensity score was considered as an ideal choice, and the recommendation was used in most included articles(50%) in our study, because this value could eliminate bias as much as possible [7,24]. When matches were hard to find, a looser caliper might be acceptable to avoid loss of sample size [18]. For reports with matching ratios, our reporting rate was lower than other studies [3], but there were some matching ratio such as 1:n and n:1 in our study, these matching ratios could maximize the use of the included participants to ensure the precision of the estimation and the generalizability of results [6]. However we found that little report guideline about PS method suggested to discussed the effects about incomplete matching, especially these subjects who were excluded after matching in the treatment group, this inadequate guideline could cause the waste of valuable information and the less reliable results. Other PS methods were also used, like Monte Carlo simulations study indicated that matching and weighting eliminates better systemic differences between treated groups than stratification and covariate adjustment [20,25]. In our study, the number of researches using matching and weighting was much more than stratification and covariate adjustment. Meanwhile, we found the Tookit for Weighting and Analysis of Nonequivalent Groups method that has not used in other quality reporting of the PS, different from other methods, because this algorithm(not user) determines the most appropriate model for the propensity score[26].
When checking for covariate balance, 28.7% articles didn't report the comparability of measured baseline covariates in our study. Xiaoxin Yao and colleagues found 21.9% cancer studies and 15.6% cancer surgical studies didn't report them [3], and the literature [15] showed that 20% of 97 surgical studies analyzed included an assessment of covariate balance using standardized differences. In present study, the reporting of checking balance had not improved, 41 articles(28.7%) didn't report the comparability of measured baseline covariates, and most articles(120, 82.0%) used test of significance instead of standardized differences to check the covariates balance. We don't recommend test of significance, the reason is that the method is susceptible to the sample size [6,20] and might ignore the imbalance because of lower statistical power. It's necessary to encourage authors to report the comparability of measure baseline covariates with appropriate methods. We suggest using standardized differences to check balance of baseline data measured between treated groups in PSM, because it is not confused with other factors [4]. A cutoff value of a standard deviation of less than 0.1 that indicates negligible was approved by most experts. [13]. Therefore the standardized differences should be encouraged [27]. If the results did not achieve the intended purpose, we could repeat the process by adding more variables or interactions based on the existing model to balance covariates. Furthermore, a caveat of PS method is that it could only account for measured confounders, those potential biases caused by unmeasured confounding variables could influence authors to obtain an accurate estimation of treatment effect. The optimal ways to avoid the effects of unmeasured covariates is to implement sensitivity analyses or subgroup analyses. In our study, 5 articles(31.5%) reported using these methods, which is consistent with other reports on PS methods [28].
This article was the first evaluation about the reporting quality of PS method of the published articles in this filed, and we hope that our study would be valued by researchers who want to apply PS methods to medical research in the gastric cancer. Meanwhile, there are limitations in our study.
Firstly, we didn't discuss the details of covariate adjustment and weighting, because our study contained only a small part of articles, readers could refer elsewhere for these methods [29,30].
Secondly, some advanced software systems have made it easier for researchers to use the PS method, which might lead them to have no explicate description in their articles. Thirdly, our study only reflected the field of gastric cancer, articles from other fields were excluded.

Conclusion
Although many studies have used propensity score methods in the gastric cancer literatures, there were some flaws in the reporting and use of PS method and the quality of the report was suboptimal.
It is time to take measurements to improve the reporting quality of PS method of the published articles, thus we propose authors to adopt rational reporting guidelines about PS methods to promote transparency and consistency.   Figure 1 PubMed search strategy.

Figure 2
Flow Diagram of Articles Identified.

Figure 3
Publication Trends in Gastric Cancer Reporting Use of Propensity Score Analysis.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.