Reproducibility assessment
Overall, rerunning Gerstung et al.’s analysis was challenging and required considerable time commitment, sufficient expertise in biomedicine, statistics, and R programming, and even some background knowledge in systems engineering. In the end our efforts to reproduce their research project were only partly successful.
1. Accessibility As we noted, the GitHub repository provided a folder containing anonymized data, a Supplementary Note with detailed descriptions of statistical methods and codes used, together with other relevant materials (Dockerfile, README, etc.) sorted in a readily comprehensible way.
However, two different versions of the Supplementary Note were found: one that went with the main paper on Nature Genetics’ website (version: Wed Sep 7 14:26:11 2016, see https://www.nature.com/articles/ng.3756) and another available in the repository, which was incomplete (version: Tue Dec 15 17:54:15 2015). The more recent version was referred to in the present study. After running through the provided R script, we noticed that the TCGA data from the cancer genome atlas, used in the original paper for external validation, were not provided. Thus, the corresponding results could not be reproduced. A data dictionary was also attached, yet it was found to be incomplete (see Online Resource 1), and one needed to refer to additional papers (Schlenk et al. 2004, 2010, 2016) before making sense of the Supplementary Note. For instance, the Note mentioned the abbreviations CIR, MUD, and RD without explaining their meanings, which are in fact short for Cumulative Incidence of Relapse, Matched Unrelated Donors, and Refractory Acute Myeloid Leukemia, respectively.
The published results were originally obtained using R (v.3.1.2); however, many packages used in the analysis were no longer supported by R (v.3.1.2). Of note, the latest available R (v.4.1.1) did not support a few packages either, such as graph or hilbertVis. As a result, other than the common way of executing the command install.packages() for the installation, additional searching was needed to avoid errors. Some intensive computations requiring parallel tasks, such as leave-one-out cross-validation, were performed by Gerstung et al. in a Load Sharing Facility (LSF) environment. Without access to such a platform, one must modify the code considerably to proceed, while such modifications are prone to unexpected errors.
The Dockerfile was once deemed an opportunity by which we could re-establish an identical computing environment with R (v.3.1.2) and the corresponding packages. However, our attempts were unsuccessful, since the given Dockerfile assumed the existence of a Docker-based R (v.3.1.2), and only described how their author-customized R packages (i.e., mg14 and CoxHD) could be built on top of that. Today, this is insufficient to build a reproducible environment due to compatibility issues. Specifically, the OS (i.e., DebianWheezy) specified in the official Dockerfile (from the Rocker project, see https://www.rocker-project.org), upon which the R (v.3.1.2) is to be built, has been too old to support essential R packages for this analysis (Fig. 1 depicts the layered structure of an R computing environment).
2. Clarity Generally, the source code followed a consistent style and used relative paths to allow data being read into R on different devices without manual alterations. Comments in the code, although with a few insignificant inaccuracies and unnecessary alternative commands, were very helpful for readers to understand the code. Documentation of user-defined packages and functions could also be easily acquired.
Still, as the research project per se is very complex (needing more than 5000 code lines in total), the source code appeared not clear enough to us. Inconsistencies were identified throughout the source code, which inevitably impacted the code legibility, such as different names created for the same concept (e.g., Time_Diag_TPL and TPL_date, Cir and Rel, kmPrs and kmPrd); different concepts with the same name (e.g., CIR could mean either cumulative incidence of relapse or Kaplan-Meier survival estimate for relapse); incorrectly-called variables in the code lines (e.g., variable TPL_efs was not found in R data frames, but appeared in the code); wrongly-created variables which did not serve their purposes as suggested by their assignment commands; and unclear data frames created without further explanation (see details in Online Resource 2). In addition to Gerstung’s Supplementary Note, a folder named Code was provided, containing 14 R scripts and an R data file, yet an accompanying README file was missing.
3. Code execution A mouse-clicking rerun through the code to reproduce the results was impossible. The first coding error appeared due to a command line (i.e., dataList$Genetics = dataList$Genetics + 0) in the 8th code chunk of the Supplementary Note (p. 13), which introduced undue factor variables to the list dataList$Genetics and stopped one from moving forward. More errors occurred throughout the Note and aborted their executions within each section. As a result, anyone inexperienced in R would find it challenging to debug and proceed. Furthermore, sections 3.6.5.1–3.6.5.9, 3.6.6.4, 3.6.6.6–3.6.6.7, 3.6.7.2–3.6.7.3, 4.4.1.3, 5.4.2.0.4, 5.4.2.0.5, 5.4.2.1.1, 5.4.2.2.1, and 5.4.2.3.1 contained parallel processing which were originally executed on an LSF platform that was different from our environments. Overall, we were able to tailor only parts of the R script, as the computations are very intensive with many custom functions unclear to us. Even so, our alterations were liable to unknown mistakes and might exacerbate the irreproducibility. We reported the errors and our modifications in Online Resource 2.
4. Implementation of the methods described The multistage KBA algorithm based on the provided code reflected the ideas described in the paper. Still, it is noteworthy that random-effect Cox models – the building blocks for the multistage KBA – assumed the parameters within each predictor group followed a normal distribution, which computationally led to a ridge regularization (Therneau et al., 2003). This simplified computations and was realized by specifying the ridge regression function argument in coxph() integrated in a user-defined function CoxRFX() (to learn this, we used the function debug() to step through the execution of CoxRFX()). It should be noted, on the one hand, normal distributed random effects are a strong assumption, on the other, Gerstung et al. claimed, in the Supplementary Note and in their function help file, that the parameter estimation was done via an Expectation–Maximization algorithm as suggested by a simulation study of Perperoglou (2014), which, in fact, favored the Restricted Maximum Likelihood-type method for the handling of random effects. Therefore, we could not fully understand how the CoxRFX() fits the random-effect Cox model.
5. Matching of outputs We could partly execute the R script after necessary modifications. Although with random seed fixed, deviations from the published results appeared at different locations, yet most of them are negligible, with numbers differing in the last few decimal places, or figures with minor alterations. Notably, our two rerunning attempts had the same warnings and errors occurring at the same locations, and observed fewer discrepancies in numbers between the two set of results obtained.
We compared the outputs from the rerunning on the personal computer and the published results in Online Resource 2.
6. Further notes regarding the published results
After the rerun of the analysis, we established the full picture of prognostic trajectories among the 1540 patients in the KBA database (Fig. 2), of those, only 995 patients were considered eligible for allograft in CR1.
Gerstung et al. presented a global treatment efficacy of different therapeutic strategies by using the three-year mortality reduction after diagnosis, which represented the comparison between receiving allograft in CR1 and salvage allograft after relapse (y-axis), against the three-year mortality with standard chemotherapy only (x-axis), shown in our Fig. 3a (adapted from Fig. 5a in their original paper). We learned from this reproducibility study that asserted causal efficacy was calculated by applying the KBA to the 995 eligible patients. This had naturally led to the question of whether confounding effects were handled appropriately for two reasons. Firstly, the KBA was constructed using 1540 patients, among whom more than 1/3 of the patients were in fact ineligible for allograft. For illustrative purposes, we rebuilt the KBA after restricting it to the 995 patients and summarized the changes in our Fig. 3 and Table 2. Notably, among those who were predicted (by the original KBA) to have a lower 3-year mortality (<40% if receiving standard chemotherapy), the estimated benefits from timely allograft in CR1 disappeared, as indicated by the rebuilt KBA (Fig. 3). This change was also captured in the average survival rates of the Favorable group in Table 2.
Secondly, the KBA was developed for predictive purposes, in that the algorithm was trained and assessed empirically based on their predictive performance without a priori clinical information, hence was ill-suited to address the population-level treatment efficacy, which was, in essence, a causal question where the theoretical causal structure and confounding effects should be taken into account – this has a profound impact on each step of the modeling process. This topic is already beyond the scope of our study and has been elaborated in full detail by Shmueli (2010).