Implementation
SumStatsRehab is written in Python3, and utilizes several native Linux executables. The key functions of SumStatsRehab are assessment, validation and restoration (Fig. 1). These functions can be implemented for chromosome, base pair position, rsID, effect allele, other allele, allele frequency, standard error, beta, and p-value. Each category of data in the input GWAS summary statistic file is assessed, validated, and restored independently. SumStatsRehab accepts GWAS summary statistic files with single nucleotide polymorphisms (SNPs) which reference human genome build 36, build 37, and build 38, and can output restored summary statistics files in reference builds 37 and 38. SumStatsRehab uses a .json header file to correctly read and interpret the columns in the input summary statistic file.
Assessment and validation of Summary Statistics Files
SumStatsRehab can be used to identify any invalid SNPs in a GWAS summary statistic file; invalid SNPs are those which are missing any core data such as variant ID. This enables users to determine the number and cause of missing or invalid SNPs (Fig. 2).
To demonstrate this command, we tested it on an example GWAS summary statistics file (GWAS blood pressure [10]). As shown, SumStatsRehab identified that less than 1% of entries for GW-significant SNPs were missing Fig. 2A), and that the majority of missing entries were rsIDs (Fig. 2B). The resultant plots derived using the diagnostic tools assess the number of invalid SNPs by significance level, showing the potential impact of the incomplete data columns on downstream calculations. The results of this diagnostic are used internally to guide and optimize restoration.
Restoration of Summary Statistics Files
SumStatsRehab only attempts to restore entries identified as invalid, with one exception. When either the base pair position or chromosome is invalid, SumStatsRehab restores both by looking up the rsID associated with that entry, and overwriting the chr and base pair position entries.
The extent of restoration possible is dependent on the inputs to SumStatsRehab. If only the summary statistics file is input, SumStatsRehab will be able to perform restoration of the p-values, betas and standard errors given two out of three of these values are present. The additional input of a dbSNP file in the target human genome reference build is optimal for restoration. SumStatsRehab preprocesses the dbSNP file, organizing it by rsID, chromosome, base pair position, alleles, ref/alt, and frequencies associated with each SNP, sorted by chromosome and base pair position, and by rsID. If the target build and the GWAS summary statistic file builds are different, an additional third input, the ‘chain file’ is needed for liftover from the summary statistic file build to the target build. With these inputs, SumStatsRehab is able to restore GWAS data files, and include EAF, missing t-statistics, rsID or chromosome numbers and base pair positions, effect allele (EA) and other allele (OA).
Preparation of test case files
To assess the utility of our tool and the extent of restoration it can achieve, we chose publicly available and complete summary statistics files from three different GWAS as test cases: : 1) blood pressure, 2) C-reactive protein, 3) allergies [10–12]. These files were preprocessed by removing one specific column of data per file at a time: rsID, chromosome number, base pair position, effect or other allele, allele frequency, p-value, beta, and standard error. After removing each column from the three different GWAS summary statistics files to generate a total of 9 test files per GWAS, with a total of 27 test files, we ran each file through both SumStatsRehab and the only current alternative, MungeSumstats.
In order to be run through MungeStumstats, test files required an extra round of extensive preprocessing. For the blood pressure test files, all columns were renamed in accordance with the MungeSumstats documentation [7]. For the GWAS allergies test files, all fields containing ‘NA’ had to be replaced with a placeholder dot, and all rows with any non-numeric value in BP fields had to be removed. In both cases, necessary preprocessing required manual deletion of SNPs, for which the missing or invalid data could be restored, to allow MungeSumstats to proceed with restoration for the remainder of the test files. Additionally all non-traditionally formatted rsIDs e.g. “esv3584976”, were removed to prevent the automatic failure of the program.
Assessment of SumStatsRehab and comparison with MungeSumstats
To assess the restoration of both tools, two different metrics were used. For qualitative attributes, accuracy was assessed in terms of concordance with the original summary statistics file. This was used for the chromosome, base pair location, effect allele, and other allele columns. For quantitative attributes, we calculated the difference between the predicted values and the original, masked values using formula 1, which yields an accuracy score between 0 and 1. This was used to calculate the relative accuracy for the allele frequency, beta, standard error, and p-value columns, in order to account for floating point arithmetic and rounding errors.
Formula 1. Where xo and xr are the original and restored values, and k is a fudge factor/an error term, which is different for each column. For allele frequency column: k = 2, for beta column: k = 6, for standard error column: k = 4, for p-value column: k = 3.

The overall accuracy for each column was calculated as the average of the accuracy metrics for each entry. These results were then used to assess and compare the restoration process of both SumStatsRehab and MungeSumstats.
We did not use any accuracy metrics with respect to evaluating restoration of rsIDs, as the rsID restoration is dependent on the publication timeframe of the GWAS. For earlier GWAS, rsID names do not correspond well to more current dbSNPs databases; the differences in rsID may not reflect differences in accuracy of restoration but differences in dbSNP versions.