Genetic Risk Score for Predicting Schizophrenia Using Human Chromosomal-Scale Length Variation

: 9 Studies indicate that schizophrenia has a genetic component, however it cannot be isolated to a 10 single gene. We aimed to determine how well one could predict that a person will develop 11 schizophrenia based on their germ line genetics. We compared 1129 people from the UK 12 Biobank dataset who had a diagnosis of schizophrenia to an equal number of age matched 13 people drawn from the general UK Biobank population. For each person, we constructed a 14 profile consisting of numbers. Each number characterized the length of segments of 15 chromosomes. We tested several machine learning algorithms to determine which was most 16 effective in predicting schizophrenia and if any improvement in prediction occurs by breaking 17 the chromosomes into smaller chunks. We found that the stacked ensemble, performed best 18 with an area under the receiver operating characteristic curve (AUC) of 0.545 (95% CI 0.539- 19 0.550). We noted an increase in the AUC by breaking the chromosomes into smaller chunks 20 for analysis. Using SHAP values, we identified the X chromosome as the most important 21 contributor to the predictive model. We conclude that germ line chromosomal scale length 22 variation data could provide an effective genetic risk score for schizophrenia which performs 23 better than chance.


Introduction:
29 Schizophrenia is a highly heritable, complex psychiatric disorder [1,2]. Genome wide 30 association studies have identified over one hundred genetic loci that contribute to its 31 heritability[2-6]. However, these loci still account for less than half of the genetic risk for 32 schizophrenia [3]. Environmental exposure to chemicals appears to play almost no role in the 33 development of schizophrenia, but different forms of trauma experienced during development does 34 appear to be a risk factor [7] . Twin studies have consistently shown a significant genetic 35 contribution to schizophrenia, and many twin studies find that the environmental contribution to 36 schizophrenia exists but that genetic effects provide significant liability to schizophrenia [8]. 37 Genetic risk scores [9-11] have been developed for many different forms of disease, 38 including breast cancer [12], coronary artery disease [13], and stroke [14]. Polygenic risk scores 39 based on SNPs clearly can predict schizophrenia. One study measured an odds ratio of about 8 40 (95% CI 4-14) for the highest decile compared to the lowest decile [15]. A second study found that 41 polygenic risk scores for schizophrenia (and bipolar disorder) are also associated with 42 creativity [16]. A review of polygenic risk scores for schizophrenia highlighted the difficulty these 43 studies had finding a consistent diagnosis of schizophrenia [17]. 44 Copy number variations (CNVs) and copy-neutral loss of heterozygosity (CN-LOH) have 45 been implicated in significant clonal selection [18]. We have previously shown that chromosome-46 scale length variation is a powerful tool to predict phenotypes from a person's genome [19]. The

47
CSLV values are averages of copy number variation (CNV) measured at each SNP location. 48 Simply using every single CNV value introduces a dimensionality problem as our dataset only has 49 roughly 488,000 individuals while the total number of CNV values is 764,257 across the 22 50 autosomes and an additional 18,857 CNV values for the X Chromosome. This means there is likely diminishing returns for using more splits unless it can be offset with increased data. This method 52 is particularly appealing for genetic risk scores because it includes epistatic effects that might be 53 missed with conventional genome wide association studies, which use logistic regression-a linear 54 combination of SNP scores. By attempting to still utilize every CNV value, this model aims to 55 demonstrate that there are likely global CNV interactions which may be missed by conventional 56 genetic risk scores.

57
The purpose of this paper is to evaluate how well a genetic risk score based on   First, we downloaded the "l2r" files from the UK Biobank. Each chromosome has a 70 separate "l2r" file. Each "l2r" file contained 488,377 columns and a variable number of rows.

71
Each column represented a unique patient in the dataset, who can be identified with an encoded represent the log base 2 ratio of intensity relative to the expected two copies measured at the SNP 74 location.

75
After downloading the "l2r" data from the UK Biobank, we computed the mean l2r value 76 for different portions of each chromosome for each patient in the dataset. We created three 77 different datasets, which we refer to as "splits". We split each chromosome into either 1, 4, or 8 78 nominally equal parts. Then, we compute the actual length for each person's chromosome split   Using the CSLV-Schizophrenia dataset, we selected all people who had a diagnosis of 91 schizophrenia and labelled them in the dataset. We constructed an age-matched control group of 92 the same size that had an identical age profile as those in the schizophrenia group. The age-93 matched control group was selected from all those in the UK Biobank dataset having no indication 94 of schizophrenia. Since only a small fraction of the people in the UK Biobank had a schizophrenia diagnosis, we could rerun the analysis with a different age-matched control group many times to 96 build up statistics. 97 We used the H2O machine learning package in R[22,23]. We created 100 machine learning 98 models that were trained to classify a person in the dataset, consisting of those who had

109
XGBoost is a refinement to the general Gradient Boosting Machine algorithm [30]. Ensembles algorithms. We found that the stacked ensemble models consistently performed best. As Figure   120 1 shows, we found a slight difference between algorithms and their performance. But all 121 algorithms could predict schizophrenia significantly better than chance (AUC=0.50). This finding    In all models, increasing splits improves model performance for the same runtime. Figure   160 3 demonstrates the difference of all models for 1 split, 4 splits, and 8 splits datasets. We tested 161 whether finer splits of the dataset provided significantly improved AUCs. As shown in Table 1,   In order to understand, how our models came to their conclusions we created several plots 181 to explain them from H2O's "explainability" framework. The first is a variable importance 182 heatmap across the generated models which is shown in Figure 4. Our analysis here indicated that 183 chromosome X was one of the highest contributing variables in predicting Schizophrenia, 184 especially in tree models such as GBM and XGBoost. We then confirmed this with a Shapley

185
Additive exPlanation or SHAP plot in Figure 5. This plot also indicates that chromosome X was 186 the leading factor in our leading model for predicting schizophrenia. report corresponding estimates of hg38 coordinates in Table 3. higher importance of that variable. In most of the models we find that the CSLV values were 218 mostly centered around split 50, 1, 9, 42, 13, 58, and 6. This is consistent with Figure 4.  We wanted to ensure these results were not due to inherent sex differences. We trained 50 224 models using the 64 split chromosome X dataset which were not only age-matched with the 225 controls but also sex-matched. 25 of the AutoML models were trained with the actual data with 226 correctly labeled disease states. The other 25 AutoML models were trained with the schizophrenia 227 diagnosis randomly shuffled. The results are shown in Table 4. Here we can see that a portion of 228 the previous performance is most likely due to CSLV differences inherent between males and 229 females (Supplemental D). However, a portion of the prediction is statistically still better than 230 random guessing. These results indicate that germline genetic variation contributes at least to some degree to 240 the onset of schizophrenia in individuals. Our results indicate that genetic structural variation 241 across the global chromosomal scope is sufficient to predict, better than guessing, whether or not  Conclusion: 276 We were able to create machine learning models for prediction of schizophrenia in patients.   Consent for publication: 296 Not applicable.

297
Availability of data and materials: 298 The datasets analyzed during the current study are available from UK Biobank at 299 https://www.ukbiobank.ac.uk/ 300 Competing interests: 301 The authors declare that they have no competing interests.