UNSUPERVISED METHODS IDENTIFY FUNCTIONALLY RELEVANT GENETIC MODULES:
Binary transformation of the genetic information contained in VCF files led to the generation of a matrix with shape 1046X112 where each row identifies a variant and each column identifies a patient. The first exploratory analysis of the 1046X112 matrix containing the whole genetic information was performed trough dimension reduction with Principal Component Analysis (PCA)(figure 1). In the 2D plot each dot represents a patient. The first two principal components (PC1 and PC2) explain 22% of the overall variance of the dataset (supplementary, table S3). PC1’s main contributors are a group of SNPs that are all harboured in the MAPT genomic region (supplementary materials, table S2), that previous works have defined as haplotype-specific30,39. The second principal component involves a more heterogeneous group of SNPs in which the SNP rs11670823 in the NOTCH3 gene is the major contributor (supplementary, table S3).
We labelled each sample according to the disease affecting the patient, the genotype (sporadic, p.Glu200Lys, p.Val210Ile) and a label marking a possible batch effect due to different sequencing runs. None of these labels matched the identified clusters (supplementary materials, Figure S1). Based on the loadings and score values of the PCA we focused on the main genetic sources of variation in the dataset. We identified the two main MAPT haplotypes (H1,H2 and H1/H2), according to two coding SNPs rs1052553 and rs1800547 30,39 which are in linkage disequilibrium (LD) with the rs11575896 (first contributor to PC1, table S2). The result of the labelling of our dataset according to MAPT haplotypes perfectly matches the clusters in the PCA plot, as shown in figure 1. Our population is in Hardy-Weinberg (HW) equilibrium for the tested SNPs. The distribution on the y-axis recognizes specific SNPs patterns associated to haplotypes of NOTCH3 (supplementary, figure S2).40 Interestingly, the SNP rs11670823 (c.3837+21T>A), which is the major contributor to the PC2, is in LD with three NOTCH3 haplotype defining SNPs (rs1044009, rs104423702 and rs4809030)40, and is in HW disequilibrium (p=0.03) in the complete cohort (sAD p= 0.085, E200K p =0.276, V210I p=0.420).
SUPERVISED METHODS RECOGNIZE CLINICAL PHENOTYPES WITH HIGH ACCURACY:
Supervised classifiers were used for automatic recognition of genetic patterns among the 1046 variants identified in this dataset. In the 112X1046 matrix, to each sample a label corresponding to the disease (class: “CJD” or “AD”) was added. The classification was achieved perfectly, with 100% of accuracy (ratio of correctly predicted observation to the total observations) on the test set, basing the classification on the two disease-causing mutations p.Val210Ile and p.Glu200Lys (Figure 2).
To test for the presence of additional recurrent genetic patterns that could characterize a homogeneously phenotypic group and possibly act as modifier, we removed from the input data provided to the classifier only the two rows of the 1046-rows matrix indicating the disease-causing mutations. As expected, accuracy decreased both in training set and in test set, but interestingly the classifier managed to distinguish the two diseases with a good accuracy (training =0.97, test 0.78) (table 1).
Table 1: Classification metrics. Precision is the ratio of correctly predicted observation to the total predicted positive observations (TruePositive/TruePositive+FalsePositive), Recall is the ratio of correctly predicted positive observations to all observations in actual class (TruePositive/TruePositive+FalseNegative), F1 Score is the harmonic mean of Precision and Recall (F1 Score = 2*(Recall * Precision) / (Recall + Precision)). Support indicates class numerosity.

The classification is based on eight variants involving six different genes (figure 3). All considered variants were reported in common databases and genomic search engines such as VarSome41, OMIM42, ClinVar43 or HGMD44 and their consequence was assessed as known disease-causing variant, risk factor, variant of uncertain significance (VUS) or benign according to the ACMG guidelines for interpretation of sequence variants45. Five variants are predicted to be benign and are intronic or synonymous, three of them are classified as variants of uncertain significance and are missense or located in 3’UTR regions.
STATISTICAL ANALYSIS OF VARIANTS FREQUENCY:
For each of the 1046 variants detected, allele frequency was calculated. We calculated separately allele frequencies in the sAD and in the gCJD group. The latter was further divided according to the presence of the p.Glu200Lys or p.Val210Ile mutations. Each allele frequency was then compared to those reported into the GnomAd database28 for the European (non-Finnish) population. Differences between observed and expected allele frequency were tested for statistical significance with Fisher’s Exact test and Benjamini-Hochberg multiple test correction. Table 2 summarizes the number of each type of variants per group and show the average number of variants per patient in the different classes (gene list reported in supplementary, table S4).
Table 2
Summary of results of statistical analysis on each variant detected in our target sequencing panel. Rows identify pathologic groups with their numerosity reported between brackets. The first column shows the average number of variants carried per patient in the different disease groups. The second column shows the overall number of different variants detected in each group in at least one patient. The third column indicates variants annotated as missense, splice variants or 3’or 5’ UTR in each disease group. The last column contains the number of variants with a p <0.05 after Fisher’s exact test and Benjamini-Hochberg correction despite of their annotation.
DISEASE GROUP
|
AVERAGE NUMBER OF SNV PER PATIENT
|
UNIQUE SNV PER DISEASE GROUP
|
UNIQUE NON-SYNONIMOUS SNV PER DISEASE GROUP
|
UNIQUE SNV p<0,05 PER DISEASE GROUP
|
AD (46)
|
145.05
|
654
|
27
|
|
72
|
CJD (66)
|
134.87
|
768
|
11
|
|
33
|
E200K (26)
|
138.73
|
483
|
14
|
|
52
|
V210I (40)
|
135.73
|
645
|
27
|
|
75
|
PATHWAY ANALYSIS AND PROTEIN-PROTEIN INTERACTION NETWORK:
To have functional insights of the consequences of the alterations in allele frequencies, genes harbouring at least one variant with p < 0.05 were used as input for pathway analysis with GO database (figure 4) and protein-protein interaction (PPI) network. Since part of the affected pathways are shared among the considered conditions, results are reported as differences between comparisons of two groups. Comparison of sAD vs gCJD in the PPI network shows a clear centrality of interactions of APP, PSEN2 and APOA1 in the AD but not in the CJD group (PPIn tables and figure in supplementary). Functional analysis of the same coupled comparison points out a significant (p<0.05) enrichment in the sAD group (compared to gCJD) of the GO terms involving regulation of the apoptotic signaling pathway, supramolecular fiber organization, antigen processing and presentation of exogenous peptide antigen. Interestingly, in the CJD group we found and enrichment of GO terms involving the ER responses to stress, protein folding, regulation of mRNA maturation and splicing and in the regulation of catabolic processes. We then investigated whether functional differences within the gCJD group could provide further understanding of the different penetrance of the two mutations. In the coupled comparison between V210I and E200K, we found that only in the V210I group there is an enrichment of GO terms referring to proteasome mediated catabolic processes and antigen processing and presentation. In the PPIn results for the comparison V210IvsE200K, APOA1 and MAPT together with DCTN1 represented hubs of the network, highlighting a similarity between the enriched modules in the networks of the lowly penetrant p.Val210Ile mutation and the one of sAD, with numerous interactions and shared nearest neighbours involved in the enriched pathways. In the E200K group compared to the V210I we found a significantly altered regulation of mRNA and splicing, reflected in the PPI network by the abundant presence of members of the family of heterogeneous nuclear ribonucleoproteins (hnRNPs gene family) as nearest-neighbours of the input genes, in addition to the alteration of actin filament organization.