Challenge format
The CAGI6 ID-challenge was divided into two tasks. First, teams were asked to predict a patient's clinical phenotype to a broad seven phenotypic traits. Second, teams were asked to predict one or more causal variant(s) based on the data derived from the customized gene-panel sequencing of 415 pediatric patients.
The phenotypic traits for each patient are based directly on information provided by clinicians. Each patient can have one or more phenotypic traits.
Sequencing data were provided as VCF files by the Laboratory of Molecular Genetic of Neurodevelopmental Disorders, University Hospital of Padua (Italy), and were linked to a specific patient. The VCF files contain exons and flanking intron regions into 74 genes sequenced with the Ion Torrent PGM platform and processed with the Ion Torrent Suite v5.0 software, as described in (Aspromonte et al., 2019). Further information on sequence data processing is available in the VCF files (e.g., Genotype Quality, the Coverage or Called genotype). The variants have not been filtered, thus VCF files may contain sequencing errors that should be excluded by sequencing or genotype quality parameters evaluation.
The description of the seven disease phenotypes, the 74 gene identifiers, the gene captured regions in browser extensible data (BED) format, a submission template, and a submission validation script were also provided and have been submitted as in the Aspromonte et al. ((Aspromonte MC et al., 2023) in this the CAGI6 Special Issue.
Training dataset for CAGI 6 ID-Challenge
The eight participant teams had also access before the CAGI6 challenge to a training dataset (sequencing data and patient’s clinical information from the CAGI5 ID-panel). This dataset has been published and includes phenotype and variants of 150 pediatric patients with NDDs. The predictors have access also to the workflow for the variants filtering, interpretation and classification (Aspromonte et al., 2019).
Phenotype prediction Assessment
The evaluation of predictive ability in this assessment primarily concentrated on examining the performance of various submissions across different disease phenotypes. This problem can be referred to as a two-class prediction problem (Fawcett, 2006) and this assessment approach has been proven successful among different domains. It offers the advantage of simplifying the assessment process by enabling a comparison on the performance of different methods for each individual phenotype. This is in contrast to evaluating them based on the entire predicted class matrix (415 x 7), which consists of one prediction for each patient and phenotype.
Predicted disease classes for each submission were assessed against the clinical phenotype given in the Padua NDD lab dataset (ground truth), using the procedure described below. If the predictors failed to offer a probability value and left an asterisk on the template file, it was regarded as having a probability of zero during the evaluation.
The assessment starts, separately for each phenotype column, with the submitted probability values into binary classes conversion: positive (1) or negative (0). We determined the threshold probability value that maximized the Matthew correlation coefficient (MCC) for that particular phenotype. Subsequently, we compared all probability values for each phenotype with their corresponding threshold and assigned a value of 0 or 1 based on whether the probability was lower or higher than the threshold, respectively.
Furthermore, we employed various performance measures to evaluate the predictions for each phenotype. Sensitivity and specificity were used to assess the model's ability to detect positive cases and discriminate between positive and negative classes. The MCC, accuracy (ACC), and F1 measures were employed to evaluate both negative and positive predictions simultaneously. Notably, the MCC has been demonstrated to be less influenced by an imbalanced dataset (Vihinen, 2012), which is the case in this challenge where some phenotypes exhibit complete imbalance.
To visualize the trade-off between the true positive rate and false positive rate, we generated receiver operating characteristic (ROC) curves by comparing the experimental and predicted probability values for each phenotype. To ensure robustness, we employed 1000 iterations of bootstrap to produce the ROC and precision-recall (PR) curves and calculate the relative AUC values (Schmidt et al., 2014). This resampling technique allows us to assess the stability and reliability of the performance estimates, especially for unbalanced datasets, and enhances the statistical validity of our results.
In order to compare the results among different predictors, we utilized the z-score at the phenotypic level on the ROC’s AUC values that we obtained with the bootstrap iterations. This statistical measure allows for the standardization of values, enabling a meaningful comparison across predictors.
Variants prediction assessment
Predictors were also evaluated on their capacity to establish causal associations between variants and individual patients across the single nucleotide variations (SNVs) present in the provided VCF files, as part of this challenge. To evaluate this problem we can treat it as a multi-label classification problem, as for each patient we can have one or more variants selected by the Padua NDD lab dataset, and predictors were also allowed to identify zero or more variants for each patient. The Padua NDD Lab classified the selected variants in three classes, depending on their association with the disease. If a patient's phenotype aligns with a Mendelian disorder, the variants labeled as pathogenic are seen as the direct cause behind the patient's phenotypic manifestations, and thus classified as “disease causing (DC)”. On the other hand, variants deemed likely pathogenic (LP) or “putative” need additional examination before being classified as causative. Uncommon or new variants that are anticipated to be pathogenic and modify genes associated with a higher risk of autism have been categorized as potential contributing factors (FC), as they alone are insufficient to trigger the disease
For the assessment we will use a confusion matrix and metrics that can be extracted from it, like accuracy and recall. True positives (TP) are identified by the cases where the predictor assigned a variant to a patient that matches one of the variants associated with that patient. False positives (FP) occur when the predictor incorrectly assigns a variant to a patient, false negatives (FN), on the other hand, represent cases where the predictor fails to identify a variant that should have been associated with a patient. True negatives (TN) in this case are not evaluated, as predictors will only output variants that should be associated with the patient. Because of this, the accuracy metric is calculated with the ratio of correctly predicted variants over all available variants.
Prediction methods
Eight teams submitted predictions, using a total of 30 models Table 1. The methods used by each team are summarized in Table 2 and described in detail below. Two of the participating teams, team 6 and team 7, participated in the ID-Challenge during the previous CAGI edition and were evaluated in the CAGI5 Assessment (Carraro et al., 2019).
Table 1
list of participating teams in CAGI6, ID-challenge
ID | Submission ID | PI | N° models |
---|
CAGI6 ID-Challenge | |
Group 1 | SID#1 | Anonymous | 6 |
Group 2 | SID#2 | Yang Shen | 6 |
Group 3 | SID#3 | Maggie Haitian Wang | 2 |
Group 4 | SID#4 | Rita Casadio | 1 |
Group 5 | SID#5 | Robert Hoehndorf | 5 |
Group 6 | SID#6 | Olivier Lichtarge | 6 |
Group 7 | SID#7 | Sean D. Mooney | 2 |
Group 8 | SID#8 | Rajgopal Srinivasan | 1 |
Table 2
Features of prediction methods developed by the predictors for the ID-challenge in CAGI6.
Teams | | | Filters | | | |
---|
ID | SID | CAGI5 Training | Variants Annotation | Low quality | Frequency | Variants effects | Inheritance | Gene-phenotype association | Mathematical model |
1 | 1.1 | + | EVIDENCE VEP | // | gnomAD < 5% | OMIM, ClinVar, Uniprot. EVIDENCE uses 3Cnet, spliceAI, REVEL | + | Patient-gene matrix scores, HPO | PRS |
1.2 | Random forest |
1.3 | Random forest |
1.4 | Scoring method |
1.5 | // |
1.6 | // |
2 | 2.1 | + | // | // | + | DNABERT | // | // | GCN |
2.2 |
2.3 |
2.4 |
2.5 |
2.6 |
3 | 3.1 | + | VEP REVEL | // | Absent or MAF < 5% in 1000 Genomes; Hom exclusion; 1/415 pz | ClinVar, Phenolyzer, REVEL | // | GWAS | PRS |
3.2 |
4 | 4.1 | // | VEP, SNPs&GO | // | gnomAD < 1% | SIFT, PolyPhen, SNP&GO, LoF variants | // | HPO, PhenPath | Correlation |
5 | 5.1 | + | VEP, CADD | + | // | ESM, CADD | // | Buckets, ESM from training data | ML (ESM) |
5.2 |
5.3 |
5.4 |
5.5 |
6 | 6.1 | + | EA | // | MAF in gnomAD | Inheritance pattern, EA score, MAF | + | ClinVar, HPO, DisGenet, Genecard, Pubmed | Correlation |
6.2 | Inheritance pattern, EA score, MAF | // | // |
6.3 | Inheritance pattern, EA score, MAF | ClinVar, HPO, DisGenet, Genecard, Pubmed | Correlation |
6.4 | EA, MAF | training data | PRS |
6.5 | EA, MAF | // | // |
6.6 | EA, MAF | training data | PRS |
7 | 7.1 | + | ANNOVAR | // | Filter in < = 1% | LINSIGHT, MutPred2, REVEL | // | DISEASES | RFC |
7.2 |
8 | 8.1 | + | In-house tool VPR | + | + | In-house tool VPR | + | ClinVar, MEDLINE | Correlation |
8.6 |
Abbreviations: SID: Submission ID; VEP: Ensembl Variant Effect Predictor; CADD: Combined Annotation Dependent Depletion; MAF: Minor Allele Frequency; Hom: homozygous; LoF: Loss of Function; HPO: Human Phenotype Ontology; PRS: polygenic risk score; ESM: Evolutionary Scale Modeling; VPR: Variant Prioritization; EA: Evolutionary Action.ML: Machine Learning; GCN: Graph convolutional network; RFC: Random Forest Classifier. Mathematical models: Naive Bayes, Correlation, Bayesian, Bayesian network, Trait specific |
Group 1–6 models
Variant annotation, prioritization, and analysis on over 100,000 single nucleotide variants (SNVs) across 415 Variant Call Format (VCF) files was performed. We used a software called EVIDENCE (Seo et al., 2020) and followed ACMG guidelines for the classification process. Rare variants (gnomAD frequency < 5%) were annotated using Ensembl Variant Effect Predictor (VEP) and prioritized based on gene function, inheritance pattern, and relevance to disease databases (OMIM, ClinVar, and UniProt). Common SNVs (gnomAD frequency > 5%) were filtered out. For phenotype matching, patient-gene matrix scores and HPO terms were utilized. Six different models for prediction and analysis were used.
Model 1, utilized a random forest approach with a polygenic risk scoring concept, considering multiple variants contributing to a single disease. Model 2, involved a random forest model with hyperparameter tuning, modifying options to optimize the model's performance. Model 3, focused on variants affecting canonical transcripts, training random forest models with their pathogenicity scores. Model 4, used prior probabilities and odds calculated from pathogenicity scores to calculate post probabilities. Model 5, implemented a simple scoring method based on an in-house database, assigning weight scores based on variant-phenotype relevance. Model 6, identified an optimal threshold for pathogenicity scores to improve machine learning model training.
Each model was applied to predict the 415 VCF files, and their respective approaches and findings were summarized for seven phenotypes. We explored and optimized prediction methods for variant analysis and classification, considering the influence of different factors such as polygenic risk, hyperparameter tuning, transcript selection, prior probabilities, in-house database, and threshold optimization.
Group 2–6 models
Based on the provided BED file and the hg19 genome sequences we extracted the DNA sequences for the 74 genes. The mutated sequences for each patient were then derived from the VCF files. To obtain sequence-wise representations for the genes, we utilized DNABERT (Ji et al., 2021), followed by a k-mer tokenization and ultimately settled on a value of 6 for k. Next, we constructed a graph representation for each sample, where the genes served as nodes. The adjacency matrix was defined by a 6-channel gene-gene interaction (GGI) network, as applied in (Karimi et al., 2020). We then employed a graph convolutional network (GCN) (Kipf & Welling, 2017) for message passing and a learnable weighted summation layer for the graph-wise representations. Finally we used a Multilayer Perceptron (MLP) with a 7-dimensional sigmoid layer to predict the labels of the target diseases. The provided labeled samples were randomly divided into training and test sets using a 7:3 ratio. We tuned the models by evaluating the test performance and developed 6 versions with distinct configurations. For each version we identified the top-20 variants based on the attention weights as the candidate variants. We tried various combinations as ensemble models and determined the causal variants by voting based on their frequency among the candidates. The results of the top-6 ensemble models were submitted at last.
Group 3 − 2 models
The Ensembl Variant Effect Predictor (VEP) tool (McLaren et al., 2016) and the precalculated REVEL scores (Ioannidis et al., 2016) were used for annotating the raw VCF files. For variants reported with multiple REVEL scores, we selected the highest score. Variants were filtered based on the following four criteria: (1) absent or with a MAF < 5% in 1000 Genomes Project data, (2) not homozygous reference alleles, (3) present in only one sample, (4) protein-altering variants. The ClinVar databases (Landrum et al., 2014), phenotype-specific Phenolyzer score (Yang et al., 2015) and REVEL score were used to prioritize variants. The variant reported as pathogenic/likely pathogenic in ClinVar or the top-ranked variant that combines the ranking of Phenolyzer score and REVEL score was identified as putative causative variants. Polygenic risk scores (PRS) were used to build the prediction model. To construct PRS for each of the 7 neurodevelopmental presentations, the previously associated variants were extracted by searching the NHGRI-EBI GWAS Catalog (https://www.ebi.ac.uk/gwas/), IEU Open GWAS platform (https://gwas.mrcieu.ac.uk/), and GWAS Atlas (atlas.ctglab.nl). ASD and epilepsy with existing GWAS summary statistics were eligible for our analysis. The Childhood IQ GWAS summary statistics was used for Intellectual disability. The allele effect sizes were also derived from the training dataset by logistic regression analysis for each phenotype. For each phenotypic trait, we train the logistic regression model with the training dataset and make predictions for the test dataset. The two submissions correspond to constructing PRS using existing GWAS summary statistics or using effect size estimates from the training dataset.
Group 4 − 1 model
The method implemented by the Bologna Biocomputing Group consists of 3 steps: i) variant annotation, ii) selection of putative causative variants, iii) phenotype association. Variant annotation was performed with the Ensembl Variant Effect Predictor (VEP) (McLaren et al., 2016). We retained annotations from SIFT (Ng & Henikoff, 2003) and PolyPhen (Adzhubei et al., 2013) and the allele frequency reported in gnomAD. The putative effect of missense variants was predicted with SNPs&GO (Manfredi et al., 2022). After filtering out variants with allele frequency greater than 1%, we retained as putative causative variants the following ones: i) missense variants predicted as damaging by the three adopted methods (SIFT, PolyPhen, SNPs&GO); ii) Stop-gain variants, if altering more than 75% of the wild-type protein sequence; iii) Frameshift variants. Phenotypes associated with genes containing putative and causative variants were retrieved from Human Phenotype Ontology (HPO) (Köhler et al., 2019), manually curated data and PhenPath (Babbi et al., 2019). Six genes were excluded as they are associated with all seven phenotypes of interest. The gene-phenotype associations were used to predict the phenotypic effect of the putative causative variants, and we consequently assigned to each patient a score in a binary form (unaffected/affected) for each possible phenotype.
Group 5–5 models
To predict the causal variant(s) and clinical phenotypes from the genomic data of patients, we first preprocessed the data, applied a quality control analysis on the variants generated from the sequencing data, and filtered out low-depth genotypes. Next, we annotated the variants with VEP (McLaren et al., 2016) and precalculated CADD scores (Rentzsch et al., 2019). For the prediction of clinical phenotypes, we used two different methods: Buckets and Evolutionary Scale Modeling (ESM) protein embeddings. The “buckets” or “genomic bins” representation groups variants by genomic locations to assign them to regions of a vector with a predefined size. We used a one-dimensional binary vector to represent regions, where “one” indicates the presence of a variant in that region. We trained different neural networks for each representation and for each phenotype separately using leave-one-out cross-validation.
For the causal variant prediction, we averaged the ESM protein representations corresponding to a single gene. We then used binary vectors with ClinVar classifications and CADD scores. We trained a model using the positive set of variants that were reported as (causative, putative, or contributing factors) and the rest are treated as negative instances. We used the Synthetic Minority Oversampling Technique to handle the highly imbalanced training data. We reported the top three predicted variants with the highest prediction scores.
Group 6–6 models
Our team developed six automated scoring systems for the needs of this challenge. All models used the Evolutionary Action (EA) method (Katsonis & Lichtarge, 2014) for estimating the pathogenic effect of each variant. Models 1, 2, and 3, scored each sample regarding each trait, based on the predicted causal effects of variants and the known associations of genes with traits. We predicted causal variants effects using the gene inheritance patterns (dominant de novo, autosomal recessive, and X-linked male causality), their EA score, and their Minor Allele Frequency (MAF), which was estimated using gnomAD (Chen et al., 2022), the test data, and the training data (VCF_CAGI5). The known gene-trait associations were obtained from ClinVar (Landrum et al., 2018), DisGeNet (Piñero et al., 2017), Human Phenotype Ontology (HPO) (Köhler et al., 2019), GeneCards (Stelzer et al., 2016), and Pubmed data mining. The three models differed in EA thresholds and the source of known gene-trait associations.
Models 4, 5, and 6, computed polygenic association scores using genetic variants grouped by gene, Evolutionary Action, and MAF of the training set. We did not use any known gene-phenotype associations for these models. Model 5 was trained for male and female patients separately. Model 6 excluded training samples without trait associations. For all six models, the variants with the strongest contributions to the sample’s final score were predicted as causal.
Group 7 − 2 models
Our approach integrated variant pathogenicity prediction with gene-disease association inference to assign weights to each variant found in a patient’s gene panel, indicative of their ability to cause a given phenotype. These combined prediction scores were then used as features in supervised, phenotype-specific random forest models to make patient-level predictions, as well as more directly to identify likely causal variants. This preliminary approach explored the feasibility of building generalizable, end-to-end data-driven methods for clinical genome interpretation.
First, we annotated the challenge data set using ANNOVAR (Wang et al., 2010) and grouped variants into seven categories based on these annotations: exonic (missense only), intronic, ncRNA_intronic, UTR3, ncRNA_exonic, UTR5, and splicing. Other classes of variants were excluded. For each gene in the panel, we defined seven variant-level features as the maximum pathogenicity prediction score within each of these categories. For missense variants in exonic regions, we included only rare variants (allele frequency < = 1% in gnomAD and used MutPred2 (Pejaver et al., 2020) for Submission 1 and REVEL (Ioannidis et al., 2016) for Submission 2. For variants in non-exonic regions, we used LINSIGHT (Huang et al., 2017). To infer gene-disease associations, we used literature-based scores from the DISEASES (Pletscher-Frankild et al., 2015) database. Specifically, we downloaded DISEASES Z-scores and min-max normalized them (over the full database) to construct a gene-phenotype matrix. We then multiplied the variant-level features with these gene-level features to construct the final patient-level feature matrix, containing combined scores for the seven categories of variants for all genes in the panel.
To train phenotype prediction models, we used the ID panel challenge data set from CAGI 5 as a training set and applied the same pre-processing and feature extraction steps as above. Individual random forest models were trained for each phenotype, using 80% of the samples for training, and 20% for validation. During the training process, a grid search method was applied to identify optimal parameters for each phenotype. For the primary task of the challenge, the output scores from these models were assumed to be the probability of a phenotype for a given patient. For the secondary task of predicting disease-causing variants for each phenotype, we selected only ultra-rare exonic variants (gnomAD allele frequency < = 2 x 10− 5) and ranked them based on the combined variant- and gene-level scores to obtain the most likely causal variant.
Group 8 − 2 models
The prediction method was a combination of ranking sample variants and phenotype gene correlations. Briefly, impactful variants were shortlisted through multiple filters. First, to each variant a gene independent score was assigned using VPR (Variant Prioritization), an in-house tool based on MAF (Minor Allele Frequency), evolutionary conservation, in silico deleteriousness predictions, and reported disease associations. Cut off scores for variants in each gene were derived from the percentile ranking of similarly scored variants in Clinvar data. Only those variants with scores above the 60 th percentile for the same gene in Clinvar data were retained. Additional quality filtering to retain only variants classified as ‘high’ or ‘medium’ quality was performed (Damiati et al., 2016). Next, phenotype-gene linking was done by querying MEDLINE abstracts and articles (Rao et al., 2020). Finally, for each sample, variants from the filtered subset were ranked by novelty or rarity in cohort, quality and variant score and grouped by gene. Gene scores were then computed as the average of 2 highest scoring variants (recessive model) and highest scoring variant (dominant model). The probability of a phenotype or phenotypes being linked to a sample was based on the ranked gene scores.