CAGI6 ID-Challenge: Assessment of phenotype and variant predictions in 415 children with Neurodevelopmental Disorders (NDDs)

In the context of the Critical Assessment of the Genome Interpretation, 6th edition (CAGI6), the Genetics of Neurodevelopmental Disorders Lab in Padua proposed a new ID-challenge to give the opportunity of developing computational methods for predicting patient’s phenotype and the causal variants. Eight research teams and 30 models had access to the phenotype details and real genetic data, based on the sequences of 74 genes (VCF format) in 415 pediatric patients affected by Neurodevelopmental Disorders (NDDs). NDDs are clinically and genetically heterogeneous conditions, with onset in infant age. In this study we evaluate the ability and accuracy of computational methods to predict comorbid phenotypes based on clinical features described in each patient and causal variants. Finally, we asked to develop a method to find new possible genetic causes for patients without a genetic diagnosis. As already done for the CAGI5, seven clinical features (ID, ASD, ataxia, epilepsy, microcephaly, macrocephaly, hypotonia), and variants (causative, putative pathogenic and contributing factors) were provided. Considering the overall clinical manifestation of our cohort, we give out the variant data and phenotypic traits of the 150 patients from CAGI5 ID-Challenge as training and validation for the prediction methods development.


Introduction
Neurodevelopmental disorders (NDDs) are a class of disorders that affect brain development and function, characterized by signi cant genetic and clinical variability. Children with NDDs exhibit cognitive, behavioral, and motor developmental delays.
NDDs include conditions such as autism spectrum disorder (ASD), intellectual disability (ID), attention de cit hyperactivity disorder, epilepsy, and motor disorders (Morris-Rosendahl & Crocq, 2020; Parenti et al., 2020). Multiple NDDs co-occur with brain size abnormalities, such as microcephaly and macrocephaly (Ritchie & Lizarraga, 2023). A combination of two or more of these disorders is frequently reported in patients as comorbidities, which share common functional pathways (Parenti et al., 2020).
The etiology of NDDs is associated with various genetic alterations, including chromosomal rearrangements, copy number variations, small insertions or deletions, and point mutations. Currently, the most common molecular diagnostic practice involves the use of different next-generation sequencing (NGS) approaches, such as targeted gene panels, whole exome sequencing (WES), and whole genome sequencing (WGS). Computational approaches have become crucial for the analysis of data generated by these technologies, enabling the prediction of a patient's phenotype from their genotype and the identi cation of causal variants against millions of others. However, due to the genetic and clinical complexity of NDDs, a considerable number of children still lack a molecular diagnosis. Deciphering and analyzing the enormous amount of data produced WES or WGS, including genes not yet associated with the disease, is a challenge (Sun et al., 2015). Cost-effective gene panels have been widely introduced in routine clinical genetic diagnostics where the analysis of genetic data is limited to the selected genes. However, also in this case many patients will have one or more novel variants that have never been detected before. The classi cation of novel DNA variants is a di cult and incompletely solved problem.Nevertheless, for the progress of precision medicine, clinicians and scientists must possess the capability to draw conclusions regarding disease susceptibility and drug effectiveness based on genetic information. The interpretation of genetic data stands as a signi cant challenge in the practical application of precision medicine. Similar to the ID-challenge in CAGI5, which involved genetic data obtained from a panel of 74 genes applied to a cohort of 150 pediatric patients , CAGI6 expanded the cohort to include 415 patients. Predictors had two primary tasks: (a) predict the phenotypes and (b) predict one or more causal variants that explain the patient's disease phenotype.
However, the challenge aimed to encourage the development of prediction methods that are as accurate as possible and can be used in future clinical practice, starting from the phenotype to identify its genetic cause or vice versa. This is especially important for complex and heterogeneous disorders.
The assessment was performed considering the clinical notes speci c for each patient, collected from geneticist and candidate variants (pathogenic and potentially pathogenic) identi ed by the Padua NDD lab through the targeted gene-panel analysis , submitted to the CAGI Special Issue in Human Genetics. Additionally, considering the amount of genetic data and the remaining undiagnosed cases, the assessors have taken into account the predicted variants from various groups to assess whether some overlooked variants can indeed play a key role in the patient's phenotype (new ndings).

Challenge format
The CAGI6 ID-challenge was divided into two tasks. First, teams were asked to predict a patient's clinical phenotype to a broad seven phenotypic traits. Second, teams were asked to predict one or more causal variant(s) based on the data derived from the customized gene-panel sequencing of 415 pediatric patients.
The phenotypic traits for each patient are based directly on information provided by clinicians. Each patient can have one or more phenotypic traits.
Sequencing data were provided as VCF les by the Laboratory of Molecular Genetic of Neurodevelopmental Disorders, University Hospital of Padua (Italy), and were linked to a speci c patient. The VCF les contain exons and anking intron regions into 74 genes sequenced with the Ion Torrent PGM platform and processed with the Ion Torrent Suite v5.0 software, as described in . Further information on sequence data processing is available in the VCF les (e.g., Genotype Quality, the Coverage or Called genotype). The variants have not been ltered, thus VCF les may contain sequencing errors that should be excluded by sequencing or genotype quality parameters evaluation.
The description of the seven disease phenotypes, the 74 gene identi ers, the gene captured regions in browser extensible data (BED) format, a submission template, and a submission validation script were also provided and have been submitted as in the Aspromonte et al. ( (Aspromonte MC et al., 2023) in this the CAGI6 Special Issue.
Training dataset for CAGI 6 ID-Challenge The eight participant teams had also access before the CAGI6 challenge to a training dataset (sequencing data and patient's clinical information from the CAGI5 ID-panel). This dataset has been published and includes phenotype and variants of 150 pediatric patients with NDDs. The predictors have access also to the work ow for the variants ltering, interpretation and classi cation (Aspromonte et al., 2019).

Phenotype prediction Assessment
The evaluation of predictive ability in this assessment primarily concentrated on examining the performance of various submissions across different disease phenotypes. This problem can be referred to as a two-class prediction problem (Fawcett, 2006) and this assessment approach has been proven successful among different domains. It offers the advantage of simplifying the assessment process by enabling a comparison on the performance of different methods for each individual phenotype. This is in contrast to evaluating them based on the entire predicted class matrix (415 x 7), which consists of one prediction for each patient and phenotype.
Predicted disease classes for each submission were assessed against the clinical phenotype given in the Padua NDD lab dataset (ground truth), using the procedure described below. If the predictors failed to offer a probability value and left an asterisk on the template le, it was regarded as having a probability of zero during the evaluation.
The assessment starts, separately for each phenotype column, with the submitted probability values into binary classes conversion: positive (1) or negative (0). We determined the threshold probability value that maximized the Matthew correlation coe cient (MCC) for that particular phenotype. Subsequently, we compared all probability values for each phenotype with their corresponding threshold and assigned a value of 0 or 1 based on whether the probability was lower or higher than the threshold, respectively. Furthermore, we employed various performance measures to evaluate the predictions for each phenotype. Sensitivity and speci city were used to assess the model's ability to detect positive cases and discriminate between positive and negative classes. The MCC, accuracy (ACC), and F1 measures were employed to evaluate both negative and positive predictions simultaneously. Notably, the MCC has been demonstrated to be less in uenced by an imbalanced dataset (Vihinen, 2012), which is the case in this challenge where some phenotypes exhibit complete imbalance.
To visualize the trade-off between the true positive rate and false positive rate, we generated receiver operating characteristic (ROC) curves by comparing the experimental and predicted probability values for each phenotype. To ensure robustness, we employed 1000 iterations of bootstrap to produce the ROC and precision-recall (PR) curves and calculate the relative AUC values (Schmidt et al., 2014). This resampling technique allows us to assess the stability and reliability of the performance estimates, especially for unbalanced datasets, and enhances the statistical validity of our results.
In order to compare the results among different predictors, we utilized the z-score at the phenotypic level on the ROC's AUC values that we obtained with the bootstrap iterations. This statistical measure allows for the standardization of values, enabling a meaningful comparison across predictors.

Variants prediction assessment
Predictors were also evaluated on their capacity to establish causal associations between variants and individual patients across the single nucleotide variations (SNVs) present in the provided VCF les, as part of this challenge. To evaluate this problem we can treat it as a multi-label classi cation problem, as for each patient we can have one or more variants selected by the Padua NDD lab dataset, and predictors were also allowed to identify zero or more variants for each patient. The Padua NDD Lab classi ed the selected variants in three classes, depending on their association with the disease. If a patient's phenotype aligns with a Mendelian disorder, the variants labeled as pathogenic are seen as the direct cause behind the patient's phenotypic manifestations, and thus classi ed as "disease causing (DC)". On the other hand, variants deemed likely pathogenic (LP) or "putative" need additional examination before being classi ed as causative. Uncommon or new variants that are anticipated to be pathogenic and modify genes associated with a higher risk of autism have been categorized as potential contributing factors (FC), as they alone are insu cient to trigger the disease For the assessment we will use a confusion matrix and metrics that can be extracted from it, like accuracy and recall. True positives (TP) are identi ed by the cases where the predictor assigned a variant to a patient that matches one of the variants associated with that patient. False positives (FP) occur when the predictor incorrectly assigns a variant to a patient, false negatives (FN), on the other hand, represent cases where the predictor fails to identify a variant that should have been associated with a patient. True negatives (TN) in this case are not evaluated, as predictors will only output variants that should be associated with the patient. Because of this, the accuracy metric is calculated with the ratio of correctly predicted variants over all available variants.

Prediction methods
Eight teams submitted predictions, using a total of 30 models Table 1. The methods used by each team are summarized in Table 2 and described in detail below. Two of the participating teams, team 6 and team 7, participated in the ID-Challenge during the previous CAGI edition and were evaluated in the CAGI5 Assessment (Carraro et al., 2019).  Variant annotation, prioritization, and analysis on over 100,000 single nucleotide variants (SNVs) across 415 Variant Call Format (VCF) les was performed. We used a software called EVIDENCE (Seo et al., 2020) and followed ACMG guidelines for the classi cation process. Rare variants (gnomAD frequency < 5%) were annotated using Ensembl Variant Effect Predictor (VEP) and prioritized based on gene function, inheritance pattern, and relevance to disease databases (OMIM, ClinVar, and UniProt). Common SNVs (gnomAD frequency > 5%) were ltered out. For phenotype matching, patient-gene matrix scores and HPO terms were utilized. Six different models for prediction and analysis were used.
Model 1, utilized a random forest approach with a polygenic risk scoring concept, considering multiple variants contributing to a single disease. Model 2, involved a random forest model with hyperparameter tuning, modifying options to optimize the model's performance. Model 3, focused on variants affecting canonical transcripts, training random forest models with their pathogenicity scores. Model 4, used prior probabilities and odds calculated from pathogenicity scores to calculate post probabilities. Model 5, implemented a simple scoring method based on an in-house database, assigning weight scores based on variant-phenotype relevance. Model 6, identi ed an optimal threshold for pathogenicity scores to improve machine learning model training.
Each model was applied to predict the 415 VCF les, and their respective approaches and ndings were summarized for seven phenotypes. We explored and optimized prediction methods for variant analysis and classi cation, considering the in uence of different factors such as polygenic risk, hyperparameter tuning, transcript selection, prior probabilities, in-house database, and threshold optimization.

Group 2-6 models
Based on the provided BED le and the hg19 genome sequences we extracted the DNA sequences for the 74 genes. The mutated sequences for each patient were then derived from the VCF les. To obtain sequence-wise representations for the genes, we utilized DNABERT (Ji et al., 2021), followed by a k-mer tokenization and ultimately settled on a value of 6 for k. Next, we constructed a graph representation for each sample, where the genes served as nodes. The adjacency matrix was de ned by a 6-channel gene-gene interaction (GGI) network, as applied in (Karimi et al., 2020). We then employed a graph convolutional network (GCN) (Kipf & Welling, 2017) for message passing and a learnable weighted summation layer for the graph-wise representations. Finally we used a Multilayer Perceptron (MLP) with a 7-dimensional sigmoid layer to predict the labels of the target diseases. The provided labeled samples were randomly divided into training and test sets using a 7:3 ratio. We tuned the models by evaluating the test performance and developed 6 versions with distinct con gurations. For each version we identi ed the top-20 variants based on the attention weights as the candidate variants. We tried various combinations as ensemble models and determined the causal variants by voting based on their frequency among the candidates. The results of the top-6 ensemble models were submitted at last.

Group 3 − 2 models
The . Six genes were excluded as they are associated with all seven phenotypes of interest. The gene-phenotype associations were used to predict the phenotypic effect of the putative causative variants, and we consequently assigned to each patient a score in a binary form (unaffected/affected) for each possible phenotype.

Group 5-5 models
To predict the causal variant(s) and clinical phenotypes from the genomic data of patients, we rst preprocessed the data, applied a quality control analysis on the variants generated from the sequencing data, and ltered out low-depth genotypes. Next, we annotated the variants with VEP (McLaren et al., 2016) and precalculated CADD scores (Rentzsch et al., 2019). For the prediction of clinical phenotypes, we used two different methods: Buckets and Evolutionary Scale Modeling (ESM) protein embeddings. The "buckets" or "genomic bins" representation groups variants by genomic locations to assign them to regions of a vector with a prede ned size. We used a one-dimensional binary vector to represent regions, where "one" indicates the presence of a variant in that region. We trained different neural networks for each representation and for each phenotype separately using leave-one-out cross-validation.
For the causal variant prediction, we averaged the ESM protein representations corresponding to a single gene. We then used binary vectors with ClinVar classi cations and CADD scores. We trained a model using the positive set of variants that were reported as (causative, putative, or contributing factors) and the rest are treated as negative instances. We used the Synthetic Minority Oversampling Technique to handle the highly imbalanced training data. We reported the top three predicted variants with the highest prediction scores.
Group 6-6 models Our team developed six automated scoring systems for the needs of this challenge. All models used the Evolutionary Action Models 4, 5, and 6, computed polygenic association scores using genetic variants grouped by gene, Evolutionary Action, and MAF of the training set. We did not use any known gene-phenotype associations for these models. Model 5 was trained for male and female patients separately. Model 6 excluded training samples without trait associations. For all six models, the variants with the strongest contributions to the sample's nal score were predicted as causal.

Group 7 − 2 models
Our approach integrated variant pathogenicity prediction with gene-disease association inference to assign weights to each variant found in a patient's gene panel, indicative of their ability to cause a given phenotype. These combined prediction scores were then used as features in supervised, phenotype-speci c random forest models to make patient-level predictions, as well as more directly to identify likely causal variants. This preliminary approach explored the feasibility of building generalizable, end-to-end data-driven methods for clinical genome interpretation.
First, we annotated the challenge data set using ANNOVAR (Wang et al., 2010) and grouped variants into seven categories based on these annotations: exonic (missense only), intronic, ncRNA_intronic, UTR3, ncRNA_exonic, UTR5, and splicing. Speci cally, we downloaded DISEASES Z-scores and min-max normalized them (over the full database) to construct a genephenotype matrix. We then multiplied the variant-level features with these gene-level features to construct the nal patientlevel feature matrix, containing combined scores for the seven categories of variants for all genes in the panel.
To train phenotype prediction models, we used the ID panel challenge data set from CAGI 5 as a training set and applied the same pre-processing and feature extraction steps as above. Individual random forest models were trained for each phenotype, using 80% of the samples for training, and 20% for validation. During the training process, a grid search method was applied to identify optimal parameters for each phenotype. For the primary task of the challenge, the output scores from these models were assumed to be the probability of a phenotype for a given patient. For the secondary task of predicting disease-causing variants for each phenotype, we selected only ultra-rare exonic variants (gnomAD allele frequency < = 2 x 10 − 5 ) and ranked them based on the combined variant-and gene-level scores to obtain the most likely causal variant.
Group 8 − 2 models The prediction method was a combination of ranking sample variants and phenotype gene correlations. Brie y, impactful variants were shortlisted through multiple lters. First, to each variant a gene independent score was assigned using VPR (Variant Prioritization), an in-house tool based on MAF (Minor Allele Frequency), evolutionary conservation, in silico deleteriousness predictions, and reported disease associations. Cut off scores for variants in each gene were derived from the percentile ranking of similarly scored variants in Clinvar data. Only those variants with scores above the 60 th percentile for the same gene in Clinvar data were retained. Additional quality ltering to retain only variants classi ed as 'high' or 'medium' quality was performed (Damiati et al., 2016). Next, phenotype-gene linking was done by querying MEDLINE abstracts and articles (Rao et al., 2020). Finally, for each sample, variants from the ltered subset were ranked by novelty or rarity in cohort, quality and variant score and grouped by gene. Gene scores were then computed as the average of 2 highest scoring variants (recessive model) and highest scoring variant (dominant model). The probability of a phenotype or phenotypes being linked to a sample was based on the ranked gene scores.

Results
The CAGI6 Intellectual Disability (ID) panel challenge was designed to test: 1) the ability of computational methods to predict comorbid phenotypes from targeted gene-panel data, 2) the accuracy of computational methods in predicting causal variants from a set of real genetic data, and 3) the effectiveness of bioinformatics algorithms in improving gene panel data analysis for clinical practice. Additionally, groups may identify variants that were not selected by the Padua NDD Lab, and use them to genetically diagnose the condition of individuals who have not received a diagnosis.

Overview of the Phenotype Prediction Assessment
The rst part of the ID-challenge in the CAGI6, was to predict the phenotype among the seven clinical features assigned, for each patient. Figure 1 shows the number of groups that accurately predicted a patient's phenotype when it was actually present (referred to as true positives). All groups correctly predicted the ID in 338 (49.4%) out of 352 patients; while 7 groups accurately predicted the 90.8% of patients with ID. A small number of patients (N = 45) showed microcephaly as phenotype.
Eight groups correctly identi ed the presence of this trait in 55% of individuals and at least 7 groups correctly identi ed the microcephaly in 93.3% of the patients.
The overall submission performance was assessed using the AUC for each phenotype, the maximum MCC values ( Fig. 2A and B) and ROC curve (Fig. 3).
Looking at the different phenotypic traits ( Fig. 2A and B), the ID phenotype was the easiest to match for SID#8.6 and SID#8.1, followed by SID#2.4 and SID#5.3. We notice that SID#8.6 accurately predicted the phenotype for 255 out of 352 patients, displaying a positive correlation with the available clinical data and achieving an F1-score of 0.85.
The second most prevalent trait in our cohort is the ASD, reported by clinicians in 202 out of 415 pediatric patients. Contrary to the previous CAGI5 (Carraro et al., 2019), we notice some differences in the phenotype predictions. In detail, all phenotypes were predicted by at least one team (Fig. 1). The prediction of the Epilepsy phenotype in CAGI6 exhibited superior performance, as evidenced by a mean MCC value of 0.09, nearly twice the previous value. Notably, SID#1 attained the highest results, achieving an AUC of 0.58, an MCC of 0.12, and an F1 score of 0.38 for SID#4 ( Fig. 2A and B). These performances, when considering all submissions and groups, stand out as the most performant.
For the microcephaly and macrocephaly any improvement was not observed, but it is important to specify that the CAGI5 dataset included only 18 patients with microcephaly and 12 with macrocephaly. In contrast, considering the larger cohort in the CAGI6 dataset, more than double of patients have been reported with these speci c phenotypes (Table 3). Nevertheless, certain submissions demonstrated accurate prediction of the patient phenotypes, such as the SID#6.5 for microcephaly achieving an AUC of 0.64 and a recall of 0.67, and SID#1.5 for macrocephaly achieving an AUC of 0.65 and a recall of 0.3 ( Fig. 3). Note: Each variant is speci c for each patient and one patient can be associated with more than one phenotype. The % for each variant class was calculated by considering the total number of variants detected and classi ed as DC, P, or CF, out of the total number of cases exhibiting that speci c phenotype trait. Abbreviations: ID, intellectual disability; ASD, autism spectrum disorder We reported that 71 patients out of 415 were affected by hypotonia The presence of the ataxia phenotype was observed in a group of 30 patients Table 3, while 285 patients did not exhibit any signs of ataxia. The SID#5.3, has an highest-performing model with a z-score of 2.48, an AUC of 0.66, and an F1 score of 0.23. It is worth noting that this result is consistent with the previous assessment. However, it should be mentioned that fewer submissions achieved an AUC score exceeding 0.65 when compared to the previous evaluation. Considering the average z-score based on the ROC's AUC for each phenotype we reported here the overall submission ranking in each phenotype Table 4. Table 4 Overall z-score, mean z-score and ranking among phenotypes by each submission. Highlighted in bold are the best values for each category. patients, accounting for 52.9% of the total cohort (Table 3). Consequently, our analysis encompassed a broader range of patients with ID, with predictions from at least one predictor in all groups accurately identifying 129 patients (85.4%). Likewise, for patients diagnosed with ASD, the predictive coverage extended to 69 patients (81.1%) (Fig. 4).
Considering this smaller subset, there were changes in the overall ranking as determined by the z-score. SID#1 achieved the top three positions, while SID#6 moved to the rank 8. Overall, the inclusion of this subset led to a 2.9% improvement in the AUC scores when considering all submissions and phenotypes (Table 5).

Variants Prediction Assessment
The second part of the CAGI6 challenge was to predict variants associated with the patient's phenotype. Figure 5 shows the variants predicted by each submission that fall into one of the three groups considered by the laboratory, namely Causative, Putative, and Contributing Factors (CF). The rst three bars of the plot show the total number of variants detected in the entire cohort with the corresponding classi cation.
If we consider the most clinically signi cant variants category, causative variants, the SID#8.1 and SID#8.6 have predicted the majority of them (54 out of 60), followed by other four groups (6, 4, 3, 7). The SID#8, SID#3 and SID#7 correctly predicted the highest number of putative and contributing factors. Variants classi ed as "Contributing Factors" may not perfectly adhere to the main criteria for classifying pathogenic variants, but they have been identi ed in genes associated with autism susceptibility. Comparing the results with the CAGI5 challenge, a major improvement can be seen, with a coverage of causative variant predictions rising from 64-90% when looking at the respective best model. Also for putative and contributing factors there has been a great improvement, from 66-79% and from 69-76%, respectively.
In Fig. 6, we present the proportion of each mutation class predicted by different groups. We notice that total causative variants were predicted by at least two groups (violet), and a small number of variants (3 out of 60) were predicted by all groups (green). In the case of the putative variants as well, all of them have been predicted by at least one group, except for the synonyms variant p.Asn839Asn in CNTNAP2, which has not been prioritized by any group. However, the majority of the causative variants, have been predicted by at least 4 or 5 groups, whereas the putative variants, have been predicted by 3 or 4 groups. It can be observed that contributing factor variants are overall very sparsely predicted, where 36% of them are predicted only by one group (29 variants) or not predicted at all (8 variants). In this particular instance, no putative or contributing mutation was predicted unanimously across all groups.
Based on the data presented in Table 6, SID#8.6 emerges as the most pro cient model for capturing a wide range of mutations, exhibiting a recall rate of 82%. However, its accuracy of 58% is comparatively lower, implying a signi cant number of false positives in the results. On the other hand, for accurate prediction, submission 6.2 surpasses all other models with an accuracy of 72.4%. Nonetheless, it exhibits a lower recall of 35%. This outcome can be attributed to the ndings depicted in Fig. 5, which highlight the good performance of SID#6 in the predictions of causal variants but demonstrates its limitations in identifying putative and contributing variants. During the assessment of the ID-challenge, it was surprising to observe that certain variants, which we de ned as "di cult to predict", were identi ed by multiple groups (Supplementary Table S1). Among these variants we can recognize three categories: variants with sequencing parameters indicating possible technical errors, variants that have discordant pathogenicity predictions among computational tools, and variants that affect intronic sequences far from the canonical intron-exon junctions.
As described in (Aspromonte et al., 2019), the initial step of our analysis involved ltering variants based on sequencing parameters and quality. We are reporting two variants that were con rmed as causative variants after Sanger validation and segregation analysis: p.(Arg504Gln) in GRIN2A and p.(Pro1585SerfsTer38) in SHANK2. In particular, the variant in GRIN2A was con rmed to be a somatic mosaicism (predicted by SID#1, 3,7,8), while the variant in SHANK2 is a frameshift deletion with sequencing parameters that gave the impression the variant was an error (predicted by SID#4 and SID#8).
To prioritize rare missense variants one important step is to consider the variant's effect through the computational methods.
Our work ow includes a ltering step based on the consensus of the pathogenicity score obtained by 12 different tools and the CADD score (> 25). Three of the novel missense variants we identi ed in PTCHD1, GATAD2B and ASH1L genes, did not pass this ltering criteria. However, we have demonstrated their relevance to the disease through segregation analysis, Xinactivation and in silico evaluation (Aspromonte MC et al., 2023

Re-evaluation and classi cation of predicted variants
One of the objectives of CAGI6 was to detect variants that might not have been identi ed in the Padua NDD lab's variant analysis but could still play a role in the patients' phenotype. The Padua NDD lab revised more than 8000 variants. including 3016 exonic variants, 4520 intronic variants, 7 splicing variants, and 137 variants in 5'/3'-UTR (untranslated region), that were indicated by the predictors associated with at least one of the patient's speci c phenotypes. Numerous variants have been excluded mainly because they are highly prevalent both in our cohort or in the general population (gnomAD).
Additionally, some of them were disregarded due to being classi ed as sequencing errors ( Supplementary Fig. S1). Focusing on rare variants, some of them were reconsidered for Sanger validation, in silico or functional analysis.
In particular, for the patient UNIPD_0215, SID#1 and SID#8 teams selected the synonymous variant c.240G > A (p.Leu80Leu) in the AP1S2 gene. Patient UNIPD_0215 was referred for the gene-panel analysis at the Padua NDD lab with a suspect of Smith-Magenis syndrome. This girl presented with developmental delay, ASD, severe ID, ataxia, and some dysmorphisms (eg. synophry, large month). In addition, clinicians reported an opposite behavior and poor impulse control. Another important feature was the MRI altered with a mega cisterna magna and periventricular ischemic dilatation of the tetra ventricular system. These characteristics closely match with the syndrome associated with AP1S2, known as Pettigrew syndrome (MIM## 304340). It is a severe X-linked condition that predominantly affects males. Although female carriers are generally asymptomatic. Upon further analysis with Human Splicing Finder, we found that this variant can alter splicing mechanisms, thus we re-classi ed it as potentially pathogenic (Aspromonte MC et al., 2023). However, further analysis such as segregation and transcript analysis are necessary to establish its pathogenic role.

Discussion
In this study, we have reported the assessment of the new ID-challenge created for the sixth edition of the CAGI (Critical Assessment of Genome Interpretation). The NDDs Laboratory of Padua and the Biocomputing UP group from the University of Padua, have previously collaborated for the CAGI5 edition as providers and assessors of the ID-challenge. The main substantial difference between the two editions lies in the sequencing data and clinical information provided from a larger cohort (N = 415) of pediatric patients with NDDs, and the number of teams and submissions that participated in this challenge (N° teams = 8; N° submissions = 30). Two of them (SID#6 and SID#7) actively participated in the CAGI5 edition of ID-challenge as well. Before starting the CAGI6 ID-challenge, the predictors were able to access the 74-gene panel data and clinical descriptions of the CAGI5 cohort (N = 150), to train their prediction methods and nd the best strategies. The predictors asked to access the sequencing data of 74 genes for the patient's phenotype prediction and to indicate which variants infer the phenotype. The CAGI6 participants employed various methods for phenotype predictions, highlighting the diverse approaches in building disease classi ers. The majority of the methods created a gene-phenotype association matrix using known associations derived from disease databases (e.g. OMIM; HPO; GWAS) or using tools which combine several sources to associate disease terms with genes (e.g. Phenolyzer, DisGenet, PhenPath, DISEASE). Some groups curated the gene-disease list using text mining on PubMed and others built the gene-disease matrix from the CAGI5 training dataset. To build the classi er some methods explored the correlation between the prioritized variants and the gene-disease matrix, while others adopted a range of machine learning approaches such as, graph convolutional network, Random Forest, convolutional neural network. Additionally, many groups used a polygenic risk score to predict the patient phenotypes both using the GWAS data or the data derived from the training dataset. Interestingly, two of the methods that used PRS reached good performance overall or in the prediction of speci c phenotype.
In variant prediction, different groups have adopted various strategies, including a standard approach that involves variant annotation and ltering based on quality and various scores of frequency. In the phenotype prediction assessment, we applied a method similar to CAGI5 with 1000 bootstrap iterations for phenotype. Each phenotype's probability was evaluated as 0 or 1 based on a threshold optimizing MCC. Sensitivity, speci city, and various performance measures (MCC, ACC, F1) were used. ROC curves were generated using bootstrap for stability and relative AUC calculation. The z-score at the phenotypic level allowed comparison across predictors. This approach addresses imbalanced datasets and ensures robustness and statistical validity of results.
ID and ASD were the most common phenotypes, followed by epilepsy, hypotonia, macro/microcephaly and ataxia ( Despite the di culties in predicting phenotypes, a good result was achieved in rarer phenotypes such as microcephaly and hypotonia. The majority of patients exhibiting these clinical features were predicted by at least 6, 7, or 8 groups (Fig. 1). This means that more than half of the groups correctly predict these phenotypes.
For the variants assessment we used the confusion matrix and metrics like accuracy and recall. We assigned True Positive if the predictor matches the patient's variant. False Positive for the incorrect variant assignment, and False Negative for missense variants predictions. Finally, the accuracy was calculated using correctly predicted variants ratio. The prediction of variants had some main protagonists. The SID#8 stood out for their predictions of variants, which we have divided into three classes (Causative, potentially pathogenic and contributing factors). Other groups also achieved good results (SID#6, SID#4, SID#3, SID#7). Some of them, considered for the ltering and variants prioritization, quality parameters or variants frequency. Many of these groups used data from CAGI5 as training for the new challenge. For example, SID#7, which achieved good predictions for both causative and putative variants, applied a method that involved excluding variants with a frequency greater than 1%. SID#3 also implemented a method by paying attention to parameters such as frequency and functional impact of variants (Fig. 5). On the contrary, SID#1, SID#2, and SID#5 seem to have some di culties in predicting the three classes of variants. Their methods do not place much importance on variant frequency in the general population or sequencing quality to lter the hundreds of variants present in each VCF. This resulted in the selection of many variants that could be classi ed as sequencing errors. However, among variants predicted by these later groups, some of them were considered "di cult to predict". Nonetheless, to infer the phenotype of the genotyped patients, some groups (SID#1, SID#3, SID#6) used a polygenic risk scoring that considered multiple variants contributing to a single disease. With this approach both rare and common variants from GWAS studies on speci c phenotypic traits have been used to infer predictions.
Additionally, we noticed that SID#8, along with SID#1, indicated the variant in AP1S2 as a causative variant. The laboratory of Padova that rstly does not consider this variant, reclassi ed it as potentially pathogenic on the basis of splicing prediction and phenotype consistency.
Our assessment demonstrates that there are still many limitations of the computational methods for the e cient phenotype prediction in patients affected by heterogeneous disorders like NDDs. However, the situation appears to be very different for variants. Some methods seem to be highly effective and can be improved for application in clinical practice and for the analysis of the enormous amount of data generated by WES or WGS. Furthermore, they have con rmed that the strategy of the NDDs Padua laboratory in identifying pathogenic variants has been very e cient and that both the targeted gene-panel and the data analysis work ow have yielded excellent results.

Funding
The CAGI experiment is supported by the National Institutes of Health (United States) award U24HG007346. This study is partially supported by the PhasAGE project (to SCET and EL), funded by the European Union's Horizon 2020 research and innovation programme Twinning (GA 952334). The work of Y.S has been funded by the National Institutes of Health (United States) award R35GM124952.

Competing Interests
The authors declare no competing interests.
Ethics approval and consent to participate     Predicted variants distribution. Category "Experimental" is the amount of variants which were identi ed and classi ed by the Padua NDD lab. Each bar represents the amount of variants and types predicted by each submission. NDD, neurodevelopmental disorder.