Creation of training, validation, and test sets of variants
The January 2020 variant summary report was downloaded from ClinVar and used as the primary basis for the training and validation sets. The version of the OMIM database from January 14th, 2020 was downloaded as well. The ClinVar dataset was filtered to identify germline variants with criteria provided and no conflicts in interpretation of pathogenicity (one star or higher). From that set, we selected the following variant types: single nucleotide variants, deletions, duplications, insertions, indels, and microsatellites. We further selected only the variants that were annotated as benign, likely benign, benign/likely benign, pathogenic, likely pathogenic, or pathogenic/likely pathogenic. In order to interpret pathogenic variants as dominant or recessive and select for only Mendelian variants, we further selected only the entries that cited an OMIM phenotype identifier. We used our downloaded version of the OMIM database to map these phenotype identifiers to patterns of inheritance. Most phenotypes in OMIM have only one mode of inheritance (even if the gene has multiple modes of inheritance). Among the cases in which the variant was mapped to a phenotype with both dominant and recessive inheritance annotated, we removed that variant from the set. Additionally, we removed any variants in which the associated OMIM identifier given in ClinVar was deemed to be incorrect. For example, we found entries in which the OMIM identifier pointed to a different gene than the variant was on. We also excluded OMIM terms that were annotated as ‘nondiseases’, ‘susceptibility’, or ‘putative’ (brackets, braces, or question marks). We also included any variants found in gnomAD v2.1.1 in the homozygous state in at least two individuals as benign variants.
To select down to the non-splicing protein-altering variants within this set, we utilized Annovar43. We worked with the GRCh37 coordinates for each variant and employed the Gencode V33 Basic annotation of the human genome (lifted over to GRCh37 coordinates)44. We used Annovar’s annotate_variation.pl and coding_change.pl scripts to identify the protein-sequence changes caused by each variant on each isoform of each gene they affect. This also provided us with information on variants near splice sites and we chose to remove all variants within 2bp of canonical splice sites. Most protein-altering variants were then seen to affect more than one transcript of the affected gene. We selected a single transcript for each variant as follows: if the variant affects the canonical transcript of the gene (according to gnomAD’s definition of canonical transcripts), then use the canonical transcript; if none of the affected transcripts are the canonical one, then use whichever has the highest expression across tissue types in GTEx V7 (using the median of samples as the representative for each tissue type); if multiple genes are affected by this variant, pick the gene whose canonical transcript is affected; if multiple genes have their canonical transcripts affected by this variant, pick the gene whose average expression is highest across tissue types in GTEx V7. In this way, we select down to a single amino acid sequence and how it is altered for each variant.
Next, we collected several numerical annotations for each variant, which are served as structured information to the MAVERICK model. For each variant, we collected the allele frequency and number of times seen as a homozygote among controls in gnomAD v2.1.18; the gnomAD constraint information for the canonical transcript of the gene (regardless of whether the variant affected the canonical transcript or not) in the form of the probability that transcript is loss-of-function intolerant (pLI), probability that transcript falls into distribution of recessive genes (pRec), the probability that transcript falls into distribution of unconstrained genes (pNull), Z-score for missense variants in gene, and Z-score for loss-of-function variants in gene8; the pext score from gnomAD26; the local constraint score (CCR) for the affected residue30; the gene damage index (GDI) score for the associated gene28; the RVIS score for the associated gene29; and the GERP + + score for the nucleotide harboring the variant27. For “residue-level” scores (CCR, pext, and GERP) on indels that span multiple residues, we used the maximum score within the affected span. These sources of structured information were chosen in an effort to supply useful information to the MAVERICK model while minimizing the risk of propagating circularity. As such, these scores were selected because they are based on observations that should be relatively uniform in quality across all genes, regardless of how well-studied a gene is.
The final annotation that we created was the evolutionary conservation track for each gene transcript. This approach was modeled after the procedure used to generate input for NetSurf-P245. For each protein-coding transcript in the Gencode V33 Basic annotation of the GRCh37 genome, MMSeqs2 was used to generate multiple sequence alignments against the August 2018 version of Uniclust9025,46. This was a two-step process, first “mmseqs search” was run with “num-iterations” set to 2 and “max-seqs” set to 2000. Second, “mmseqs results2msa” was run with the default parameters using the output of the first step. These multiple sequence alignments were then run through HHSuite’s hhmake utility47 using default settings except with the parameter “-M” set to “first”. Finally, these HHM profiles were parsed into compressed NumPy arrays for easy loading as input to the model.
The protein-altering variants that passed the filtering procedure detailed above were annotated with the appropriate residue-level, transcript-level, and gene-level structured information. The amino acid sequences of the reference and altered protein were saved for each variant as well. This yielded 126,739 variants. One thousand of those variants were randomly selected to serve as the validation set. This contained 778 benign, 108 pathogenic or likely pathogenic dominant, and 114 pathogenic or likely pathogenic recessive variants. The remaining variants composed the training set, which had 99,380 benign, 13,112 pathogenic or likely pathogenic dominant, and 13,247 pathogenic or likely pathogenic recessive variants. The validation set was used for hyperparameter tuning and model selection and as a result, MAVERICK’s performance on that set is slightly better than would be expected in general. Therefore, we did not utilize it for primary evaluations.
In order to construct the known and novel genes test sets, we downloaded the January 2021 variant summary report from ClinVar and repeated the above procedure, but at the end removed the 126,739 variants that were already in the training and validation sets. There were 17,942 variants that passed these filters. These could have been newly added to ClinVar, upgraded from zero star to a higher rating, or had an OMIM phenotype term associated with them. The variant set was then split into the 16,012 that fell on genes that had at least one pathogenic variant in the training set (the known genes set) and the 1,930 that fell on genes that had no pathogenic variants in the training set (the novel genes set). The known genes set contained 2,917 benign, 6,085 dominant, and 7,010 recessive variants. The novel genes set contained 1,234 benign, 183 dominant, and 513 recessive variants.
MAVERICK architecture
MAVERICK is an ensemble of eight models spanning two distinct architectures. This combination was used in order to encourage a diversity in the types of evidence used by the ensemble members and thereby to increase the likelihood of training a well-calibrated final model. The ensemble is also more accurate overall than any of its individual members.
Architecture 1 uses a tandem stack of transformers to process the conservation information and then runs the outputs of those transformers, along with the structured information through a classification head which returns the likelihood that each variant is benign, dominant, or recessive. This architecture is depicted in Supplementary Fig. 1. Specifically, for each variant, the evolutionary conservation track for the referent version of the protein is taken and the 100 amino acids upstream and downstream of the site of variation (so, 201 amino acid span in total) is extracted. This 201 amino acid span of the referent protein sequence along with its conservation is the first input. The variation is then applied to this referent protein sequence. In the case of missense variants, this is simply changing the particular amino acid, while leaving the conservation information unchanged. For deletions, this involves marking that particular residues are omitted. For insertions, this involves adding in new amino acids that have no conservation information. For frameshifts, the amino acids are changed all the way until a stop codon is reached, with the conservation information left unchanged. Similarly for stop-gains, the residues after that point are marked as omitted while the conservation track remains unchanged. This altered protein sequence with conservation information is the second input to the model. The final input to the model is the structured information. The referent and altered protein sequences with conservation are run through the tandem stack of six transformer layers. The stacks that process each input have shared weights. The outputs of each stack are then sliced to only use the center token (which is the site of variation) and passed through a dense layer. This produces a fixed-length embedding for the referent and the altered sequence inputs. The embedding of the referent sequence is then subtracted from the embedding of the altered sequence. This difference in their embeddings is then concatenated with the embedding of the altered sequence along with the structured information. The concatenated data is then passed through a three-layer classification head to produce the final output.
There are three models in the ensemble based on architecture 1. One of the models was trained with default class weights, while the other two used class weights to add more importance to the dominant and recessive classes. Specifically, they gave recessive variants a weight of 7, dominant variants a weight of 2, and kept benign variants at a weight of 1. This means that, during training, each recessive variant contributes 7 times as much toward the loss function as a benign variant, while a dominant variant contributes twice as much as a benign variant. This puts more emphasis on classifying recessive and dominant variants correctly in the loss function. Models based on architecture 1 each have 473,539 trainable parameters.
Architecture 2 is depicted in Supplementary Fig. 2. It also takes three inputs, but does not use the referent protein sequence with conservation. Instead, it takes as its third input a 201 amino acid span of the altered protein sequence, centered on the site of variation (no conservation information). This protein sequence is run through ProtTrans’ ProtT5-XL-BFD as a feature extractor. ProtT5-XL-BFD is a protein language model trained to predict masked amino acid residues from millions of protein sequences. As a result, its representations of protein sequences have information about each residue’s secondary structure and solvent accessibility, as well as protein-level features like subcellular localization. The alternate protein sequence is converted to a dense embedding by ProtT5-XL-BFD and this is then passed through a bidirectional long short-term memory (LSTM) layer, the final internal state of which is used as a fixed-length representation of the sequence. The alternate protein sequence with conservation information is processed exactly as in architecture 1, through a six-layer stack of transformers, then has its center token sliced out and passed through a dense layer to generate a fixed-length representation of that input. The fixed-length representations of the altered protein sequence and the altered protein sequence with conservation information are then concatenated together along with the structured information. The concatenated data is then passed through a three-layer classification head to produce the final output.
There are five models in the ensemble based on architecture 2. Three of the models were trained with default class weights. The fourth model gave dominant variants a weight of 2 and recessive variants a weight of 3. The fifth model gave dominant variants a weight of 2 and recessive variants a weight of 7. Models based on architecture 2 each have 723,651 trainable parameters.
All models were trained with a batch size of 128 for 20 epochs using SGD as the optimizer. The learning rate and momentum were cycled by a “OneCyclePolicy”48 with a maximum learning rate of 0.1, an initial learning rate of 0.001, and an initial momentum of 0.85. The binary cross-entropy of the validation set was used to select the best models for the ensemble.
Comparison to MAPPIN, ALoFT, and PrimateAI
Predictions for all protein-altering SNVs by MAPPIN for hg19 were downloaded from https://doi.org/10.6084/m9.figshare.4639789. Predictions for all premature stop variants by ALoFT for hg19 were downloaded from https://aloft.gersteinlab.org. Predictions for all missense SNVs by PrimateAI for hg19 were downloaded from https://basespace.illumina.com/s/yYGFdGih1rXL.
MAVERICK was run on the known genes and novel genes sets to predict the scores for those variant sets. The subset of variant types appropriate for each tool (missense and stop-gain for MAPPIN, missense for PrimateAI, and stop-gain and frameshift for ALoFT) were then selected and the scores for those variants were extracted from the pre-calculated lists above. We were not able to get the command line version of ALoFT to install, limiting us to only assess stop-gain SNVs for it. For the comparison with PrimateAI, the dominant and recessive scores from MAVERICK were summed to create an overall pathogenicity score.
Spike-in analyses
Ninety-eight patients with rare, Mendelian diseases were chosen from whom whole exome data was available that had yielded a genetic diagnosis. We re-processed the exome datasets according to GATK4 best practices and called variants49. Briefly, samples were aligned to GRCh37 using BWA50, duplicate reads were marked with Picard’s MarkDuplicates utility, base quality score recalibration was performed with GATK 4.0. GATK’s HaplotypeCaller was then run to generate gVCF files for each sample, which were then resolved into variant calls with GATK’s GenotypeGVCFs utility. We then manually removed the variant call (or pair of calls) from each sample that had been identified as the causal pathogenic variant in order to make these patients pseudo-controls. We chose this method because initial tests using samples from the 1000Genomes cohort proved to be too easy since almost all of their variants are present in gnomAD, which MAVERICK takes as a strong indication that the variants are not dominant pathogenic, thus providing an unfair advantage. We further filtered the samples to only include high-quality variant calls using GATK’s CNN filter at default settings as well as requiring a read depth of 20 for each variant and that for heterozygous variants there be at least half as many reads supporting the alternate allele as there are supporting the referent allele. All remaining protein-altering variants were then scored with MAVERICK. To create a rank ordering of the variants in each sample, we created a ‘final score’. For heterozygous variants, MAVERICK’s predicted dominant score was used as this ‘final score’. For homozygous variants, MAVERICK’s predicted recessive score was used as this ‘final score’. Additionally, compound heterozygous pairs were generated by taking the harmonic mean of the recessive score for each pair of heterozygous variants on each gene. This set was filtered so that two variants could not be a compound heterozygous pair if HaplotypeCaller had determined they were part of the same haplotype. All variants and compound heterozygous pairs of variants for each sample were then sorted according to this ‘final score’.
For the spike-in analyses, dominant variants from the test sets were placed into each sample to determine where their dominant score would rank among the ‘final scores’ for all of that sample’s variants. Similarly, recessive variants were ranked in the same way but using their recessive score. We recognize that this means all spiked-in recessive variants were tested as if they were seen as homozygotes. This may appear to be an advantage, but it actually is not. Since the compound heterozygous pairs are scored according to their harmonic mean, the score for a pair of identical heterozygous variants would be numerically equal to that of a homozygous variant. So we believe that testing the recessive variants as homozygotes does not provide any unfair advantage and makes the results simpler to interpret.
Simulation of patient phenotypes
In order to simulate patient phenotypes for each variant in the validation, known genes, and novel genes sets, we exploited the fact that each variant was associated with an OMIM phenotype term due to the manner in which the training and test sets were created. We then used the HPO annotation (downloaded June 21st, 2021) to find the HPO terms associated with each OMIM phenotype. If there were more than five HPO terms associated with any OMIM phenotype, five were randomly selected.
Scoring phenotypes with GADO, HiPhive, and Phenix
Scoring phenotypes with HiPhive and Phenix was accomplished using the Exomiser REST Prioritiser version 12.1.0. Exomiser data version 2003 (from March of 2020) was used so that the performance of HiPhive and Phenix could be accurately assessed on known and novel genes without the passage of time giving them unfair knowledge of the novel disease genes. The list of up to five HPO terms was passed to the REST prioritiser for each variant, which returns scores between 0 and 1 for every gene. For genes not in Exomiser’s annotation set (and therefore without a score), we assigned a phenotype score of 0.5.
Scoring phenotypes with GADO required first converting the set of HPO terms for the OMIM phenotype into the lowest parent term on the HPO graph that was scored by the GADO method. These are referred to as the ‘significant’ HPO terms. We selected a maximum of five ‘significant’ HPO terms for each OMIM phenotype. Next, we downloaded the GADO prediction matrix of Z scores from https://molgenis26.gcc.rug.nl/downloads/genenetwork/v2.1/genenetwork_gene_pathway_scores.zip. As described in the GADO paper, known associations between genes and HPO terms were then set to a value of 3. To convert the data from the Z-score range to a more useful range for our purposes, we applied a sigmoid function to compress the scores to a range of 0 to 1. The distribution of values in this matrix was centered on 0.5 and any genes without entries in this matrix were also assigned phenotype scores of 0.5. To compute gene-phenotype scores for sets of ‘significant’ HPO terms, we took the arithmetic mean of the values of the individual gene-phenotype scores from this matrix.
To combine the phenotype scores from GADO, HiPhive, or Phenix with the MAVERICK score, we took the arithmetic mean of the appropriate MAVERICK score (the ‘final score’ described above) and the phenotype score for that variant’s gene according to each of these tools.
Comparison to Exomiser
To test how MAVERICK prioritizes variants in real patients, we used the Genesis database to select 18 patients with inherited neuropathies recently found to be caused by variants on novel disease genes. In this real-life setting, pseudonymized phenotype information including patient history, clinical examination results, and nerve conduction studies were retrieved from the database of the rare disease clinical research network (RDCRN) for each individual. Between one and 12 HPO terms were assigned per individual, depending on the amount and specificity of available data. This assignment was done by a trained neurologist experienced in the diagnostic procedures of neuromuscular diseases (author MFD). Supplementary Table 5 lists the HPO terms assigned to each individual, along with the final rankings of that individual’s causal variant(s) by MAVERICK and Exomiser each with and without incorporation of phenotypic information.
The whole exome sequencing data from the 18 patients were processed according to GATK best practices as described above for the samples used for the spike-in analyses. Slightly more relaxed quality filtering was used with a depth requirement of only 15 reads and only a quarter as many reads being required to support an alternate allele as support the referent allele. The CNN filter was still used at default settings.
The VCF file for each patient and their associated HPO terms were then submitted to Exomiser running data version 2003. The VCF files were run through MAVERICK and the accompanying HPO terms were scored by HiPhive as described in the section above. The Exomiser scores at the gene level were collected for the autosomal dominant and autosomal recessive variants. The scores were sorted by the ‘EXOMISER_GENE_COMBINED_SCORE’ column to calculate the rank of the causal variant when leveraging phenotype information and sorted by the ‘EXOMISER_GENE_VARIANT_SCORE’ to calculate performance when ignoring phenotype information. Exomiser creates these gene-level scores by summarizing each gene as the most likely pathogenic variant or pair of variants in it. The fact that Exomiser is operating at the gene level does give it an advantage in this comparison, but we decided to run it this way since that is the way most people use Exomiser’s output. MAVERICK performance was calculated by ranking variants (with or without the influence of phenotype information) according to a ‘final score’ calculated as described for the spike-in analyses.
GENESIS
GENESIS is a web-based genomic data management platform designed to facilitate ‘matchmaking’ among physicians whose rare monogenic disease patients carry identical variants or who carry pathogenic variants on the same gene. It has thus far been involved in the discovery of over 70 novel disease genes. We have built MAVERICK into GENESIS’s annotation engine so that all applicable variants are given a MAVERICK score when they are brought into the database. Prioritization of variants in a patient by MAVERICK score is now part of the default sorting scheme. GENESIS also provides numerous orthogonal approaches to prioritize variants, including the incorporation of inheritance information and robust options for filtering variants by their call quality.
Patient cohort
A cohort of 644 patients with genetic diagnoses for rare monogenic diseases was downloaded from GENESIS. Variants had been called on these patients through a variety of methods including but not limited to GATK and Freebayes. Not all samples had all the same quality checking measures for their variants. Variants were filtered using GATK’s Convolutional Neural Network (CNN) filter when available at default settings as well as requiring a read depth of 20 for each variant and that for heterozygous variants there be at least half as many reads supporting the alternate allele as there are supporting the referent allele. Variants seen in more than 1% of the overall population in either gnomAD v2 or v3 were removed. Variants seen in more than 0.5% of samples within GENESIS were also removed. Variants called with high quality in unaffected samples in GENESIS were also blacklisted out in this analysis if they appeared with the same zygosity or in the same compound heterozygous pair. Variants were filtered to only include those that matched the observed inheritance pattern of the causal variant. GENESIS contains high-level descriptions of the phenotype for most patients which generally takes the form of an ORPHAnet code for a large family of rare disorders. We used these codes to select 1–3 HPO terms for each disease. These were generally very vague with the most common assignment being “Peripheral Neuropathy” (HP:0009830). The HPO codes were scored using Phenix to generate a gene phenotype score which was then averaged with the MAVERICK score of each variant on each gene. Ranking was performed as described for the spike-in analyses.
Code availability
The code for MAVERICK is provided under an MIT open-source license at our Github: https://github.com/ZuchnerLab/Maverick. A Colab notebook version of MAVERICK that accepts a VCF file aligned to GRCh37 as input and returns scores for all applicable variants is linked at the Github. Python notebooks and CoLabs demonstrating the creation of the training and test datasets, as well as to replicate the training process are also given there.
Data availability
Pre-computed scores for all missense and nonsense SNVs in Gencode Basic V33 are available for download at our Github. The scores were computed using GRCh37 and a version lifted over to GRCh38 is also provided.