Most monogenic disorders are caused by mutations altering protein folding free energy

Revealing the molecular effect that pathogenic missense mutations cause on the corresponding protein is crucial for developing therapeutic solutions. This is especially important for monogenic diseases since, for most of them, there is no treatment available, while typically, the treatment should be provided in the early development stages. This requires fast, targeted drug development at a low cost. Here, we report a database of monogenic disorders (MOGEDO), which includes 768 proteins, the corresponding 2559 pathogenic and 1763 benign mutations, along with the functional classification of the corresponding proteins. Using the database and various computational tools that predict folding free energy change (ΔΔG), we demonstrate that, on average, 70% of pathogenic cases result in decreased protein stability. Such a large fraction indicates that one should aim at in-silico screening for small molecules stabilizing the structure of the mutant protein. We emphasize that knowledge of ΔΔG is essential because one wants to develop stabilizers that compensate for ΔΔG but not to make protein over-stable since over-stable protein may be dysfunctional. We demonstrate that using ΔΔG and predicted solvent exposure of the mutation site; one can develop a predictive method that distinguishes pathogenic from benign mutation with a success rate even better than some of the leading pathogenicity predictors. Furthermore, hydrophobic-hydrophobic mutations have stronger correlations between folding free energy change and pathogenicity compared with others. Also, mutations involving Cys, Gly, Arg, Trp and Tyr amino acids being replaced by any other amino acid are more likely to be pathogenic. To facilitate further detection of pathogenic mutations, the wild type of amino acids in the 768 proteins mentioned above was mutated to other 19 residues (14,847,817 mutations), and the ΔΔG was calculated with SAAFEC-SEQ, and 5,506,051 mutations were predicted to be pathogenic.


Introduction
The advent of next-generation sequencing has transformed the approach to investigating and diagnosing human genetic disorders.Next-generation sequencing, in conjunction with bioinformatics pipelines, has enabled the e cient analysis of a large number of DNA variants found throughout entire genomes.
However, this wealth of data presents challenges when identifying individual molecular effects contributing to diseases.In Mendelian diseases, accurately pinpointing one or two pathogenic mutations is crucial amidst numerous harmless variants that naturally occur in the human genome.This requires distinguishing true positives (actual disease-causing mutations) and false positives (benign mutations).On average, an individual's exonic genome region contains approximately 20,000 variations (Gilissen et al. 2012;Tennessen et al. 2012).Among these, only a small fraction of non-synonymous changes in DNA sequences could be directly involved/associated with disorders.Thus, identifying and prioritizing mutations based on their pathogenic potential is of signi cant value for precision medicine and diagnostics.
This prompted the development of numerous in-silico methods to predict the pathogenicity of amino acid mutations (Ng and Pejaver et al. 2020).Most of these methods use amino acid features to assess the likelihood of a mutation being pathogenic.These features include evolutionary conservation, physical and chemical properties of amino acids, and some include structural information like B-factor, etc.While these methods have been successful in distinguishing between a benign and pathogenic mutation, none of the methods provide a comprehensive understanding of the functional consequence of a mutation to assist the development of therapeutic solutions.This is not crucial for complex diseases, which are caused by multiple mutations in multiple genes, and mutations can be mapped onto affected pathways or interaction networks, thus pinpointing the effect causing the disease.
In the case of monogenic disorders, the pathogenicity is caused by a single mutation in a single gene and revealing the molecular effect of the mutation is crucial for the development of treatment.Furthermore, complex diseases affect many individuals, and thus, a variety of therapeutic solutions were developed, while monogenic diseases are typically rare, and there is no therapeutic solution for many.This motivated us to focus on monogenic disorders and to investigate what is the dominant molecular effect causing such disorders.
Proteins perform their function by adopting a particular 3D structure and binding mode, and any deviation from the native structural features may alter the protein's function.This prompted another set of investigations and developments to link pathogenic mutations with changes in protein folding and binding free energies.Although limited, attempts have been made to understand both, i.e., the effect of mutations on protein stability and the potential impact on the interaction of protein with other macromolecules.Recent research conducted an extensive analysis of mutations, considering experimental measurements of changes in protein folding and binding free energy.The study revealed a strong correlation between the probability of a mutation being pathogenic and its impact on protein folding or binding free energy (Pearson correlation coe cient, PCC up to 0.7) (Peng and Alexov 2016).
Another investigation used computational stability predictors on a dataset consisting of 3338 pathogenic and 10170 benign mutations from ClinVar and gnomAD v2.1.This study showed that these predictors could provide valuable insights into the potential pathogenicity of mutations, with an area under the curve value of 0.66 (Gerasimavicius et al. 2020).In a recent study, Aledo et al. developed a tnessstability model based on Arrhenius law to investigate the impact of mutations on protein stability.The model was validated through the extensive mutational analysis of 14,094 proteins and demonstrated that most mutations destabilize protein structure (Aledo and Aledo 2023).Furthermore, they observed a positive correlation between the extent of destabilization caused by amino acid substitutions and their potential for causing diseases (Aledo and Aledo 2023).With regard to the interaction of proteins with other biomolecules such as DNA, RNA, and other proteins, research has demonstrated that mutations can signi cantly hinder the ability of a protein to bind and interact with these molecules effectively.This can have implications for the development of various diseases, including cancer, cystic brosis, and cardiovascular disorders, among others (Sahni et al. 2015;Jackson et al. 2018; Jemimah and Gromiha 2020; Zaucha et al. 2021).The proportion of mutations stabilizing proteins is relatively low, comprising less than one-third of the available data.It is worth noting that while increased thermostability of proteins can be advantageous in preventing thermal inactivation and conformational changes at higher reaction temperatures, it should also be recognized that certain pathogenic mutations may result in protein stabilization.H101Q mutation in the chloride intracellular channel 2 (CLIC2) protein is an example of this observation (Witham et al. 2011;Takano et al. 2012).This variant has demonstrated a higher stability level than the wild-type protein.Functional investigations on the H101Q variant have revealed its ability to enhance, rather than hinder, ryanodine receptor (RyR) intracellular channel activity, resulting in prolonged channel openings and potential ampli cation of calcium signals that depend on RyR channel functionality (Takano et al. 2012).The R111G and A140V variants of the methyl-CpG binding protein 2 (MeCP2) are mutations that increase protein stability but signi cantly decrease DNA binding (Yang et al. 2016).It is important to note that these studies did not differentiate between monogenic and polygenic disorders.However, the results of these studies clearly indicate that there is a correlation between the stability of proteins and their potential for causing diseases.Therefore, it remains an open question as to what extent these predictions can be generalized to monogenic diseases.Considering this, in our earlier work, we reported a database of monogenic disorders consisting of two datasets: Dataset 1 consists of 686 proteins and 1934 pathogenic and 1405 benign mutations (only mutations that are classi ed as pathogenic or benign), and Dataset 2 consists of 768 proteins and 2559 pathogenic and 1763 benign mutations (Dataset 2 includes the likely benign and likely pathogenic cases as well) (Pandey et al. 2023).These datasets allowed us to explore further the relationship between missense mutations and their impact on protein stability and pathogenicity in monogenic disorders.Our study revealed a strong association between changes in the folding free energy and the potential pathogenicity of mutations.By averaging the folding free energy values using various folding energy predictors for two datasets, we obtained an area under the curve value of 0.71 and Matthew's correlation coe cient value of 0.32 (Pandey et al. 2023).
The change in protein stability due to a mutation is directly related to the solvent exposure of the mutation site.It is expected that mutations occurring in the core of the protein will affect protein stability more than mutations at the protein's surface.In a recent study, the authors explored the association between changes in solvent accessibility and the pathogenicity of human protein variations (Savojardo et al. 2021).Utilizing an in silico approach, they predicted the changes in solvent accessibility (∆SASA) for a large data set of residues undergoing variations in 12,494 human protein sequences lacking threedimensional structures.Overall, 69385 OMIM-related single residue variations were considered, out of which 39436 were neutral and 29949 were disease-related.They found that the majority of the diseasecausing mutations occur primarily in the buried positions (67 %) compared to the neural ones (64.3 %), which tend to occu in the exposed region (Savojardo et al. 2021).We want to emphasize again that these ndings were obtained without differentiating between monogenic and polygenic disorders.
Moreover, the impact of missense mutations on protein function extends beyond changes in stability alone.The consequences of missense mutations on protein function are multi-faceted and, apart from stability changes, can involve disruptions of protein-protein, protein-DNA/RNA interactions, active site changes, and conformational dynamics of protein.These disruptions ultimately can lead to abnormal protein function and contribute to the development of diseases.Therefore, it is important to consider how a missense mutation affects a protein's functionality beyond its stability.However, accurately quantifying the impact of missense mutations on stability changes is relatively straightforward through the estimation of folding free energy change.However, assessing other properties, such as protein interaction with other macromolecules, active site changes, and alterations in conformational dynamics, becomes more challenging and complex.The complexity arises due to incomplete characterization of all interacting partners and active site residues for each protein.Considering this, in this study, we speci cally examine changes in folding free energy as a measure of pathogenicity.The goal of the current work is a) to enrich the MOGEDO database of monogenic disorders with functional annotation; (b) to determine if the leading folding free energy change predictors can discriminate between pathogenic and benign mutations; (c) to understand whether solvent-accessible surface area of mutation sites can be used as a measure of pathogenicity; and d) to compare the performance of simple machine-learning based predictor that uses physics-based quantities, i.e., folding free energy change and SASA as features with the leading pathogenicity predictors.We also show how the association between a change in folding free energy and a mutation to be pathogenic varies among different families of the protein and the chemical nature of the mutations.This is important for drug design, to develop drugs that bind to mutants and restore wild-type folding free energy.Moreover, the insights acquired from investigating monogenic diseases can also be employed to elucidate the complex interplay between genes and environmental factors involved in the development and advancement of polygenic disorders.

Database of Monogenic Disorders (MOGEDO)
To conduct our study on the correlation between changes in protein folding free energy and the likelihood of a mutation being pathogenic, we have used the database of monogenic disorders (Pandey et al. 2023).
For the development of the database, the list of genes was obtained from OMIM (Amberger et al. 2015), where only genes associated with monogenic diseases were selected, and cancer-related diseases were excluded.A total of 3108 genes were chosen to collect missense mutations.The mutations for each gene were retrieved from ClinVar (Landrum et al. 2018).Only genes annotated as benign, pathogenic, likely benign/benign, and likely pathogenic/pathogenic mutations were considered for the dataset.If the percentage of the benign mutations and the pathogenic mutation of a gene is larger than 10 %, the mutations of that gene were saved in Dataset 1.If the proportion of benign and pathogenic mutations in a particular gene is greater than 10 %, the mutations associated with that gene were included in Dataset 1.Similarly, if the combined percentage of likely benign and likely pathogenic mutations for a speci c gene exceeds 10 %, those mutations were included in Dataset 2. Subsequently, around 14,000 variations were selected for further analysis.The dataset was additionally ltered based on population frequency data obtained from the Ensembl genome database (Cunningham et al. 2022).Variants classi ed as benign with a population frequency below 0.01 and variants classi ed as pathogenic with a population frequency above 0.01 were excluded.Additionally, if there were discrepancies in clinical signi cance between the data sources, those particular mutations were excluded from the dataset.As a nal result, Dataset 1 consists of 686 proteins and 1934 pathogenic and 1405 benign mutations, and Dataset 2 consists of 768 proteins and 2559 pathogenic and 1763 benign mutations.The MOGEDO database includes information like RefSeq accession ID of corresponding proteins, allele ID, OMIM gene name, ENSEMBLE gene ID, OMIM phenotypes, and experimental conditions.The database was further enriched to provide the functional classi cation of the proteins listed.The proteins were classi ed into 16 distinct groups.These groups included enzyme, transport/translocation/cargo protein, transcription regulation, structural support, scaffold protein, receptor protein, signaling protein, regulatory protein, DNA binding, motor protein, secretory proteins, adhesion protein, chaperons, membrane protein, RNA binding and antigen-antibody complexes.Any remaining unclassi ed proteins were grouped under miscellaneous.The MOGEDO database is available for download from http://compbio.clemson.edu/lab/downloads/.
For the calculation of folding free energy change as a result of mutation, the protein sequences corresponding to the genes were obtained from the NCBI RefSeq sequence database.It is important to note that not all mutations in a given gene were present in the same isoform.To ensure consistency, we selected one speci c isoform for all mutations; however, if a mutation was observed only in a particular isoform, then that speci c isoform was used for calculating the change in folding free energy caused by the mutation.

Folding free energy calculation
To estimate the impact of mutations on folding free energy, we utilized three

Solvent accessible surface area calculation
The relative solvent accessible surface area (RSA) of each residue undergoing mutation was calculated using NetSurfP-2.0 (Schantz Klausen et al. 2019), a sequence-based method for the computation of SASA.The approach utilizes a model comprised of convolutional and long short-term memory networks that have been trained using resolved three-dimensional protein structures.

Regression Model Development
For predicting whether a mutation is pathogenic or benign, we trained a regression model using the sklearn library in Python.The change in folding free energy and the RSA values were used as input features.In order to construct a reliable and robust regression model, an equal number of benign and pathogenic mutations were taken, and the dataset was split into a training set (80 %) and a test set (20 %).The training set was used to train the model and evaluate its performance, while the test set was used to validate its predictions.To build a robust model, vefold cross-validation was performed 100 times using different c values, which, in logistic regression, is a critical hyper-parameter that controls the regularization strength of the algorithm to prevent over tting.The nal model was trained on the complete training and validation set using the c value for which the best average score (AUC score) was obtained.Multiple regression models were trained independently for both datasets using folding free energy change predictions from each predictor and RSA values.

Methods for predicting pathogenicity of amino acid mutations
Altogether, we employed four different methods to predict the pathogenicity of amino acid mutations for the monogenic disorder database.In the following section, we will provide a brief overview of these methods.While numerous prediction methods are available, the chosen approaches were selected for their popularity, user-friendly interfaces, and ease of installation.

Receiver operating characteristics (ROC)
To quantify the association between folding free energy and the classi cation of a mutation as benign or pathogenic, we assessed the change in folding free energy caused by each mutation.We assigned four categories to entries in the monogenic database: true positive for pathogenic mutations correctly classi ed as such, true negative for benign mutations accurately classi ed as benign, false positive for benign mutations wrongly labeled as pathogenic, and false negative for pathogenic mutations mistakenly identi ed as benign.ROC curve analysis was conducted by adjusting cutoff values, with an area under the curve calculated accordingly.The change in folding free energy computed using SAAFEC-SEQ (Li et

Sampling and Assessment of Predictions
To compare how the change in folding free energy method as a predictor of pathogenicity performs compared to other commonly used methods, we performed performance evaluation using the measures: True Positive Rate (TPR), False Positive Rate (FPR), False Negative Rate (FNR) and accuracy.Since most of the leading pathogenicity predictors give the result mainly in the form of classi cations (benign or pathogenic), we used a threshold DDG to classify the mutations as pathogenic or benign for all the change in folding free energy predictors and RSA.The DDG cut-off at which the maximum Mathews correlation coe cient was observed in our dataset was selected as the threshold for each predictor.
For performance evaluation, we considered an equal number of cases of benign and pathogenic mutations in the sample such that they are equally represented (sample size = N (benign) + N (pathogenic), where "N" is the minimum of 50 % of the benign or pathogenic mutations and calculated TPR, FPR, FNR and Accuracy for each sample and repeated the analysis 100 times to get the average and standard deviation.Then, we report averaged TPR, FPR, FNR and accuracy.

Database of Monogenic Disorders (MOGEDO)
In our earlier work, we compiled a comprehensive database of monogenic disorders, which contains information on genetic mutations associated with various monogenic diseases.The database includes information on the types of mutations, their locations within the genes, and the corresponding clinical phenotypes.Here, we provide a comprehensive overview of the database in terms of different types of amino acid changes based on the chemical nature of the amino acid.The results are provided in Supplementary Figures S1a-c  Considering the properties of amino acids, we see more cases of small-to-small followed by polar-topolar, large-to-large, and hydrophobic-to-hydrophobic mutations (Supplementary Figure S1c,f and Supplementary Table 1).The analysis of amino acid changes in our database of monogenic disorders revealed certain patterns based on the chemical nature of the amino acids.For instance, in the case of pathogenic mutations, there is a higher prevalence of mutations from hydrophobic to polar amino acid and polar to hydrophobic amino acid.Interestingly, mutations from negatively charged amino acids to positively charged amino acids are more likely to be pathogenic; vice-versa is less likely to be pathogenic.

Change in folding free energy and pathogenicity
In our earlier work, we showed that there is a signi cant correlation between the changes in folding free energy and pathogenic mutations.Considering all predictors, the average folding free energy change yielded an area under the curve value of 0.71 for MOGEDO datasets.However, by combining predictions ).Nevertheless, no improvement in AUC was observed for any of the predictors in both datasets (Supplementary Figure S2).This is not surprising since only some mutations result in stabilization (Table 1).On the contrary, we see a signi cant decrease in the area under the curve when predictions from individual methods were used.A negligible effect on the AUC (Dataset 1: 0.70) was seen when the average change in folding free energy was used to calculate the AUC.
Here, we present the distribution of the change in folding free energy using various predictors for the MOGEDO database.It is important to note that a negative value for the change in folding free energy indicates a destabilizing mutation, whereas a positive value indicates a stabilizing mutation.The results clearly show that most mutations are destabilizing based on all predictors (Figure 1 & Supplementary Figure S3).This observation holds for the entire dataset and when benign and pathogenic mutations are analyzed separately.There is considerable overlap between the folding free energy pro les of benign and pathogenic mutations.However, on average, pathogenic mutations tend to have a more signi cant negative change in folding free energy than benign mutations.This suggests that changes in folding-free energy can be used as an indicator of pathogenicity.Pathogenic mutations tend to make the corresponding protein less stable; in contrast, benign mutations have less impact on protein stability.
In literature, a cut-off value of 1.0 kcal/mol or 2 kcal/mol has been commonly used to distinguish between mutations strongly affecting protein stability and those that do not.A similar cut-off is typically used to suggest "hot spots" by mutating wild-type residue to Ala.Presumably, a mutation in a hot spot will dramatically affect both protein structure and function and is likely to be pathogenic.However, typically, the usage of either 1 kcal/mol or 2 kcal/mol is not justi ed.Our results indicate that the cut-off value differs for various predictors and is not a universally applicable threshold for all the predictors.The optimal Matthews correlation coe cient (MCC) value for distinguishing benign from pathogenic mutations using SAAFEC-SEQ was achieved with a cut-off value of 1.1 kcal/mol.For I-mutant 2.0 and INPS-SEQ, the best MCC values were obtained with cut-off values of 1.7 kcal/mol and 0.5 kcal/mol, respectively.When averaging the change in folding free energy across all predictors, the best MCC was observed at a cut-off value of 0.7 kcal/mol.Table 1 shows the count of stabilizing and destabilizing mutations in the monogenic disorder database using different predictors and cut-off values commonly found in literature, as well as the average cut-off value (average of DDG cut-off value for the three predictors) based on optimal MCC value.It can be seen that the ratio of destabilizing to stabilizing mutations is signi cantly higher across all predictors and at all cut-off values.Folding free energy change as a measure of pathogenicity based on the chemical nature of amino acid mutations In the earlier section, we have shown that changes in folding free energy can be used as a potential indicator of pathogenicity.In this section, we wanted to understand how change in folding free energy as a measure of pathogenicity varies based on the chemical nature of amino acid mutations.Since different amino acid mutations can have varying effects on protein stability and the methods to estimate the change in folding free energy have been trained on a diverse set of mutations, it is plausible that the ability of folding free energy change to discriminate pathogenic from benign mutations may differ for different types of amino acid changes.To investigate this, we categorized the amino acid mutations in the monogenic disorder database into different groups based on their chemical nature: hydrophobichydrophobic, hydrophobic-polar, polar-polar, polar-hydrophobic, small-small, small-large, large-large, large-small, aliphatic-aliphatic, aliphatic-aromatic, aromatic-aromatic, aromatic-aliphatic, positivepositive, positive-negative, negative-negative, and negative-positive.The count of amino acid mutations in each category is given in Table S1.We only considered those categories for analysis where the number of cases exceeds 100.The ROC plot for the change in folding free energy as a measure of pathogenicity based on the chemical nature of amino acid mutations is shown in Figures 2a,b for Dataset 1 and Supplementary Figures S4a,b for Dataset 2. It is worth noting that the ability of the folding free energy change to predict pathogenicity can vary depending on the speci c chemical characteristics of amino acid mutations.Consistently, the average of folding free energy predicted from SAAFEC-SEQ and INPS-SEQ gives the best AUC for all categories, followed by INPS-SEQ alone.The ability of I-mutant 2.0 is worse in differentiating between benign and pathogenic mutations for all categories.
The best AUC was obtained for the hydrophobic-hydrophobic category, with a value of 0.84 using the average of the change in free energy value calculated using SAAFEC-SEQ & INPS-SEQ and INPS-SEQ, respectively, followed by small-large and large-large mutation for which the values are 0.83 using the average from SAAFEC-SEQ & INPS-SEQ for Dataset 1.The worst performance is obtained for the polarhydrophobic mutation, followed by the small-small mutation.Also, the AUC obtained for these categories is poorer than the AUC of the whole Dataset 1.A similar trend is also observed for Dataset 2, where the best AUC is obtained for the hydrophobic-hydrophobic category, followed by small-large mutation.

Folding free energy change as a measure of pathogenicity based on functional category
In addition to examining the chemical characteristics of amino acid mutations, it is also important to consider the functional category of the proteins when predicting their pathogenicity.To investigate how folding-free energy change as an indicator of pathogenicity varies among different functional categories, we conducted an analysis considering different types of proteins.Both datasets are dominated by the enzyme functional category, followed by the transport/translocation/cargo protein and transcriptional regulation.The underrepresented categories in the datasets include antigen-antibody, RNA binding and membrane protein (Table S2).It is to be noted here that we only considered those categories for further analysis where the total number of cases is at least greater than 100.The ROC plot in Figures 3a,b and Supplementary Figures S4a,b illustrates the performance of change in folding free energy as a predictor for pathogenicity based on the speci c functional categories of proteins.The analysis revealed that the ability of change in folding free energy to predict pathogenicity can vary depending on the functional category of proteins.Similar to what has been observed in the earlier section, the average of folding free energy predicted from SAAFEC-SEQ and INPS-SEQ and INPS-SEQ alone gives the best AUC.Again, the performance of I-mutant 2.0 is poor for all categories.Also, the ability of folding free energy change to discriminate pathogenic and benign mutations varies among different functional categories.SAAFEC-SEQ and I-mutant 2.0 performed best for receptor proteins with an AUC value of 0.78 and 0.62, respectively, while INPS-SEQ for scaffold proteins with an AUC value of 0.80.In the case of transport/translocation/cargo protein, we see a signi cant decrease in the correlation between folding free energy and a mutation to be pathogenic.
Relative Surface Area (RSA): a measure of pathogenicity Another critical factor that has been associated with the pathogenicity of single amino acid variants is the relative surface area of the mutated residue (Savojardo et al. 2021).However, this property has rarely been included in the physicochemical characteristics adopted to describe the residues undergoing variations (Chen and Zhou 2005; Martelli et al. 2016b; Savojardo et al. 2019).To understand the correlation between the relative surface area of mutated residues and pathogenicity, we followed the same protocol as was followed for the change in folding free energy.The analysis demonstrated that relative surface area strongly correlates with pathogenicity with an area under the curve value of 0.78 for both Dataset 1 and Dataset 2 (Supplementary Figure S6).The best MCC was obtained at a cut-off of 0.35.These ndings suggest that the relative surface area of mutated residues can serve as a useful predictor for the pathogenicity of monogenic disorders, and buried residues are more frequently associated with pathogenicity compared to solvent-exposed residues.

Regression model training and testing
The regression model was trained on datasets of monogenic disorders, and the results are presented in Table 2.The best performance in distinguishing pathogenic mutations from benign mutations was achieved by training the regression model using the average change in folding free energy predicted by SAAFEC-SEQ and INPS-SEQ, along with RSA as a feature.This model showed an AUC value of 0.84 and an MCC value of 0.55 for both the training set and test set of Dataset 1, followed by the model trained using only the change in folding free energy predicted by INPS-SEQ and RSA.Similar results were obtained for Dataset 2 as well.Hence, it can be concluded that utilizing changes in folding free energy alone, along with RSA, yields better performance for discriminating between pathogenic and benign mutations across both datasets.The performance of the change in the folding free energy method as a predictor of pathogenicity was compared to other leading pathogenicity predictors, including the regression model trained on the monogenic disorder database.In case of change in folding free energy predictors and RSA, the ΔΔG cutoff at which maximum MCC was obtained was used as the threshold to distinguish pathogenic and benign mutations.The best accuracy was obtained for PolyPhen using the classi er model trained on the HumVar dataset (0.86 ± 0.01), followed by the PolyPhen classi er model trained on HumDiv (average accuracy: 0.83 ± 0.01) (Table 3) for Dataset 1.The RSA method and the linear regression models trained in this work show comparable accuracy with the leading pathogenicity predictors: PhD-SNP and SIFT 4G; however, the true positive and false negative rates for SIFT 4G are impressive compared to all methods.
An improvement is seen in accuracy, TPR and FNR when the change in folding free energy and RSA is used together to predict the pathogenicity of a mutation.The same is true for Dataset 2. Pro ling pathogenic mutations through folding free energy change estimated using SAAFEC-SEQ As we have shown in the earlier section that folding free energy change can be used as a measure of pathogenicity, we further explored SAAFEC-SEQ to pro le pathogenic mutations by mutating all the sites of a protein to other 19 amino acids.This was done for all the proteins in the MOGEDO database.
Pathogenic mutations were identi ed using the cut-off value of -

Discussion
Understanding the molecular mechanisms of diseases is crucial for developing effective diagnostic tools and therapeutic interventions.Genetic disorders are frequently caused by missense mutations in speci c proteins, leading to malfunctioning proteins and subsequent disease phenotypes.Since the human population consists of a plethora of genetic variations, it is essential to identify and distinguish between pathogenic and benign mutations to accurately diagnose and treat disorders.Although plenty of methods are available in the literature to predict the pathogenicity of mutations accurately, most of the methods do not provide a comprehensive understanding of the functional implications of these mutations.Since mutation can have complex functional consequences; for instance, they can affect the protein folding process and the binding of proteins to other macromolecules like other proteins, DNA, RNA, etc., understanding the functional effect of mutation is essential for the drug-discovery process.In the current work, we aimed to assess the pathogenicity of a mutation using a thermodynamic approach, i.e., change in folding free energy.We focused on disorders caused by a mutation in a single gene, i.e., monogenic disorders, as these often have clearer genotype-phenotype correlations than polygenic disorders.
Here, we reported the MOGEDO database, the database of monogenic disorders, which contains information about pathogenic and benign mutations in various proteins involved in monogenic disorders, and second, analyzing the functional consequences of mutations by assessing the impact of the mutation on protein folding.In addition, we provide a comprehensive overview of the monogenic disorder database with reference to amino acid mutations, the chemical nature of the mutation site, how the change in folding free energy as a measure of pathogenicity varies for different classes of proteins and how it compares with the existing leading pathogenicity predictors (Align GVGD, PhD-SNP, PolyPhen, and SIFT), including the RSA method and regression models built using change in folding free energy and RSA as features.
A notable difference observed in the monogenic disorder database is that mutations involving Cys, Gly, Arg, Trp and Tyr amino acids being replaced by any other amino acid are more likely to be pathogenic.This could be due to the essential roles that these amino acids play in protein structure and function.For instance, Cys is an essential catalytic residue and is also involved in the formation of disul de bridges, which is essential for stabilizing the structure of the protein; Gly is often found in structural regions, and its substitution can disrupt protein conformation, Arg and Tyr are frequently found in protein binding sites and mutations in these residues can affect protein-protein interactions.Trp is often involved in hydrophobic core formation and protein stability.Concerning the chemical nature of amino acid mutations, mutations from negatively charged amino acids to positively charged amino acids are more likely to be pathogenic; however, vice-versa is less likely to be pathogenic.Further, hydrophobic-polar, polar-hydrophobic, small-large and large-small mutations are also more likely to be pathogenic.This analysis was prompted by the observation that mutations involving a drastic change in the chemical nature of amino acids can have severe consequences on protein structure and function.
It is important to note that the impact of amino acid substitutions on protein structure and function can vary depending on several factors.These factors include the speci c protein being studied, the location of the amino acid residue within the protein structure, and the type of mutation that occurs.Therefore, it is crucial to consider these factors when studying the pathogenicity of a mutation and assessing the linkage to protein destabilization.Therefore, we evaluated the correlation between the change in folding free energy and pathogenicity across different classes of amino acid mutations based on their chemical properties and also among different classes of proteins.Our analysis revealed that the correlation is not consistent for all classes or across all predictors of ΔΔG.The strongest correlation was found for hydrophobic-hydrophobic mutations, using the average of folding free energy predicted from SAAFEC-SEQ & INPS-SEQ.On the other hand, a weak correlation was observed for polar-hydrophobic and smallsmall mutations, which was worse than the performance on the overall dataset.When considering the functional category of proteins, it was found that receptor proteins exhibited slightly stronger correlations between changes in folding free energy and pathogenic mutations using all the predictors.However, for transport/translocation/cargo proteins, there was a decrease in the performance of change in folding free energy as a predictor of pathogenicity.Because these proteins primarily work by binding to other proteins or macromolecules, mutations are most likely to impact their binding a nities rather than their overall folding stability.
Besides folding free energy change, we also explored RSA as a measure of pathogenicity and observed that it shows a strong correlation with pathogenicity with an AUC value of 0.78.This suggests that a change in the relative solvent accessibility of a protein residue can serve as an effective indicator of its pathogenicity.Further, the regression model developed utilizing folding free energy change and RSA as features demonstrated comparable accuracy to leading pathogenicity predictors such as PhD-SNP and SIFT 4G and is indeed better than some pathogenicity predictors (Align GVGD).While PolyPhen achieved the highest accuracy overall, SIFT 4G showed impressive rates for correctly identifying positive cases and minimizing false negatives among all other methods.In summary, our research provides evidence to support the use of physics-based measurements, such as changes in folding free energy and solvent accessibility, as reliable indicators of pathogenicity.These metrics can effectively assist in identifying variants that may have pathological implications.Further, this approach provides a more comprehensive understanding of the effect of mutation on protein function as it also reveals the underlying cause for the mutation to be pathogenic and the magnitude of its impact on protein stability.This information is essential from the drug discovery perspective as it can aid in developing targeted therapies.For example, when designing a small molecule to target a speci c mutated protein, understanding the folding stability of the protein can be crucial.This information helps determine if the mutation is likely to stabilize or destabilize the protein structure, which can guide the design process.Furthermore, the magnitude with which the mutation affects the protein's stability can inform decisions on the optimal drug design strategy.Suppose one aims to develop a small molecule for a destabilized mutant protein and restore its Figures sequence-based methods: SAAFEC-SEQ(Li et al. 2021), I-mutant 2.0(Capriotti et al. 2005), and INPS-seq(Savojardo et al. 2016).We only focused on sequence-based methods because there are a su cient number of cases in the monogenic disorder dataset where protein structure remains unknown.Additionally, we selected these methods because of their widespread use, accessibility, and user-friendly nature.Below, we brie y describe the methods used to estimate change in folding free energy.SAAFEC-SEQ(Li et al. 2021): SAAFEC-SEQ is a machine learning method based on gradient-boosting decision trees, which incorporates physicochemical properties, sequence features, and evolutionary information to estimate the impact of amino acid mutations on folding free energy.This method requires amino acid sequences as input for prediction.I-mutant 2.0(Capriotti et al. 2005): I-mutant 2.0 is implemented as both sequence and structure base methods that utilize support vector machine algorithms to predict changes in folding free energy resulting from mutations.INPS-MD (Savojardo et al. 2016):The INPS-MD method has been developed and implemented as both a sequence and structure-based method.It employs machine learning techniques based on support vector regression.
from SAAFEC-SEQ and INPS-SEQ predictors to calculate the average folding free energy change, an improved AUC of 0.77 was observed.A few studies have shown absolute DDG to be a better indicator of pathogenicity as it treats stabilization and destabilization equally(Casadio et al. 2011;Martelli et al. 2016a;Gerasimavicius et al. 2020

Figure
Figure 3

3 a.
Figure 3 Henikoff 2003; Mathe et al. 2006; Tavtigian et al. 2006; Chun and Fay 2009; Davydov et al. 2010; Schwarz et al. 2010; Reva et al. 2011; Choi et al. 2012; Adzhubei et al. 2013; Shihab et al. 2013; Carter et al. 2013; Choi and Chan 2015; Vaser et al. 2016; Ioannidis et al. 2016; Jagadeesh et al. 2016; Raimondi et al. 2017; Kim et al. 2017; (Vaser et al. 2016006tigian et al. 2006) is a widely used method for predicting the pathogenicity of amino acid mutations.It combines sequence alignment and Grantham Variation Grantham Deviation scores to classify mutations as deleterious or benign.4G:SIFT4G(Vaser et al. 2016) is a widely used tool for predicting the pathogenicity of amino acid mutations in protein sequences.It uses a combination of sequence conservation and protein structure information to predict the impact of amino acid mutations on protein function.
(Adzhubei et al. 2013dzhubei et al. 2013) is a computational algorithm that predicts the potential impact of amino acid substitutions on protein function.It utilizes sequence alignments and structural features of 3D proteins to assess the potential impact of amino acid substitutions.SIFT for Dataset 1 and S1d-f for Dataset 2. In both the datasets, most of the mutations have been made from Arg (Dataset 1: 553; Dataset 2: 763) to other amino acids, followed by mutations from Gly (Dataset 1: 302; Dataset 2: 385), Ala (Dataset 1: 266; Dataset 2: 340), Leu (Dataset 1: 224; Dataset 2: 277) and Ser (Dataset 1: 224; Dataset 2: 270).It is worth mentioning that the majority of the mutations in both datasets primarily consist of mutations to Arg (Dataset 1: 346; Dataset 2: 434), Ser (Dataset 1: 303; Dataset 2: 369), Val (Dataset 1: 256; Dataset 2: 310), Pro (Dataset 1: 239; Dataset 2: 287) and Leu (Dataset 1: 217; Dataset 2: 294).However, when it comes to commonly utilized datasets for training models that estimate changes in folding free energy, protein-protein binding free energy, and protein-DNA binding free energy, a majority of the mutations are made to alanine.This is because alanine scanning is widely recognized as one of the most popular methods for examining how mutations lead to alterations.Interestingly, mutation of Cys, Gly, Arg, Trp and Tyr to any other amino acid is likely to result in pathogenicity.

Table 1 .
Total number of stabilizing and destabilizing mutations in the Monogenic Disorder Database.

Table 2 :
Performance of the regression model on training and test sets for Dataset 1 and Dataset 2.

Table 3 .
Performance comparison of change folding free energy method, RSA method and regression model with leading pathogenicity predictors on MOGEDO.
(Fersht et al. 1993;Durairaj et al. 2022.edu/lab/downloads/).This indicates that a signi cant proportion of amino acid mutations have the potential to cause disease via destabilizing the corresponding protein, and therapeutic solutions should focus on the development of small molecule stabilizers(Fersht et al. 1993;Durairaj et al. 2022).
1.1 kcal/mol.Out of 14847817 mutations performed, 5506051 mutations are predicted to be pathogenic (the results of the predictions are available