Data acquisition
KVarPredDB contains information of ten keratin genes reported to be related to genodermatoses. All this information was integrated and manually curated from the Interfil, NCBI-dbSNP, NCBI-PubMed (before 30th October’ 2020) (Figure 1). In particular, pathogenic missense variants were extracted and integrated from the Interfil. Related references in the NCBI-PubMed were searched with the keywords “mutation” and “keratin” dated from March 2017. Meanwhile, VUS were mainly extracted and curated from NCBI-dbSNP. All this variant information is presented in two ways that are easily accessible (i.e. text and lollipop-diagram to the users [28]).
We also integrated and enlisted all types of keratin gene associated diseases with a hyperlink to the Online Mendelian Inheritance in Man (OMIM) and classified these pathogenic missense variants according to different disease types. KVarPredDB displayed the complete keratin information in the database, including gene and protein information, such as keratin coding sequence obtained from Ensembl and Uniprot. Also, KVarPredDB had all the ethnic or population information of the patient identified with the reported pathogenic missense variants.
Due to the particularly long fibrous structure of keratin, only part of the crystal structures was available for protein structure analyses, i.e. K1/K10-2B domain (4ZRY) [5], K5/K14-2B domain (3TNU) [21] and K1/K10-1B domain (6EC0) [22], which was retrieved from RCSB Protein Data Bank.
Pathogenicity prediction
Analyses of pathogenicity for each missense variant were performed according to two parts, i.e. our previous computational studies [19, 20] and molecular docking methods based on resolved crystal structures.
KVarPredDB is providing the detailed information including changes in physico-chemical characteristics, inter/intra-chain interaction, evolutionary conservation and heptad repeat location to understand the stability and assembly competence of the keratin coiled-coil heterodimer upon missense variants (Figure 2). We can use this basic information to determine the structural and functional impact of variants on the keratin coiled-coil heterodimer.
Besides, molecular docking simulations were adopted to predict stability changes and get the binding energy to compare the wild type protein with the mutated one. The structure was first relaxed using the Rosetta relax application [29]. We used Relax application in order to find the most energetically favourable conformation of the protein. The Monte Carlo (MC) algorithm generate conformational structure changes; the energy of the new conformation is calculated and compared with the energy before change. If the energy is better, the change is accepted. We run 50 repeats of the Relax application and choose the best one as input. We took the relaxed rada recombinase as the wild-type, and then used Rosetta backrub application to generate a structural model of the mutant. Rosetta Backrub application attempts to capture the tiny conformational changes with the protein. The protein backrub is first divided into multiple fragments; each of the fragments rotates around the connecting axis, while immobilizing the rest of the protein. Rotational movements have six internal backbone degrees of freedom: φ, ψ, and N-Cα-C bond angle at each pivot. The side chain repackaging and energy minimization follow along with all torsion angles. We repeated this process for 1000 trials and selected the lowest energy structure. The resulting conformations were scored with the Rosetta scoring function and accepted (or rejected) according to the Metropolis criterion using a kT of 0.6. For each missense variant at least 50 models were generated. To predict the change in stability of rada recombinase mutant protein induces by a missense variant, models are further screened with the ddG method in Rosetta Script, chain number is 2. DdG refers to binding energy and gives the differences in Rosetta energy between the wild and mutant protein structure. Rosetta energy function Talaris2014 was used. Finally, we visualize the difference between wild and mutant protein structure with box plot chart and structural alignment.
Database construction
A user-friendly web interface was developed with Java 1.8. All data was implemented with MySQL (version 5.6). The tables included diseases, pathogenic/uncertain significance missense variants, references, proteins, amino acid physico-chemical properties and molecular mocking results. The back-end used a three-tier model: customer display layer, business logic layer, and data layer, which has good flexibility, scalability and shareability. It was built under SpringBoot (version 2.1.6), Spring MVC and Mybatis (version 2.0.1) framework. The front-end was implemented using Bootstrap (version 3.3.7) and jQuery. The pages used Ajax which refers to a web development technology for creating interactive, fast and dynamic web applications. It can update web pages without reloading the entire web page.
A lollipop plot (also known as stick or needle plots) generated by the MutationMapper visualization tool [28, 30] displays the distribution of all missense variants of a linearized keratin protein and its domains (genome build GRCh37/hg19). An embedded module from RCSC supported by NGL viewer (ngl.js) was employed to display molecular graphics [31]. Embed Tomcat was used for the server. The website design was responsive. The database is routinely updated and, as such, the quantity of analysis and accuracy will continue to increase as more case reports and determined crystal structures are added.
Accession Numbers
Atomic coordinates and structure factors for the reported crystal structures are retrieved from the Protein Data bank under accession number 3TNU, 4ZRY, 6EC0.
Searching for keratin gene variants with NCBI-dbSNP under accession number 3848, 3849, 3852, 3853, 3854, 3857, 3858, 3861, 3868, 3872