Data Selection
NDRG1 gene SNPs were obtained from the NCBI dbSNP (https://www.ncbi.nlm.nih.gov/snp/?term=NDRG1). The primary sequences of the proteins encoded by the gene were taken from the UniProt database (https://www.uniprot.org/uniprot/Q92597) under the code Q92597. The three available FASTA sequences were used for analysis.
Data analysis in bioinformatics
Bioinformatics tools were used to determine the effects of the NDRG1 gene SNPs on the NDRG1 protein. These effects were evaluated on the following platforms:
Prediction of deleterious nsSNPs
PredictSNP1.0 (http://loschmidt.chemi.muni.cz/predictsnp1/) (Bendl 2014) was used as the predictor of the SNP effect on protein function and structure. This feature is a consensus classifier that allows access to nine top performing prediction tools: SIFT, PolyPhen-1, PolyPhen-2, MAPP, PhD-SNP, SNAP, PANTHER, PredictSNP, and nsSNPAnalyzer. SIFT (Sorting Intolerant from Tolerant) predicts whether an amino acid substitution affects protein function based on sequence homology and the physical properties of the amino acids (Ng 2003). This predictor uses a query string and makes use of various alignment information to predict tolerated and deleterious substitutions at each position in the query string. The PolyPhen-1 (Polymorphism Phenotyping) tool uses a specialized set of empirical rules to predict the possible impact of amino acid substitutions, while PolyPhen-2 (Polymorphism Phenotyping v2) predicts the potential effect of an amino acid substitution on the structure and function of a human protein using multiple sequence alignment and structural information. MAPP (Multivariate Analysis of Protein Polymorphism) analyzes the physicochemical variation present in each column of a protein sequence alignment and predicts the impact of amino acid substitutions on protein function (Stone 2005). PhD-SNP (Predictor of Human Deleterious Single Nucleotide Polymorphisms) is a support vector machine (SVM-)-based predictor used to classify nsSNPs causing genetic mutations in human disease (Capriotti 2006). SNAP (Screening for Unacceptable Polymorphisms) is a neural network-based method used to predict functional effects of nsSNPs using information from in silico-derived proteins (Bromberg 2007). PANTHER (Analysis of Proteins Through Evolutionary Relationships) estimates the probability that a given nsSNP will cause a functional effect on the protein using evolutionary position-specific preservation (Tang 2016). nsSNPAnalyzer uses a machine learning method called random forest to predict whether nsSNP has a phenotypic effect (Bao 2005) based on multiple sequence alignment and 3D structure information. Finally, PredictSNP1.0 displays the confidence scores generated by each tool and a consensus prediction as percentages using their observed precision values to simplify comparisons (Bandl 2014).
Prediction of change in protein stability
The effect in protein stability due to amino acid change was predicted using I-Mutant2.0 (http://folding.biofold.org/cgi-bin/i-mutant2.0), a support vector machine (SVM), a web server tool used for automatic prediction of changes in protein stability after mutations at a single site, or Single Nucleotide Polymorphism (SNP). It gives the predicted free energy change (DDG) value and the sign of the prediction as increased or decreased. The DDG value is calculated from the mutated protein unfolding Gibbs free energy value minus the wild-type unfolding Gibbs free energy value in kcal/mol (Capriotti 2005).
Stability was also verified by MUpro tool (http://mupro.proteomics.ics.uci.edu/). This server is based on two machine learning methods: support vector machines and neural networks. Both were trained on a large mutation dataset and showed accuracy above 84%. A SCORE confidence<0 indicates that the mutation decreases the stability of the protein, while a confidence SCORE>0 means that the mutation increases the stability of the protein (Elkhattabi 2019).
PROVEAN was also used. This tool was developed with the aim to predict whether a protein's sequence modification due to amino acid change has its function affected. For maximum separation of deleterious and neutral variants for all 4 classes of human protein variants, the default score threshold is currently set at -2.5 for the binary classification (i.e. deleterious vs neutral). To increase detection sensitivity (find more deleterious variants including those with lower confidence), a higher score threshold can also be used (Elkhattabi 2019).
Prediction of linkage disequilibrium
Some pairs of SNP alleles along a haplotype may be genetically inherited more often than alleles from other pairs of SNPs, which is called linkage equilibrium – when the frequency of one allele at one locus conveys information about that of another locus. The Haploview software is able to statistically predict the linkage disequilibrium of the human genome (Barret 2005, Jr 2021). Elevated levels of linkage disequilibrium between two SNPs indicate that the statistical information of these polymorphisms is redundant and not suitable for association analysis (Jr 2021). Haploview is used to (1) analyze HapMap data and choose target SNPs, (2) assess the quality of disease genotype data, (3) test for association, and (4) evaluate a region for tracking an association (Jr 2021).
Protein-protein interaction analysis
STRING (https://string-db.org/cgi/about) is a database of known and predicted protein-protein interactions. Interactions include direct (physical) and indirect (functional) associations; these interactions originate from computational predictions, from the transfer of knowledge between organisms, and from aggregated interactions of other (primary) databases.
Probability analysis of the impact of nsSNPs on the function of the studied protein
We employed the SNPs3D resource (http://www.SNPs3D.org), a tool that has three main modules. A first module identifies which genes are candidates for involvement in a specific disease. A second module provides information on the relationships between sets of candidate genes. The third module analyzes the likely impact of non-synonymous SNPs on protein function. Disease/candidate gene relationships and gene-gene relationships are derived from the literature using simple but effective text profiles. SNP/protein function relationships are derived by two methods, one using principles of protein structure and stability, the other based on sequence conservation. Entries for each gene include a series of links to other data such as expression profiles, path context, mouse knockout information, and roles. Gene-gene interactions are presented in an interactive graphical interface, providing quick access to underlying information as well as convenient web browsing.