Objective
Specific Objectives
This study has four objectives:
1. To sequence the human genome from patients suffering from chronic kidney disease, on
Illumina-MiSeq platform.
2. Exploring the possibilities of correlation between plasma and urinary kidney profile and some biochemical parameters, as well of association between known phenotypic SNPs and amino acids exchange.
3. Investigating the existence of deleterious amino acids exchange.
4. Authenticating these exchanges on phenotypic basis by reconstructing the variant genes harboring these mutations paving the way for future development of animal models for ex vivo studies.
Genomic DNA library preparation and sequencing
• High quality gDNA was purified from lymphocytes whole as described elsewhere. The construction of the library was conducted using the Nextera Rapid Capture kits (Illumina)
• 50 ng of gDNA was utilized using transposase-based chemistry to create exome libraries
•Library was Qc on Bioanalyzer 2100, (Agilent Technology, Santa Clara, CA) to assess the integrity and size library fragments.
• 2 x 75 paired end cycles were performed (+1 cycle to each forward and reverse read to allow for phasing/pre-phasing).
Bioinformatics analyses
The flow chart for bioinformatics analysis have been published by our team elsewhere.
Below are summarized successive steps we used in bioinformatics analysis.
- Read Trimming and Clipping of Adapters
Reads are trimmed from the 3' end to have at least a phred score of 30. Illumina sequencing adapters are removed from the reads, and all reads are required to have a length of at least 50 bp. Trimming and clipping are performed using Trimmomatic (Bolger, Lohse, and Usadel 2014) [18]. This step generated: Samples Readset, Raw pair Read, number of Paired Reads obtained from the sequencer, the number of Remaining Paired Reads after the trimming step and the percentage of Surviving Paired Reads and Raw Paired Reads.
- Read Alignment to a Reference Genome
The filtered reads are aligned to a reference genome. The genome used in this analysis is Homo_sapiens assembly hg19. Each readset is aligned using BAW [Li and Durbin (2010)] [19] which creates a Binary Alignment Map file (bam). Then we merged all readset BAM files from the same sample into a single global BAM file using Picard.
BWA is a fast light-weighted tool which aligns relatively short sequences (queries) to a sequence database (target), such as the human reference genome. It is based on the Burrows-Wheeler Transform (BWT). BWA is designed for short queries up to ~200 bp with low error rate (< 3%). It does gap global alignment, supports paired-end reads, and is one of the fastest short read alignment algorithms to date while also visiting suboptimal hits.
- Read Alignment Improvement
Because of lack of perfect aligner tool, the alignment step remains really sensitive to the aligner parameters as well as technical and biological variations. To increase the quality and speed of subsequent variant calling, we performed a series of procedures to improve the alignment. These procedures consist of realigning the surrounding short insertions or deletions, fixing possible read mate discrepancy due to realignment marking duplicated reads and recalibrating read base quality.
- Realigning Short Insertions and Deletions (INDELs)
The realignment is done using GATK (DePristo et al. 2011) [20]. INDELs in reads (especially near the ends) can trick the mappers into mis-aligning with mismatches. These artifacts resulting from mismatches can harm base quality recalibration and variant detection. Realignment around INDELs helps to improve the accuracy of several downstream steps.
Insertion and deletion realignment is performed on regions where multiple base mismatches are preferred over INDELs by the aligner since it appears to be less costly for the algorithm. Such regions will introduce false positive variant calls which may be filtered out by realigning those regions properly. Mainly realignment will occur in 3 different region types: (1) Known sites (e.g., dbSNP, 1000 Genomes), (2) INDELs seen in original alignments (in CIGARs) (3) sites where evidence suggests a hidden INDEL.
- Fixing Read Mates
Once local regions are realigned, the read mate coordinates of the aligned reads need to be recalculated since the reads are realigned at positions that differ from their original alignment. The read mate positions fixing is performed using multi-task application Picard and will be subsequently used to mark duplicates.
- Marking Duplicates
Because Exome Sequencing experiment encompasses several PCR amplifications steps which could induce coverage bias during the library construction, this bias could generate high level of false positives and false negatives during the variant calling steps (SNV, SV and CNV). Thus, removing or marking read duplicates using Picard is an important step of the sequencing analysis. Aligned reads are duplicates if they have the same 5' alignment positions (for both mates in the case of paired-end reads). All but the best pair (based on alignment score) will be marked as a duplicate in the bam file. Duplicates reads will be excluded in the subsequent analysis.
- Base Quality Recalibration
This step allows recalibrating base quality scores of sequencing-by-syntheses reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the BAM output are more accurate in that, the reported quality score is closer to its actual probability of mismatching to the reference genome. QUAL represents the accuracy of the genotyping and when QUAL is 20, this means that there is 99% probability that there is a variant at the site. The recalibration process is accomplished by analyzing the covariation among several features of a base. For example: (1) Reported quality score, (2) the position within the read and (3) the preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine. The recalibration process is performed using GATK (DePristo et al. 2011) [20]
- Sequencing and Alignment Metrics per Sample
The aim of this step is to provide each sample a general summary statistics and Sample readsets are merged for clarity.
- SNV Annotations
An in-house database identifies regions in which reads are confidently mapped to the reference genome. Generally, low mappability corresponds to RepeatMasker regions and decreases substantially with the increasing in the read length. A region is identified as HC = coverage too high, LC = low coverage, MQ = too low mean mapQ (< 20) and ND = unmappable region (no data) at the position.
The VCF files are also annotated for:
- dbSNP using the SnpSift software (Cingolani, Patel, et al. 2012) [21].
- Variant effects (predicted variant effects on gene, such as amino acid changes) using the SnpEff software (Cingolani, Platts, et al. 2012) [21].
The SnpSift http://snpeff.sourceforge.net/SnpSift.html software which is a toolbox used to filter and localize missense which are number of variants in coding region for which the variation generates a codon modification. This lead to generate two types of variant metrics tables: variants for codons changes and amino acids changes induced by SNV.
- Variants Metrics
The variant metrics and statistics have been generated using the SnpEff software (Cingolani, Platts, et al. 2012) [21].
The variant metrics and statistics have been generated using the SnpEff software (Cingolani, Platts, et al. 2012) [21].
Non-synonymous mutations in humans
We query two reference data sets for natural human variants that we used work. The first set of natural variants was from the 1000 Genomes Project (1kG), and the second from gnomAD. The 1kG data comprises a compilation of human DNA variants from 1092 individuals from diverse populations across the world. Only variants with a frequency of occurrence of >1% were included. To avoid the presence of unstable elements, major histocompatibility and immunoglobulin were removed from group of the transcripts. Of interest here were the variants that mapped to protein coding regions as synonymous (sSNPs) and non-synonymous (nsSNPs) variants. These were obtained from Ensembl [22 40] v67, as described in de Beer et al [1].
The other source of natural variants was gnomAD, which contains data from various disease-specific and population genetic studies. We used v.2.0 which had variants from over 140,000 individuals. The data were downloaded from the Broad Institute’s web site. As with the 1kG data, the genome coordinates had to be converted to the corresponding sSNPs and nsSNPs in the relevant protein coding regions to get the amino acid changes involved.
Non-synonymous mutations in humans from CKD Patients
As initially reported in section bioinformatics analysis, each transcript representative for each protein sequence obtained from the VCF files was allocated Ensembl ID using the mapping tool provided by Uniprot. We derived nsSNPs for each protein from CKD from VCF files using SnpEff software and the metrics converted into two tables with one representing codon change and the other one amino acids exchange with each record of entry on Y-axis symbolized by one single letter amino acid. Because, major histocompatibility complex and immunoglobulins are inherently unstable, they were removed from the pool of the transcripts The data set representative of amino acids exchanges is later converted into amino acid pair score matrices or exchange frequencies among amino acids.
Statistical Analysis
We calculated the Pearson correlation coefficient (rP) to examine the linear relationship between the variables and the Spearman correlation coefficient (rS) to examine the monotonous relationship over sequence numbers. We interpreted the obtained correlations data as strong relationship if 0.7 < |r| ≤ 1, as moderate relationship when 0.4 < |r| ≤ 0.7, and for weak or absence of relationship if 0 ≤ |r| ≤ 0.4. We analyzed the differences in the mutation numbers between the Patients, 1kg on one hand the patients and gnomAD datasets using Fisher's exact test. We applied Bonferroni correction to the p-values when examining multiple comparisons. We obtained all analysis results and graphics using the R package program (R Core Team, 2021) [23].
Resources: R Core Team (2021). R: A language and environment for statistical computing.
We will have weak correlation, if 0 ≤ |r| ≤ 0.4, moderate correlation, if 0.4 < |r| ≤ 0.7 and strong correlation if 0.7 < |r| ≤ 1