Eutherian comparative genomic analysis protocol

The eutherian genomics momentum greatly advanced biological and medical sciences. Yet, future revisions and updates of eutherian genomic sequence data sets were expected, due to potential genomic sequence errors and incompleteness of genomic sequences. The eutherian comparative genomic analysis protocol was established as guidance in protection against potential genomic sequence errors in public eutherian genomic sequence assemblies. The protocol revised, updated and published 14 major eutherian gene data sets, including 2615 complete coding sequences deposited in European Nucleotide Archive as curated third party data gene data sets under accession numbers: FR734011-FR734074, HF564658-HF564785, HF564786-HF564815, HG328835-HG329089, HG426065-HG426183, HG931734-HG931849, LM644135-LM644234, LN874312-LN874522, LT548096-LT548244, LT631550-LT631670, LT962964-LT963174, LT990249-LT990597, LR130242-LR130508 and LR760818-LR761312. 3 original genomics and protein molecular evolution tests, tests of reliability of public eutherian genomic sequences using genomic sequence redundancies, tests of contiguity of public eutherian genomic sequences using multiple pairwise genomic sequence alignments and tests of protein molecular evolution using relative synonymous codon


Introduction
The eutherian genomics momentum greatly advanced biomedical research. For example, one major aim of initial sequencing and analysis of human genome was to update and revise human genes, as well as to uncover potential new drugs, drug targets, and molecular markers in medical diagnostics. However, future revisions and updates of eutherian genomic sequence data sets were expected, due to potential genomic sequence errors and incompleteness of reference genomic sequences. Speci cally, the potential genomic sequence errors included Sanger DNA sequencing method errors (artefactual nucleotide deletions, insertions and substitutions) and analytical and bioinformatical errors (erroneous gene annotations, genomic sequence misassemblies). Thus, the eutherian comparative genomic analysis protocol RRID:SCR_014401 was established as guidance in protection against potential genomic sequence errors in public eutherian genomic sequence assemblies 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15  Eutherian comparative genomic analysis protocol The eutherian comparative genomic analysis protocol RRID:SCR_014401 integrated gene annotations, phylogenetic analysis and protein molecular evolution analysis into one framework of eutherian gene descriptions.The protocol included 3 original genomics and protein molecular evolution tests, including tests of reliability of public eutherian genomic sequences using genomic sequence redundancies, tests of contiguity of public eutherian genomic sequences using multiple pairwise genomic sequence alignments and tests of protein molecular evolution using relative synonymous codon usage statistics.

Gene annotations
The eutherian gene annotations included gene identi cations in public genomic sequence assemblies, analyses of gene features, tests of reliability of public eutherian genomic sequences and tests of contiguity of public eutherian genomic sequences.
1.1. All analyses and manipulations of nucleotide and protein sequences used sequence alignment editor BioEdit.
1.2. The eutherian reference genomic sequence data sets were accessible in National Center for Biotechnology Information's (NCBI) GenBank, as well as in Ensembl genome browser.
1.3. The identi cations of potential coding sequences used public eutherian reference genomic sequence assemblies and NCBI's BLAST program including BLAST Genomes and Ensembl genome browser's BLAST or BLAT programs.
1.4. The analyses of gene features used potential coding sequences and direct evidence of eutherian gene annotations accessible in NCBI's nr, est_human, est_mouse and est_others databases.
1.5. The tests of reliability of eutherian public genomic sequences analysed potential coding sequences using good laboratory practice in Sanger DNA sequencing method. The rst test steps analysed nucleotide sequence coverages of potential coding sequences using NCBI's BLAST program and processed Sanger DNA sequencing reads or traces accessible in NCBI's Trace Archive. The second test steps discriminated complete coding sequences and putative coding sequences. Speci cally, the tests described potential coding sequences as complete coding sequences only if consensus trace nucleotide sequence coverages were available for every nucleotide. Alternatively, if consensus trace nucleotide sequence coverages were not available for every nucleotide, the potential coding sequences were described as putative coding sequences that were not used in analyses. For example, the good laboratory practice in Sanger DNA sequencing method exacted that minimal consensus trace nucleotide sequence coverage included 2 identical trace nucleotide sequences.
1.6. The tests of contiguity of public eutherian genomic sequences included multiple pairwise genomic sequence alignments. The tests used public eutherian reference genomic sequences encoding complete coding sequences and mVISTA's program AVID. In eutherian genomic sequences, the tests analysed translated exon numbers, as well as their chimerisms and relative orders and orientations. The tests of contiguity of eutherian public genomic sequences did not use masking of transposable elements in public eutherian reference genomic sequence assemblies. 1.7. The curated eutherian gene collections were deposited in European Nucleotide Archive as third party data gene data sets. The revised and updated eutherian gene classi cations and nomenclatures used guidelines of human gene nomenclature and guidelines of mouse gene nomenclature.

Phylogenetic analysis
The phylogenetic analysis included protein and nucleotide sequence alignments, calculations of phylogenetic trees and calculations of pairwise nucleotide sequence identities.
2.1. The complete coding sequences were translated using BioEdit and then aligned at amino acid level using ClustalW in protein amino acid sequence alignments. The protein amino acid sequence alignments were manually corrected, and nucleotide sequence alignments were prepared accordingly using BioEdit.
2.2. The calculations of phylogenetic trees used nucleotide sequence alignments and MEGA program.
2.3. Using nucleotide sequence alignments, the pairwise nucleotide sequence identities of eutherian complete coding sequences were calculated using BioEdit. The statistical analyses using Microsoft O ce Excel statistical functions included calculations of average pairwise nucleotide sequence identities (ā) and their average absolute deviations (ā ad ), as well as largest (a max ) and smallest (a min ) pairwise nucleotide sequence identities.

Protein molecular evolution analysis
The protein molecular evolution analysis included analyses of protein amino acid sequence features and tests of protein molecular evolution using relative synonymous codon usage statistics.
3.1. The protein amino acid sequence features were annotated manually, including analyses of common cysteine amino acid residue patterns among eutherian major protein clusters.

3.2.
The tests of protein molecular evolution using relative synonymous codon usage statistics integrated patterns of nucleotide sequence similarities with protein primary structures. Using nucleotide sequence alignments, the MEGA calculated relative synonymous codon usage statistics as ratios between observed and expected amino acid codon counts (R = Counts / Expected counts). The amino acid codons including R ≤ 0.7 were described as not preferable amino acid codons. In reference protein amino acid sequences, the tests described invariant amino acid sites (invariant alignment positions), forward amino acid sites (variant alignment positions that did not include amino acid codons with R ≤ 0.7) and compensatory amino acid sites (variant alignment positions that included amino acid codons with R ≤ 0.7).

Troubleshooting
Time Taken