Human reference protein sequences
Human protein sequences are well studied and for the purposes of this study, are assumed to be accurate. The Homo sapiens reference proteome (UP000005640) from Uniprot (downloaded in Dec. 2019) was used for the reference protein sequences and we selected the canonical isoform for each human gene. A total of 20,596 human protein sequences were extracted.
Orthologous sequences from primate proteomes
Potential orthology relationships between human proteins and proteins from ten primates (Pan Troglodytes, Gorilla gorilla gorilla, Macaca mulatta, Macaca fascicularis, Chlorocebus Sabaeus, Papio Anubis, Pongo abelii, Nomascus leucogenys, Callithrix jacchus, Otolemur garnettii) were predicted using a simple protocol based on BLASTP [34] searches in the Uniprot reference proteome database. For each human protein sequence, the best hit in each of the ten primates was considered as an orthologous protein if it has an e-value <10-50 and sequence identity >80%. Table 1 shows details of the primate proteomes used and the number of predicted orthologous sequences.
Orthologous sequences were also identified using the same protocol in the proteomes corresponding to the ten primates in the RefSeq database (release 98). In this case, protein isoforms of the same gene are not grouped in the same database entry, we retained all protein sequences identified in the BLASTP searches with the same thresholds as above, i.e. an e-value <10-50 and sequence identity >80%.
Multiple Alignments and Detection of Sequence Errors
For each of the 20,596 human proteins, a multiple sequence alignment (MSA) of the primate orthologous sequences was constructed using the MAFFT program [35]. Each of MSA was analyzed to detect potential insertions, deletions and mismatches in the orthologous sequences using the SIBIS program [36]. SIBIS uses a Bayesian framework combined with Dirichlet mixture models and visual inspection to identify inconsistent sequence segments representing potential sequence errors. The predicted errors are classified into seven categories: N-terminal deletion, N-terminal extension, C-terminal deletion, C-terminal extension, internal deletion, internal insertion and mismatched segment.
For this study, we focused on potential sequence mismatches, where a mismatch was defined as a sequence segment that is at least 20 amino acids long with less than 50% sequence identity between the human reference and the primate sequences. In addition, to exclude false positive mismatch errors due to wrong ortholog detection or MSA errors, mismatching sequences in primates that were significantly longer or shorter than the human sequence (primate sequence length three times longer or shorter than human) were excluded.
Phylogenetic Trees
Phylogenetic relations between organisms were determined using the LifeMap tool [37] by extracting a subtree based on the list of organisms of interest. The subtree was visualized using iTOL [38] generate the tree figures.
Intron-exon maps and genomic sequences
To perform more in-depth analysis of mismatch errors, Uniprot protein sequences were mapped to their transcript IDs in the Ensembl database (release 99) using the Uniprot Retrieve/ID mapping tool. For each transcript, the CDS and genomic sequence were retrieved, as well as intron and exon positions on the genomic sequence.
Mismatch characterization
For each potential mismatch error detected by SIBIS, the mismatch sequence segment was located on the exon/intron map, CDS and genomic region. To identify the potential causes of the mismatch error, nine features were defined and classified into three categories (Gene misprediction, False positive, Undetermined) as follows:
Evidence of a Gene prediction error involves five possible features:
- a low quality genomic sequence, containing ‘N’ characters indicating undetermined or ambiguous nucleotides (IUPAC nomenclature) probably caused by genome sequencing errors or assembly gaps,
- short introns in the primate sequence (<30 nucleotides),
- a mismatch coded by a single exon in human compared to >3 exons in the primate sequence,
- non-canonical splice sites (not GT/AG) in the human mismatch sequence,
- the presence of a small nucleotide insertion in the primate exon sequence causing a frameshift.
Evidence that the detected mismatch is in fact a False positive (i.e. the mismatch is due to inaccuracies in the quality control protocol) involves three possible features:
- the existence of alternative human isoform in the Ensembl database that does not generate a mismatch at that position,
- MSA alignment errors,
- mismatches in a sequence region with repeats.
The mismatch was characterized as Undetermined if one feature was detected, namely that the same mismatch is identified in four or more primates in the MSA, or a mixture of Gene misprediction and False positive evidence was identified.
Mismatch error correction protocol
For the mismatch errors classified as gene mispredictions, a simple protocol was implemented to try to improve the quality of the protein sequence. Given a primate protein sequence with a mismatch error compared to the human sequence, the error correction protocol involves four steps (outlined in Fig. 2) as follows:
- Using the human sequence segment as a query, perform a TBLASTN search in the primate genomic region corresponding to the transcript.
- If a hit is found with an e-value <0.001 and sequence identity >80%, it is inserted into the primate transcript as a new exon.
- Verify the integrity of the new transcript, by checking for overlapping or inverted exons, frameshifts and stop codons in the inserted sequence.
- Create a new CDS and protein sequence.
To estimate the quality of the new protein sequence, the modified primate sequence was realigned with the human reference protein. Two scores were used to evaluate the quality of the primate sequences:
- Percent sequence identity I was defined as I = i/N×100, where i is the number of identical amino acids in the alignment and N is the total number of non-gap positions in the alignment.
- Percent sequence coverage C was defined as C = N/Ntot×100, where N is the number of non-gap positions in the alignment and Ntotis the total length of the alignment.
Implementation
The analyses were performed using an Sqlite3 database, in-house bash and Python (version 2.7) scripts, and numpy (numpy.org), pandas (pandas.pydata.org) and matplotlib (matplotlib.org) Python libraries. The Biopython (biopython.org) package was used to perform pairwise alignments during the mismatch characterization step. To retrieve large amounts of data from the Ensembl database, the grequests (pypi.org) package was used to perform asynchronous application programming interface (API) requests. The Json (www.json.org) library was used to process Ensembl API output.