A MAKER pipeline for prediction of protein-coding genes in parasitic worm genomes

Once a genome has been sequenced, a challenging task is to predict protein-coding genes in the genome assembly. If several genomes of related species have been sequenced, it is desirable that the pipeline is scalable to multiple species, and will give consistent results across species. Here we describe a consistent, scalable and automated computational pipeline, based on the MAKER software, for prediction of protein-coding genes in parasitic worm \(nematode and platyhelminth) genome assemblies. This protocol can be used to generate a relatively accurate and complete gene set, for a broad range of nematode and platyhelminth species. Furthermore, it does not require RNAseq data and does not require training using a curated set of genes of known structure.


Introduction
Once a genome has been sequenced, a challenging task is to predict protein-coding genes in the genome assembly.If several genomes of related species have been sequenced, it is desirable that the pipeline is scalable to multiple species, and will give consistent results across species.Here we describe a consistent, scalable and automated computational pipeline, based on the MAKER software, for prediction of protein-coding genes in parasitic worm \(nematode and platyhelminth) genome assemblies.In our pipeline, gene predictions are generated using a pipeline that uses MAKER version 2.2.28 \(Holt _et al_ 2011).Our MAKER annotation pipeline consists of four steps, taking into account evidence from multiple sources \(Figure 1).First, repetitive elements in the genome are identi ed and masked using RepeatMasker \(www.repeatmasker.org) by scanning scaffolds for matches to repeats from a repeat library generated using RepeatModeler \(www.repeatmasker.org/RepeatModeler.html).Second, _ab initio_ gene models to be used as evidence within MAKER are generated using Augustus 2.5.5 \ (Stanke _et al_ 2006), GeneMark-ES 2.3a \(self-trained) \(Ter-Hovhannisyan _et al_ 2008), and SNAP 2013-02-16 \ (Korf 2004).Further gene models to use as MAKER input are generated using comparative algorithms genBlastG \(She _et al_ 2011) \(using comparisons to _C. elegans_ gene models from WormBase; Yook _et al_ 2012) and RATT \(Otto _et al_ 2011; transferring gene models from the taxonomically nearest published 'reference' genome from the list: _Haemonchus contortus_ for clade V parasites; _Ascaris suum_ for Ascaridomorpha; _Brugia malayi_ \(and _Onchocerca volvulus_) for Spiruromorpha; _Trichuris muris_ for clade I; _Strongyloides ratti_ for clade IV; _Hymenolepis microstoma_ for cestodes except _Echinococcus multilocularis_ for _Taenia_ species; _Schistosoma mansoni_ for trematodes).Third, species-speci c ESTs and cDNAs from INSDC \(Cochrane _et al_ 2016), and proteins from related species \(see below), are aligned against the genome using BLASTN and BLASTX \(Altschul _et al_ 1997), respectively, and these alignments are further re ned with respect to splice sites using exonerate \(Slater & Birney 2005).Last, the EST and protein homology alignments, comparative gene models, and _ab initio_ gene predictions are integrated and ltered by MAKER to produce a gene set for the species, with just one transcript for each gene.The four-step MAKER pipeline is run three consecutive times \(Figure 1).The rst run is performed using the est2genome option with species-speci c ESTs and cDNAs and the protein2genome for nematode protein sequences from UniProt's UniRef 90 clusters for nematodes \ (UniProt Consortium, 2015).For this rst MAKER run, Augustus and SNAP are trained using CEGMA \ (Parra _et al_ 2007) gene models for KOGs, as well as 'nematode orthologous groups' \(NOGs) \ (Martin & Mitreva 2018), 'trematode orthologous groups' \(TROGs), or 'cestode orthologous groups' \(CEOGs) as appropriate.Gene models obtained from the rst MAKER run are used to train SNAP, and MAKER is run a second time, using the same nematode proteins as in the rst run.Gene models from the second run are then used to train Augustus.Using the trained versions of SNAP and Augustus, MAKER is run a third time, using a taxonomically broader protein set that includes proteins from metazoans with complete proteomes from UniProt and proteins from helminths from GeneDB \(Logan-Klumpler _et al_ 2012).The resulting MAKER gene set is then ltered to remove less reliable gene models.This protocol can be used to generate a relatively accurate and complete gene set, for a broad range of nematode and platyhelminth species.Unlike gene-nding approaches that require extensive use of RNAseq data \(for example, use of Augustus in the _Strongyloides_ project; Hunt _et al_ 2016), this protocol does not require RNAseq data, so is suitable when RNAseq data is not available.Furthermore, unlike many gene-nding approaches, it does not require training using a curated set of genes of known structure \(e.g.con rmed using cDNAs or RNAseq), as it uses conserved genes from KOGs/NOGs/TROGs/CEOGs for an initial round of training, and then its own gene MAKER models for later rounds of training.

Equipment
Computer cluster.

Troubleshooting
It is likely that in a highly fragmented genome assembly, many genes will be split into multiple partial gene predictions.To correct for the effect of assembly fragmentation on the gene count, the gene count in your assembly can be normalised, by dividing the total proteome length by the mean protein length for _C.elegans_ \(409.82amino acids).This should give a more accurate estimate of the true gene count in your species.

Anticipated Results
The output from the protocol is a set of protein-coding gene predictions for one or more genome assemblies of interest.

References Figures
Flowchart of protocol Four-step MAKER pipeline.