Genomic and transcriptomic data production for helminths

Protocols for genomic and transcriptomic data production, assembly and quality control.


Introduction
A multi-step protocol for preparation of draft genomes, specifically applicable to Nematoda. This includes initial library preparation, genome and transcriptome assembly, assembly QC/contamination screening and gene prediction. Both 454 and Illumina sequencing platforms are included. DNA is sheared into 3kb fragments, blunt ended and ligated to the SOLiD Mate-Pair Cap Adapter (ABI).

2.
Ligated DNA is size fractionated to 2-4 kb fractions and then purified.

3.
Circularization reactions are set up using 1 μg of the extracted fraction and 1.

2.
The ligated DNA is size-fractionated and the 6.5-10 kb fraction is purified using the Qiagen Gel Extraction Kit.
3. 300 ng of size selected DNA is circularized using 10U of Cre Recombinase. Linear (or non-circularized and nicked) library fragments are removed.

4.
The circularized library fragments are fragmented targeting a mean insert size of 300 bp.

5.
The fragmented DNA is blunt ended using the DNA Terminator End Repair Kit (Lucigen), processed through an adenylylation reaction (NEB) and Illumina's TruSeq adaptors are ligated.

7.
The final 300-500 bp library fragments are selected with a dual SPRI reaction.
Genomes sequenced on the Roche/454 platform are assembled from a combination of fragment reads, 3 kb paired-end reads and 8 kb paired-end reads generated to meet the coverage criteria of 15x, 15x and 3x respectively, with a target of 30x coverage for the final assembly. Genomes sequenced on the Illumina platform had overlapping fragment reads, 3 kb and 8 kb paired-end reads and are sequenced to a depth of 45x, 45x, and 10x, respectively.

B Genome assembly
Assemblies are generated using the assembly workflows outlined in Fig

1.
A combination of 3kb, 8kb and fragment 454 reads are subject to adapter removal, quality trimming and length filtering using a combination of the Flexbar 1 and Trimmomatic 2 tools (Fig. 1).

2.
Contaminant screening is done using the Bowtie2 9 aligner and a local contaminant database containing ribosomal RNA, bacteria and host sequence.

3.
Cleaned reads are then assembled using the Newbler assembler 3 before being scaffolded with an in-house tool CIGA which links contigs based on cDNA evidence.

4.
The resulting assembly is improved using another local tool named Pygap that uses Illumina short paired end sequences to help fill gaps between scaffolded contigs.

5.
Finally the L_RNA_scaffolde 4 used 454 cDNA data to further improve scaffolding.

Illumina data
Assemblies constructed from 3kb, 8kb and fragment Illumina sequences followed this methodology ( Fig. 1 panel m2) 1. 3kb, 8kb and fragment Illumina sequences are subject to the adapter removal, quality trimming, length filtering and contamination screening process described for the 454 data above.

2.
The cleaned reads are then assembled using the AllPaths-LG assembler 5 before being improved using the Pygap and L_RNA_scaffolder tools as described above. Reference guided, assisted assembly Finally we have a protocol for a reference guided, assisted assembly approach (Fig. 1 panel m3) 1.
Illumina 3kb paired end sequence data are subject to the same 'cleaning' procedure described above and is then mapped against the reference assembly using the bwa aligner 6 with default parameters.

2.
Samtools mpileup is run on the alignment along with vcfutils.pl varFilter using suggested argument settings to identify the differences between the new input reads and the reference backbone.

3.
We then subtract out the reference from the mapped region by omitting all SNP loci where the alternate allele frequency is 1.

4.
We then used FastaAlternateReferenceMaker method of GATK ( 6. BLAT 8 is then used to compare the contigs created by Velvet to the contigs created by alignment to the reference and all Velvet contigs greater than 500 bp that mapped less than 50% of their length (and at >80% identity) to an existing contig are added to the assembly.
C Assembly QC / Contamination screening All assemblies are screened, to remove for contamination, before annotation.

1.
Adaptor sequencs and contaminants are identified by compared contigs to a database of vectors, bacterial and host contaminants using Megablast.

2.
High-scoring segment pairs (HSPs) with E-value <0.01 and length >100 bp are picked. The final alignment length is from the first base of the first HSP to the last base of the last HSP.

3.
The contig is removed if the identity is greater than 75% and the coverage is greater than 40% of the contig, or the contig is less than 2000 bp.

4.
Any contigs which are on the border of the requirements and longer in length are manually reviewed as an extra measure against true genome contigs being removed.

D Transcriptome sequencing and assembly
Assembled RNAseq ata are used alongside EST data in the MAKER stage of gene prediction.

1.
Transcriptome (RNAseq) libraries are generated with the Illumina TrueSeq stranded protocol following the manufacturer's guidelines.

2.
Raw reads are cleaned using an in-house tool that trims adaptor, quality trims and applies a length filter using Flexbar 1 and Trimmomatic 2 .

4.
The cleaned, filtered RNAseq reads are assembled de novo with Trinity 11 , using both left and right cleaned paired reads.

5.
The output is filtered for the longest representative open reading frame, resulting in a 'best candidates' file.

6.
Transcripts are merged using CD-HIT 12 with 98% coverage and identity.

7.
The assembled contigs are assessed for quality by aligning (with TopHat2 10 ) back to reference assembly to establish the percentage of reference aligned to by the reads and the percentage of reads that aligned to the reference.

E Gene prediction
Gene prediction is run on assemblies as follows: 1.

3.
Repeats and predicted RNAs are then masked using RepeatMasker.

2.
These decision-making steps are followed to define the set of high confidence genes: a) Genes are screened for overlaps (<10% overlap is allowed). f) If no hit is recorded the gene is retained if it had ≥ 55% identity to the genes database from KEGG 21 , and and a bitscore of ≥35.

Additional curation of gene sets
Depending on the nature of the final gene set in relation to the assembly quality some gene sets underwent an additional manual review of short genes lacking definitive evidence. After the high confidence gene selection steps described above, shorter single and double exon genes and genes annotated as hypothetical (with no KEGG nor InterPro homologies) are further scrutinized. A manual review of the Annotation Edit Distance (AED, from MAKER) is considered in combination with the QI scores (all provided by MAKER), enabling analysts to make a more informed decision about whether to keep or discard each such gene.
Anticipated Results raw sequence data, genomic and/or transcriptomic assembly and a high confidence gene set.

Figure 1
The McDonnell Genome Institute Genome assembly pipelines.
12 Figure 2 The McDonnell Genome Institute Gene-finding pipeline.