Zebrafish retinal mRNA RNA-seq data processing

This protocol details the step-by-step procedures followed to process zebrafish retinal mRNA sequencing data generated by the in the


1.
Raw sequencing data from mRNA SMARTSeq2 libraries (100bp paired-end [PE]) was converted from bcl to FASTQ format in BaseSpace per sample.

2.
All FASTQ reads were assessed for quality control using FASTQC and FastQ Screen. FastQ Screen is highly recommended as it assesses the library for sequence origin, ensuring the data match expectations.

3.
Sequencing adapters (Illumina) and low quality read bases were trimmed using Trim Galore, removing reads with a quality Phred score under 6.
Quality of mapping was carried out, summarizing the number of mapped, multimapped, unmapped, as well as mapped to exons, intergenic, or intragenic regions using RNA-SeQC/picard metrics/Qualimap. 10. Summary reports of all metrics were created using MultiQC (v1.8) 5 .
11. The R computing environment (>3.5) and Bioconductor packages, DESeq2 (v1.28.0) was used for statistical modelling of the count data, carrying out pairwise comparisons of optic fissure (OF) tissue and dorsal retina (DR) tissue at each specific developmental time point using WALD testing to generate p values. Metadata of the samples should include; sample id, condition, timepoint. Per sample count data was imported using tximport package, creating a single count dataframe. Low counts were filtered and removed (basemean <10) before modelling. Pairwise comparisons were carried out using specific contrasts called in the DESeq2 command. This ensured the complete dataset was preprocessed and consistent for all comparisons. For the time course analysis, DESeq2 was used with a likelihood-ratio test (LRT) to generate significance values on the complete data set using the interaction of time with tissue origin factors (time:tissue) .
12. The data was explored using several graphical plots using ggplot2 in R 11 , including the MA plot -this shows the log2 fold change, per feature, plotted against the mean of normalized counts, for all the samples. An overview of the level of similarity was created using a sample-to-sample distance heatmap, using hierarchical clustering. Principle component analysis (PCA plot) was used to show the samples in a 2D space, separated by the first two principle components, key to identification of outliers and batch effects. All plots can be saved as resolution independent SVG images. 13. Differentially expressed genes (DEGs) were defined in this dataset as those having an absolute log2 fold change >= 1 and an FDR <= 0.05.
14. (optional) Genes can be further annotated using the biomaRt package 12 ; providing common gene name, chromosomal location, biotype and strand specificity as well as associated protein identifiers and homologues in a number of species. 15. Gene ontology over representation analysis was carried out using GOseq, which has the benefit of considering any length bias in the data. Alternative tools include GOrilla, DAVID and Enrichr (accessible through web interfaces and APIs).
16. Hierarchical clustering of the identified DEGs can be used to identify clades/groups of differentially expressed genes, either up or down regulated in samples represented together colour coded in the heatmap output. The rows of the heatmap (representing an individual gene/transcript) are reordered according to the clustering result, putting similar observations close to one another. Clustering used the rlog transformed assay data from the DESeq2 object.
17. For inter sample comparisons of specific gene(s) expression, transcripts per million (TPM) values were calculated 13 and appropriate plots created.