3’ RNA sequencing for robust and low-cost gene expression proling

3’seq-RNA Proling (3’SRP) approach is based on multiplexing samples and molecular indexing mRNA in order to drive genome-wide transcriptional proling at reasonable cost in comparison to standard RNA-sequencing. The protocol is performed according to the 3′-digital gene expression (3′-DGE) approach developed by the Broad institute. The libraries are prepared from small amounts of total RNA where the mRNA poly(A) tails are tagged with universal adapters, well-specic barcodes and unique molecular identiers (UMIs). We have improved the fragmentation step by implementing tagmentation based on the activity of a bead-linked transposome. This technique allows sample multiplexing on 96-well plates. Libraries are then sequenced using standard procedures, e.g. on Hiseq2500 or NovaSeq 6000 SP Flow Cells. We have developed a snakemake pipeline including every analysis step from raw fastq de-multiplexing to functional annotation of the differentially expressed genes, producing a complete HTML report for end-user.


Introduction
For the past decade, RNA sequencing has progressively become the gold-standard approach in genomewide transcriptional pro ling. Experimental design has to balance between su cient numbers of sample measurements and economical cost. Hence, compromising between the number of tested conditions and the number of replicates tends to limit statistical power. Increasing costs are mainly due to limited throughput in library preparation and higher read depth, the latter being required in particular for transcript reconstruction and allele-speci c expression analysis. In this context, improving the throughput of library preparation while optimizing sequencing depth would allow accurate analysis of large sample groups at reasonable cost.
We have implemented in our core facility the 3′-digital gene expression (3′-DGE) approach developed by the Broad institute 1,2 , based on multiplexing samples and molecular indexing of mRNA molecules. We have in particular improved the fragmentation step by implementing tagmentation based on the activity of a bead-linked transposome. This technique is named 3'SRP for 3' sequencing RNA Pro ling (Fig.1).
3'SRP relies on transcript barcoding with unique molecule identi ers (UMI). As reads reporting the same gene and UMI are ltered, this strategy removes the duplicate reads, which result from multiple counts of the same DNA fragment after the PCR ampli cation step. Another advantage of 3'SRP is that only 5 million reads are required for transcriptome pro ling, while full-length RNA sequencing (RNAseq) usually requires 50 million reads.
This technique allows sample multiplexing on 96-well plates. Libraries are then sequenced using standard procedures, e.g. on Hiseq2500 or NovaSeq 6000 SP Flow Cells. High multiplexing considerably reduces the cost of sequencing per sample.
Regarding data analysis, standard RNAseq tools can be used with some adjustments related to UMI counting. To facilitate data processing in service-oriented settings, we have modi ed the de-multiplexing step to obtain one fastq and one bam le per sample. This allows us to handle several projects on the same run. This le splitting step also reduces the computation time by using the parallelization power of a HPC cluster. We have also adapted the code to allow for large projects split over several runs. We have developed a snakemake 8 pipeline including every analysis step from raw fastq de-multiplexing to functional annotation of the differentially expressed genes, producing a complete HTML report for enduser.  A crucial point is the speci cation of the number of biological replicates to ensure su cient statistical power. The optimal number of replicates depends mainly on the magnitude of differential expression (effect size) and the complexity of experimental design (number of covariates). However, as described in Alpern et al. article 9 , shifting from 20 to 5 replicates greatly reduces the ability to detect true differentially expressed genes (DEGs) (decrease statistical power). The low cost per sample is a great advantage for increasing the number of samples in an analysis and allows a high performance to detect DEGs.

Reagents
In order to avoid batch effects, RNA extraction series have to be randomized according to biological conditions. A randomization of the samples on micro-well plates is also necessary to avoid rows/columns batch effects; we have developed an algorithm to randomize samples.

Sample quali cation
Libraries are prepared from 10ng of total RNA per sample, diluted in 4µl of RNase free water. RNA concentration and purity are measured on NanoDrop station (ThermoFisher scienti c); OD 260/230 ratios should be around 1.8 -2.0 to ensure that concentration is properly assessed. Quality of RNA samples is controlled on 2100 Bioanalyzer system (Agilent) with RNA 6000 Nano Kit (ref 5067-1511, Agilent).
The nal concentration must be greater than 2nM in 10µl (for HiSeq sequencing) or greater than 15nM in 10µl (for NovaSeq sequencing) to be sequenced. Raw fastq pairs used for analysis matched the following criteria: -the 16 bases of the rst read (forward reads) correspond to 6 bases for a designed well-speci c barcode and 10 bases for a unique molecular identi er (UMI).
-the second read (reverse reads), 104 bases for NovaSeq (58 bases for HiSeq runs), corresponds to the captured poly(A) RNAs sequence.
Bioinformatics analysis (Fig.1) can be divided in three parts: 1) primary analysis which includes demultiplexing, alignment and counting steps; 2) secondary analysis that performs the differential analysis; 3) tertiary analysis which is focused on the functional annotation and the representation of differentially expressed genes.

Primary Analysis
The primary analysis (Fig.3) consist in generating the expression matrix containing the raw counts for each sample and for each gene, from raw sequencing data les.
-Illumina basecall les are transformed into paired-end fastq les with Illumina bcl2fastq converter.
-Samples are demultiplexed according to their respective barcodes described in a sample sheet to create a single-end fastq le per sample.
-A step of read trimming is performed in order to remove any poly A tail, which can occur when the transposase cuts a DNA fragment in close proximity to the poly A tail. By using cutadapt on these fragments, all the bases following the poly A tails (reverse sequences of the UMI, sample barcode and illumina adapter) are also removed.
-The cDNA sequences from each sample are then aligned on the reference transcriptome and the mitochondrial genomic reference sequence with BWA 5 aligner. Since a reference transcriptome is used, there is no need for RNA aligner (such as STAR or bowtie).
-Expression pro les are generated by parsing the alignment les (bam) then counting the number of unique UMIs associated with each gene for each sample.
-Reads are removed if one of the following cases occur: * the UMI contains 'N's in its sequence.
* there are more than 3 mismatches in the alignment.
* the sequence aligns with the same score on transcripts of more that one gene.
-Finally, the raw expression matrix is produced, which contains the values of abundance for every gene in all samples.

Secondary Analysis
The secondary analysis consists in identifying the differentially expressed genes between conditions from the raw expression matrix.
The pipeline is mainly written in the R programming language.
-Samples and genes are ltered out of the raw expression matrix according to the parameters speci ed in the con guration le: * Samples are removed if they have less than --minReads reads assigned (default 200k) or if they have less than --minGenes genes detected (i.e. number of genes with at least 1 count) (default 5k).
* Genes are removed if not detected (counts = 0) in at least the number of samples forming the smallest condition.
-Counts are then normalized by DESeq2 6 method.
-Differentially expressed genes are found by DESeq2. By default, genes with an adjusted p-value < 0.05 and a log2 fold-change > 0.58 (i.e. fold-change > 1.5) are de ned as differentially expressed.

Tertiary Analysis
Using ClusterPro ler 10 and StringDB 8 tools provides some hints on the interpretation of the differentially expressed genes.
-Enrichment tests are performed on Gene Ontology annotations and KEGG pathways.
-GSEA analysis -StringDB networks are shown if the number of differentially expressed genes is below 50.

Running the 3'SRP pipeline
The pipeline is entirely implemented as a Snakemake work ow. Scripts used in the pipeline are mainly written in the python and R programming language.

Requirements
Installing git and conda are mandatory before running the pipeline.

Input les
The pipeline takes as input a samplesheet describing samples, one or multiple pairs of fastq les and eventually a le listing the comparisons to test.
--The samplesheet : A samplesheet describing the samples is necessary to run the pipeline. This le has to be tab delimited (without header) containing six columns: well, index, name, project, condition, species.
--The fastq les : The The script used to create the con guration le is make_srp_con g.py in the SCRIPTS folder. The help can be visualized with: ----$ python SCRIPTS/make_srp_con g.py -h ---- The mandatory options are -s for the samplesheet, either -f for a le listing the fastq paths or -i for an illumina directory and -r for the directory containing the genome les for the different assemblies.
It is recommended to specify a working directory where the les will be output with option -w to keep the srp-pipeline git clone clean. The program outputs a json object on stdout which can be redirected to a le.
Test the launch with a dry run: ----$ snakemake --con g conf="con g.json" -rpn ----All rules and commands that will be run should be displayed. Otherwise errors are returned.
Note on samplesheet: -The le should be tab delimited without trailing or leading spaces.
-Only characters a-zA-Z-_ are allowed.
-There should be no empty lines.
-There should be no header.
Note on fastq les list: -The le should be tab delimited without trailing or leading spaces.
-Fastq les listed can be .fastq or .fastq.gz -There should be no empty lines.
-There should be no header.
Note on comparison le: -The project name and conditions must match one or more samples in the samplesheet.
-There should be no empty lines.
-There should be no header.
-Same condition can be speci ed in column 2 and 3 to perform only the rst part of the secondary analysis (all but comparisons). The project has to appear only once in the le.
-The rst condition column is the test and the second is the control.
Note on the conda environment: -If the environment for this pipeline has already been created the environment for this pipeline, just "conda activate srp".
Note on snakemake launch: -Specify the number of jobs with -j <N>.
-Even if multiple jobs are not speci ed, two scripts in the pipeline are still threaded. .Thus it is advised not to run the pipeline on a real dataset without an HPC.
-The path to the log output les must exist ($ mkdir ./logs). As demonstrated in the article by Xiong et al. 3 there is good concordance of outcomes in terms of differential expressed genes and biological interpretation between these two methods, 3'SRP being slightly less sensitive on gene detection (15%). However, this technique shows a good balance between bene ts in costs and shortcomings.
We have carried out more than 60 projects on different types of samples. These projects represent more than 6000 samples analyzed. Different types of samples were tested, with RNA extracted from cell culture, primary cells, cell sorting, or biopsies. On average, about 300 million reads are generated by a Hiseq rapid run multiplexing 96 samples and about 1 billion on a SP NovaSeq run multiplexing 2*96 samples. From a total of 27k human genes referenced in RefSeq, a maximum of about 16k expressed genes are detected per sample. This plateau is reached with 5 million reads (Fig.4). Our experience based on these projects has allowed us to optimise library preparation and improve data outputs. Each aspect is detailed below.
Library preparation -Size library quali cation: The expected library size recommended by Illumina is between 350 and 800 pb. In this protocol, we recommend a library size greater than 400 pb to avoid sequencing the polyA tails. Nevertheless, a trimming of these polyA tails is applied in the primary analysis.
The common size obtained is ~550-650 pb.
-Quantity of library injected: The quantity of library injected is ~9-10 pM on HiSeq 2500 and ~380 pM on NovaSeq 6000. However, this quantity may vary depending of the sequencer used.

Bioinformatics analysis
The snakemake pipeline is part of our registry and is based on FAIR practices, taking advantage of virtual environments (conda, docker) and continuous integration (jenkins). These best practices allow for pipeline deployment in multiple environments for reproducibility and scalability issues.
The gitlab repository provides a short example dataset, which is quick and easy to use. This small dataset mimes 3 conditions with 2 replicates each aligned on hg19 reference.
More details are available on the gitlab wiki.
Output les & folders: The main output folder corresponds to the -w speci ed for the creation of the con guration le or to a default folder named "RESULTS". One folder per project speci ed in the samplesheet is created.
For each analysis step, a folder with the name of the tool or stage is created containing corresponding result les.
The CUTADAPT folder contains the single-end demultiplexed fastq les (one per sample).
The FASTQC and MULTIQC folders hold the results les of these tools applied on the fastq les.
The ALIGNMENT folder contains the aligned reads on the reference transcriptome bam les, as well as their indexes bai les.
The EXPRESSION folder contains the result les of the counting step. A naming convention, in four parts, has been de ned for the different matrices in this folder.
Pre x name is the name of the project. The second part is either "all" (multiple alignments) or "unq" ( align on one gene). The third part of the name de nes the information to be found i.e counts matrices, summary... The fourth part of the name, only found in count matrices les can either be "total" or "umi": -"total" is the number of total reads of the counts without taking into account the UMIs. Counts can be arti cially increased during PCR ampli cation.
-"umi" is the number of unique molecules of RNA, this taking into account the UMIs for each read. If two reads of the same sample map to the same gene and have the same UMI, it will only be counted for 1.
The DE folder contains the result les of the secondary analysis (i.e. the differential expression analysis) as well as the tertiary analysis (i.e. the functional annotation analysis). The folder is composed of the following les: -ltered, normalized and log-transformed matrices.

Report
The pipeline generates a HTML report using JavaScript/jinja/python technologies. This report, intended for end-users, displays project summary, raw and processed data quality controls and differential analysis results (PCA, differential expressed genes, sample clustering...) The rst part of the report presents the primary analysis: Samples and QC summary, visualization of samples variability. Graphs display the number of total sequences, the read distribution and the number of gene detected (Fig.5A). The HTML report embeds the MultiQC 4 report. All FastQC reports are summarized in one report, containing quality scores, GC and adapter content, overrepresented sequences. The samples variability is shown by a reduction dimension graph (PCA) and a sample clustering ( Fig.5B,C).