The Systematic Analysis of lncRNA Expression in Ebola-infected Human Macrophages


 Ebola is a dangerous virus which causes severe headache, muscle pain, hemorrhagic fever, and multi-organ failure. The underlying mechanism of Ebola-human interaction at the molecular level remains unexplored. The recently reported research revealed that lncRNA are known to be important in the pathogenesis of viruses. The study carried out the comprehensive analysis of lncRNA identification in EBOLA, LPS and RESTV infected MDMs. In EBOLA, the total number of known lncRNA is EBOLA 555 in, in RESTV 476, in LPS 550 and the novel lncRNAs in EBOLA 142, RESTV 127, in LPS 136 identified. Also, differentially expressed lncRNAs and neighbourhood genes of each lncRNA about 100kb identified for analysis.The study reported the known and novel lncRNA in EBOLA, LPS and RESTV infected cells of MDMs. The results of the study may help to understand the immune response further.


Introduction
Viral infections are a deadly threat to the worldwide community. Though many viral infections get treated with advanced treatment, there is still a need for an effective treatment for some virus diseases. Ebola belongs to such a group of virus with no effective and adequate treatment. The Ebola virus causes a deadly infection named Ebola Virus Disease (EVD), which produces a fatal outcome in many patients. The major outbreak of Ebola reported in 2013-2016 [1].
Zaire ebolavirus (EBOV) and Reston ebolavirus (RESTV) differ in the characteristic nature of pathogenesis though belong to the same Filovirus family. EBOV causes devastating effects in humans such as dysregulation of immune response and induces the cytokine storm. RESTV considered as nonpathogenic as there are no reported cases against humans. The differences in pathogenicity among EBOV and RESTV are not thoroughly understood. The current study uses RNA Seq dataset (PRJNA328248) to identify the known, novel and differentially expressed lncRNAs. Primary human monocyte-derived macrophages (MDMs) cells are treated with EBOV, RESTV and control with Lippo Poly Saccharide (LPS) [2], [3] to understand the host immune response.
Long non-coding RNAs (lncRNAs) are regulatory molecules controlling the various biological processes.
However, the functions and characterization of identi ed lncRNA remain unexplored. LncRNAs of length > 200 nt in size lacks the coding potential and ubiquitously expressed in a mammalian system. LncRNAs have undergone post-transcriptional modi cation as similar to coding mRNA such as capping polyadenylation. LncRNAs are important regulator to control various biological processes in cells and organ systems such as regulating protein complexes, tra cking process of genes and chromosomes to their speci c locations. Many reports have shown that differentially expressed cellular lncRNA of virusinfected cells involved in immune response [4]- [7] and some of them favour the viral replication by inhibiting the immune response [8].
The classi ed lncRNAs under different categories are based on the genomic locations, proximity to the respective protein-coding genes including exon sense overlapping, intron sense overlapping, bidirectional, and intergenic lncRNAs [9]. Recent ndings explored that lncRNA are novel players in antiviral immune response [10]. Thus, the study aims to i) identify the lncRNA ii) differentiate the known and novel lncRNA.
iii) identify the differentially expressed lncRNAs with the neighborhood genes in a range of 100km distance. RNA sequencing is more bene ted to capture RNA expressions when compared to microarray system such as analysis of novel transcript identi cation, allele-speci c expression and splice junctions. This study analyzed the ebola infected human macrophages to identify the known, novel lncRNA and differentially expressed lncRNA. Many animal viruses like Epstein-Barr virus [11], herpes virus [12], Marek's disease virus [13], severe acute respiratory syndrome coronavirus (SARS-CoV) [4], human immune de ciency virus [14] reported reveals that the cellular lncRNAs expresses during the infection and favour the virus replication.

Data Collection and Pre-processing
Publically available data from Sequence Reads Archive database with the accession number PRJNA328248 is considered for the study. The fastq les directly downloaded from European Nucleotide Archive browser (https://www.ebi.ac.uk/ena/browser/home) and checked the quality of reads using FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) [15]. HISAT2, a splice aware alignment, it is worked based on the Burrows-Wheeler transform and the Ferragina-Manzini (FM) index (http://www.ccb.jhu.edu/software/hisat/) [16]. The output of HISAT2 (SAM format) is converted into BAM le. The sorted BAM le serves as an input for assembling using StringTie.
It produces an accurate, complete reconstructs of genes and estimate the expression level of the transcripts [17].

Identi cation of Novel lncRNA
The merged assembled transcript from StringTie used to identify the novel lncRNAs. To start with the transcript length >200 nucleotides with strand information are considered for further steps and the subset of a ltered transcript is compared with hg38 annotation le using Gffcompare . The class code representing non-coding regions "i", "u", and "x" are retained for subsequent steps [18]. The transcripts with Open Reading Frame (ORF) identi ed using TransDecoder are discarded. The remaining reads checked for coding potential using the tool CPAT(Coding Potential Assessment Tool) and PLEK, (Predictor of Long Non-coding RNAs and messenger RNAs based on an improved k-mer scheme https://sourceforge.net/projects/plek/ les/ [19]. CPAT is an alignment-free logistic model to identify noncoding regions of the transcripts [20]. The tool performs better in terms of sensitivity, speci city and accuracy compared to other non-coding region prediction tool, CPC, PhyloCSF. The lncRNA transcripts with a score of less than zero are ltered out for further analysis. The transcripts are processed using standalone BLASTX against Swissprot database to check the false positives. The transcript with an alignment score, E-value >10-5were removed. The outputs of BLASTX are subject to BLASTN against LNCipedia and NONCODE database to get the novel lncRNAs.

GC Content Analysis
Emboss geecee is an online tool used to calculate the content G+C bases of the nucleic acid sequence(s).
It sums the number of G and C bases and reports the result to le infractions in the interval 0.0 to 1.0 [21].

Differential Gene Expression Analyses
The BAM le is input to Sub read package to generate the expression gene counts matrix [22]. The count matrix process to identify the differentially expressed genes of EBOV, RESTV, LPS treated cells using DESeq2 [23]. The genes with threshold logFC ±1.5 and adjusted p-value < 0.05 are considered as signi cant.

DE-lncRNA target prediction and functional annotations
To understand the functional role of DE-lncRNA better nearby genes are identi ed with the distance of 100kb of upstream and downstream regions for further investigation. The nearby genes extracted using BEDOPS [24] and BED TOOLS [25].

Gene Enrichment Analysis
Gene enrichment analysis would help to identify the interested genes and proteins generated through high-throughput studies. WebGestalt (WEB-based GEneSeTAnaLysis Toolkit) is most widely used tool for gene enrichment analysis. The signi cant gene terms are ltered out based on the p-value < 0.05.

Data Collection and Quality Checking
The publically available dataset PRJNA328248 downloaded from NCBI SRA. The quality of the reads is checked by FastQC tool and the reads with Phred score above 20 are retained for the analysis.

Identi cation of Novel lncRNA
The protocol to identify the lncRNA is designed based on previous studies [26]- [30]. On checking of the quality, each sample was mapped to human reference genome hg38 using HISAT2. The overall alignment score greater than 75 per cent has taken for further steps as listed in Supplementary le S1. The generated HISAT2 alignment les of each sample were further processed to transcript assembly using StringTie. Subsequently, the Gffcompare is used to identify the non-coding reads and lncRNAs identi ed with the use of TranDecoder, CPAT and BLASTX, BLASTN. The lncRNAs without the strand information were removed. It is evident from the research reported that the lncRNAs regulating the coding genes are nearby in the upstream and downstream regions (26,27). The current study adopted the pipeline suggested in the research to identify the Ebola known lncRNA (555) and novel lncRNA (142)

GC Content Analysis
The GC content of lncRNAs identi ed through Emboss geecee tool. The value of GC contents of novel lncRNAs ranges from 0.78 to 0.31 (Fig 3) & known lncRNA from 0.8 to 0.29 (Fig 4).

Differential Expression Analysis of lncRNA
The gene expression count matrix le generated using Subread package. Differentially expressed lncRNA identi ed using R package DESeq2 for EBOV, LPS and RESTV. The differentially expressed transcripts are shown in Fig 5 and Table 1. The results showed that EBOV, LPS has strongly induced the immune response than RESTV. The differentially expressed transcripts lncRNA counts showed Supplementary le S8, Table 1

Gene Enrichment Analysis
The gene enrichment analysis of identi ed DE-lncRNAs neighborhood genes was performed in WebGestalt. The highly enriched terms of EBOV, LPS, and RESTV cells identi ed the signi cant GO terms as in (Table 2) and the Supplementary le S9.

Discussion
LncRNA research has become a fascinating eld of biological research like cancer research, genetic disorders, and infectious diseases. High throughput sequencing and bioinformatics methods make researchers possible to uncover the functions and characterization of lncRNAs in many species. LncRNA, play a important role in regulating the coding genes around 10 to 100 kb distance of up and downstream regions [31], [32]. Also, play a vital role in the regulation and control of multiple biological processes.
Different classes of lncRNAs reported inducing cytokine production during viral infections. The scienti c literature evidence proved that lncRNA involved in the host-virus interactions, such as activate the pathogen recognition receptors, epigenetic modulation, controlling transcriptional and posttranscriptional process [33].
The current study, whole transcriptome analysis of EBOV, LPS and RESTV infected MDM cells performed.
High-throughput techniques with a bioinformatics approach allow the scientist to uncover the role and characterization of lncRNA. The known and the unknown lncRNA expression in EBOV, LPS and RESTV cells identi ed. Cellular lncRNAs which actively expressed during viral infections may help to promote the virus replications, suppress the antiviral immune response [34]. In total, 1581 known and 405 novel lncRNA identi ed. Further characterization of identi ed novel lncRNAs may give the functional role in the activation of the immune response.
As a result, in total, 1278 DE-lncRNAs in EBOV, LPS and RESTV were identi ed. Some of this lncRNA overlapped with different time points which may play an important role in the immune response.The neighborhood genes of each DE-lncRNAs in the range of 100 kb distance in upstream and downstream positions identi ed. The reported research on the gene enriched ontology terms of nearby genes and its associated functions of other viruses support and evident the current study of ebola.
The important enrichment terms of EBOV (i) GO: 0061676, the importin protein α hijacked by Ebola VP24 protein to block STAT mediated IFN-α/β and IFN-γ synthesis. Further, importin α7 involved in the formation of inclusion bodies. Also, it involves potentially in pathogenesis (37)(38)(39). ii) GO: 0032036 Myosin heavy chain binding. The micropinocytosis is uptake EBOV with the initiation of an external stimulus to activate the receptor tyrosine kinases. Several regulators involved to carry out this process for an example Arp2/3 and myosin [35]. iii) GO:0046875 ephrin receptor binding are known to be involved in cell-to-cell interactions [36].
LPS enriched terms: i) GO: 0005229 intracellular calcium-activated chloride channel activity. Many bacterial LPS reported that it has the potential to induce calcium signalling and chloride signalling. ii) GO: 0001614 purinergic nucleotide receptor activity. LPS causes the dysregulated ATP release that intervenes with autocrine purinergic signalling mechanism important for the antimicrobial host defence process [37].
RESTV enriched terms: i) GO:0015038 glutathione disul de proteins involved in the immune and in ammatory responses to infection [38]. ii) Oxidoreductase activity reported in many viral infection [39]. iii) GO:0008276 protein methyltransferase activity are essential for the epigenetic regulation like methylation, histone and non-histone proteins [40]. The identi ed lncRNAs may regulate the gene enrichment terms. Further studies are required to understand the lncRNA regulation of the neighborhood genes.

Conclusion
Limited research on lncRNA expression and characterization in Ebola virus infection need to be investigated further. The study identi ed the novel and known lncRNA in Ebola, LPS, and RESTV treated MDM cells. The results suggested that RESTV lacks immune activation by comparing EBOV and LPS. The DE-lncRNAs along with the neighbourhood genes with a distance of 100 kb up and downstream positions reported. The information out the research helps to understand the function of the immune responses. The future directions are to study the functional and structural characteristics of novel lncRNAs. Further investigation is required to understand the role of DE-lncRNAs and neighbourhood genes. Figure 1 The work ow: (i)LncRNA identi cation: (ii) differentially expressed lncRNAs (iii) known and novel lncRNAs.

Figure 2
Identi ed known and novel lncRNA counts in EBOV, LPS and RESTV treated cells    The bar chart depicts differentially expressed lncRNA transcripts in 6, 24 and 48 hrs treated cells of EBOV, LPS and RESTV.