Characterizing The Landscape Of Viral Expression In Cancer By Deep Learning

About 15% of human cancer cases are attributed to viral infections. To date, virus expression in tumor tissues has been mostly studied by aligning tumor RNA sequencing reads to databases of known viruses. To allow identification of divergent viruses and rapid characterization of the tumor virome, we developed viRNAtrap, an alignment-free pipeline to identify viral reads and assemble viral contigs. We apply viRNAtrap, which is based on a deep learning model trained to discriminate viral RNAseq reads, to 14 cancer types from The Cancer Genome Atlas (TCGA). We find that expression of exogenous cancer viruses is associated with better overall survival. In contrast, expression of human endogenous viruses is associated with worse overall survival. Using viRNAtrap, we uncover expression of unexpected and divergent viruses that have not previously been implicated in cancer. The viRNAtrap pipeline provides a way forward to study viral infections associated with different clinical conditions.


I. INTRODUCTION
Viral infections have a causal role in approximately 15% of all cancer cases worldwide [1]. Viruses linked to cancer are generally divided into direct carcinogens, which drive an oncogenic transformation through viral oncogene expression, and indirect carcinogens, which may lead to cancer through mutagenesis associated with infection and inflammation. To date, seven viruses have been classified as direct carcinogenic agents in humans [2]. Among these, the high-risk subtypes of human papillomavirus (HPV) are the causative agent of approximately 5% of human cancers. Chronic hepatitis B virus (HBV) or hepatitis C virus (HCV) infections are associated with most hepatocellular carcinoma cases. More recently, advances in sequencing technologies have contributed to better appreciation of the high burden of viral infections in cancer, exemplified by the Kaposi's sarcoma herpesvirus and the Merkel cell polyomavirus, which were discovered based on nucleic acid subtraction to cause Kaposi's sarcoma and Merkel cell carcinoma, respectively [2]. The discovery of oncogenic viruses, starting with the Rous sarcoma virus [3], has been critical for understanding mechanisms driving cancer evolution and for improving cancer prevention and intervention strategies.
This motivated us to develop viRNAtrap framework, the first method to our knowledge that employs a deep learning model to accurately distinguish viral reads from short RNA sequencing and utilizes the model scores to assemble viral contigs. Deep learning has been utilized to identify virus from DNA sequencing data (ref). In this study, we utilized Convolutional Neural Networks (CNN) to capture features out of the short RNA reads to identify viral reads from human reads. We apply viRNAtrap to 14 cancer types from TCGA (selected based on potential viral relevance to oncogenesis), to characterize the landscape of viral infections in the human cancer transcriptome. We demonstrate the ability of viRNAtrap to identify different types of viruses that are expressed in tumors by constructing three viral databases and comparing viRNAtrap findings to sequences in those databases. We first evaluate known cancer-associated viruses that are expressed in different tumor types. Then, we curate a database of potentially functional human endogenous retroviruses (HERVs) and analyze expression patterns of different HERVs across human cancers to find that HERV expression is associated with poor survival rates. Finally, we employ viRNAtrap to identify divergent viruses that are expressed in tumor tissues. Notably, we identify Redondoviridae members that are expressed in head and neck carcinomas, a Siphoviridae member that is expressed in 10% of high grade serous ovarian cancers, and a Betairdovirinae member that is expressed in more than 25% of endometrial cancer samples

A. viRNAtrap Architecture
The viRNAtrap framework is composed of two main components. The first is a deep learning model, which was trained to accurately distinguish viral from human reads using RNA-sequencing. The second assembles the predicted viral reads into contigs. The trained neural network is composed of one 1D-convolutional layer and three fully connected layers, one of which is the final output layer. The RNA sequences were one-hot encoded to vectors that were given as input to the model. The learning rate was set to 0.005, we used 64 filters with ReLU as an activation function in the convolutional layer, followed by one pooling layer for feature extraction. The global extracted features from the convolutional layer are passed to three fully connected layers, to make a prediction based on a sigmoid activation function in the output layer.
We compared our method on our test set with available methods which identify bacterial viral reads from DNA sequences by reimplementing them (ref) which all use longer viral reads range between 150bps to 500 bps. This comprehensive comparison shows that viRNAtrap outperforms all available state of the art methods that they have been developed to identify bacterial viral reads and human reads from DNA sequences such as DeepViFi [4], ViraMiner [5], DeepVirFinder [6] and off-the-shelf Seq2Seq mode.

B. Data Preparation
We collected human and viral sequencing data. Coding sequences of human and other placental viruses downloaded from the Virus Variation Resource [7]. Human transcripts for hg19 were downloaded from NCBI Human Genome Resources [8]. These sequences were segmented into 48bps segments, which is the read length for the RNAseq in almost all tumor types in TCGA; only a few tumor types that were added chronologically last to TCGA used longer reads. We used a 48bp window size for human transcripts and a 2bp window size for viral sequence, to balance the positive and negative data. Then, these were randomly split (where all segments of each transcript were considered together) into balanced train and test sets (n=8,800,000, and 2,558,044, respectively). The training dataset was further separated into a training (n=8,000,000) and validation sets (n=800,000).  [2] N. A. Krump, J. You, "Molecular mechanisms of viral oncogenesis in