Cohort design and sample collection
The cohort in this study included 295 plasma samples in total. Except for 65 previously published samples (GSE142987: 35 liver cancer patients and 30 healthy donors)[21], we sequenced the total cfRNAs (>50nt) in 230 additional plasma samples (54 colorectal cancer, 37 stomach cancer, 27 liver cancer, 35 lung cancer, 31 esophageal cancer and 46 HDs).
Samples were obtained between October 2018 to January 2020 from 6 clinical centers in China: Peking University First Hospital (PKU, Beijing), Peking Union Medical College Hospital (PUMCH, Beijing), Department of Epidemiology Navy Medical University (ShH-1, Shanghai), Eastern Hepatobiliary Surgery Hospital (ShH-2, Shanghai), National Center for Liver Cancer (ShH-3, Shanghai) and Southwest Hospital (SWU, Chongqing). The study was approved by the local institutional research ethics committees. Informed consent was obtained from all patients.
Peripheral whole blood samples were collected from participants before therapies using EDTA-coated vacutainer tubes. The tubes were inverted 8-10 times to mix blood with anticoagulant. Plasma was separated by a standard clinical blood centrifugation protocol within 2 hours after collection. All plasma samples were aliquoted and stored at -80°C before cfRNA extraction.
cfRNA-seq library preparation
Cell free RNAs (cfRNAs) were extracted from 1 mL of plasma using the Plasma/Serum Circulating RNA and Exosomal Purification kit (Norgen). Purification was based on the use of Norgen’s proprietary resin as the separation matrix. This kit is able to extract all sizes of circulating cfRNAs. The concentration of extracted cfRNAs was assessed using the Qubit RNA assay (Life Technologies).
Total cfRNA library (>50nt) was prepared with the SMARTer® Stranded Total RNA-Seq Kit–Pico. This kit removes ribosomal cDNAs after reverse transcription using a CRISPR/DASH method. We used Recombinant DNase I (TAKARA) to digest circulating DNAs. ERCC RNA Spike-In Control Mixes (Ambion) was added to the samples before library preparation, 1μL per library at an appropriate concentration. RNA Clean and Concentrator-5 kit (Zymo) was used to obtain purified total RNA. More than 20 million reads of total cfRNA were sequenced on an Illumina platform for each library.
Potential contaminations in RNA extraction and library preparation were evaluated using two types of negative controls. 2 RNA samples were extracted from E. coli DH5α strain, using same kit for plasma cfRNA extraction. RNA-seq libraries of E. coli RNA samples, together with human brain RNA provided by SMARTer® Stranded Total RNA-Seq Kit, were constructed using same protocol for cfRNA library preparation.
Data processing
For RNA sequencing data, adapters and low-quality sequences in raw sequencing data were trimmed using cutadapt[22] (version 2.3). GC oligos introduced in reverse transcription were also trimmed off, and reads shorter than 30 nt were discarded. We used STAR[23] (version 2.5.3a_modified) for sequence mapping. The trimmed reads were mapped to ERCC’s spike-in sequences, vector sequences in NCBI’s UniVec database, and human rRNA sequences in refseq sequentially.
The remaining reads was mapped to the hg38 genome index built with the GENCODE[24] v27 annotation. Circular RNA annotation was downloaded from circBase[25]. Upstream 150bp and downstream 150bp sequences around the back spliced sites of circRNAs were concatenated to generate junction sequences, and circRNA sequences shorter than 100 bp were discarded. Reads unaligned to hg38 were mapped to circRNA junctions. Duplicates in the aligned reads were removed using Picard Tools MarkDuplicates (version 2.20.0). An aligned read pair was assigned to an RNA type if at least one of the mates overlapped with the corresponding genomic regions. In this way, the aligned reads were sequentially assigned to lncRNA, mRNA, snoRNA, snRNA, srpRNA and Y RNA with HTSeq[26] package according to the GENCODE v27 annotation.
Count matrix for human genes was constructed using featureCounts[27] v1.6.2 with the GENCODE v27 annotation. To avoid the impact of potential DNA contamination, only intron-spanning reads were considered.
Quality control
We filtered cfRNA-seq samples with multiple quality control criterions (Figure S1): 1) raw reads > 10 million; 2) clean reads (reads remained after trimming low quality and adaptor sequences) > 5 million; 3) aligned reads after duplicate removal (aligned to the human genome, hg38, and circRNA junctions) > 0.5 million; 4) For the clean reads, fraction of spike-in reads < 0.5 and ratio of rRNA reads < 0.5; 5) For genome aligned reads, ratio of mRNA and lncRNA reads > 0.2, ratio of unclassified reads < 0.3, and number of intron-spanning read pairs (defined as a read pair with a CIGAR string in which at least one mate contains “N” in the BAM files) > 100,000.
Differential analysis and functional enrichment analysis
We used quasi-likelihood method in edgeR[28] package to identify differentially expressed genes and genera with significantly abundance alteration (|log2(fold-change)| > 1 and FDR < 0.05). We used this method to identify differential genes between cancer patients and healthy donors, as well as genes specific to one cancer type. For cancer type specific genes, previously reported gender related genes[29] were excluded. KEGG pathway enrichment analysis of deregulated genes/RNAs were carried out using clusterProfiler[30].
Data normalization
The count matrix of gene expression was normalized using the trimmed mean of M-values (TMM) method in edgeR. ANOVA was performed among different sample groups (HD and five cancer types) on discovery set using the quasi-likelihood method in edgeR, and the 25% most insignificant genes that stably expressed among different groups were considered as empirical control genes. The TMM normalized expression matrix was adjusted by the RUVg function in the RUVSeq[31] package based on the identified control features.
Microbe data analysis
Unmapped reads (cleaned reads failed to aligned to human genome or circRNA junctions) were processed independently using a k-mer based pipeline and an alignment-based pipeline. In first pipeline, unmapped reads were classified using kraken2[32] with its standard database, which contains bacteria, archaea, virus and human sequences. In the alignment-based pipeline, using SortMeRNA[33] (version 4.3.3), unmapped reads were annotated as either rRNA or non rRNA. rRNA reads were mapped to Silva database with bowtie2[34]. Non rRNA reads were aligned to virus genome curated in kraken2’s standard database. In both pipelines, counts at genus level were used for downstream analysis.
The same preprocessing and downstream analysis pipeline were applied to negative control samples (E. coli RNA-seq data was aligned to its reference genome NZ_CP025520.1 with bowtie2, instead of map to human rRNA, human genome and circRNA junctions). For reads coverage analysis of Lawsonella clevelandensis and HBV, reads unmapped to human sequences were mapped to their reference genomes (NZ_CP012390.1 and NC_003977.2, respectively).
Potential contaminations in genera detected by both kraken2 pipeline and bowtie2 pipeline (have at least 3 reads in at least 3 samples) were filtered prior to downstream analysis. We remove bacteria genera detected in at least 1 control samples (at least 3 reads), and virus genera detected in at least 1 E. coli control samples (at least 3 reads). Genera present in a previously reported common lab contamination list[35], or genera that contain species with CPM > 10 in a published human skin microbiome dataset[36] were removed. Virus genera that contains species with non-human eukaryotic host according to virushostdb[37] were also excluded.
The genera with altered abundance were identified using edgeR. For cancer type specific microbes, genera with a prevalence lower than 20% in all sample groups were excluded. Counts at genus level were also normalized with TMM and RUVg, as we did for human gene expression.
Classification performance evaluation based on independent validation
We split the samples into a discovery set and an independent validation set in a 1:1 manner, stratified by cancer type, age, and gender. Esophageal cancer samples were all retained in the discovery set due to its small sample size.
In the discovery set, we use rank sum test to select 100 most significant features, and the more computationally intensive SURF method was applied to select 10 most important features. We used the ranksums functions in scipy[38] for rank sum test. SURF was implemented in python package skrebate[39]. To mitigate the impact of within-class heterogeneity, feature selection was embedded in the shuffle-split cross-validation process. The discovery set was randomly split into training and test sets in an 80%-20% manner; feature selection was performed on the training set and the model was evaluated on the test set. This procedure was repeated 100 times. This strategy was applied separately for gene expression and microbe abundance data, and 5 up regulated and 5 down regulated features were selected in each cross validation run.
Finally, 10 most frequently selected features were used for evaluate the binary classification performance on the validation set. A balanced version of random forest (implemented in python package imblearn[40]) was used to handle the class-imbalance problem explicitly. AUROC and its confidence interval were calculated using R package pROC[41].
For multiclass classification, we apply the same strategy in one vs. rest manner, that is selecting 10 up regulated features in each cross validation run, and finally used features selected with recurrence higher than 10 (in 100 cross-validation runs) for multiclass classification. In the final multi-classification model, 129, 117, and 145 features were selected from human gene expression, microbe genera abundance in kraken2 pipeline and bowtie2 pipeline, respectively.
Data availability
In addition to 35 datasets for liver cancer and 30 datasets for healthy donors we published previously (GSE142987)21, 230 raw FASTQ files supporting the findings of this study are available in the Gene Expression Omnibus (GEO: GSE174302).
For editors and reviewers:
The data in GSE174302 can be downloaded from the GEO with a secure token: cxmxycqelxctbst.