Data source:
We extracted circRNAs sequences from the circRNADb database[22] and other lncRNAs sequences from the GENCODE database[23] (lincRNA, antisense, processed transcript, sense intronic, and sense overlapping) respectively. After removed sequences shorter than 200 nucleotides, we got 31939 circRNAs and 19722 other lncRNAs. The circRNA sequences were regarded as positive samples. We randomly divided the dataset into a training set (75%), validation set (10%), and test set (15%).
Instances extraction by sliding window:
An RNA sequence was regarded as a bag, and instances were extracted from the sequence. For each full-length sequence, we connected the head (5’ end) and tail (3’ end) of the sequence, set the slider window size and the sliding step, and made the slider move from the head. For each step, the sequence contained in the window was extracted as an instance, until the slider moved out of the tail of the sequence (Illustrated in Fig.1). For a sequence of a certain length, the number of instances can be calculated by the following formula.
The network structure was represented in Figure 2. We employed the encoder structure of the seq2seq model[24] here as the instance feature extractor. The embedding layer[25] was employed to represent bases (15 (A, T, G, C, N, H, B, D, V, R, M, S, W, Y, K) → 4 (representative dimension))。The encoder used a bi-directional RNN structure, which given equal attention to the head and the tail of the instance, and the output was a context vector[26] to represent the feature of the instance. And subsequently, through the MIL layer, the features of all instances were scored and aggregated jointly to determine the type of the bag[20, 21, 27].
Attention mechanism as the MIL pooling
Reference to previous work of pooling layer structure, we selected the attention-based pooling structure, which exhibited better aggregation and representation capacity[21]. It was assumed that the feature extracted by encoder were , and its corresponding attention weights were , which could be formulated as follow.
Where and . The attention-base structure allowed to discover the similarity between different instances and made the network have better representability. After the encoder feature was weighted by the attention scores, the probability of determination was output via a sigmoid neuron through a fully connected layer.
Handling of handwritten numbers dataset
The handwritten numbers dataset was used to verify the representational power of the attention score. Each number figure (size = 28×28) was served as an instance. A bag contained more than 16 instances. For each instance, we treated the image as a sequence containing 28 characters, and each with a representation dimension of 28, for feeding into the network (Circ-ATTEN-MIL; the embedding layer in encoder block was removed in this task) (Fig.3). A bag is positive when it contains the determining number (Two modes were set: determining number is 0; determining number are 0, 1, 3).
Fusion model:
The ‘weighted feature’ (the penultimate layer) of Circ-ATTEN-MIL was extracted as the sequence feature defined by the model. The other features were calculated using the extraction methods of RCM features and conservation features in CircDeep. Combining these three types of features (sequence feature: 100; RCM feature: 40; conservation feature: 23), a four-layer MLP (Multi-Layer Perceptron) network (163-80-20-1 (the output layer is a sigmoid-activated neuron)) was constructed as a fusion model.
Evaluation criteria:
We evaluated the model performance by classification accuracy, sensitivity, specificity, MCC (Matthews correlation coefficient), and F1 score (formulated as follows).
Extraction of highly attention sequence splices
As the attention score was applied to the encoder features of each instance, we assigned the same scores to the sequence of the instance, and collapsed the weighted sequences according to the inverse of the slider rule (Fig.4), and extracted the sequence fragment (with certain length: >7) with the higher attention score (after scaling to between 0 and 1: >0.6), which served as the high attention sequence splices.
Motif enrichment:
MEME software[28] was utilized to perform motif enrichment tasks. In MEME environment, classic mode was selected to enrich motifs in RNA sequences between 6 and 50 lengths (The code is: meme RNA.fasta -rna -nostatus -mod zoops -minw 6 -maxw 50 -objfun classic -markov_order 0).
Result 1: Dataset description
The sequence length distribution and base proportion between circRNAs and other lncRNAs (In training set) were very similar (Fig.5), which illustrated that the simple features between the two-type sequences were comparable and the model feeding with raw sequences was hard to accomplish the identification task by these simple features.
Result2: Model architecture
In instance extraction, the window size was set to 70, sliding step was set to 5 (Fig.1). In the encoder block, it consists of one embeding_15_4 layer and two bi-direction LSTM_4_150 layers. The final step outputs of both directions were concatenated, and via an FCN_300_100 layer, the instance feature (C_100) was obtained. In the attention block, the C_100 features of each instance were accepted as key values. After an FCN_100_30, an FCN_30_1 layer, the dimension for each instance was reduced to 1 (attention value). A softmax layer was utilized to normalize the attention value for each instance, and then the normalized attention score was yield. Finally, the classifier block accepting all instances’ weighted C_100 feature, through a fully connected layer and a sigmoid neuron, outputted the identification probabilities (Fig.2).
Result3: Model training and identification evaluation
We used the binary cross-entropy loss function to calculated loss and trained the models with the adam optimization algorithm (learning rate is 0.0002; betas = (0.9, 0.999); weight decay is 10e-5). Balancing the accuracy and over-fitness, we chose the model trained at the 70th epoch as the final model and plotted the ROC curve (Fig.6). As a result, the performance of the model training had strong identification efficiency (Train AUC=0.99; Validation AUC=0.97; Test AUC=0.97). Subsequently, multiple evaluation criteria were employed to test the model (Table 1), and these metrics also validated that the model has a high degree of robustness.
Table 1. The evaluation for classification task
|
Accuracy
|
Sensitivity
|
Specificity
|
Precision
|
MCC
|
F1
|
Train
|
0.9552
|
0.9662
|
0.9547
|
0.9713
|
0.9194
|
0.9687
|
Validation
|
0.9333
|
0.9485
|
0.9092
|
0.9433
|
0.8291
|
0.9459
|
Test
|
0.9284
|
0.9396
|
0.9039
|
0.9393
|
0.8435
|
0.9394
|
Result4: Comparison with other algorithms
This model was compared with ACNN-BLSTM in CircDeep[13], which took the sequence as input without the feature from the secondary structure and conservation score of the sequence. In Circ-ATTEN-MIL, the input was full-length raw sequences. While in ACNN-BLSTM, the input was the padded triplet sequences (the base triplet was transformed to a 40-dimension vector by word2vec; the input length was padded to 8000). The comparison results showed that our final model was better under the three metrics (Table.2). Finally, we incorporated the RCM and conservation features which used in CircDeep model to build a fusion model (Methods), and successfully improved the discriminative power of the final model.
Table 2. The comparison results
|
ACC
|
MCC
|
F1 Score
|
PredcircRNA
|
0.8056
|
0.6113
|
0.8108
|
ACNN-BLSTM
|
0.8942
|
0.7756
|
0.9149
|
Circ-ATTEN-MIL
|
0.9284
|
0.8435
|
0.9394
|
CircDeep (fusion)
|
0.9327
|
0.8536
|
0.9304
|
Fusion model
|
0.9434
|
0.8796
|
0.9546
|
Result5: Attention score employed for identifying determining factor
To verifying the representational power of the attention score, we used the handwritten numbers dataset to visualize the known determining factor with the produced attention score. Two model (In encoder block: 2 LSTM_28_10, FCN_10_10; in MIL block: FCN_10_5, FCN_5_1) was trained in this part, one (model 1) with 0 and another (model 2) with 0, 1, 3 as determining number (a bag contains determining number instances was treated as positive sample). The training was stopped after the accuracy exceed 0.90 (around 10 epochs). We visualized the attention score with the matched instances and discovered that the attention score identifies well whether the bag contains a single determinant, multiple identical determinants, or multiple different determinants (Fig.7). Statistics on determining numbers identification showed a very low percentage of false identifications, and although there was a certain unrecognized rate, the identified numbers had a very high confidence level (>99%).
Result6: Motif enrichment from high attention sequence
The high attention sequences were extracted for all correct identification circRNAs transcripts. Most of the high attention sequences were between 8-40 in length, and the count of the attention sequences for each transcript was around 4 (Fig.8), which validates our initial assumption that the meaningful features were sparse. All high attention sequences were used for motif enrichment, and multiple validated motifs were yield (Table 3).
Table 3. motif enriched from the sequence