Table 1. Summarization of the existing methods for predicting m5C sites of RNA.
Methods
|
Datasetsa
|
Algorithms
|
Webserver availability
|
Evaluation strategy
|
Features
|
Species
|
iRNA-m5C[26]
|
120 m5C + 120 non-m5C
97 m5C + 97 non-m5C
6289 m5C + 6289 non-m5C
211 m5C + 211 non-m5C
|
RF
|
Yes
|
(1) Jackknife test
(2) independent test
|
PseKNC
MNBE
KNFC
NV
|
H. sapiens
M.musculus
A. thaliana
S.cerevisiae
|
RNAm5Cfinder [25]
|
All m5C sites recorded in GSE90963
GSE93749
GSE83432
|
RF
|
Yes
|
(1)5 fold cross validation
(2) independent test
|
MNBE
|
H. sapiens
M. musculus
|
PEA-m5C[24]
|
DatasetCV (1196:11960)
DatasetHT (100:100)
DatasetT1 (79:79)
DatasetT2 (73:73)
|
RF
|
Yes
|
(1)10 fold cross validation
(2) independent test
|
PseDNC
KNFC
MNBE
|
A. thaliana
|
RNAm5CPred[23]
|
Met935 (127:808)
Met240 (120:120)
Met1900 (475:1425)
Test1157 (157:1000)
|
SVM
|
Yes
|
(1) Jackknife test
(2)10 fold cross validation
(3) independent test
|
KNF
KSNPF
PseDNC
|
H. sapiens
|
pM5CS-Comp-mRMR [22]
|
120 m5C and 120 non-m5C
|
SVM
|
No
|
Jackknife test
|
DNC,
TNC, Tetra-NC
|
H. sapiens
|
M5C-HPCR[21]
|
Met1320(120:1200)b
Met1900 (475:1425)
|
Ensemble of SVM
|
No
|
Jackknife test
|
PseDNC
|
H. sapiens
|
iRNAm5C-PseDNC[20]
|
Met1900 (475:1425)
|
RF
|
Yes
|
Jackknife test
|
PseDNC
|
H. sapiens
|
m5C-PseDNC[19]
|
Met1320(120:1200)b
|
SVM
|
No
|
Jackknife test
|
PseDNC
|
H. sapiens
|
a The numbers in the parentheses are the ratios between m5C and non-m5C sites of that dataset.
b Although the ratio between m5C and non-m5C sites is 120:1320, but the final model is based on a balanced dataset with 120 m5C and 120 non-m5C sites.
Table 2. The results of feature selection for H. sapiens
Feature Subset
|
KS
|
BC
|
Sn(%)
|
Sp(%)
|
Pre(%)
|
Acc(%)
|
Mcc
|
F1score
|
AUC
|
PSNP
|
1
|
32
|
81.5
|
81.0
|
81.1
|
81.3
|
0.625
|
0.813
|
0.879
|
5SPSDP
|
0.25
|
8
|
82.5
|
77.5
|
78.6
|
80.0
|
0.601
|
0.805
|
0.862
|
CPD
|
8
|
0.25
|
81.0
|
76.5
|
77.5
|
78.8
|
0.576
|
0.792
|
0.850
|
5SNPF
|
0.25
|
4
|
73.5
|
79.5
|
78.2
|
76.5
|
0.531
|
0.758
|
0.802
|
PseDNC
|
1
|
4096
|
74.0
|
73.0
|
73.3
|
73.5
|
0.470
|
0.736
|
0.790
|
4NF
|
0.125
|
0.5
|
55.5
|
88.5
|
82.8
|
72.0
|
0.466
|
0.665
|
0.783
|
PSNP+4NF
|
1
|
16
|
83.5
|
80.0
|
80.7
|
81.8
|
0.635
|
0.821
|
0.893
|
PSNP+5SNPF
|
1
|
32
|
81.0
|
80.5
|
80.6
|
80.8
|
0.615
|
0.808
|
0.882
|
PSNP+5SPSDP
|
2
|
32
|
82.5
|
81.0
|
81.3
|
81.8
|
0.635
|
0.819
|
0.885
|
PSNP+PseDNC
|
0.5
|
0.25
|
87.5
|
70.5
|
74.8
|
79.0
|
0.589
|
0.806
|
0.854
|
PSNP+CPD
|
8
|
0.25
|
82.0
|
75.5
|
77.0
|
78.8
|
0.576
|
0.794
|
0.850
|
PSNP+4NF+5SNPF
|
1
|
16
|
85.5
|
79.5
|
80.7
|
82.5
|
0.651
|
0.830
|
0.897
|
PSNP+4NF+5SPSDP
|
1
|
8
|
82.5
|
82.5
|
82.5
|
82.5
|
0.650
|
0.825
|
0.895
|
PSNP+4NF+CPD
|
8
|
0.25
|
82.0
|
75.5
|
77.0
|
78.8
|
0.576
|
0.794
|
0.850
|
PSNP+4NF+PseDNC
|
1
|
16
|
80.5
|
78.5
|
78.9
|
79.5
|
0.590
|
0.797
|
0.873
|
PSNP+4NF+5SNPF+5SPSDP
|
1
|
8
|
81.5
|
80.5
|
80.7
|
81.0
|
0.620
|
0.811
|
0.896
|
PSNP+4NF+5SNPF+CPD
|
64
|
16
|
84.0
|
73.0
|
75.7
|
78.5
|
0.573
|
0.796
|
0.854
|
PSNP+4NF+5SNPF+PseDNC
|
1
|
16
|
85.5
|
80.0
|
81.0
|
82.8
|
0.656
|
0.832
|
0.899
|
PSNP+4NF+5SNPF+PseDNC+CPD
|
64
|
16
|
84.0
|
73.0
|
75.7
|
78.5
|
0.573
|
0.796
|
0.854
|
PSNP+4NF+5SNPF+PseDNC+5SPSDP
|
1
|
8
|
81.5
|
81.5
|
81.5
|
81.5
|
0.630
|
0.815
|
0.897
|
Table 3. The results of feature selection for M. musculus
Feature Subset
|
KS
|
BC
|
Sn(%)
|
Sp(%)
|
Pre(%)
|
Acc(%)
|
Mcc
|
F1score
|
AUC
|
CPD
|
4
|
1
|
74.0
|
73.1
|
73.3
|
73.6
|
0.461
|
0.737
|
0.812
|
1SPSDP
|
16
|
8192
|
73.0
|
72.6
|
72.7
|
72.8
|
0.456
|
0.728
|
0.803
|
PSNP
|
8
|
8192
|
75.2
|
69.1
|
70.9
|
72.2
|
0.444
|
0.730
|
0.794
|
4NF
|
0.125
|
0.5
|
68.0
|
66.1
|
66.8
|
67.1
|
0.341
|
0.674
|
0.730
|
PseDNC
|
0.5
|
256
|
65.1
|
66.2
|
65.8
|
65.7
|
0.313
|
0.655
|
0.715
|
1SNPF
|
1
|
8
|
65.5
|
64.2
|
64.7
|
64.9
|
0.298
|
0.652
|
0.702
|
CPD+1SPSDP
|
4
|
1
|
74.1
|
73.1
|
73.3
|
73.6
|
0.472
|
0.737
|
0.813
|
CPD+PSNP
|
4
|
1
|
74.2
|
73.0
|
73.3
|
73.6
|
0.472
|
0.738
|
0.813
|
CPD+4NF
|
32
|
256
|
75.1
|
72.7
|
73.3
|
73.9
|
0.478
|
0.742
|
0.815
|
CPD+PseDNC
|
64
|
4096
|
75.4
|
72.4
|
73.2
|
73.9
|
0.478
|
0.743
|
0.813
|
CPD+1SNPF
|
8
|
2
|
74.8
|
72.3
|
73.0
|
73.6
|
0.471
|
0.739
|
0.811
|
CPD+4NF+1SNPF
|
32
|
256
|
75.3
|
72.7
|
73.4
|
74.0
|
0.480
|
0.743
|
0.816
|
CPD+4NF+PSNP
|
32
|
256
|
75.2
|
72.7
|
73.3
|
73.9
|
0.479
|
0.742
|
0.815
|
CPD+4NF+1SPSDP
|
64
|
4096
|
75.7
|
72.8
|
73.6
|
74.3
|
0.486
|
0.746
|
0.822
|
CPD+4NF+PseDNC
|
32
|
256
|
75.4
|
72.7
|
73.4
|
74.0
|
0.48
|
0.744
|
0.816
|
CPD+4NF+1SPSDP+1SNPF
|
64
|
2048
|
76.0
|
72.9
|
73.8
|
74.5
|
0.490
|
0.749
|
0.822
|
CPD+4NF+1SPSDP+PSNP
|
64
|
4096
|
75.7
|
72.8
|
73.6
|
74.2
|
0.485
|
0.746
|
0.822
|
CPD+4NF+1SPSDP+PseDNC
|
64
|
4096
|
75.7
|
72.8
|
73.6
|
74.2
|
0.485
|
0.746
|
0.822
|
Table 4. The results of feature selection for A. thaliana.
Feature Subset
|
KS
|
BC
|
Sn(%)
|
Sp(%)
|
Pre(%)
|
Acc(%)
|
Mcc
|
F1score
|
AUC
|
PseDNC
|
0.125
|
0.25
|
59.4
|
80.6
|
75.4
|
70.0
|
0.410
|
0.665
|
0.760
|
4NF
|
0.125
|
1
|
62.3
|
76.3
|
72.4
|
69.3
|
0.389
|
0.670
|
0.759
|
CPD
|
16
|
16
|
61.1
|
78.4
|
73.9
|
69.8
|
0.401
|
0.669
|
0.753
|
1SNPF
|
0.25
|
0.125
|
57.7
|
81.0
|
75.2
|
69.4
|
0.398
|
0.653
|
0.753
|
PSNP
|
0.5
|
32
|
55.8
|
78.1
|
71.8
|
66.9
|
0.347
|
0.628
|
0.724
|
3SPSDP
|
0.0625
|
1
|
58.2
|
72.4
|
67.8
|
65.3
|
0.309
|
0.626
|
0.694
|
PseDNC+1SNPF
|
0.25
|
0.25
|
61.5
|
78.8
|
74.4
|
70.1
|
0.409
|
0.673
|
0.759
|
PseDNC+PSNP
|
1
|
64
|
60.0
|
80.6
|
75.7
|
70.5
|
0.419
|
0.672
|
0.769
|
PseDNC+3SPSDP
|
0.25
|
2
|
63.4
|
78.7
|
74.9
|
71.1
|
0.426
|
0.686
|
0.773
|
PseDNC+4NF
|
0.25
|
1
|
61.0
|
79.7
|
75.0
|
70.3
|
0.414
|
0.673
|
0.763
|
PseDNC+CPD
|
16
|
16
|
61.0
|
78.7
|
74.1
|
69.7
|
0.404
|
0.669
|
0.753
|
PseDNC+3SPSDP+4NF
|
0.25
|
1
|
65.1
|
77.3
|
74.2
|
71.2
|
0.427
|
0.693
|
0.777
|
PseDNC+3SPSDP+1SNPF
|
0.25
|
0.5
|
65.2
|
77.3
|
74.1
|
71.2
|
0.428
|
0.694
|
0.776
|
PseDNC+3SPSDP+PSNP
|
0.25
|
1
|
64.1
|
76.8
|
73.4
|
70.4
|
0.412
|
0.684
|
0.768
|
PseDNC+3SPSDP+CPD
|
16
|
16
|
61.0
|
78.8
|
74.2
|
69.9
|
0.404
|
0.670
|
0.753
|
PseDNC+3SPSDP+4NF+1SNPF
|
0.5
|
2
|
64.9
|
78.1
|
74.7
|
71.5
|
0.433
|
0.695
|
0.779
|
PseDNC+3SPSDP+4NF+PSNP
|
0.5
|
2
|
63.5
|
78.5
|
74.7
|
71.0
|
0.424
|
0.686
|
0.772
|
PseDNC+3SPSDP+4NF+CPD
|
16
|
16
|
61.0
|
78.8
|
74.2
|
69.9
|
0.404
|
0.670
|
0.755
|
PseDNC+3SPSDP+4NF+1SNPF+PSNP
|
0.25
|
0.5
|
68.1
|
75.5
|
73.5
|
71.8
|
0.437
|
0.707
|
0.782
|
PseDNC+3SPSDP+4NF+1SNPF+CPD
|
16
|
16
|
61.1
|
78.9
|
74.3
|
70.0
|
0.406
|
0.670
|
0.756
|
PseDNC+3SPSDP+4NF+1SNPF+PSNP+CPD
|
16
|
16
|
61.1
|
78.9
|
74.3
|
70.0
|
0.406
|
0.670
|
0.756
|
Table 5. Comparison of different classifiers based on the cross-validation results on the training datasets for the three species
Species
|
Classifiers
|
Sn(%)
|
Sp(%)
|
Pre(%)
|
Acc(%)
|
Mcc
|
F1score
|
AUROC
|
H. Sapiens
|
SVM
|
85.5
|
80.0
|
81.0
|
82.8
|
0.656
|
0.832
|
0.899
|
XGBoost
|
82.5
|
79.5
|
80.1
|
81.0
|
0.620
|
0.813
|
0.879
|
RF
|
77.5
|
77.0
|
77.1
|
77.3
|
0.550
|
0.773
|
0.849
|
KNN
|
84.5
|
72.5
|
75.5
|
78.5
|
0.574
|
0.797
|
0.850
|
Adaboost
|
79.5
|
73.5
|
75.0
|
76.5
|
0.530
|
0.772
|
0.860
|
DT
|
68.0
|
65.0
|
66.1
|
66.5
|
0.330
|
0.670
|
0.678
|
LR
|
62.0
|
61.5
|
61.7
|
61.8
|
0.235
|
0.618
|
0.617
|
M. musculus
|
SVM
|
75.7
|
72.8
|
73.6
|
74.3
|
0.486
|
0.746
|
0.822
|
XGBoost
|
76.1
|
73.6
|
74.3
|
74.9
|
0.498
|
0.752
|
0.823
|
RF
|
75.9
|
71.6
|
72.8
|
73.7
|
0.476
|
0.743
|
0.814
|
KNN
|
67.3
|
67.5
|
67.5
|
67.4
|
0.349
|
0.674
|
0.729
|
Adaboost
|
74.2
|
72.6
|
73.0
|
73.4
|
0.468
|
0.736
|
0.812
|
DT
|
62.6
|
62.3
|
62.4
|
62.5
|
0.250
|
0.630
|
0.615
|
LR
|
73.3
|
73.2
|
73.2
|
73.2
|
0.465
|
0.733
|
0.811
|
A. thaliana
|
SVM
|
68.1
|
75.5
|
73.5
|
71.8
|
0.437
|
0.707
|
0.782
|
XGBoost
|
65.1
|
76.3
|
73.3
|
70.7
|
0.417
|
0.690
|
0.770
|
RF
|
66.1
|
76.8
|
74.1
|
71.5
|
0.432
|
0.699
|
0.778
|
KNN
|
58.0
|
78.6
|
73.1
|
68.3
|
0.375
|
0.647
|
0.734
|
Adaboost
|
65.2
|
74.2
|
71.6
|
69.7
|
0.395
|
0.683
|
0.756
|
DT
|
59.5
|
60.0
|
59.8
|
59.8
|
0.200
|
0.600
|
0.587
|
LR
|
64.4
|
69.8
|
68.1
|
67.1
|
0.342
|
0.662
|
0.730
|
Table 6. Comparison with existing methods on the independent test sets.
Species
|
Model
|
Sn(%)
|
Sp(%)
|
Pre(%)
|
Acc(%)
|
Mcc
|
F1-score
|
AUROCa
|
H. sapiens
|
RNAm5Cfinder
|
37.7
|
88.4
|
76.5
|
63.1
|
0.303
|
0.505
|
0.635
|
iRNA-m5C
|
42.1
|
46.4
|
43.9
|
44.2
|
-0.116
|
0.429
|
---
|
iRNAm5C-PseDNC
|
4.35
|
97.1
|
60.1
|
50.7
|
0.039
|
0.081
|
---
|
RNAm5CPred
|
71.0
|
66.7
|
68.1
|
68.9
|
0.377
|
0.695
|
0.772
|
our method
|
75.4
|
79.7
|
78.8
|
77.5
|
0.551
|
0.77
|
0.858
|
M. musculus
|
RNAm5Cfinder
|
38.6
|
78.9
|
64.5
|
58.8
|
0.191
|
0.483
|
0.593
|
iRNA-m5C
|
0.61
|
99.8
|
75.1
|
50.2
|
0.032
|
0.012
|
----
|
our method
|
67.9
|
74.9
|
73.0
|
71.4
|
0.429
|
0.704
|
0.775
|
A. thaliana
|
iRNA-m5C
|
72.4
|
75.6
|
73.5
|
74.1
|
0.481
|
0.729
|
----
|
PEA-m5C
|
43.2
|
45.4
|
43.8
|
44.3
|
-0.114
|
0.454
|
----
|
our method
|
75.5
|
76.1
|
76.0
|
75.8
|
0.516
|
0.757
|
0.836
|
a There are no predicted scores of iRNA-m5C, iRNAm5C-PseDNC and PEA-m5C, so the AUROCs for these methods were not available.
Table 7. The information of the datasets
Dataseta
|
Length(bp)
|
Positive subset
|
Negative subset
|
Total
|
H_train
|
41
|
200
|
200
|
400
|
H_test
|
41
|
69
|
69
|
138
|
M_train
|
41
|
4563
|
4563
|
9126
|
M_test
|
41
|
1000
|
1000
|
2000
|
A_train
|
41
|
5289
|
5289
|
10578
|
A_test
|
41
|
1000
|
1000
|
2000
|
a ‘H’ represents H. sapiens, ‘M’ represents M. musculus and ‘A’ represents A. thaliana.
Table 8. Three types of physicochemical properties of dinucleotides in RNA
Dinucleotide
|
Free energy
|
Hydrophilicity
|
Stacking energy
|
GG
|
-3.260
|
0.170
|
-11.100
|
GA
|
-2.350
|
0.100
|
-14.200
|
GC
|
-3.420
|
0.260
|
-16.900
|
GU
|
-2.240
|
0.270
|
-13.800
|
AG
|
-2.080
|
0.080
|
-14.000
|
AA
|
-0.930
|
0.040
|
-13.700
|
AC
|
-2.240
|
0.140
|
-13.800
|
AU
|
-1.100
|
0.140
|
-15.400
|
CG
|
-2.360
|
0.350
|
-15.600
|
CA
|
-2.110
|
0.210
|
-14.400
|
CC
|
-3.260
|
0.490
|
-11.100
|
CU
|
-2.080
|
0.520
|
-14.000
|
UG
|
-2;.110
|
0.340
|
-14.400
|
UA
|
-1.330
|
0.210
|
-16.000
|
UC
|
-2.350
|
0.480
|
-14.200
|
UU
|
-0.930
|
0.440
|
-13.200
|