We developed the process illustrated in Fig. 1. The training pipeline (Fig. 1, A) uses RNA-seq read pairs from both parents, as well as a reference genome or transcriptome from each parent. Every read pair is aligned to both parental references. Features are extracted from the alignments by a Python script that parses the aligner output. A machine-learning model is trained to predict the parent-of-origin per read pair based on the alignment features. The prediction pipeline (Fig. 1, B) would apply a similar pipeline to read pairs from the hybrid offspring. In this case, the unknown parent-of-origin per read pair must be predicted. To assess model accuracy in hybrids, we simulated hybrid data by combining real RNA-seq data from both parents in equal quantities. Possible limitations of the simulation are addressed in Discussion.
Fig. 1 Process for classifying RNA-seq by machine learning. Data from parents P1 and P2 are represented by blue and yellow, respectively, while data from their hybrid offspring are green. A) Training uses many RNA-seq read pairs from both parents, P1 and P2. For compactness, only one read pair is shown here, and it is from parent P2. The read pair is aligned to both parent references. Features are extracted from the best alignment to each reference. A binary classifier, such as the random forest shown here, is trained to examine the features for each read pair and assign it to the correct parent. B) The trained classifier can be used to classify RNA-seq read pairs from a hybrid. Classifying a read pair as P2, as shown, implies its RNA is likely transcribed from the P2 haplotype in the hybrid genome. The classified reads can be binned by gene and quantified to detect allele-specific gene expression in the hybrid.
We searched INSDC databases (45) to find pairs of organisms each having (1) a reference genome assembly, and (2) a reference transcriptome assembly, and (3) at least one RNA-seq dataset. We filtered for pairs of organisms known to hybridize and belonging to the same genus. We filtered further for organism pairs having RNA-seq datasets that were comparable by read length and sample collection methods. In some cases, after low map rates were seen, different RNA-seq data was sought out. The search yielded data for: (1) two species of the flowering plant genus Arabidopsis; (2) two species of the cultivated plant genus Brassica; (3) two strains of the mouse species Mus musculus; and (4) two species of the equine genus Equus.
We selected mapping software packages based on the software’s ability to generate SAM/BAM files (46) that include the required and optional fields required for our alignment feature extraction process. For mapping to reference transcripts, we selected the aligners Bowtie2 (6,7) and STAR (8,9), which we configured for mapping to RNA references lacking introns. For mapping to reference genomes, we selected HiSat2 (18,19), which incorporates Bowtie2, and STAR using its default configuration for mapping to DNA containing introns. For an additional comparison, the alignment-free mapping software Salmon was included for transcript mapping, though its outputs did not satisfy the criteria for alignment feature extraction.
For each genus, RNA-seq read pairs from two parental organisms were combined in equal quantities. Reads were mapped to both parental references, using either both transcriptomes or both genomes. For machine-learning approaches, fifty-three features were selected to represent the pairwise sequence alignment generated by one aligner of one RNA-seq read pair to two parental references, where both references were either transcriptomes or genomes. The features are listed in Table 1. A portion of the reads were used to train a random forest model (47). A separate portion of the reads were used to quantify the predictive performance of the trained model. The models’ performance was compared to these other approaches: comparing alignment scores or relying on the mapper directly.
Table 1 Alignment Features used for Machine Learning
Feature type
|
Extraction
|
Technical notes
|
A)
AS: Alignment Score
ED: Edit Distance
MM: Mismatch count
HQMM: HQ mismatch count
GO: Gap Open count
GE: Gap Extend count
INS: Insertion count
HQINS: HQ insertion count
DELS: Deletion count
HQDEL: HQ deletion count
|
Taken from:
- P1 R1,
- P1 R2,
- P2 R1,
- P2 R2,
10 feature types,
40 features total.
|
High-quality (HQ) means that the base call quality score is the maximal value. The HQ requirement was applied to the one base involved in a mismatch or insertion, and to the two surrounding bases for deletion. INS or DEL refer to an extra or missing base in the read, respectively. GO is the number of separate indels, and GE is the number of bases in indels.
|
B)
AS diff
ED diff
MM diff
HQMM diff
GO diff
GE diff
INS diff
DELS diff
HQINS diff
HQDEL diff
MAT diff
|
Subtract
(P1 R1 + P1 R2)
from
(P2 R1 + P2 R2)
|
Each difference represents the sum over the read pair alignments to parent 2 minus the equivalent sum for parent 1.
MAT is the matched base count. See A) for other feature types.
|
C)
Span diff
|
Subtract P1 span
from P2 span.
|
Span is the length of the read pair alignment along the reference.
|
D)
Parent choice
|
Compare P1 to P2
|
Use -1 or +1 to indicate whether P1 or P2 had the greater alignment score, respectively, or 0 if tied.
|
After an aligner, such as Bowtie2, aligned an RNA-seq read pair to both parental references, the one best alignment per reference was selected based on flags in the aligner output. Then, 53 features were extracted for machine learning. A) These 10 features, taken directly from the aligner output, describe the alignment of one read to one reference. Four sets of 10 we extracted to represent reads R1 and R2 aligned to references of parents P1 and P2. B, C, D) These features were derived from combinations of those in part A.
Results with genus Arabidopsis
This experiment used public data from A. lyrata and A. halleri, two species of the flowering plant genus Arabidopsis. The transcriptome results are characterized in Table 2. As shown in column A, we achieved 83% accuracy by running Bowtie2 on the concatenation of parental references. As shown in column B, we achieved 81% accuracy by running Bowtie2 twice, once on each parent reference, then choosing the one alignment having the higher alignment score, which is one of the criteria used by aligners internally. As shown in column C, we achieved 96% accuracy by applying the random forest post-process to the alignment generated for column B. Out of these three approaches, the random forest achieved the highest performance by all measures including accuracy, F1-score, and MCC. Columns D, E, and F show the same experiment repeated using the STAR aligner configured for transcript alignments. Again, the highest performance was achieved with random forest. For comparison, column G characterizes the performance of an alignment-free approach.
Table 2 Classification Performance with Reference Transcripts of genus Arabidopsis
Arabidopsis RNA
|
A
Bowtie2
|
B
Bo_AS
|
C
Bo_RF
|
D
STAR
|
E
St_AS
|
F
St_RF
|
G
Salmon
|
Accuracy:
Sensitivity:
Specificity:
Precision:
F1-score:
MCC:
AUPRC:
AUROC:
Pos Pref:
Ties:
|
82.8%
84.7%
80.9%
81.6%
83.1%
0.656
-
-
51.1%
0%
|
81.3%
71.3%
91.2%
89.0%
79.2%
0.638
-
-
40.1%
13.6%
|
96.0%
92.1%
99.9%
99.9%
95.8%
0.923
99.7%
99.7%
46.1%
0%
|
80.9%
81.6%
80.2%
80.5%
81.1%
0.618
-
-
50.5%
0%
|
80.9%
69.9%
92.0%
89.7%
78.6%
0.634
-
-
49.8%
14.7%
|
92.3%
92.0%
92.5%
92.5%
92.3%
0.846
98.4%
98.4%
46.1%
0%
|
75.5%
76.9%
74.0%
74.8%
75.8%
0.510
-
-
51.0%
0%
|
Performance metrics for parent-of-origin classification in Arabidopsis. In all seven approaches, RNA-seq read pairs were assigned to either of two reference transcriptomes. Whether used with Bowtie2 or STAR, the random forest method demonstrated superior performance. For the sake of directional statistics like sensitivity, species A. lyrata and A. halleri were designated as the negative and positive classes, respectively. A) Parent chosen by the Bowtie2 aligner. B) Parent chosen by comparing Bowtie2 alignment scores. C) Parent chosen by the random forest classifier using Bowtie2 alignment features. D, E, F) Similar to columns A, B, C, but using the STAR aligner, configured to avoid splicing. G) Parent chosen by the alignment-free software, Salmon.
To evaluate alternate machine learning architectures, the Bowtie2 experiment was repeated with three other architectures: a gradient boosting classifier, a support vector machine, and a multi-layer perceptron. The results, described in Table S1, did not exceed those of Table 2. To evaluate whether more trees would help the random forest, the experiment was repeated after increasing the number of trees within the random forest. The results, described in Table S2, did not exceed those of Table 2. Therefore, the default random forest model was used for the remaining experiments.
Whereas Table 2 showed results with the transcriptome references, Table 3 shows the results using the genome references and splice-aware alignment. The highest performance, as measured by accuracy, F1, or MCC, was achieved with the machine-learning post-process (columns C and F). The overall highest accuracy on Arabidopsis data was achieved with HiSat2 plus random forest, an in this case, the accuracy climbed from 67% with the aligner alone to 96% with machine learning.
Table 3 Classification Performance with Reference Genomes of genus Arabidopsis
Arabidopsis DNA
|
A
HiSat2
|
B
Hi_AS
|
C
Hi_RF
|
D
STAR
|
E
St_AS
|
F
St_RF
|
Accuracy:
Sensitivity:
Specificity:
Precision:
F1-score:
MCC:
AUPRC:
AUROC:
Pos Pref:
Ties:
|
66.8%
38.3%
95.4%
89.2%
53.6%
0.410
-
-
28.7%
0%
|
82.8%
72.0%
93.5%
91.8%
80.7%
0.671
-
-
39.2%
13.2%
|
96.2%
93.2%
99.1%
99.0%
96.0%
0.925
99.7%
99.7%
47.1%
0%
|
67.5%
40.5%
94.5%
88.0%
55.4%
0.415
-
-
30.0%
0%
|
81.4%
69.8%
93.1%
91.0%
79.0%
0.647
-
-
38.4%
14.7%
|
92.6%
93.8%
91.5%
91.7%
92.7%
0.853
98.5%
98.5%
51.2%
0%
|
Performance metrics for parent-of-origin classification in Arabidopsis. In all six approaches, RNA-seq read pairs were assigned to either of two reference genomes. Whether used with HiSat2 or STAR, the random forest led to superior accuracy, F1, and MCC. For the sake of directional statistics like sensitivity, species A. lyrata and A. halleri were designated as the negative and positive classes, respectively. A) Parent chosen by the HiSat2 aligner. B) Parent chosen by comparing HiSat2 alignment scores. C) Parent chosen by the random forest classifier using HiSat2 alignment features. D, E, F) Similar to columns A, B, and C but using the STAR aligner, configured for splicing.
Mild class imbalance is seen in Table 2. For example, in column C, sensitivity < specificity and precision > recall. (Recall and sensitivity are the same in binary classification). Also, the “Pos Pref” statistic shows that 46% of reads were assigned to the positive class, though the sample contained 50% from each class. These statistics reveal a bias for choosing the negative class (A. lyrata) and a tendency to mistakenly assign A. halleri reads to the A. lyrata parent. Investigation showed the mistakes were non-random, with a few transcripts attracting large portions of the mistaken alignments. Class bias is seen again in Table 3. The two aligners had a positive preference of 29% and 30%, while the random forest post-processor brought these more in balance with values of 47% and 51%. Thus, the trend in class bias seen with genomes was opposite of that seen with transcriptomes. Possible causes and mitigations for the bias will be addressed in the Discussion.
Results with genus Brassica
This experiment used public data from B. oleracea and B. rapa, two species in genus Brassica which includes many cabbage-like cultivars consumed by humans. The transcriptome and genome results are characterized in Tables 4 and 5 respectively. The aligner-plus-machine-learning combination showed the best performance in every trial, as measured by accuracy, F1, or MCC.
Table 4 Classification Performance with Reference Transcripts of genus Brassica
Brassica RNA
|
A
Bowtie2
|
B
Bo_AS
|
C
Bo_RF
|
D
STAR
|
E
St_AS
|
F
St_RF
|
G
Salmon
|
Accuracy:
Sensitivity:
Specificity:
Precision:
F1-score:
MCC:
AUPRC:
AUROC:
Pos Pref:
Ties:
|
88.7%
92.3%
85.1%
86.1%
89.1%
0.777
-
-
52.0%
0%
|
91.5%
94.2%
88.9%
89.4%
91.7%
0.832
-
-
52.6%
3.2%
|
93.6%
92.5%
94.6%
94.5%
93.5%
0.872
98.3%
98.2%
49.0%
0%
|
90.2%
93.4%
87.1%
87.8%
90.5%
0.806
-
-
51.8%
0%
|
92.2%
94.6%
89.9%
90.3%
92.4%
0.846
-
-
52.4%
5.0%
|
93.8%
93.7%
93.9%
93.9%
93.8%
0.876
98.4%
98.4%
49.9%
0%
|
84.6%
88.9%
80.3%
81.9%
85.2%
0.694
-
-
52.5%
0%
|
Performance metrics for transcript-based parent-of-origin classification in Brassica. Whether used with Bowtie2 or STAR, the random forest improved the accuracy, F1, and MCC. For directional statistics, species B. rapa and B. oleracea were considered the negative and positive classes, respectively. A, B, C) Using the Bowtie2 aligner, a parent was chosen by the aligner, or by comparing alignment scores, or by the random forest, respectively. D, E, F) Similar to A, B, and C but using the STAR aligner, configured to avoid splicing. G) Parent chosen by the alignment-free software, Salmon.
Table 5 Classification Performance with Reference Genomes of genus Brassica
Brassica DNA
|
A
HiSat2
|
B
Hi_AS
|
C
Hi_RF
|
D
STAR
|
E
St_AS
|
F
St_RF
|
Accuracy:
Sensitivity:
Specificity:
Precision:
F1-score:
MCC:
AUPRC:
AUROC:
Pos Pref:
Ties:
|
91.8%
95.1%
88.6%
89.3%
92.1%
0.839
-
-
53.3%
0%
|
92.9%
95.6%
90.3%
90.8%
93.1%
0.860
-
-
52.7%
3.4%
|
95.1%
94.7%
95.6%
95.5%
95.1%
0.903
99.0%
98.9%
49.5%
0%
|
93.0%
96.0%
90.1%
90.6%
93.2%
0.862
-
-
53.0%
0%
|
92.3%
94.6%
90.0%
90.4%
92.5%
0.847
-
-
52.3%
5.6%
|
93.9%
93.2%
94.7%
94.6%
93.9%
0.879
98.4%
98.3%
49.2%
0%
|
Performance metrics for genome-based parent-of-origin classification in Brassica. Whether used with HiSat2 or STAR, the random forest improved the accuracy, F1, and MCC. For directional statistics, species B. rapa and B. oleracea were considered the negative and positive classes, respectively. A, B, C) Parent chosen by the HiSat2 aligner, or by comparing HiSat2 alignment scores, or by the random forest using HiSat2 alignment features, respectively. D, E, F) Similar to columns A, B, and C but using the STAR aligner, configured for splicing.
The overall highest accuracy with Brassica was achieved with HiSat2 plus random forest, and in this case, the accuracy climbed from 89% with the aligner alone to 94% with machine learning. Each mapping software package showed better performance with the Brassica data (Tables 4 and 5), than with the Arabidopsis data (Tables 2 and 3). Consequently, the performance differences between approaches were smaller. Nevertheless, the random forest gains seem smaller in Brassica than in Arabidopsis. A possible explanation is suggested by the smaller numbers of alignment score ties in Brassica (3%-6%) than in Arabidopsis (13%-15%). It may be that the random forest has the most effect when the aligners generate many equally good (or equally bad) alignments between the two parental references.
Mild class imbalance is seen in the Brassica results, though the positive preference was always close to 50%. The alignment-based methods (columns A, B, D, and E) showed the most imbalance between sensitivity and specificity, and these imbalances were reduced by the machine-learning post-process (columns C and F).
Results with genus Mus
This experiment used public data from Mus musculus, the mouse species that serves as a model organism for mammalian genetics. The B6 and D2 strains are inbred strains of the same species, developed and maintained in the laboratory. Parent-of-origin classification accuracy was low, barely exceeding the 50% expected by random guessing. Accuracy did not exceed 57% by any method tested. These results suggest that these within-species hybrids did not present enough genomic variants for parent-of-origin classification. For this reason, the results are shown in supplement, Tables S3 and S4, and the procedure is not recommended for crosses between very similar genotypes. Nevertheless, even on this dataset, the random forest post-processor added value, achieving 56%-57% accuracy, compared to 53%-56% achieved by the aligners directly. As shown in the supplemental tables, the random forest predictions showed positive class bias (toward D2) using transcript references, but negative class bias using genome references.
Results with genus Equus
This experiment used public data from the genus Equus. The equine species of horse, E. caballus, and donkey, E. asinus, can hybridize to yield a mule or hinny, depending on which parent is male or female (48). Results are shown in Tables 6 and 7. The machine-learning method provided the best performance using transcriptomes or genomes. These results indicate that our method is suitable for interspecies hybrids of animals as well as plants.
Table 6 Classification Performance with Reference Transcripts of genus Equus
Equus RNA
|
A
Bowtie2
|
B
Bo_AS
|
C
Bo_RF
|
D
STAR
|
E
St_AS
|
F
St_RF
|
G
Salmon
|
Accuracy:
Sensitivity:
Specificity:
Precision:
F1-score:
MCC:
AUPRC:
AUROC:
Pos Pref
Ties:
|
71.9%
78.2%
65.6%
69.5%
73.6%
0.441
-
-
56.3%
0%
|
76.6%
74.9%
78.3%
77.6%
76.2%
0.533
-
-
48.3%
37.3%
|
81.7%
91.0%
72.4%
76.7%
83.2%
0.645
92.0%
91.5%
59.3%
0%
|
71.2%
79.0%
63.3%
68.3%
73.3%
0.428
-
-
57.8%
0%
|
76.6%
75.1%
78.1%
77.4%
76.2%
0.532
-
-
48.5%
40.1%
|
85.2%
79.1%
91.4%
90.2%
84.3%
0.710
93.3%
93.0%
43.9%
0%
|
63.7%
74.8%
52.5%
61.2%
67.3%
0.281
-
-
61.2%
0%
|
By many measures including accuracy, F1, and MCC, the random forest performance surpassed that of the other methods tested. For directional statistics, donkey and horse were considered the negative and positive classes, respectively. A, B, C) Using the Bowtie2 aligner, a parent was chosen by the aligner, or by comparing alignment scores, or by the random forest, respectively. D, E, F) Similar to A, B, and C but using the STAR aligner, configured to avoid splicing. G) Parent chosen by the alignment-free software, Salmon.
Table 7 Classification Performance with Reference Genomes of genus Equus
Equus DNA
|
A
HiSat2
|
B
Hi_AS
|
C
Hi_RF
|
D
STAR
|
E
St_AS
|
F
St_RF
|
Accuracy:
Sensitivity:
Specificity:
Precision:
F1-score:
MCC:
AUPRC:
AUROC:
Pos Pref
Ties:
|
73.1%
72.4%
73.7%
73.3%
72.9%
0.461
-
-
49.4%
0%
|
78.9%
78.2%
79.5%
79.2%
78.7%
0.577
-
-
49.4%
33.6%
|
84.3%
91.0%
77.5%
80.2%
85.3%
0.692
93.9%
93.7%
56.7%
0%
|
71.6%
71.1%
72.2%
71.9%
71.5%
0.432
-
-
49.4%
0%
|
78.3%
77.6%
79.0%
78.7%
78.1%
0.566
-
-
49.3%
36.9%
|
86.2%
81.5%
90.8%
89.8%
85.5%
0.726
94.1%
94.0%
45.4%
0%
|
Table 7 Legend: By many measures including accuracy, F1, and MCC, the random forest performance surpassed that of the other methods tested. For directional statistics, donkey and horse were considered the negative and positive classes, respectively. A, B, C) Parent chosen by the HiSat2 aligner, or by comparing HiSat2 alignment scores, or by the random forest using HiSat2 alignment features, respectively. D, E, F) Similar to columns A, B, and C but using the STAR aligner, configured for splicing.
The overall highest accuracy with Equus was achieved with STAR DNA plus random forest, and in this case, the accuracy climbed from 72% with the aligner alone to 86% with machine learning. The random forest post-process introduced mild levels of class bias on the Equus data, though the direction varied. Positive preference increased when the random forest was used with Bowtie2 or HiSat2 but decreased when used with STAR. This indicates that, at least on these data, the bias is a computational artifact.
Results Summary and Generalization
The results on the three interspecies datasets are compared in Fig. 2. For each of the Arabidopsis, Brassica, and Equus datasets, the approach that achieved the highest accuracy is compared to the other approaches tested with the same aligner. On each dataset, the random forest accuracy surpassed that from relying on the alignment scores or the aligner by itself. On Arabidopsis and Brassica, the maximum accuracy was achieved using HiSat2 alignments to the genome reference, plus the random forest. On Equus, the maximum accuracy was achieved using STAR alignments to the genome reference, plus the random forest. Since no one approach worked best on all three datasets, it may be advantageous to experiment with several aligners, as we have done here, when using other datasets.
Fig. 2 Comparison of three models by accuracy. Using the genera Arabidopsis and Brassica, the parent-of-origin read classification accuracy was maximized using HiSat2 alignments to genomes followed by random forest. Using the genus Equus, accuracy was maximized using STAR alignments to genomes followed by random forest. The vertical axis ranges from 50% (guessing) to 100% (perfect).
Unlike many machine learning models, random forest models can be interrogated for indications of which features were most important. Fig. 3 illustrates the top five features reported for the three most performant models in this study. (In Supplement, Table S5 shows the top five features for all twelve models presented in this paper.) The different rankings in this comparison indicate that the feature importance varied with the dataset and the aligner. However, the features in the intersection of these three lists are Parent (a number indicating the parent with higher alignment score) and MAT diff (the P2-P1 difference in matched bases). Also, there was at least one HQ feature in each list. Our HQ features count problems (mismatches, insertions, deletions) that involve positions in reads assigned the maximal quality score by the sequencing instrument software. Notably, the match counts and quality scores are not reflected in the alignment scores generated by the mappers used here. Thus, it seems the random forest models learned to rely on complementary features.
Fig. 3 Alignment Features Ranked by Importance. Top five alignment features, ranked by importance, used by the random forest model. The three columns here correspond to those in Figure 2. The figure indicates that different models relied on different features. Labels like “P2 R1” refer to the alignment of read 1 to parent 2. Labels with “HQ” count only those events that involve a maximal base call quality score in the read. Labels with “diff” include the difference between parent 2 and 1 alignments. Labels with “MM” and “MAT” refer to mismatched or matched base counts, respectively. Labels with “INS” or “DEL” refer to bases inserted or deleted in the read, respectively. The label “PARENT” indicates a feature that was 1 if P2 had the better combined alignment score, or -1 if P1 had the better score, or 0 if they were tied.
In all the experiments shown so far, each model was trained on one set of read pairs and tested on another. However, each test set was carved from the same RNA-seq runs as its cognate training set. In an experiment on actual hybrids, the training RNA-seq from the parents and would come from different sequencing libraries and sequencing runs compared to the test RNA-seq from the hybrid. We wondered how well our models would perform on RNA-seq runs other than those on which it was trained. In other words, could the model generalize.
We performed a test with the model trained on Equus. We downloaded two additional parental RNA-seq runs from the same Equus project. We selected runs for testing that were generated from different individuals, i.e., a different horse and a different donkey, compared t the runs used for training. These secondary runs were aligned to the parental transcriptomes with Bowtie2. The random forest model that was trained on the primary runs was used without retraining to make predictions on the secondary runs. The model achieved 82.5% accuracy, 0.659 MCC, and 91.4% AUROC on the secondary runs. The full results are shown in Table S6. Overall, the model performed similarly whether evaluated with the primary or secondary runs. Thus, the model was able to generalize and maintain accuracy with RNA-seq runs other than those on which it was trained.