From miRBase Release 22.1 were successfully obtained 44 885 mature miRNA sequencies together with their hairpin structures. Aligning of stem-loop structures by BLAST search available on Phytozome 13 proved the occurrence in the genome Linum usitatissimum v1.0 only for 11 919 mature miRNAs (Phytozome 13, 2022, https://phytozome-next.jgi.doe.gov/). Distribution of occurred microRNAs based on their percentage of positives with query sequence is presented in the Fig. 1. For further analysis have been selected only microRNAs with percentage of positives equal or higher than 80% (441 mature microRNAs).
Distribution of 34 microRNA families (miR156, miR157, miR159, miR160, miR162, miR164, miR166, miR167, miR168, miR169, miR171, miR172, miR319, miR390, miR393, miR394, miR395, miR396, miR397, miR398, miR399, miR408, miR530, miR828, miR2916, miR3533, miR4426, miR4995, miR5219, miR5288, miR5523, miR8005, miR11602, miR11604) within the new dataset of 441 mature microRNAs, showed in the Fig. 2, proved as the most occurred microRNAs of family miR156 (72), miR160 (50), miR171 (43) and miR167 (41).
In the Fig. 3 presenting distribution of 47 origins within the new dataset of 441 mature microRNAs is the most distributed species Linum usitatissimum L. – Lus (124). On the next places are Malus domestica Borkh. – Mdm (31), Manihot esculenta Crantz – Mes (30), Populus trichocarpa Torr. & A.Gray ex. Hook – Ptc (26) and Solanum tuberosum L. – Stu (26).
Aau - Acacia auriculiformis A.Cunn. ex Benth., Aly - Arabidopsis lyrata L., Ama - Avicennia marina Forssk., Aqc - Aquilegia caerulea E. James, Ath - Arabidopsis thaliana L., Bcy - Bruguiera cylindrica L., Bgy - Bruguiera gymnorhiza L., Bna - Brassica napus L., Bra - Brassica rapa L., Bta - Bos taurus L., Cas - Camelina sativa L., Cca - Cynara cardunculus L., Cme - Cucumis melo L., Cpa - Carica papaya L., Csi - Citrus sinensis L., Ctr - Citrus trifoliata L., Dpr - Digitalis purpurea L., Eun - Eugenia uniflora L., Fve - Fragaria vesca L., Ghr - Gossypium hirsutum L., Gma - Glycine max L., Gra - Gossypium raimondii Ulbr., Han - Helianthus annuus L., Hbr - Hevea brasiliensis Willd. ex A. Juss., Hsa - Homo sapiens L., Lja - Lotus japonicus L., Lus - Linum usitatissimum L., Mdm - Malus domestica Borkh., Mes - Manihot esculenta Crantz, Mtr - Medicago truncatula Gaertn., Nta - Nicotiana tabacum L., Osa - Oryza sativa L., Pab - Picea abies L., Pde - Pinus densata Mast., Peu - Populus euphratica Oliv., Pla - Paeonia lactiflora Pall., Ppe - Prunus persica L., Ptc - Populus trichocarpa Torr. & A.Gray ex. Hook, Rco - Ricinus communis L., Sbi - Sorghum bicolor L., Sly - Solanum lycopersicum L., Ssl - Salvia sclarea L., Stu - Solanum tuberosum L., Vca - Vriesea carinata Wawra, Vun - Vigna unguiculata L., Vvi - Vitis vinifera L., Zma - Zea mays L.
The resulting matrix of microRNA families and their origins within 441 mature microRNAs occurred in the genome Linum usitatissimum v1.0 with averaged percentages of positives within their hairpin structures is presented in the Fig. 4. Red squares represent origins of 34 microRNA families found in the genome Linum usitatissimum v1.0 for example red square for Acacia auriculiformis A.Cunn. ex Benth. - Aau and miR160 represents found stem-loop structure of mature microRNA Aau-miR160 in the genome Linum usitatissimum v1.0.
The most occurred miRNA family within identified species was miR156 which has been found in 22 species followed by miR167 (19 species) and mir171 (18). The most occurred species within identified microRNA families was Linum usitatissimum L. (23 microRNA families) followed by Manihot esculenta Crantz (23), Glycine max L. (10) and Populus trichocarpa Torr. & A.Gray ex. Hook (10).
The average values for origin and microRNA families represent an average for all found mature microRNAs belonging to specific microRNA origin respectively family. The highest average reach species Linum usitatissimum L. (100%) followed by Populus euphratica Oliv. (91%) and Cynara cardunculus L. (89%). The average 100% reached microRNA families miR168, miR397, miR398, miR530, miR828.
The obtained transcriptomes L. usitatissimum v1.0, Linum usitatissimum 1, Linum usitatissimum 2 and project GENOLIN were described in the Table 2.
Table 2
Description of obtained transcriptomic data
Name of transcriptome | Contigs | The smallest contig | The longest contig | Average size |
L. usitatissimum v1.0 | 43 484 | 150 bp | 14 619 bp | 1 200 bp |
Linum usitatissimum 1 | 73 195 | 100 bp | 4 448 bp | 329 bp |
Linum usitatissimum 2 | 78 323 | 100 bp | 14 687 bp | 633 bp |
Project GENOLIN | 59 626 | 40 bp | 6 523 bp | 482 bp |
Bp - Base pairs |
The alignment of gene sequences with transcriptomic data for a verification of their occurrence in various Linum usitatissimum L. transcriptomes (L. usitatissimum v1.0, Linum usitatissimum 1, Linum usitatissimum 2 and project GENOLIN) demonstrated different quality of RNA sequencing respectively different distribution and count of aligned contigs within the same plant genome however all gene sequences were successfully aligned with each transcriptome. The highest average of aligned contigs reached transcriptome L. usitatissimum v1.0 (32) followed by Linum usitatissimum 2 (22), Linum usitatissimum 1 (12) and project GENOLIN (9). The averages of aligned contigs with query coverage equal or higher than 50% were for L. usitatissimum v1.0 (4), Linum usitatissimum 1 (1), Linum usitatissimum 2 (1) and project GENOLIN (0).
For the increasing of variability of mRNA sequences have been obtained aligned FASTAs of contigs that reached value of query cover equal or higher than 50%. All chosen gene sequences have been found in all transcriptomes however not all with at least 50% query coverage (sequences of SDH - AF352734.1 and CYP79D4 - AY599896.1). The highest number of contigs was found for the sequence of cytochrome P450 monooxygenase CYP71E - MK172858.1, dirigent protein DIR 3 - KM433755.1 and DIR 6 - KM433752.1, uridine glycosyltransferase UGT74S1 - JX011632.1 and JN088324.1 however the number of aligned sequences with equal or more than 50% query coverage was much smaller in all cases. The results are presented in the Table 3.
Table 3
Number of aligned contigs with gene sequences of selected enzyme within various Linum usitatissimum L. transcriptomes
Name of enzyme | Accession | Transcriptomes | Number of all aligned contigs | Aligned sequences with at least 50% query coverage |
L. usitatissimum v1.0 | Linum usitatissimum 1 | Linum usitatissimum 2 | Project GENOLIN |
Lignans |
DIR 1 | KM433751.1 | 31 | 16 | 39 | 13 | 99 | |
≥ 50% query coverage | 6 | 1 | 1 | 0 | | 8 |
DIR 2 | KM433754.1 | 31 | 11 | 36 | 9 | 87 | |
≥ 50% query coverage | 6 | 1 | 1 | 0 | | 8 |
DIR 3 | KM433755.1 | 46 | 18 | 52 | 12 | 128 | |
≥ 50% query coverage | 6 | 1 | 1 | 0 | | 8 |
DIR 4 | KM433756.1 | 12 | 2 | 7 | 2 | 23 | |
≥ 50% query coverage | 6 | 1 | 1 | 0 | | 8 |
DIR 5 | KM433753.1 | 35 | 15 | 33 | 7 | 90 | |
≥ 50% query coverage | 6 | 1 | 1 | 0 | | 8 |
DIR 6 | KM433752.1 | 46 | 25 | 50 | 8 | 129 | |
≥ 50% query coverage | 6 | 1 | 1 | 0 | | 8 |
PLR 1 | AJ849359.1 | 12 | 7 | 14 | 6 | 39 | |
≥ 50% query coverage | 5 | 3 | 3 | 3 | | 14 |
PLR 2 | EU029951.1 | 20 | 11 | 21 | 12 | 64 | |
≥ 50% query coverage | 5 | 3 | 3 | 3 | | 14 |
UGT74S1 | JX011632.1 | 53 | 21 | 35 | 11 | 120 | |
≥ 50% query coverage | 8 | 1 | 3 | 0 | | 12 |
UGT74S1 | JN088324.1 | 53 | 23 | 38 | 11 | 125 | |
≥ 50% query coverage | 3 | 1 | 1 | 0 | | 5 |
SDH | AF352734.1 | 21 | 3 | 6 | 4 | 34 | |
≥ 50% query coverage | 0 | 0 | 0 | 0 | | 0 |
SDH | AF352735.1 | 37 | 2 | 7 | 5 | 51 | |
| 2 | 0 | 0 | 0 | | 2 |
Cyanogenic glycosides |
CYP79D1 | AF140613.1 | 40 | 16 | 21 | 10 | 87 | |
≥ 50% query coverage | 6 | 2 | 3 | 0 | | 11 |
CYP79D1 | AY834391.1 | 39 | 15 | 22 | 10 | 86 | |
≥ 50% query coverage | 6 | 2 | 3 | 0 | | 11 |
CYP79D2 | AF140614.1 | 27 | 14 | 11 | 6 | 58 | |
≥ 50% query coverage | 6 | 2 | 3 | 0 | | 11 |
CYP79D2 | AY834390.1 | 27 | 15 | 12 | 6 | 60 | |
≥ 50% query coverage | 6 | 2 | 3 | 0 | | 11 |
CYP79D3 | AY599895.1 | 8 | 2 | 5 | 2 | 17 | |
≥ 50% query coverage | 1 | 0 | 0 | 0 | | 1 |
CYP79D4 | AY599896.1 | 8 | 1 | 4 | 2 | 15 | |
≥ 50% query coverage | 0 | 0 | 0 | 0 | | 0 |
CYP71E | MK172858.1 | 72 | 24 | 37 | 29 | 162 | |
≥ 50% query coverage | 2 | 1 | 1 | 1 | | 5 |
CYP71E | AY217351.1 | 34 | 7 | 11 | 17 | 69 | |
≥ 50% query coverage | 2 | 0 | 1 | 0 | | 3 |
UGT85K4 | JF727883.1 | 23 | 5 | 5 | 7 | 40 | |
≥ 50% query coverage | 3 | 1 | 1 | 0 | | 5 |
UGT85K5 | JF727884.1 | 32 | 8 | 11 | 10 | 61 | |
≥ 50% query coverage | 3 | 0 | 0 | 0 | | 3 |
Average for transcriptome | 32 | 12 | 22 | 9 | | |
Average for ≥ 50% coverage | 4 | 1 | 1 | 0 | | |
DIR - Dirigent protein, PLR 1 - (-)-pinoresinol-(-)-lariciresinol reductase 1, PLR 2 - (+)-pinoresinol-(+)-lariciresinol reductase 2, UGT74S - Uridine glycosyltransferase UGT74S, SDH - (-)-secoisolariciresinol dehydrogenase, CYP - Cytochrome P450 monooxygenase, UGT85K - Acetone cyanohydrin β-glucosyltransferase UGT85K
For better description and annotation of transcriptomic data, contigs that reached more than 50% query coverage have been sorted based on their origin and name/number and assigned with NCBI records in the Table 4. Names of some contigs are abbreviated by symbol of three dots for example contig linum_usitatissimum-20100629:013755 is abbreviated as “…:013755”; medp_linus_20101112|9111 as “…|9111” and genolin_c1251 “…c1251”. Differences within contigs of the same enzyme family are in Uridine glycosyltransferase UGT74S, Secoisolariciresinol dehydrogenase (SDH), Cytochrome P450 monooxygenase CYP71E and Acetone cyanohydrin β-glucosyltransferase UGT85K. The same contigs were found for enzyme family Dirigent protein (DIR), Pinoresinol-lariciresinol reductase (PLR) and Cytochrome P450 monooxygenase CYP79D.
Table 4
Alignment of NCBI records with contigs that reached more than 50% query coverage coming from various transcriptomic data
Name of enzyme family | Name of enzyme | Accession | Transcriptomes |
L. usitatissimum v1.0 | Linum usitatissimum 1 | Linum usitatissimum 2 | Project GENOLIN |
Lignans |
Dirigent protein (DIR) | DIR 1 | KM433751.1 | Lus10017538 Lus10017539 Lus10024714 Lus10024715 Lus10028749 Lus10032331 | …:013755 | …|9111 | - |
DIR 2 | KM433754.1 | Lus10017538 Lus10017539 Lus10024714 Lus10024715 Lus10028749 Lus10032331 | …:013755 | …|9111 | - |
DIR 3 | KM433755.1 | Lus10017538 Lus10017539 Lus10024714 Lus10024715 Lus10028749 Lus10032331 | …:013755 | …|9111 | - |
DIR 4 | KM433756.1 | Lus10017538 Lus10017539 Lus10024714 Lus10024715 Lus10028749 Lus10032331 | …:013755 | …|9111 | - |
DIR 5 | KM433753.1 | Lus10017538 Lus10017539 Lus10024714 Lus10024715 Lus10028749 Lus10032331 | …:013755 | …|9111 | - |
DIR 6 | KM433752.1 | Lus10017538 Lus10017539 Lus10024714 Lus10024715 Lus10028749 Lus10032331 | …:013755 | …|9111 | - |
Pinoresinol-lariciresinol reductase (PLR) | PLR 1 | AJ849359.1 | Lus10007599 Lus10010403 Lus10012143 Lus10012145 Lus10012147 | …:002028 …:003965 …:007509 | …|1482 …|2895 …|15565 | …c1251 …c2534 …c3993 |
PLR 2 | EU029951.1 | Lus10007599 Lus10010403 Lus10012143 Lus10012145 Lus10012147 | …:002028 …:003965 …:007509 | …|1482 …|2895 …|15565 | …c1251 …c2534 …c3993 |
Uridine glycosyltransferase UGT74S | UGT74S1 | JX011632.1 | Lus10006351 Lus10006352 Lus10006353 Lus10006721 Lus10008742 Lus10014148 Lus10017825 Lus10024486 | …:007634 | …|1633 …|1734 …|3550 | - |
UGT74S1 | JN088324.1 | Lus10006353 Lus10014148 Lus10017825 | …:007634 | …|3550 | - |
(-)-secoisolariciresinol dehydrogenase (SDH) | SDH | AF352734.1 | - | - | - | - |
SDH | AF352735.1 | Lus10016997 Lus10021320 | - | - | - |
Cyanogenic glycosides |
Cytochrome P450 monooxygenase CYP79D | CYP79D1 | AF140613.1 | Lus10023143 Lus10023144 Lus10026341 Lus10031145 Lus10031151 Lus10031726 | …:000584 …:003269 | …|434 …|4579 …|9159 | - |
CYP79D1 | AY834391.1 | Lus10023143 Lus10023144 Lus10026341 Lus10031145 Lus10031151 Lus10031726 | …:000584 …:003269 | …|434 …|4579 …|9159 | - |
CYP79D2 | AF140614.1 | Lus10023143 Lus10023144 Lus10026341 Lus10031145 Lus10031151 Lus10031726 | …:000584 …:003269 | …|434 …|4579 …|9159 | - |
CYP79D2 | AY834390.1 | Lus10023143 Lus10023144 Lus10026341 Lus10031145 Lus10031151 Lus10031726 | …:000584 …:003269 | …|434 …|4579 …|9159 | - |
CYP79D3 | AY599895.1 | Lus10031145 | - | - | - |
CYP79D4 | AY599896.1 | - | - | - | - |
Cytochrome P450 monooxygenase CYP71E | CYP71E | MK172858.1 | Lus10011499 Lus10023140 | …:002191 | …|9048 | …c2430 |
CYP71E | AY217351.1 | Lus10011499 Lus10023140 | - | …|9048 | - |
Acetone cyanohydrin β-glucosyltransferase UGT85K | UGT85K4 | JF727883.1 | Lus10000632 Lus10025741 Lus10035903 | …:001112 | …|32 | - |
UGT85K5 | JF727884.1 | Lus10025741 Lus10031388 Lus10035903 | - | - | - |
DIR - Dirigent protein, PLR 1 - (-)-pinoresinol-(-)-lariciresinol reductase 1, PLR 2 - (+)-pinoresinol-(+)-lariciresinol reductase 2, UGT74S - Uridine glycosyltransferase UGT74S, SDH - (-)-secoisolariciresinol dehydrogenase, CYP - Cytochrome P450 monooxygenase, UGT85K - Acetone cyanohydrin β-glucosyltransferase UGT85K
Verification and demonstration of annotation of enzyme families in the contigs of transcriptomic data proved by algorithm Finding genes by keyword within transcriptome L. usitatissimum v1.0 showed that many contigs are not annotated. Because of different results within very similar keywords or one enzyme family, results indicate that either annotations or algorithm discrimination settings by keywords are not refined enough to allow searches with the same expected output. The number of hits does not correspond to number of matches with our results.
The full match was observed in the enzyme families Dirigent protein (DIR) but only with keyword “dirigent” (6/6), Pinoresinol-lariciresinol reductase (PLR) with keywords “pinoresinol” and “lariciresinol” (5/5) and Acetone cyanohydrin β-glucosyltransferase UGT85K with used keywords “UGT85” and “UGT” (4/4). The results are presented in the Table 5.
Table 5
Verification and demonstration of annotation of enzyme families in the transcriptome L. usitatissimum v1.0 base on algorithm Finding genes by keyword available on Phytozome 13.
Enzyme family | Used keyword | Hits | Match with our results |
Lignans |
Dirigent protein (DIR) | Dirigent | 44 | 6/6 |
Dirigent protein | 32 | 2/6 |
DIR | 228 | 0/6 |
Pinoresinol-lariciresinol reductase (PLR) | Pinoresinol | 8 | 5/5 |
Lariciresinol | 8 | 5/5 |
Pinoresinol-lariciresinol reductase 1 | 37603 | 0/5 |
Pinoresinol-lariciresinol reductase 2 | 43471 | 0/5 |
PLR | 2 | 0/5 |
Uridine glycosyltransferase UGT74S | Uridine glycosyltransferase UGT74S | 554 | 0/8 |
UGT74S | 0 | 0/8 |
UGT74 | 1 | 0/8 |
UGT | 67 | 0/8 |
Uridine glycosyltransferase | 554 | 0/8 |
Glycosyltransferase | 423 | 0/8 |
Uridine | 170 | 0/8 |
UDP glycosyltransferase | 136 | 3/8 |
UDP | 578 | 0/8 |
(-)-secoisolariciresinol dehydrogenase (SDH) | Secoisolariciresinol dehydrogenase SDH | 1591 | 0/2 |
Secoisolariciresinol dehydrogenase | 1589 | 0/2 |
Secoisolariciresinol | 0 | 0/2 |
Dehydrogenase | 1589 | 0/2 |
SDH | 17 | 0/2 |
Cyanogenic glycosides |
Cytochrome P450 monooxygenase CYP79D | Cytochrome P450 monooxygenase CYP79D | 911 | 0/6 |
Cytochrome P450 monooxygenase CYP79 | 911 | 1/6 |
CYP79D | 0 | 0/6 |
CYP79 | 6 | 6/6 |
Cytochrome P450 monooxygenase | 12 | 0/6 |
Cytochrome | 812 | 0/6 |
P450 | 490 | 0/6 |
Monooxygenase | 464 | 0/6 |
Cytochrome P450 monooxygenase CYP71E | Cytochrome P450 monooxygenase CYP71E | 911 | 0/2 |
Cytochrome P450 monooxygenase CYP71 | 911 | 0/2 |
CYP71E | 0 | 0/2 |
CYP71 | 1 | 0/2 |
Cytochrome P450 monooxygenase | 12 | 0/2 |
Cytochrome | 812 | 0/2 |
P450 | 490 | 0/2 |
Monooxygenase | 464 | 0/2 |
Acetone cyanohydrin β-glucosyltransferase UGT85K | Acetone cyanohydrin β-glucosyltransferase UGT85K | 7 | 0/4 |
β-glucosyltransferase UGT85K | 0 | 0/4 |
UGT85K | 0 | 0/4 |
UGT85 | 35 | 4/4 |
UGT | 67 | 4/4 |
Glucosyltransferase | 390 | 1/4 |
β-glucosyltransferase | 0 | 0/4 |
Cyanohydrin | 7 | 0/4 |
Acetone | 0 | 0/4 |
DIR - Dirigent protein, PLR - pinoresinol-lariciresinol reductase, UGT74S - Uridine glycosyltransferase UGT74S, SDH - (-)-secoisolariciresinol dehydrogenase, CYP - Cytochrome P450 monooxygenase, UGT85K - Acetone cyanohydrin β-glucosyltransferase UGT85K
As input data for prediction were used gene sequences (Table 1) together with alignment sequences of contigs from Table 4 and dataset of 441 selected mature microRNAs (Figs. 1, 2, 3 and 4). From the results follows that for dirigent proteins (DIR1-DIR6) was predicted only microRNA family miR160. For both PLRs ((-)-pinoresinol-(-)-lariciresinol reductase 1 and (+)-pinoresinol-(+)-lariciresinol reductase 2) were commonly predicted microRNA families miR159, miR164, miR166, miR167, miR171, miR395, miR399 and miR5219. For uridine glycosyltransferases UGT74S1 there were found microRNA families miR156, miR157, miR159, miR164, miR167, miR319 and miR395. Within the enzyme family secoisolariciresinol dehydrogenase (SDH) were identified microRNA families miR172, miR396 and miR5523. In total, for biosynthetic pathway of lignans were predicted 15 microRNA families (miR156, miR157, miR159, miR160, miR164, miR166, miR167, miR171, miR172, miR319, miR395, miR396, miR399, miR5219 and miR5523) where the most active seem to be microRNA families miR159, miR164, miR167 and miR395 regulating enzyme families PLR and UGT74S.
Within the first key enzyme family of cyanogenic glycosides were for cytochrome P450 monooxygenase CYP79D predicted microRNA families miR160, miR171, miR319, miR2916 and miR11602. For cytochrome P450 monooxygenase CYP71E were found microRNA families miR168, miR171, miR319 and miR396. Enzyme family acetone cyanohydrin β-glucosyltransferase UGT85K is probably regulated by microRNA families miR159, miR160, miR393 and miR5219.Within biosynthetic pathway of cyanogenic glycosides were in total identified 9 microRNA families (miR159, miR160, miR168, miR171, miR319, miR393, miR396, miR2916, miR5219 and miR5219) where the most active seem to be microRNA family miR160 that can regulate enzyme families CYP79D and UGT85K and microRNA families miR171 together with miR319 that regulate enzyme families CYP79D and CYP71E.
From all 19 identified microRNA families were for both biosynthetic pathways predicted 6 microRNA families miR159 (regulating UGTS4S and UGT85K), miR160 (regulating DIR, CYP79D and UGT85K), miR171 (regulating PLR, CYP79D and CYP71E), miR319 (regulating UGTS4S, CYP79D and CYP71E), miR396 (regulating SDH and CYP71E) and miR5219 (regulating PLR and UGT85K). Results are presented in the Fig. 5 by matrix with red squares representing a match.
DIR - Dirigent protein, PLR 1 - (-)-pinoresinol-(-)-lariciresinol reductase 1, PLR 2 - (+)-pinoresinol-(+)-lariciresinol reductase 2, UGT74S - Uridine glycosyltransferase UGT74S, SDH - (-)-secoisolariciresinol dehydrogenase, CYP - Cytochrome P450 monooxygenase, UGT85K - Acetone cyanohydrin β-glucosyltransferase UGT85K
Based on abovementioned results has been designed an original scheme of microRNA-based participation of biosynthetic pathways of lignans and cyanogenic glycosides including metabolites and enzymes (Fig. 6).
DIR - Dirigent protein, PLR 1 - (-)-pinoresinol-(-)-lariciresinol reductase 1, PLR 2 - (+)-pinoresinol-(+)-lariciresinol reductase 2, UGT74S1 - Uridine glycosyltransferase UGT74S1, SDH - (-)-secoisolariciresinol dehydrogenase, CYP - Cytochrome P450 monooxygenase, UGT85K - Acetone cyanohydrin β-glucosyltransferase UGT85K