1. HML-9 elements identification, localization, and actual distribution in hg38
According to the BLAT results of LTR14C-HERVK14C-LTR14C in GRCh38/hg38, we characterized a total of 23 HERV-K HML-9 provirus elements. Each HML-9 element is named according to the genomic locus of integration (Table 1). It has been identified that the average length of the provirus is 5473 bp. Among them, 6 elements are longer than 70% of the full length of the reference, 9 elements are between 40–70% in length, and the remaining 8 sequences are between 17.11% and 34.26% of the reference sequence in length. For solo LTRs, 47 solo LTR elements of HERV-K HML-9 were characterized in total. Of these, 44 solo LTRs (93.62%) are longer than 90% of the representative reference LTR14C in length. The nucleotide sequence of each element was shown in Supplementary dataset 1–2. The whole HML-9 element distribution was displayed based on Ensembl (www.ensembl.org) (Fig. 1A).
Table 1
HML-9 provirus distribution.
Number
|
Locus
|
Chromosome
|
Strand
|
Position start
|
Position end
|
Length (bp)
|
Match + mismatch(bp)/full length(bp)
|
Range
|
Qgap(bp)/match + mismatch + Qgap(bp)
|
Insertion or deletion
|
Intergenic/intron/exon
|
Gene including the region
|
1.
|
16p12.3
|
chr16
|
-
|
19393581
|
19402152
|
8572
|
96.00%
|
(90%-100%)
|
1.01%
|
NA
|
exon_intron
|
AC130456.2
|
2.
|
2p12
|
chr2
|
+
|
82022660
|
82031279
|
8620
|
95.91%
|
(90%-100%)
|
1.13%
|
NA
|
intergenic
|
NA
|
3.
|
15q21.1
|
chr15
|
-
|
45234477
|
45243073
|
8597
|
95.34%
|
(90%-100%)
|
1.85%
|
NA
|
exon_intron
|
AC051619.4
|
4.
|
8p11.1
|
chr8
|
-
|
43694016
|
43702583
|
8568
|
95.10%
|
(90%-100%)
|
2.14%
|
NA
|
intergenic
|
NA
|
5.
|
13q31.1
|
chr13
|
+
|
84869526
|
84877320
|
7795
|
86.84%
|
(80%-90%)
|
6.67%
|
NA
|
exon_intron
|
AL445588.1
|
6.
|
4q33
|
chr4
|
-
|
170126345
|
170133883
|
7539
|
70.03%
|
(70%-80%)
|
0.79%
|
Insertion
|
intergenic
|
NA
|
7.
|
6p12.3
|
chr6
|
+
|
48873675
|
48879725
|
6051
|
64.84%
|
(60%-70%)
|
34.48%
|
Deletion
|
intergenic
|
NA
|
8.
|
Yp11.2
|
chrY
|
-
|
9273707
|
9279611
|
5905
|
59.83%
|
(50%-60%)
|
39.23%
|
Deletion
|
intergenic
|
NA
|
9.
|
8q24.3
|
chr8
|
+
|
145019974
|
145032719
|
12746
|
57.06%
|
(50%-60%)
|
0.79%
|
Insertion
|
intergenic
|
NA
|
10.
|
Yq11.223
|
chrY
|
+
|
21580120
|
21585551
|
5432
|
57.04%
|
(50%-60%)
|
38.30%
|
Deletion
|
exon_intron
|
TTTY13
|
11.
|
19q13.2
|
chr19
|
+
|
40954172
|
40959178
|
5007
|
56.66%
|
(50%-60%)
|
41.93%
|
Deletion
|
intergenic
|
NA
|
12.
|
Yp11.2
|
chrY
|
-
|
8121821
|
8126768
|
4948
|
54.57%
|
(50%-60%)
|
44.92%
|
Deletion
|
intergenic
|
NA
|
13.
|
Yp11.2
|
chrY
|
+
|
8996062
|
9000755
|
4694
|
50.80%
|
(50%-60%)
|
41.95%
|
Deletion
|
intergenic
|
NA
|
14.
|
Yq11.222
|
chrY
|
-
|
18622534
|
18626952
|
4419
|
47.33%
|
(40%-50%)
|
52.23%
|
Deletion
|
intergenic
|
NA
|
15.
|
Yq11.223
|
chrY
|
-
|
21845475
|
21850069
|
4595
|
43.18%
|
(40%-50%)
|
49.78%
|
Deletion
/Insertion
|
exonic_intergenic
|
AC024236.1
|
16.
|
21q21.1
|
chr21
|
-
|
18563368
|
18566735
|
3368
|
34.26%
|
(30%-40%)
|
8.19%
|
NA
|
exon_intron
|
MIR548XHG
|
17.
|
5q33.3
|
chr5
|
-
|
156660448
|
156663815
|
3368
|
34.14%
|
(30%-40%)
|
8.50%
|
NA
|
intron
|
SGCD
|
18.
|
1q22
|
chr1
|
-
|
155629408
|
155632775
|
3368
|
33.75%
|
(30%-40%)
|
9.56%
|
NA
|
intron
|
AL353807.5
|
19.
|
7q36.1
|
chr7
|
-
|
150561277
|
150563994
|
2718
|
27.92%
|
(20%-30%)
|
10.27%
|
Deletion
|
intergenic
|
NA
|
20.
|
8q21.13
|
chr8
|
+
|
78652302
|
78654820
|
2519
|
26.60%
|
(20%-30%)
|
0.30%
|
NA
|
intron
|
AC068700.2
|
21.
|
10q24.2
|
chr10
|
-
|
99822511
|
99825532
|
3022
|
25.36%
|
(20%-30%)
|
24.65%
|
Deletion/
Insertion
|
intron
|
ABCC2
|
22.
|
12q13.11
|
chr12
|
+
|
48509228
|
48511681
|
2454
|
18.44%
|
(10%-20%)
|
33.18%
|
Deletion/
Insertion
|
intergenic
|
NA
|
23.
|
Yq11.222
|
chrY
|
-
|
17669948
|
17671523
|
1576
|
17.11%
|
(10%-20%)
|
12.69%
|
Deletion
|
intergenic
|
NA
|
Next, the expected number of integration of HML-9 elements per chromosome was predicted and compared with the number of actually detected sites to assess whether HML-9 existed randomly in the human genome. The results showed that the number of HML-9 integration events observed is always inconsistent with the expected amounts (Fig. 1B-C). For the proviral elements, the number of HML-9 insertions in chromosomes 8, 13, 15, 16, 19, 21, and Y were higher than expected. In particular, the provirus elements in the Y chromosome were significantly higher than the predicted elements by the Chi-square test (p < 0.05). In chromosomes 1, 2, 4, 5, 6, 7, 10, and 12, the actual numbers identified are lower than expected (Fig. 1B). Especially, there is no provirus integration in chromosomes 3, 9, 11, 14, 17, 18, 20, 22, and X at all. With respect to the solo LTR elements, the number of HML-9 solo LTRs in chromosomes 2, 3, 14, 15, 18, 21, X, and Y were higher than expected. In chromosomes 1, 4, 5, 6, 7, 8, 10,11, 12, 13, 17, and 20 the actual numbers identified are lower than expected. Especially, there is no provirus integration in chromosomes 9, 16, 19, and 22 at all (Fig. 1C). Analysis revealed that HML-9 provirus and solo LTR integration displayed a non-randomly integration way among human chromosomes.
Further, all 23 identified provirus elements and 47 solo LTRs were analyzed to determine their locations in intergenic regions, intron, or exon (Table 1–2). The results show that 13 provirus elements are located in intergenic regions, accounting for 56.52%. 4 provirus elements are located in introns, accounting for 17.39%. 6 provirus elements are located in both introns and exons, accounting for 26.09% (Table 1). With respect to solo LTRs, 28 solo LTRs are located in intergenic regions, accounting for 59.57%. The remaining 19 solo LTRs are located in introns, accounting for 40.43% (Table 2). The results displayed an apparent insertion preference into intergenic regions and introns.
Table 2
HML-9 Solo LTR tracks distribution.
Number
|
Locus
|
Chromosome
|
Strand
|
Position start
|
Position end
|
Length (bp)
|
Percentage of LTR14C in length
|
Match + mismatch/full length
|
Range
|
Qgap(bp)/match + mismatch + Qgap(bp)
|
Insertion or deletion
|
Intergenic/intron/exon
|
Gene including the region
|
1
|
14q21.1
|
chr14
|
+
|
38011040
|
38012012
|
973
|
101.36%
|
6.91%
|
(0%-10%)
|
35.61%
|
Deletion, Insertion
|
intron
|
TTC6
|
2
|
Xq21.32
|
chrX
|
-
|
93273183
|
93274197
|
1015
|
100.85%
|
6.88%
|
(0%-10%)
|
2.47%
|
Insertion
|
intergenic
|
NA
|
3
|
2q31.1
|
chr2
|
+
|
180236847
|
180237437
|
591
|
100.34%
|
6.84%
|
(0%-10%)
|
0.34%
|
NA
|
intergenic
|
NA
|
4
|
18p11.31
|
chr18
|
-
|
4527618
|
4528209
|
592
|
100.17%
|
6.83%
|
(0%-10%)
|
0.00%
|
NA
|
intergenic
|
NA
|
5
|
2q11.2
|
chr2
|
+
|
97964920
|
97965508
|
589
|
100.00%
|
6.82%
|
(0%-10%)
|
0.00%
|
NA
|
intron
|
TMEM131
|
6
|
15q14
|
chr15
|
+
|
39011033
|
39011621
|
589
|
100.00%
|
6.82%
|
(0%-10%)
|
0.00%
|
NA
|
intron
|
LINC02694
|
7
|
2p12
|
chr2
|
-
|
81304430
|
81305068
|
639
|
99.83%
|
6.81%
|
(0%-10%)
|
0.00%
|
NA
|
intergenic
|
NA
|
8
|
2q32.3
|
chr2
|
-
|
194256159
|
194256746
|
588
|
99.83%
|
6.81%
|
(0%-10%)
|
0.00%
|
NA
|
intergenic
|
NA
|
9
|
3q26.1
|
chr3
|
-
|
163283189
|
163283777
|
589
|
99.83%
|
6.81%
|
(0%-10%)
|
0.00%
|
NA
|
intron
|
LINC01192
|
10
|
4q26
|
chr4
|
+
|
116980222
|
116980809
|
588
|
99.83%
|
6.81%
|
(0%-10%)
|
0.34%
|
NA
|
intergenic
|
NA
|
11
|
4p15.31
|
chr4
|
+
|
19556097
|
19556684
|
588
|
99.83%
|
6.81%
|
(0%-10%)
|
0.17%
|
NA
|
intron
|
AC024230.1
|
12
|
7p21.2
|
chr7
|
+
|
14509240
|
14509827
|
588
|
99.83%
|
6.81%
|
(0%-10%)
|
0.00%
|
NA
|
intron
|
DGKB
|
13
|
8q11.21
|
chr8
|
-
|
51178592
|
51179179
|
588
|
99.83%
|
6.81%
|
(0%-10%)
|
0.00%
|
NA
|
intergenic
|
NA
|
14
|
11q12.3
|
chr11
|
-
|
62185237
|
62185824
|
588
|
99.83%
|
6.81%
|
(0%-10%)
|
0.00%
|
NA
|
intergenic
|
NA
|
15
|
2q21.3
|
chr2
|
+
|
135521883
|
135522470
|
588
|
99.66%
|
6.80%
|
(0%-10%)
|
0.17%
|
NA
|
intron
|
ZRANB3
|
16
|
3p12.2
|
chr3
|
-
|
81329902
|
81330488
|
587
|
99.66%
|
6.80%
|
(0%-10%)
|
0.17%
|
NA
|
intergenic
|
NA
|
17
|
3p12.1
|
chr3
|
-
|
83618409
|
83618995
|
587
|
99.66%
|
6.80%
|
(0%-10%)
|
0.17%
|
NA
|
intergenic
|
NA
|
18
|
5q21.3
|
chr5
|
-
|
105998962
|
105999549
|
588
|
99.66%
|
6.80%
|
(0%-10%)
|
0.34%
|
NA
|
intergenic
|
NA
|
19
|
10p12.31
|
chr10
|
-
|
18856645
|
18857233
|
589
|
99.66%
|
6.80%
|
(0%-10%)
|
0.17%
|
NA
|
intergenic
|
NA
|
20
|
Yq11.23
|
chrY
|
-
|
25974734
|
25975320
|
587
|
99.66%
|
6.80%
|
(0%-10%)
|
0.17%
|
NA
|
intergenic
|
NA
|
21
|
6q14.1
|
chr6
|
-
|
82297755
|
82298498
|
744
|
99.49%
|
6.78%
|
(0%-10%)
|
0.34%
|
Insertion
|
intergenic
|
NA
|
22
|
Xp22.2
|
chrX
|
+
|
11033746
|
11034330
|
585
|
99.32%
|
6.77%
|
(0%-10%)
|
0.51%
|
NA
|
intron
|
AC073529.1
|
23
|
2p12
|
chr2
|
+
|
77807602
|
77808185
|
584
|
99.15%
|
6.76%
|
(0%-10%)
|
0.68%
|
NA
|
intron
|
AC012494.1
|
24
|
1q23.3
|
chr1
|
+
|
162419359
|
162419942
|
584
|
98.98%
|
6.75%
|
(0%-10%)
|
0.17%
|
NA
|
intergenic
|
NA
|
25
|
Xq27.2
|
chrX
|
-
|
142767872
|
142768454
|
583
|
98.98%
|
6.75%
|
(0%-10%)
|
0.00%
|
NA
|
intergenic
|
NA
|
26
|
2q31.1
|
chr2
|
+
|
171365032
|
171365617
|
586
|
98.64%
|
6.73%
|
(0%-10%)
|
1.19%
|
NA
|
intron
|
METTL8
|
27
|
5q13.3
|
chr5
|
+
|
75859521
|
75860102
|
582
|
98.64%
|
6.73%
|
(0%-10%)
|
0.17%
|
NA
|
intergenic
|
NA
|
28
|
Xq27.3
|
chrX
|
-
|
144791258
|
144791846
|
589
|
98.47%
|
6.71%
|
(0%-10%)
|
1.37%
|
NA
|
intergenic
|
NA
|
29
|
12q12
|
chr12
|
+
|
38144469
|
38145052
|
584
|
98.30%
|
6.70%
|
(0%-10%)
|
1.03%
|
NA
|
intergenic
|
NA
|
30
|
15q21.3
|
chr15
|
-
|
54594796
|
54595373
|
578
|
98.13%
|
6.69%
|
(0%-10%)
|
2.04%
|
NA
|
intron
|
UNC13C
|
31
|
21q11.2
|
chr21
|
-
|
14080466
|
14081052
|
587
|
97.79%
|
6.67%
|
(0%-10%)
|
2.05%
|
NA
|
intron
|
AP001347.1
|
32
|
4q28.2
|
chr4
|
+
|
129080872
|
129081454
|
583
|
97.61%
|
6.66%
|
(0%-10%)
|
2.22%
|
NA
|
intron
|
SCLT1
|
33
|
3q25.2
|
chr3
|
-
|
154944330
|
154944911
|
582
|
97.44%
|
6.64%
|
(0%-10%)
|
2.56%
|
NA
|
intergenic
|
NA
|
34
|
11q24.2
|
chr11
|
+
|
124270705
|
124271275
|
571
|
96.93%
|
6.61%
|
(0%-10%)
|
0.00%
|
NA
|
intergenic
|
NA
|
35
|
2q14.3
|
chr2
|
-
|
125024208
|
125024792
|
585
|
96.76%
|
6.60%
|
(0%-10%)
|
3.07%
|
NA
|
intergenic
|
NA
|
36
|
6q27
|
chr6
|
+
|
169084226
|
169084808
|
583
|
96.76%
|
6.60%
|
(0%-10%)
|
3.07%
|
NA
|
intergenic
|
NA
|
37
|
13q13.3
|
chr13
|
-
|
38319721
|
38320300
|
580
|
96.76%
|
6.60%
|
(0%-10%)
|
3.24%
|
NA
|
intron
|
LINC00571
|
38
|
7q35
|
chr7
|
+
|
143472173
|
143472744
|
572
|
96.08%
|
6.55%
|
(0%-10%)
|
3.75%
|
NA
|
intron
|
EPHA1-AS1
|
39
|
14q21.3
|
chr14
|
+
|
48011215
|
48011780
|
566
|
95.91%
|
6.54%
|
(0%-10%)
|
0.53%
|
NA
|
intergenic
|
NA
|
40
|
3p21.31
|
chr3
|
+
|
44534488
|
44535059
|
572
|
95.74%
|
6.53%
|
(0%-10%)
|
3.77%
|
NA
|
intergenic
|
NA
|
41
|
2q22.1
|
chr2
|
-
|
138860917
|
138861512
|
596
|
95.06%
|
6.48%
|
(0%-10%)
|
3.79%
|
NA
|
intergenic
|
NA
|
42
|
12p13.32
|
chr12
|
-
|
4720007
|
4720593
|
587
|
94.89%
|
6.47%
|
(0%-10%)
|
4.95%
|
NA
|
intron
|
AC005833.1
|
43
|
3p14.2
|
chr3
|
-
|
59469489
|
59470030
|
542
|
91.82%
|
6.26%
|
(0%-10%)
|
3.23%
|
NA
|
intron
|
AC126121.3
|
44
|
1q24.2
|
chr1
|
+
|
168457190
|
168457732
|
543
|
90.80%
|
6.19%
|
(0%-10%)
|
0.37%
|
NA
|
intron
|
AL023755.1
|
45
|
20p13
|
chr20
|
-
|
2809052
|
2809886
|
835
|
88.42%
|
6.03%
|
(0%-10%)
|
0.38%
|
Insertion
|
intergenic
|
NA
|
46
|
18q21.33
|
chr18
|
+
|
63648105
|
63648555
|
451
|
76.49%
|
5.22%
|
(0%-10%)
|
0.22%
|
NA
|
intron
|
SERPINB11
|
47
|
17q22
|
chr17
|
+
|
52961655
|
52962071
|
417
|
70.87%
|
4.83%
|
(0%-10%)
|
0.24%
|
NA
|
intergenic
|
NA
|
2. Structural Characterization
To define the structural characteristics of HML-9 elements, the 23 proviruses were further analyzed by comparing them with the reference LTR14C-HERVK14C-LTR14C. According to the annotation information summarized in Dfam database (https://www.dfam.org/family/DF0000189/features), the complete HML-9 showed a typical proviral structure, containing 4 open reading frames (ORFs) and 2 flanking LTRs. In detail, the 5’ LTR located from nucleotide 1 to 587, the gag gene located from nucleotide 758 to 2548, the pro gene located from nucleotide 2548 to 3435, the pol gene located from nucleotides 3411 to 6060, the env gene located from nucleotides 5975 to 8020, and 3’ LTR located from nucleotide 8022 to 8608. We aligned the 23 HML-9 proviral sequences and annotated the position of the single retroviral component and deletions to describe the structure of each HML-9 provirus element (Fig. 2). Thereinto, the 16p12.3, 2p12, 15q21.1, 8p11.1, 13q31.1 and 4q33 are longer than 70% of complete reference in length. Further, the integrity of 6 separate regions (5’ LTR, gag, pro, pol, env, and 3’ LTR) was summarized in Table 3.
Table 3
The integrity of 6 separate regions relative to the corresponding sections of reference.
Number
|
Locus
|
Provirus Regions
|
5’LTR
|
gag
|
pro
|
pol
|
env
|
3’LTR
|
1
|
16p12.3
|
chr16 19393581 19402152
|
100.00%
|
99.83%
|
99.89%
|
99.39%
|
99.17%
|
99.66%
|
2
|
2p12
|
chr2 82022660 82031279
|
98.98%
|
99.72%
|
99.89%
|
99.43%
|
99.90%
|
99.15%
|
3
|
15q21.1
|
chr15 45234477 45243073
|
99.83%
|
99.27%
|
100.00%
|
99.66%
|
99.56%
|
99.83%
|
4
|
8p11.1
|
chr8 43694016 43702583
|
99.83%
|
99.44%
|
98.31%
|
99.39%
|
99.66%
|
99.83%
|
5
|
13q31.1
|
chr13 84869526 84877320
|
35.78%
|
99.55%
|
52.20%
|
99.77%
|
99.80%
|
99.32%
|
6
|
4q33
|
chr4 170126345 170133883
|
99.66%
|
99.89%
|
99.77%
|
99.70%
|
13.93%
|
0.00%
|
7
|
6p12.3
|
chr6 48873675 48879725
|
99.15%
|
92.79%
|
65.84%
|
6.13%
|
99.85%
|
99.66%
|
8
|
Yp11.2
|
chrY 9273707 9279611
|
88.42%
|
90.73%
|
63.81%
|
6.36%
|
98.83%
|
99.49%
|
9
|
8q24.3
|
chr8 145019974 145032719
|
0.00%
|
0.00%
|
0.79%
|
99.77%
|
99.90%
|
95.23%
|
10
|
Yq11.223
|
chrY 21580120 21585551
|
88.25%
|
91.74%
|
64.83%
|
6.25%
|
99.36%
|
15.84%
|
11
|
19q13.2
|
chr19 40954172 40959178
|
98.98%
|
98.94%
|
64.26%
|
6.17%
|
67.40%
|
77.51%
|
12
|
Yp11.2
|
chrY 8121821 8126768
|
99.66%
|
76.49%
|
14.09%
|
6.40%
|
99.17%
|
98.81%
|
13
|
Yp11.2
|
chrY 8996062 9000755
|
0.00%
|
80.23%
|
64.37%
|
6.47%
|
99.80%
|
95.55%
|
14
|
Yq11.222
|
chrY 18622534 18626952
|
96.76%
|
75.21%
|
0.00%
|
0.00%
|
84.75%
|
98.47%
|
15
|
Yq11.223
|
chrY 21845475 21850069
|
0.00%
|
70.85%
|
64.60%
|
6.40%
|
99.32%
|
99.15%
|
16
|
21q21.1
|
chr21 18563368 18566735
|
0.00%
|
0.00%
|
95.26%
|
96.12%
|
0.00%
|
0.00%
|
17
|
5q33.3
|
chr5 156660448 156663815
|
0.00%
|
0.00%
|
95.26%
|
96.12%
|
0.00%
|
0.00%
|
18
|
1q22
|
chr1 155629408 155632775
|
0.00%
|
0.00%
|
95.26%
|
96.12%
|
0.00%
|
0.00%
|
19
|
7q36.1
|
chr7 150561277 150563994
|
0.00%
|
0.00%
|
0.00%
|
16.91%
|
99.02%
|
54.51%
|
20
|
8q21.13
|
chr8 78652302 78654820
|
0.00%
|
0.00%
|
0.00%
|
0.00%
|
87.34%
|
100.00%
|
21
|
10q24.2
|
chr10 99822511 99825532
|
0.00%
|
0.00%
|
52.29%
|
95.43%
|
0.00%
|
0.00%
|
22
|
12q13.11
|
chr12 48509228 48511681
|
0.00%
|
0.00%
|
88.72%
|
63.52%
|
0.00%
|
0.00%
|
23
|
Yq11.222
|
chrY 17669948 17671523
|
52.81%
|
62.09%
|
0.00%
|
0.00%
|
0.00%
|
0.00%
|
It can be noted that many of the proviruses should be retrotransposed pseudogenes. Although these genes have been inactivated, they still can provide us with information about the evolutionary history of the gene family or the organism.
3. Phylogenetic analyses
To characterize the phylogenetic relationship of the HML-9 group, 5 proviral sequences longer than 80% of the HML-9 reference were screened to generate ML trees together with all representatives of Dfam HERV-K groups (HML-1 to 10) and exogenous Betaretroviruses. The results showed that the 5 identified proviruses all clustered with the Dfam HML-9 reference by a 100% bootstrap support (Fig. 3A). Subsequently, phylogenetic trees of a total of 44 identified solo LTRs longer than 90% of the LTR14C were constructed together with the LTR reference (Fig. 3B). Next, 4 ML trees of sub-regions whose lengths are longer than 90% of the corresponding section of the reference sequence were constructed, including 10 gag elements, 8 pro elements, 11 pol elements, and 13 env elements, respectively (Fig. 3C-E). These phylogenetic groups of different regions of HML-9 all clustered together and were clearly separated from the other HERV-K groups (HML1-8, 10). Thereinto, two distinct clusters in the pro and pol group were observed, respectively. They were statistically supported by 100% bootstrap values and were named type I and type II. The results showed that 21q21.1 (chr21:18563368–18566735), 1q22 (chr1:155629408–155632775), and 5q33.3 (chr5:156660448–156663815) were divided into type I; 15q21.1 (chr15:45234477–45243073), 16p12.3 (chr16:19393581–19402152), 4q33 (chr4: 170126345–170133883), 8p11.1 (chr8:43694016–43702583), and 2p12(chr2: 82022660–82031279) were divided into type II. Type II sequences included the Dfam HML-9 reference, whereas type I elements showed a more divergent relationship relative to the group references, which suggested these sequences were amplified at least 2 times after the initial genome integration.
4. Estimated time of integration
The HML-9 proviral age in the human genome was estimated based on the LTRs, gag, pro, pol, and env regions, respectively. Each region whose length exceeds 90% of the corresponding section of reference was selected to calculate the integration time. For gag, pro, pol, and env regions, the ancestral sequence of each region has been generated via Mega 7, following the ML method based on the multiple alignments of all elements. The T value (integration time) has been estimated by the relation T = D/0.2, where 0.2 represents the human genome neutral mutation rate expressed in substitutions/nucleotide/million years. To LTRs, as the 5' and 3' LTRs of the same provirus are identical at the time of integration and accumulate random substitutions in an independent manner [46], the T value was estimated by the relation T = D/0.2/2. For each region of a provirus, we provided details on the period of proviruses formation in Table 4. Overall, the time for evaluation based on the four regions (gag, pro, pol, and env) is larger than that for evaluation based on LTR. The majority of HML-9 elements (gag, pro, pol, and env) found in the human genome have been integrated between 37.5 and 151.5 million years ago (mya). The average time of integration was 76 mya. Interestingly, we found that the average integration time of type I was around 100 (99.92) mya, while the average integration time of type II is about 70 (68.85) mya. The integration time of type I was about 30 mya earlier than that of type II. However, the LTRs have been integrated between 17.5 and 48.5 million years ago (mya).
Table 4
Estimated time of HML-9 elements integration
locus
|
Provirus regions
|
Divergence from Consensus sequence
|
Mean Divergences
|
T = D/0.2
|
Age/ million years (gene vs consensus)
|
Divergence between 2 LTRs
|
T = D/0.2/2
|
Age/million years (LTR vs LTR)
|
gag
|
pro
|
pol
|
env
|
16p12.3
|
chr16 19393581 19402152
|
0.059
|
0.158
|
0.206
|
0.089
|
0.128
|
0.64
|
64.00
|
0.082
|
0.20500
|
20.50000
|
2p12
|
chr2 82022660 82031279
|
0.061
|
0.182
|
0.204
|
0.101
|
0.137
|
0.685
|
68.50
|
0.070
|
0.17500
|
17.50000
|
15q21.1
|
chr15 45234477 45243073
|
0.051
|
0.126
|
0.206
|
0.099
|
0.121
|
0.6025
|
60.25
|
0.080
|
0.20000
|
20.00000
|
8p11.1
|
chr8 43694016 43702583
|
0.091
|
0.177
|
0.231
|
0.121
|
0.155
|
0.775
|
77.50
|
0.107
|
0.26750
|
26.75000
|
13q31.1
|
chr13 84869526 84877320
|
0.054
|
NA
|
0.208
|
0.103
|
0.122
|
0.608333333
|
60.83
|
NA
|
NA
|
NA
|
4q33
|
chr4 170126345 170133883
|
0.058
|
0.172
|
0.214
|
NA
|
0.148
|
0.74
|
74.00
|
NA
|
NA
|
NA
|
6p12.3
|
chr6 48873675 48879725
|
0.063
|
NA
|
NA
|
0.106
|
0.085
|
0.4225
|
42.25
|
0.110
|
0.27500
|
27.50000
|
Yp11.2
|
chrY 9273707 9279611
|
0.114
|
NA
|
NA
|
0.139
|
0.127
|
0.6325
|
63.25
|
0.141
|
0.35250
|
35.25000
|
8q24.3
|
chr8 145019974 145032719
|
NA
|
NA
|
0.513
|
0.093
|
0.303
|
1.515
|
151.50
|
NA
|
NA
|
NA
|
Yq11.223
|
chrY 21580120 21585551
|
0.105
|
NA
|
NA
|
0.140
|
0.123
|
0.6125
|
61.25
|
NA
|
NA
|
NA
|
19q13.2
|
chr19 40954172 40959178
|
0.075
|
NA
|
NA
|
|
0.075
|
0.375
|
37.50
|
0.097
|
0.24250
|
24.25000
|
Yp11.2
|
chrY 8121821 8126768
|
NA
|
NA
|
NA
|
0.125
|
0.125
|
0.625
|
62.50
|
0.157
|
0.39250
|
39.25000
|
Yp11.2
|
chrY 8996062 9000755
|
NA
|
NA
|
NA
|
0.133
|
0.133
|
0.665
|
66.50
|
NA
|
NA
|
NA
|
Yq11.222
|
chrY:18622534–18626952
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
0.194
|
0.48500
|
48.50000
|
Yq11.223
|
chrY 21845475 21850069
|
NA
|
NA
|
NA
|
0.143
|
0.143
|
0.715
|
71.50
|
NA
|
NA
|
NA
|
21q21.1
|
chr21 18563368 18566735
|
NA
|
0.266
|
0.190
|
NA
|
0.228
|
1.14
|
114.00
|
NA
|
NA
|
NA
|
5q33.3
|
chr5 156660448 156663815
|
NA
|
0.215
|
0.160
|
NA
|
0.188
|
0.9375
|
93.75
|
NA
|
NA
|
NA
|
1q22
|
chr1 155629408 155632775
|
NA
|
0.212
|
0.156
|
NA
|
0.184
|
0.92
|
92.00
|
NA
|
NA
|
NA
|
7q36.1
|
chr7 150561277 150563994
|
NA
|
NA
|
NA
|
0.197
|
0.197
|
0.985
|
98.50
|
NA
|
NA
|
NA
|
10q24.2
|
chr10 99822511 99825532
|
NA
|
NA
|
0.169
|
NA
|
0.169
|
0.845
|
84.50
|
NA
|
NA
|
NA
|
Additionally, the integration time of HML-9 elements in chimpanzees was also estimated. The results based on 4 internal regions (gag, pro, pol, and env) show they range from 22.5 mya to 117.5 mya (the average time of integration was 45.33 mya), which is later than that in humans (data not published).
5. Function prediction of cis-regulatory regions and enrichment analysis
The GREAT analysis can provide a prediction of potentially regulated genes based on spatial proximity. The results describing the associations between each solo LTR and its putatively regulated gene(s) were shown in Supplementary Table 1. There are a total of 69 genes that have been predicted. Among these, 5 solo LTRs are not associated with any genes, 15 solo LTRs are associated with 1 gene, and 27 solo LTRs are associated with 2 genes (Fig. 4A). The absolute distances to the transcription start site (TSS) of 3 genes are less than 5 kb, 13 genes are between 5 and 50 kb, 34 genes are between 50 and 500 kb, and 19 genes are more than 500 kb (Fig. 4B-C).
To analyze the biological classification of key genes related to solo LTRs, GO Slim summaries were performed. The summary of biological processes (BP) revealed that these genes were mainly enriched in biological regulation, metabolic process, multicellular organismal process, response to stimulus, cell communication, developmental process, localization, and cellular component organization (Fig. 4D). The GO Slim summary of cellular component (CC) showed that these genes significantly enriched in membrane and nucleus, and GO Slim summary of the molecular function (MF) showed that these genes significantly enriched in protein binding and ion binding (Fig. 4E-F).
Further, these potential regulatory genes are all annotated to the selected functional categories and used for enrichment analysis. According to the false discovery rate (FDR) value, the TOP 10 significant GO terms for biological process included the regulation of endothelial cell chemotaxis, regulation of natural killer cell-mediated immunity, positive regulation of synapse assembly, natural killer cell-mediated immunity, regulation of synapse assembly, positive regulation of chemotaxis, synapse assembly, regulation of synapse organization, regulation of synapse structure or activity, and synapse organization (Fig. 5A). The bar chart shows the enrichment ratio of the results. When top results are chosen to be returned and the FDR for the categories is ≤ 0.05, the colors of the bars are in a darker shade than when the FDR exceeds 0.05 (Fig. 5A). The volcano plot in Fig. 5B shows the log2 of the FDR versus the enrichment ratio for all the functional categories in the database, highlighting the degree by which the significant categories stand out from the background. The size and color of the dot are proportional to the number of overlapping (for ORA). The significantly enriched categories are labeled, and the labels are positioned automatically by a force field-based algorithm at startup. The enrichment results for cellular component and molecular function are illustrated in Fig. 5C-F. It must be noted that these results are entirely predictive and that future research is required to confirm any of the implied associations between the solo LTRs and the nearby genes.
In addition to solo LTRs, the prediction of putatively regulated genes by proviral LTRs was also performed by GREAT. Enrichment analysis was carried out as described for solo LTR. The results describing the associations between each proviral LTR and its putatively regulated gene(s) were shown in Supplementary Table 2. There are a total of 36 genes that have been predicted. Among these, 4 proviral LTRs are not associated with any genes, 6 proviral LTRs are associated with 1 gene, and 15 proviral LTRs are associated with 2 genes (Supplementary Fig. 1A). Unexpectedly, the 4 proviral LTRs associated with none genes belong to two pairs of 5’ and 3’ LTRs of the same provirus, 2p12 (chr2 82022660 82031279) and Yp11.2 (chrY 8121821 8126768), respectively. Especially, 2p12 (chr2 82022660 82031279) is a rather complete provirus. The absolute distances to the TSS of 0 genes are less than 5 kb, 13 genes are between 5 and 50 kb, 15 genes are between 50 and 500 kb, and 8 genes are more than 500 kb (Supplementary Fig. 1B-C). The biological classification of key genes related to provirus LTRs by GO Slim summaries was shown in Supplementary Fig. 1D-F). The enrichment results for biological process, cellular component, and molecular function were illustrated in Supplementary Fig. 2–4.
6. In silico examination of the conserved transcription factor binding sites
HML-9 exhibiting specific base insertions may influence the complexity of LTR transcriptional regulation [51]. A complete view of the putative transcription factor binding sites within the HML-9 LTR were shown in Fig. 6A. A total of 22 human transcription binding sites were predicted, including 19 transcription factors: EHF, SOX10, FOS, FOSL1, FOSL2, JUNB, JUND, ETV4, KLF1, KLF5, KLF7, ZNF263, THAP1, SP4, RBPJ, HAND2, MAZ, NEUROG2, and NEUROD1. The motifs are marked on the sense strand and antisense strand of the consensus sequence.
7. PBS type of HML-9 sequences
Traditionally HERVs have been named per the tRNA that binds their RT enzyme and PBS [52]. Thus, HERV-K is named after lysine-tRNA. In the analyzed 5 proviral and consensus sequences of HML-9 elements, the PBS was located approximately from nucleotide 3 to 20 in nucleotides downstream the 5’LTR. To summarize the general variation of the PBS sequence within the HML-9 group we generated a logo in which the letter height is proportional to the nucleotides/amino acid conservation at each position (Fig. 6B-C). The result showed the TGG starting nucleotides were the most conserved among the 18 bases analyzed which had a tryptophan (W) PBS type, followed by arginine (R), confirming the relatively low taxonomic value of this feature.