A cis-Tether Terminator, Linc-GmSTT1, Regulates Transcription Termination via the Linc-GmSTT1-intermolecular Interactome in Soybean

Soybean β-conglycinin α-subunit is an important allergen that adversely affects the nutritional and processing qualities of soya products. Although inheritance of the α-subunit and the molecular basis of α-null mutations have been studied intensively, the molecular mechanism that regulates α-subunit expression remains unclear. Here, we demonstrated that a long intergenic non-coding RNA, acting as a soybean cis-tether terminator1 (designated Linc-GmSTT1) regulate β-conglycinin α-subunit expression. The Linc-GmSTT1 was mapped in physical proximity of α-subunit CG-α-1 gene and demonstrated to be a crucial element of the convergent alpha-transcription termination unit (alpha-TTU). Ingeniously, by reading through, Linc-GmSTT1 and CG-α-1 gene co-transcribed and subsequently achieve its Cgy-2-locus (conrm α-normal) specic regulation function via Linc-GmSTT1-intermolcular interactome. This work provides a unique model whereby LincRNA regulated the effective transcriptional termination of proximal protein-coding genes which might be a crucial procession protecting it from the silencing machinery in plant.

Identi cation and characterization of Linc-GmSTT1. The α-subunit gene expression is controlled at the transcriptional and post-transcriptional levels 6,9 . To identify long non-coding RNAs (lncRNAs) associated with the α-null trait, we used RNA sequencing (RNA-Seq) to assess the genome-extensive lncRNA expression pro les in DN47 compared with its α-null NIL. The lincRNA transcript MSTRG128686 (designated Linc-GmSTT1) was located in the intergenic non-coding region within the α-subunit genes CG-α-1 and CG-α-2 in the Cgy-2 locus (AB604030) of DN47 (Fig. 1A, B). Linc-GmSTT1 was located 27 bp downstream of the 3′ untranslated region (UTR) of CG-α-1 in DN47 (Fig. 1A, B), but was not detected in NIL (Fig. 1D). Linc-GmSTT1 cis-regulated the differential expression of seven coding genes (Fig. 1A, Table  S1). To obtain the full-length cDNA corresponding to MSTRG128686, we performed 5′ and 3′ RACE. The cDNA from developing seeds of DN47 (20 days after owering; DAF) was ampli ed using MSTRG128686-speci c primers and sequenced; the product (1212 bp) was designated Linc-GmSTT1 ( Fig. S1). It contained 84 bp of the 5′ UTR, two exons, one intron, and 81 bp of the 3′ UTR (Fig. 1C). Quantitative real-time PCR (qRT-PCR) results agreed with the RNA-Seq data, showing that Linc-GmSTT1 expression was signi cantly lower in NIL than DN47 (Fig. 1E). Thus, the presence or absence of Linc-GmSTT1 was associated with the α-subunit normal or null phenotype.
Identi cation of Linc-GmSTT1 in DN47 led us to compare the intergenic non-coding region of the Cgy-2 locus between DN47 and NIL (Fig. 1B). In addition to the coding-region IR of the alpha-IR locus reported previously 5 , we detected two intergenic terminal inverted repeat (ITIR) units (Fig. 1B) that were identical in DN47 and NIL. The rst unit, located immediately downstream of the 3′ UTR-end of the CG-α-1 anking region, was a 27-bp ITIR1(L) fragment. The corresponding inverted repeat sequence of ITIR1(L), designated ITIR1(R), was embedded in the 3′ UTR of CG-α-2 (Fig. 1B, red arrowheads). The second ITIR was designated ITIR2(L)/ITIR2(R). ITIR2(L) was a 364-bp sequence located upstream of Linc-GmSTT1 and comprised 84 bp within the 5′ UTR region and 280 bp extending into exon 1 of Linc-GmSTT1, whereas ITIR2(R) was located in reverse orientation in proximity to the 3′ UTR-end of CG-α-2 ( Fig. 1B, green arrowheads). These ITIR units might produce a double-stranded RNA molecule potentially involved in the silencing regulatory pathways.
Although the coding sequences of the two α-subunit genes were identical in DN47 and NIL, we observed striking differences in the intergenic non-coding region from nt 38681395 to 38682596. We named this region HVR (hypervariable region; Fig. 1B). The HVR co-segregated with absence of the α-subunit.
Convergent transcription termination of CG-α-1 and Linc-GmSTT1 genes. The CG-α-1 gene was closely associated with lincRNA because the Linc-GmSTT1 gene was located only 27 bp from the 3′ UTR-end of CG-α-1 (Fig. 1A). Closely spaced genes are particularly prone to inducing occlusion and interference in transcription, especially when expressed simultaneously 28 . Normal expression of CG-α-1 and Linc-GmSTT1 in DN47 was indicated to be the result of a balance between two factors: (1) transcriptional interference and (2) initiation of transcription of the downstream Linc-GmSTT1. We speculated that cotranscription of Linc-GmSTT1 and CG-α-1 resulted in nascent transcripts from the adjacent sequences including CG-α-1 and Linc-GmSTT1, whereby the CG-α-1/Linc-GmSTT1 cryptic transcripts might be the rst step required for initiation and termination of Linc-GmSTT1 transcription as well as regulating transcription termination of the Cgy-2 locus.
To assess the possibility of co-transcription, we searched for evidence of transcriptional readthrough of CG-α-1 into the Linc-GmSTT1 gene using RT-PCR ampli cation followed by sequencing. A CG-α-1-speci c forward primer (F1) located within the 3′ UTR of CG-α-1 was designed; in addition, eight reverse primers were designed at ve locations (R1-R5) within the alpha-TTU region, R6 and R7 within the intergenic region downstream of alpha-TTU, and R8 within the 3′ UTR of CG-α-2 ( Fig. 2A, Table S2). The PCR products were ampli ed using the F0+R0 and F1+R1-R5 primer pairs, but not the F1+R6-R8 primer pairs (Fig. 2B). The ampli ed sequences were consistent with the target fragment sequences (Data S1). These data indicated transcription of CG-α-1 continued through the intergenic region and extended into Linc-GmSTT1. Therefore, we speculated that through sharing a common alpha-TTU with CG-α-1, Linc-GmSTT1 might effectively complete both initiation and termination, simultaneously contributing to regulation of transcription termination of the Cgy-2 locus. CRISPR/Cas9-mediated targeted mutagenesis of Linc-GmSTT1. To determine whether mutation of Linc-GmSTT1 was responsible for the α-null mutant phenotype, we used CRISPR/Cas9 technology to knockout the Linc-GmSTT1 gene in the 'DongNong 50' (DN50) background (α-subunit normal) (Fig. 3). Genomic DNA was extracted from 25 independent transgenic T 0 seeds, which were used to amplify the Linc-GmSTT1 gene region by PCR (Fig. 3B), followed by DNA sequencing to validate the targeted Linc-GmSTT1 gene disruption in the transgenic T 0 plants. Nineteen T 0 transgenic events contained a CRISPR/Cas9-edited large fragment deletion (Fig. 3C). One example (event 11, Linc-GmSTT1-13a) is illustrated in Fig. 3B, D-F. Event 11 represented a homozygous 551 bp deletion in the Linc-GmSTT1 gene (Fig. 3B, D).
Identi cation of Linc-GmSTT1-RNA and Linc-GmSTT1-protein interactions. Using ChIRP-Seq, we detected 3073 Linc-GmSTT1-RNA-interacting peaks (Fig. S3B), annotated as 2372 genes (Dataset S2).Go and KEGG analysis show that the most abundant Linc-GmSTT1-RNA interacting genes (LRGs) were involved in biological process response to cadmium ion (60 genes, Fig. S3D Left) and the majority of LRGs enriched in the spliceosome pathway (46 genes, Fig. S3D Right). These data indicate that Linc-GmSTT1 use numerous RNA-RNA intermolecular interactions to achieve its regulation function.
Intriguingly We found Linc-GmSTT1 speci cally bound to its own genomic sequence in the Cgy-2 locus in DN47 but not in NIL (Fig. 5A). Furthermore, we identi ed three Matrix motifs located within the genomic sequence of Linc-GmSTT1. The rst motif contained GTTGG, the second contained ATAATTG, and the third contained TACAGT (Fig. 5B). Comparing the sequences with those in NIL, we found that these motifs were the novel putative Linc-GmSTT1-speci c end-terminating cis-regulatory Matrix motifs that were detected in the cultivars Williams 82 and DN47, but not in the corresponding HVR region in NIL (Fig.  5B,C).

Discussion
Emerging data indicate that lincRNAs can function as tethers. In theory, lincRNA has an intrinsic cisregulatory capacity because it is able to function while tethered to its own locus 29 ; by remaining tethered to the site of transcription, it can uniquely direct allelic regulation [30][31][32] . Our data indicate that α-subunitassociated Linc-GmSTT1 may recognize targets by DNA-RNA recognition, and Linc-GmSTT1 plays its role by binding to its own genomic sequence to function as a tether for the Cgy-2-locus-speci c and allelic control (Fig. 6B). The discovery that Linc-GmSTT1 is a cis-tether terminator involved in transcription termination was unexpected because very few spontaneous lincRNA terminators of transcription termination have been described in plants.
Recently, a study using full-length cDNA datasets from humans and mouse proved that lincRNAs predominantly originate from the vicinity of protein-coding genes, and transcription of certain lincRNAs depends on the same promoter regions as the nearby protein-coding genes 11,28 . Being positioned close to their target protein-coding genes, lincRNAs might depend on the same promoter regions to regulate expression of the protein-coding genes, which might be a common lincRNA-mediated regulatory mechanism in higher eukaryotes. In the present study, we demonstrated that Linc-GmSTT1 depend on the same transcriptional termination region with CG-α-1. Transcription of CG-α-1 continued through the intergenic region and the entire Linc-GmSTT1, resulting in nascent transcripts from adjacent sequences that included both CG-α-1 and Linc-GmSTT1 (Fig. 2). Moreover, the majority of 2372 LRGs were classi ed as spliceosome pathway genes ( Fig. S3D-right). We speculate that chimeric CG-α-1/Linc-GmSTT1 cryptic transcripts rst require precision splicing to release mature mRNAs of CG-α-1 and Linc-GmSTT1, which might explain the predominant classi cation of LRGs as spliceosome pathway genes. How the splicing of a single co-transcription unit leads to the Linc-GmSTT1 transcript distinct from that of the proteincoding gene CG-α-1 needs further research.
LincRNAs interact with numerous DNAs, RNAs and proteins for accurate transcriptional regulation 33-42 . Our data indicate that Linc-GmSTT1 has the unique capacity for interacting simultaneously with multiple DNA, RNA and protein molecules (Fig. 6B, C, D), and suggest that Linc-GmSTT1 acts at nearly every level of transcriptional regulation. These results provide strong evidence there is no single mechanism by which the α-subunit is strictly and effectively transcribed. However, we have not yet proven how the functions of Linc-GmSTT1 are coordinated cooperatively to interact with speci c interactome including DNA, RNA and protein sequences and so regulate the expression of the α-subunit at different transcriptional levels. This issue will require further study.
The LincRNA-mediated intergenic region convergent transcription termination is a new model (Fig.6) for regulating the expression of soybean seed storage protein subunit. Based on the data presented here, we propose that a tail-to-tail α-subunit genes convergent terminator system operates at two levels of regulating the expression of the α-subunit: (1) co-transcription of CG-α-1 and Linc-GmSTT1 at transcriptional level (Fig.6A), and (2) post-transcriptional regulation of the transcription of the Cgy-2-locus speci cally regulated by Linc-GmSTT1 (Fig. 6B, C, D) is a prerequisite for the normal expression of the αsubunit (Fig.6E).Both mutation (Fig.6F)and knockout (Fig.6G) of Linc-GmSTT1 results ine cient termination of α-subunit CG-α-1 gene (Fig.6H) and induces post-transcriptional α-subunit gene silencing (Fig.6I). The proper transcription termination of α-subunit might be a crucial process in protecting αsubunit gene from the silencing machinery. The possibility of intergenic lincRNA-mediated regulation in other similar tail-to-tail gene pairs is yet to be examined.

Methods
Plant material and growth conditions. The α-null type NIL used in this study was developed by four generations of backcrossing a line harboring cgy-2 (con rmed α-null) from RiB with DN47, followed by ve generations of sel ng to generate a BC 4 F 5 NIL population (Fig. S4). We previously used this population to investigate α-null-related transcription-level changes 8 . Standard farming practices were used to grow the BC 4 F 5 NIL plants in a randomized block design at the Northeast Agricultural University Experimental Station, China. Pod samples were collected during the seed development stage at 20 DAF (Fig. S4C) during the summer of 2018. SDS-PAGE and western blot analyses con rmed that the α-null phenotype was stably inherited in NIL (Fig. S4D). The BC 4 F 5 seeds harvested in 2018 were used for the ChIRP analyses. 'DongNong 50' (DN50), a soybean cultivar that shows high transformation e ciency, was used for CRISPR/Cas9 analysis.
Phenotype screening for the α-subunit-null mutation in the NIL using SDS-PAGE analysis. The absence of the α-unit of b-conglycinin was con rmed in the collected NIL seed samples by analyzing the subunit composition of seed proteins by SDS-PAGE ( Supplementary Fig. 1D Library checking. A Qubit ® RNA Assay Kit was used to measure the RNA concentrations of the prepared libraries, after which samples were diluted to 1 ng/μL. Using an Agilent Bioanalyzer 2100 system (Agilent Technologies), the insert sizes were evaluated, and appropriate inserts were quanti ed using a TaqMan uorescence probe and a StepOne Plus Real-Time PCR System (Applied Biosystems) (valid library concentration > 10 nM).
Library clustering and sequencing. A cBot cluster-generation system with a TruSeq PE Cluster Kit (version 4) cBot-HS (Illumina, San Diego, CA, USA) was used to complete the clustering of the index-coded samples. The libraries were sequenced on an Illumina platform after clustering to generate 150-bp pairedend reads.
Data quality control. Perl scripts were used to process the raw data to guarantee the suitable quality of the data for subsequent analyses. The reference genome and the annotation les were downloaded from the ENSEMBL database (http://www.ensembl.org/index.html). The genome index was used to build Bowtie2 (version 2.2.3). Using TopHat (version 2.0.12), clean sequence data were mapped to the reference genome. The latter program was also used to recognize exon-exon junctions by separating the mapped reads and remapping them to the reference genome. TopHat uses Bowtie2 for mapping, which improves the accuracy and speed of the analysis.
Quanti cation of gene expression levels. Read counts for each gene in every sample were determined using HTSeq (version 0.6.0), after which the number of reads per kilobase per million mapped reads (RPKM) was computed with the following equation to approximate the gene expression levels in each sample: where R represents the number of reads for a particular gene in a speci c sample, N denotes the total number of mapped reads in a speci c sample, and L is the length of a particular gene.
Analysis of differentially expressed genes. The DESeq (version 1.16) program was used to analyze DEGs in DN47 and NIL in accordance with a negative binomial distribution model. A P-value was allocated to each gene and the Benjamini-Hochberg method used to control the false discovery rate. Genes with |log 2 ratio| ≥ 1 and q ≤ 0.05 were recognized as DEGs.
Quantitative real-time PCR validation. Total RNA was transcribed reversely into cDNA utilizing SuperScript III Reverse Transcriptase (Invitrogen, Grand Island, NY, USA) following the manufacturer's instructions A 2× PCR Master Mix and Applied Biosystems ViiA 7 Real-Time PCR System were used for qRT-PCR analysis with incubation for 10 min at 95°C, followed by 40 cycles of 60°C for 1 min and 95°C for 10 s. The 2 −ΔΔCt method was used to calculate the relative mRNA and lincRNA expression levels, which were normalized to GAPDH as an endogenous reference transcript. The data shown represent the means of three repetitions.
5′ and 3′ RACE of MSTRG128686. The 5′ RACE PCR ampli cation was performed based on the Invitrogen 5′ RACE system manual. For cDNA rst-strand synthesis, the mixture contained 5 μL total RNA and incubation was performed for 1 μL random primer at 70°C for 5 min followed by placement in an ice bath for 2 min. Then, 2.0 μL of 5× rst-strand buffer, 0.5 μL of 10 mM dNTPs, 0.25 μL RNase inhibitor, and 0.25 μL reverse transcriptase were added. The mixture was made to 10.0 μL total volume and incubated at 42°C for 60 min followed by 72°C for 10 min. For 5′ RACE with a nested PCR reaction system (end C method), reverse transcription used speci c primers RC583-RT1/RC583-RT2 to amplify the cDNA, and after the RNase H and TdT treatment we performed nested PCR (see the following section). For 5′ and 3′ RACE of rare cDNAs, the temperature parameters for PCR were: 3 min at 95°C followed by 33 cycles at 94°C for 30 s and 68°C for 30 s; after a 7-min ultimate extension at 72°C, the PCR was repeated.
The 3′ RACE ampli cation was also conducted using nested PCR, using the 3′ adaptor as the reverse primer, cDNA as the template, and the same conditions and cycle parameters as for 5′ RACE, except that the annealing temperature was 58°C for 30 s. The PCR products were separated on 1.0% (w/w) agarose/ethidium bromide gels in 1× TBE buffer containing 90 mM Tris-borate and 2 mM EDTA (pH 8.0 at 22°C). We used a 1 kb DNA ladder as a DNA size marker.
RT-PCR. Total RNA was extracted from DN47 and NIL seeds at 20 days after owering using TRIzol reagent (Invitrogen) followed by treatment with RNase-free DNase I (Invitrogen) to eliminate genomic DNA. Treated RNA was utilized for RT-PCR. The RT-PCR ampli cation of the convergent transcription readthrough of CG-α-1/Linc-GmSTT1 transcripts was conducted using the primer pairs listed in Supplementary Table 2. The PCR-created products were cloned directly into pCRII using a TOPO TA cloning kit (Invitrogen) and subsequently sequenced.
The Cas9/sgRNA expression vectors in pCBSG015(Basta) were introduced into Agrobacterium tumefaciens strain EHA105 by electroporation. Embryo cotyledonary nodes from DN50 seeds germinated for 5 days were placed in a petri dish containing 50 mL Agrobacterium suspension. About 150 explants were treated for 2 h, and were then left at room temperature for 30-60 min for infection. After infection, the Agrobacterium liquid was discarded, the explants were transferred to the co-cultivation medium and incubated in the dark at 23°C for 3 days. After co-cultivation, the embryos were transmitted to the shootinduction medium, cultured at 25°C for 7 days, then placed on selection medium containing glufosinate. After culture for 3 weeks, the glufosinate-resistant shoots were transferred to shoot-elongation medium containing glufosinate and cultured in the light for 6-9 weeks. The regenerated elongated seedlings were transferred to rooting medium at 25°C and cultured under light (5000 lux) until rooting.
For each transformed plant, to validate the CRISPR/Cas9-mediated gene disruption, genomic DNA was extracted from the leaves using the CTAB method. The target Linc-GmSTT1 gene fragment was ampli ed by PCR using the primer pair 5′-CTTCAACTGTCTGCTTAGCTAATTT-3′ and 5′-CCTTTGCCTTCCATAAGGAATTGT-3′. Ultimately, the PCR products were sequenced to verify the successful editing of the gene. Only transformed plants in which the target gene was edited successfully were used in the subsequent tests.
Crosslinking and chromatin preparation. One gram of frozen tissue was sliced and resuspended in 1 volume PBS, crosslinked in 1% (v/v) formaldehyde for 10 min, then quenched for 5 min with 0.125 M glycine, and collected by centrifugation at 2000 ×g for 5 min. Nuclei were lysed (100 mg/mL in nuclear lysis buffer: 50 mM Tris [pH 7.0], 1% [w/v] SDS, 10 mM EDTA, with DTT and PMSF added just before use) on ice for 10 min, and sonicated utilizing a Bioruptor until most chromatin was solubilized and the DNA was within the size range of 100-500 bp. Chromatin preparations were snap-frozen in liquid nitrogen and stored at −80°C until use.
Hybridization and washing. Chromatin was diluted in two volumes of hybridization buffer (1% [w/v] SDS, 750 mM NaCl, 1 mM EDTA, 15% [v/v] formamide, 50 mM Tris [pH 7.0], with DTT and PMSF added just before use). Probes (100 pmol) were added to 3 mL diluted chromatin and combined by end-to-end shaking at 37°C for 4 h. Streptavidin-magnetic C1 beads were rinsed three times in nuclear lysis buffer, then 100 µL of washed beads was added per 100 pmol probes, and the blend was mixed at 37°C for 1 h. Beads:biotin-probes:RNA:chromatin adducts were captured using magnets (Invitrogen) and rinsed ve times with 1 mL wash buffer (0.5% [w/v] SDS, 2× SSC, with DTT and PMSF added just before use). At the last wash, the beads were resuspended. Aliquots of 300 μL were removed for isolation of protein, RNA, and DNA. All tubes were placed on a DynaMag-2 magnetic strip and the wash buffer was removed. After brief centrifugation, tubes were placed on a magnet strip and the last remnants of wash buffer were removed using a ne 10 μL pipette tip.
ChIRP protein elution and MS analysis. Beads were resuspended in 3× original volume of DNase buffer (0.1% NP-40 and 100 mM NaCl). Protein was eluted with 0.1 U/µL RNase H (Epicenter), 100 U/mL DNase I (Invitrogen), and a cocktail of 100 µg/mL RNase A (Sigma-Aldrich) at 37°C for 30 min. Protein eluent was supplemented with 0.2 volume of 5× SDS loading buffer, boiled for 5 min, separated on a NuPAGE 4%-12% (w/w) Bis-Tris gel, followed by silver staining to identify differential bands. The whole gel lane was excised, trypsinized, reduced, alkylated, and further trypsinized at 37ºC overnight. The resulting peptides were extracted, concentrated, and HPLC-puri ed. The peptides separated by liquid-phase chromatography were ionized through a nanoESI source and then passed through a tandem mass spectrometer LTQ Orbitrap Velos (Thermo Fisher Scienti c, San Jose, CA, USA) with data-dependent acquisition-(DDA-) mode detection. Protein identi cation aligned the experimental MS/MS data with the theoretical MS/MS data from a database. Raw MS data were converted into a peak list and then used to search for matches in the database with strict ltering and quality control to produce possible protein identi cations. The nal protein identi cation list was used for functional annotation analysis using the GO and KEGG databases.
ChIRP DNA elution and high-throughput sequencing. Beads were resuspended in 3× original volume of DNA elution buffer (1% [w/v] SDS, 50 mM NaHCO 3 , and 200 mM NaCl), including DNA INPUT, and DNA was eluted with 100 µg/mL RNase A (Sigma-Aldrich) and 0.1 unit/µL RNase H (Epicenter). Elution was performed two times [for 1 h] at 37°C with end-to-end shaking, and both eluates were combined.
Chromatin was reverse-crosslinked with formaldehyde at 65°C overnight then treated with 0.2 U/µL of proteinase K at 55°C for 60 min. DNA was then extracted with an equivalent volume of phenol:chloroform:isoamyl alcohol (Invitrogen) and precipitated with ethanol at −80°C overnight. Using a DNA library preparation protocol, eluted DNA was ampli ed into sequencing libraries based on the manufacturer's instructions (KAPA). To create 151 nt paired-end reads, the recovered libraries were sequenced on an Illumina NextSeq 500 platform (ABLife Inc., Wuhan, China). The raw reads were ranged by Bowtie2 (version 2.2.9) with the Glycine max reference genome. The exclusively mapped reads were exposed to the peak-calling algorithm MACS (version 1.4.2) with default factors.
ChIRP RNA elution and high-throughput sequencing. Beads were resuspended in 95 μL RNA PK buffer (10 mM Tris-Cl [pH 7.0], 100 mM NaCl, 0.5% [w/v] SDS, and 1 mM EDTA), then 5 μL of proteinase K was added and the mixture was incubated at 50°C for 45 min with end-to-end shaking. For RNA INPUT samples (10 μL), 85 μL RNA PK buffer was added. All tubes were centrifuged brie y and heated at 95°C for 10 min, and then RNA was extracted with TRIzol:chloroform. Eluted RNA was ampli ed into sequencing libraries via a RNA library preparation protocol based on the manufacturer's instructions (KAPA). To create 151 nt paired-end reads (ABLife Inc., Wuhan, China), the recovered libraries were sequenced on an Illumina NextSeq 500 platform. The raw reads were aligned by Bowtie2 (version 2.2.9) with the Glycine max reference genome. The exclusively mapped reads were exposed to the peak-calling algorithm MACS (version 1.4.2) with default factors. Identi cation and characterization of Linc-GmSTT1 in the intergenic non-coding region of the Cgy-2 locus (AB604030) of soybean β-conglycinin. A, Schematic presentation of the novel lincRNA transcript MSTRG128686 (termed Linc-GmSTT1) identi ed between the two α-subunit genes CG-α-1 and CG-α-2, and its cis-regulated differentially expressed genes in NIL and DN47 (Table S1). B, Structural and comparative analysis showing that the alpha transcription termination unit (designated alpha-TTU) is the critical region differing between NIL and DN47, and Linc-GmSTT1 embedded within alpha-TTU is the core of alpha-TTU. The orange square indicates the alpha-TTU structure. The purple line represents the hypervariable region (HVR). Red and green arrowheads represent the left and right portions of intergenic terminal inverted repeat 1 [ITIR1(L) and ITIR1(R)] and repeat 2 [ITIR2(L) and ITIR2(R)]. C, Genomic structure of the Linc-GmSTT1 gene. Dark red bars = exons, single line = spliced intron, purple rectangles = non-translated regions; the numbers indicate the number of nucleotides. D, Genomic tracks display the differential expression of Linc-GmSTT1 detected by RNA-Seq, with unique reads of Linc-GmSTT1 detected in DN47 but not in NIL. E, qRT-PCR validation of Linc-GmSTT1 differential expression between NIL and DN47. Signi cant differences were observed between NIL and DN47 (**, p < 0.01).

Figure 2
Detection of CG-α-1/Linc-GmSTT1 readthrough transcripts generated by convergent transcription termination of CG-α-1 (Glyma.20g148300) and Linc-GmSTT1. A, Schematic presentation of the Cgy-2 locus (AB604030), including CG-α-1, Linc-GmSTT1 and CG-α-2. The position of the primers used is indicated by arrows. Readthrough transcription of CG-α-1 toward the Linc-GmSTT1 gene is indicated by an arrowhead at the bottom. B, RT-PCR ampli cation of CG-α-1/Linc-GmSTT1 readthrough transcripts. Genomic DNA-free RNA samples isolated from DN47 were used as templates. Nine sets of primers are indicated at the top of the gure, and the size of DNA markers in kilobase pairs (kb) are shown at the left of the gure.  proteins as A0A0R0I621 (Glyma.09G092600). D, Validation of ChIRP-MS-detected Linc-GmSTT1-binding protein A0A0R0I621 pull-down by Linc-GmSTT1 in vitro. E, Screening of interacting proteins with A0A0R0I621 (Glyma.09G092600) in a yeast two-hybrid assay.

Figure 5
Linc-GmSTT1 is a Cgy-2 allele-speci c tether. A, ChIRP-DNA-Seq revealed Linc-GmSTT1 binding to its own genomic sequence in the Cgy-2 locus in DN47 but not in NIL. B, Sequence logos of Linc-GmSTT1 Page 23/24 binding motifs in the alpha-TTU region in DN47 detected by ChIRP-DNA-Seq. C, Alignment of different distribution patterns of Linc-GmSTT1 binding motifs sites in the alpha-TTU region of the Cgy-2 locus in DN47 and NIL. DN47 contained six motifs, which were detected in the cultivars 'Williams 82' and DN47, but not in the corresponding HVR region in NIL.