Genome-wide identifying of G-quadruplex structures directly by whole-genome resequencing


 We present a convenient genome-wide DNA G-quadruplex (G4) profiling method that identifies G4 structures from ordinary whole-genome resequencing data by seizing the slight fluctuation of sequencing quality. We identified 736,689 G4 structures within human genome, in which 44.9% of all predicted canonical G4-froming sequences were contained. We observed that some of the single nucleotide variations (SNVs) influenced the formation of G4 structures, including homozygous SNVs and heterozygous SNVs. Due to SNVs contain individual differences, the given approach is available to identify and characterize G4s genome-wide for specific individuals.


Introduction
G-quadruplex (G4) structure is an alternative conformation of the double helical structure of B-form DNA arisen in guanine-rich sequences 1 . G4s are also discovered to arise in single-stranded RNA sequences 2, 3 . One four-stranded G4 is the stacks of two or more planar G-tetrads, and one G-tetrad forms through Hoogsteen hydrogen-bonding of four guanines 4 . G4 structures are considered to be knot-like structures and have been showed stable formed under near-physiological conditions in vitro 5,6 . G4 structures are stabilized by cations with stabilizing preference of monovalent cations according to the following order: K + > Na + > Li + 7 . The formation of G4 structures in genome can in uence chromatin architecture and many fundamental biology processes, such as DNA replication and gene regulation [8][9][10] . Identifying G4s genome-wide or transcriptome-wide will accelerate the study on the formation, stability and function of these structures.
In the last decade, several methods have been developed to identify G4 in whole genome or whole transcriptome, including computational sequence analyses and experiments. By searching simple consensus sequence 11,12 or accommodate structural variants [13][14][15] , computational tools have helped to identify potential G4s in given genome 16,17 . Computational predictions employ algorithms derived from experimental data on a small number of sequences, and require a thorough experimental validation.
Using chromatin immunoprecipitation followed by high-throughput sequencing (G4 ChIP-seq), DNA G4s can be detected and mapped to genome 18,19 . Antibodies against known G4-binding proteins are used to infer G4s in these approaches. To be noticed, G4 ChIP-seq depends on speci c antibodies and potential biases may be introduced 20,21 . The presence of folded G4 in DNA can stall a DNA polymerase 22,23 . Coupling with high-throughput sequencing (G4-seq), more than 700,000 DNA G4 sites were identi ed based on genome-wide DNA polymerase-stop assay 24 . In G4-seq, each DNA template is sequenced twice.
The initial sequencing run uses the monovalent cation Na + that do not stabilize G4s to ensure accurate sequencing, and the second sequencing run is under G4-stabilizing conditions (in K + or with addition of pyridostatin). By mismatch quanti cation based on the results of the rst sequencing run, the folded G4s are detected genome-wide in vitro. Similarly, RNA G4 structures were detected in high-throughput using reverse transcriptase stalling (rG4-seq) 25 . G4-seq is a fast and powerful approach to identify G4 sites in genome, which included various non-canonical G4 structures that are di cult to predict. Optimized version of G4-seq has been used to generate whole genome G4 maps for 12 species 26 . However, G4-seq depends on the accommodation of sequencing buffer, which is unreachable to most researchers. Two sequencing runs also mean double cost. If G4 structures can be directly pro led from standard highthroughput sequencing data, the procedure will be more friendly and the cost will be greatly decreased.
Here, we present G4-miner, a genome-wide approach to directly identify G4 structures from standard sequencing data. G4-miner inspects sequencing quality in whole genome, seizes unexpected uctuations in local, and ascribes some of the quality uctuations to the unstable formation of G4. We sequenced DNA from primary human B lymphocytes under standard Illumina sequencing condition. Although the standard Illumina sequencing buffers do not cause strong G4 stabilization 24,27 , a slight and inconstant drop of sequencing quality was recorded from the beginning of the G4 structure (Fig. 1). We mined G4 sites in human genome, appraised the results with the results of computational prediction and G4-seq, and evaluated the in uence of single nucleotide variations (SNVs) in G4 formation using the seized G4 sites. The given method provides a friendly and convenient way to identify G4 structure genome-wide.

Results
Principle and process of G4-miner In G4-miner, genomic DNAs are sequenced only one time using standard Illumina sequencing protocol and standard buffers. The standard sequencing buffers has been optimized and do not cause strong G4 stabilization 24,27 . However, destabilized G4 structures still in uence the sequencing reaction. Instead of altering sequencing readout, a slight and inconstant drop of sequencing quality was recorded from the beginning of the G4 structure ( Fig. 1). In a standard sequencing experiment, DNA fragments in clusters are sequenced and the phred quality scores are calculated for each read (cluster). The unstably formed G4 causes unexpected uctuations of sequencing quality, but do not cause mismatch at most time.
G4-miner inspects sequencing quality in whole genome, seizes unexpected uctuations in local, and ascribes some of the quality uctuations to the unstable formation of G4. Median quality scores of bases mapping to each locus in the genome are calculated after reads mapping. For each site, we seize the quality scores of the next 75 nt, recorded the number of loci whose median quality score signi cantly dropped in comparing with that of the speci c site. G4 sites are mined after setting the thresholds of the number of loci and the quality score dropped. Figure 1 A schematic of the G4-miner method. In a standard sequencing experiment, DNA fragments in clusters are sequenced and the phred quality scores are calculated for each read (cluster). The unstable formation of G4 causes unexpected uctuations of sequencing quality, but do not cause mismatch.
Reads mapping is performed to calculate the median quality score of bases mapping to each locus in the genome. For each site in the genome, the number of loci in the next 75 nt whose median quality score signi cantly dropped in comparing with that of the speci c site is calculated. G4 sequences in the whole genome can be mined after setting the threshold of the number of loci and the quality score dropped.

Validation by known sequences
We validated our approach by inspecting the sequencing quality scores of four known control sequences: two containing stable G4 structures (src and myc), one conforming requirement of canonical G4 but has been demonstrated to prevent G4 formation (a-repeat) 28 and a guanine-rich strand that cannot form a G4 (g-rich) 29 (Supplementary Table 1). Phred quality scores 30 were used to quantify the sequencing quality.
We calculated the median quality scores for each locus in the template/genome to seize and represent the inconstant sequencing quality uctuation. Although the readouts do not alter in all control sequences, the median quality score was slightly uctuated down from or before the beginning of the G4 structure in positive controls (Fig. 2a, b). For src and myc, median quality score dropped no less than 4 at 55 loci and 25 loci among the 75 bases (half of the read length) from beginning of G4 structure. In contrast, sequencing of the negative controls observed less than 6 loci ( Fig. 2c, d). The inspection of quality scores and readouts along positive control revealed that sequencing accuracy slightly reduced after the G4 start sites but the readouts did not alter ( Fig. 1). At the same time, the reduction of quality scores was not observed in all reads which covered G4 structure and the sequence downstream. Some of the reads had constant high quality scores. The observations suggested that using standard Illumina sequencing buffers, the formation of G4s is not stable and the polymerase stalling is not occurred in most template molecules. Sequencing quality score of a read was generated by comprehensively evaluating the accuracy of synthesis reactions in tens of thousands of molecules in a cluster. Under G4-destabilizing condition (standard Illumina sequencing buffers), the quadruplex structures were unstably formed in only a small portion of template molecules in a cluster. The pause of polymerase stalling occurred in these molecules cause phase shift signals, which interfered the real signals and induced the drop of quality score.
Whole genome seizing the quality uctuation Next, we explored whether quality uctuation could be speci cally seized genome-wide. A set of 75nt (half of the read length and longer than 99% PG4s, Supplementary Fig. 1) segments which contain about 178,606 predicted G-quadruplexes (PG4s) and downstream sequences at 3' end were cut from the plus strand of the human genome. The rest regions of plus stand which do not contain PG4s were cut into 75nt segments. The analysis based on about one million reads (about 45× of the plus strand) generated by standard pipeline showed more signi cant quality uctuations in the segments containing PG4s than the rest ones. Over 40% PG4 positive segments contained over 15 loci whose quality scores decrease no less than 4 in comparing with the loci before PG4s, and only 4.05% non-PG4 segments passed the criteria (Supplementary Table 2). Considering that the guanine content is greater than 28% in over 99% PG4 positive segments ( Supplementary Fig. 2), we ltered the non-PG4 segments with low guanine content (G% < 28%), and less than 1% of the non-PG4 segments remained under the same criteria (Supplementary  Table 3). Although the sequence quality for most non-PG4s were stable with no observable decrease, the remained 1% non-PG4 segments were found to have signi cant quality uctuation (~370,000 segments). The number of these non-PG4 segments showing the considerable decreasing amplitude in considerable number of loci, was much greater than the number of predicted PG4s (178,606), which is consistent with the previous approach in G4-stablizing conditions 24 , indicating that the human genome contains more G4s than predicted 11 .
We applied G4-miner to pro le DNA G4s structures in the human genome by resequencing primary B lymphocytes (GM12878, Online Methods), using the Illumina Hiseq platform under standard condition. To identify reliable G4 sequences, we inspected the quality scores of 75 bases at 3' end of a speci c site and screened median quality scores of each site in both strands of the human genome. We set individual threshold for the two strands of the parallel runs to ensure the false positive rate similar and smaller than 1% (Supplementary Table 4). The in uence evaluation of sequencing depth revealed that PG4 detection ratio got plateau when sequencing depth was greater than 20 folds for one stand ( Supplementary Fig. 3). Therefore, two duplicates of the experiment were performed, at least 1.88 billion reads were generated with an average coverage of 45× for each strand of human genome (Supplementary Table 5). We set the threshold of 15 loci quality score decreased 4 (15,4) for plus stand of experiment 1, (15,3) for the minus stand of experiment 1, (12,4) for the minus stand of experiment 2 and (11,4) for the minus stand of experiment 2, respectively. Any segment of the genome passed these thresholds is termed mined G4 sequence (MG4). Within both strands of the human genome, we identi ed 1,054,941 MG4s and 936,545 MG4s in the two parallel experiments (Supplementary Table 6). About 52% and 49% of all 356,298 predicted canonical quadruplexes were also detected by G4-miner and present in MG4s (Supplementary  Table 6). Moreover, 736,689 of the total number of MG4s (70% for experiment 1 and 78% for experiment 2) were detected in both experiments (Fig. 3a, b). The high overlap between parallel experiments indicated that G4-miner is stable in quadruplex assigning. Comparing to the result of the G4-seq, which combined polymerase stop assays with next-generation sequencing 24 , over 75% and 82% of PG4s detected by G4-miner are also detected by adding K + and PDS to sequencing buffers, separately (Fig. 3c, d). More PG4s are detected under G4-stabilizing conditions (K + or PDS) than under standard sequencing condition, which con rmed to the existing knowledge that G4 structure are thermodynamically stable in the presence of K + and PDS.
Structural analysis and genome browser of mined G4 sequences We noticed that only 17% of the MG4s were predicted as classical described G4s. In recent, noncanonial G4s structures such as long loops, bulges and two quartets have been identi ed by structural biological and biophysical studies [31][32][33] . Therefore, we hierarchically assigned the MG4 to ve categories to elucidate the structural features (Fig. 4a). Canonical G4s, long loops, bulges and two quartets accounted for 17%, 6%, 23% and 34% of total MG4s based on the threshold of sequencing quality. The distribution of MG4 categories is highly consistent with the result of the rG4-seq 25 which exploited reverse transcriptase stalling reactions containing K + with G4-stabilizing ligand pyridostatin (PDS) (Supplementary Fig. 4). Canonical G4 sequences with short loops were more easily identi ed than sequences with longer loops (Fig. 4b), which is keeping with their relative thermodynamic stability 32,33 .
The MG4s in 'other' category do not contain known G4 structures, but also cause slight and inconstant drop of sequencing quality. The sequences analysis revealed that most of them contained other DNA secondary structures, such as hairpin structure, poly-adenine/thymine and i-motif (Supplementary Table  7). To ensure delity, this category was removed for the following analyses. Our results revealed that the G4-miner is able to mine and identify G4 sequences in whole genome from ordinary sequencing data. We quanti ed the OQs in exons, introns, promoters, untranslated regions (UTRs) and splicing junctions ( Table 1). MG4s was found to be densest in 5' UTRs and splicing junctions, where G4s functioned in posttranscriptional regulation 34,35 . The density of G4s in genomic regions is consistent with the result by adding K + or PDS to sequencing buffers 24 . Visual inspection of genes rich in PG4s revealed that G4miner is effective in mining G4 regions (Fig. 4c).  Fig. 5a, b). For heterozygous SNVs, we classi ed the sequencing reads into two groups based allele type to identify MG4s separately. Of all heterozygous SNVs, 7.17% (186,904) of them revealed to having different in uence on the formation of G-quadruplexes in plus stand, and 6.28% (163,666) for the minus stand (Fig. 5c, Table 2). All those homozygous and heterozygous SNVs associated G4s are individual speci c G4s which might contain individual differences in chromatin architecture and many fundamental biology processes, such as DNA replication and gene regulation. Our results show that G4-miner can characterize G4s for speci c individuals.

Discussion
Many techniques have been developed to detect and map G4s including chemical biology and genomic technologies. However, to date most of these techniques' protocols are complicated by relying on G4stablizing conditions or G4 antibodies. To address this conundrum, we have established a convenient, high-throughput approach for identifying DNA G4 structures from ordinary next-generation sequencing data directly. Our results con rm that G4 structures are unstably formed in the standard sequencing buffer, which cause sequencing quality uctuations. We have conveniently identi ed canonical G4s and numerous non-canonical structures in whole genome by inspecting the quality uctuations.
For one thing, to discover the stability of the G4-miner, we carried out two parallel experiments, and 736,689 MG4s (70% for experiment 1 and 78% for experiment 2) were detected in both experiments. The high overlap between them indicated that G4-miner is stable in quadruplex assigning. For another, to verify the validity of the method, we compared G4-miner with the G4-seq. As a result, over 75% and 82% of PG4s also detected by G4-miner, and it con rmed that G4s are stable in the presence of K + and PDS. The different overlaps between G4-stabilizing conditions and G4-miner indicated that G4-miner is valid in detecting G4s. Besides, the quality uctuations seized by G4-miner are not only G4s, but also many unidenti ed regions. The principle of G4-miner can be used to identify other structures that would in uence sequencing quality, and describes genomic maps of other secondary structures by their sequence characterization through quality uctuations.
One key role of our method is to reveal the effect of SNVs on G4 formation. After inspecting SNVs in the whole human genome, we observed that the presence of some SNVs determines the formation of G4 structure. The homozygous SNV causes the formation of G4 structure or prevents the formation of G4 structure. For some heterozygous SNVs, the two alleles have different in uence on the formation of G4 structure, as the allele G caused the quality uctuations, the allele A did not. The SNVs shows signi cant individual differences, the speci c relation between SNVs and G4s might contain individual differences in chromatin architecture and fundamental biology processes. The G4-miner can be used to detect individualized G4 structures, which will accelerate the discovery of the relation among SNVs, G4s and biology processes.
In summary, our work provides G4-miner, a friendly and convenient method to identify G4 structure genome-wide. Our nding that some quality uctuations ascribed to the unstable formation of G4, describes a map of G4s in the whole human genome under standard sequencing protocol. Importantly, through G4-miner, we have also con rmed that some of the SNVs can in uence the formation of G4 structures. The given approach is available to identify and characterize G4s genome-wide for speci c individuals. This approach could also be extended to pro le other DNA secondary structures which exist in standard sequencing buffer.

PG4 identi cation and region classi cation
PG4s were predicted from hg19 using g4predict (https://github.com/mparker2/g4predict.git) according to the Quadparser algorithm by searching for the '(G{3,}[ATCG]{1,7}){3}G{3,}' pattern. For parameter determining step, the start site of 'positive region' is de ned as the sequence up to 12 bases (approximate footprint of DNA polymerase) up-stream of the PG4 start site, and the end site was 75 bases downstream of the start site. After removing the positive regions and 300 bases upstream and downstream of them, the remaining region was split into 75-base windows, which were de ned as 'negative regions'.

Low-quality region scanner
For each sequencing read, the aligned sequence information was extracted from BAM le. Reads were separated into forward strands and reverse strands according to the FLAG value. After the quality score was translated into decimal numbers, the median quality of every locus of different strands was calculated. The low-quality region scanner algorithm checked the median quality value of every locus sequentially. The parameter 'M' is designed to describe the degree of low quality, and parameter 'N' limits the number of low-quality bases to set the lower bound of the low-quality region. If the difference of median quality scores between one base and the just base up-stream of it (the orientation of 'up' or 'down' is consistent with the orientation of the strand) is no less than M, the base with higher quality value is 'the ag site' whose quality score is de ned as 'the ag value', and the other is 'low-quality site', also 'the start site of low-quality region'. From the start site on, if the number of low-quality sites is greater than or equal to N in a range of 75 bp, a 75nt-low-quality region was marked as a 'hit'.

Parameter determining and MG4 detection
Positive regions (with PG4s) and negative regions (without PG4s) from 'PG4 identi cation and region classi cation' step were used to calculate the hit ratio according to the previous low-quality region scanning algorithm. The parameters M and N were set as 1~12 and 1~20 respectively. Under each M-N set, a positive rate and a false positive rate were calculated. For both strands of two replicates, parameter sets which can obtain the max positive rate while keeping the false positive rate lower than 1% would be used in the following steps. To eliminate the in uence of other non-G-quadruplex secondary structures, the ratio of G/C bases in detection regions should be at least 28% (99% of positive regions contain 28% or more G/C bases). For the whole genomic analysis, low-quality regions from two strands of two replicates were scanned independently under their most suitable parameter sets. For the more complete detection of positive regions, the low-quality regions in MG4 detection step extended 35 bases (the longest length of canonical G4s) up-stream of the start site. Low-quality regions with overlapped fragments were merged into a single one. The merged regions considered as MG4s.

Genomic regions analysis
Gene annotation les were downloaded from the UCSC genome browser website (https://genome.ucsc.edu/), genome version hg19. Genomic regions contain 5'UTRs, 3'-UTRs, exons, introns, CDSs, promoter regions, TSS (translational start site) regions and splice regions. TSS regions are de ned as the 1,000 bp up-and down-stream the TSS. Promoters are 1,000 bases up-stream TSS. Splice regions are 50 bases up-and down-stream splice sites. For these regions, the total number and size of them were calculated. The number of MG4s, including total MG4s and canonical MG4s, and predicted PG4s overlapping to the region intervals were calculated. The number of MG4s were all the intersection of two replications. For each gene annotated in the human genome (hg19), the number of predicted PG4s of MG4s (PG4inMG4s) in two replications were counted. The density of G4s was calculated by dividing the respective counts by the length of genes or regions and multiplying by 1,000.