Improved Gossypium raimondii Genome Using a Hi-C-based Proximity-Guided Assembly

Background: Genome sequence plays an important role in both the basic and applied studies. Gossypium raimondii, the putative contributor of the D subgenome of Upland cotton (Gossypium. hirsutum), highlights the need to improve the genome quality in a rapid and ecient way. Methods: we performed Hi-C sequencing of Gossypium raimondii and reassembled its genome based on a set of new Hi-C data and previously published scaffolds. We identied and corrected errors of initial scaffolds before reassembled into chromosomes. Result: A total of 98.42% of sequence was clustered successfully, among which 99.72% of the clustered sequence was ordered and 99.92% of the ordered sequence was oriented with high-quality. Further evaluation of results by heat-map and collinearity analysis revealed that the current reassembled genome is signicantly improved than previous one. Conclusion: This improvement in Gossypium raimondii genome not only provides a better reference genome to increase study eciency, but also offers a new way to assemble cotton genomes. Furthermore, Hi-C data of Gossypium raimondii may be used for 3D structure research or regulating analysis. sequenced and assembled genome of Gossypium raimondii (Wang et al., 2012) was relatively low due to higher numbers of repeat elements and low numbers of genetic markers. In the present study, we conducted a de novo Hi-C sequencing of Gossypium raimondii genome, and incorporated the new Hi-C data with the existing Gossypium raimondii scaffolds (Wang et al., 2012) to improve the quality of the D-genome.


Introduction
Over the last decade, next generation sequencing (NGS) technologies has brought immense improvements in plant genome sequencing throughput and cost, and many plant genomes have been  (Li et al., 2014) and Gossypium hirsutum (Li et al., 2015;Zhang et al., 2015). However, de novo assembly of large eukaryotic genomes has remained a great challenge with the NGS platform due to signi cant amount of repeat contents (Rounsley et al., 2009). As a result, de novo assembly of chromosome-scale scaffolds has become a major constraint to the completion of highquality genome sequence. Compared with the traditional method, high-throughput chromosome conformation capture (Hi-C) technology has assisted in the assembly of long scaffolds to produce chromosome-scale genome assemblies (Lightfoot et al., 2017). The Hi-C technique is based on existing scaffolds and restriction enzyme cutting sites, which are more evenly distribution and have much higher density.
The Hi-C-based proximity-guided assembly was initially developed to study the three-dimensional (3-D) conformation of chromosomes of yeast gene expression (van Berkum et al. 2010). There are two regular models with Hi-C data: the rst one is that the rate of Hi-C interaction is inversely proportional to the genomic distance between the pairs of loci, the second one is that the rate of Hi-C interaction of pairs of loci within a chromosome is signi cantly higher than that in different chromosomes (Xie et al. 2015).
Based on these two models, Hi-C-based proximity-guided assembly was applied for de novo assembling of human, and subsequently for the assembling of mouse and Drosophila genomes which reported good results or improvement (Burton et al. 2013). With the success of testing and verifying this method in Arabidopsis thaliana (Xie et al. 2015), Hi-C based proximity-guide assembly has been reported as an effective and e cient method which subsequently has been used in many other plants.
Besides cotton's commercial value, it also serves as a perfect model system for studying cell wall The Gossypium genus comprises of more than 50 species including at least 5 tetraploid species and 45 diploid species. Diploid cottons are divided into 8 sub-genomes, denoted A-G and K based on chromosome pairing relationships (Wendel et al., 1992). Tetraploid cottons, such as cultivated Gossypium hirsutum (AD 1 ) and Gossypium barbadense (AD 2 ), had formed by an allopolyploidy event about 1-2 million years ago (Paterson et al. 2012). These tetraploid cotton species share common ancestors with the modern New World species Gossypium raimondii (D 5 ) and the Old World A-genome species Gossypium herbaceum (A 1 ) or Gossypium arboreum (A 2 ). Previously, genomes of different cotton species sequenced and assembled including Gossypium raimondii (Wang et al. 2012), Gossypium arboreum (Li et al. 2014), and Gossypium hirsutum (Li et al., 2015).
Among these cotton species, the genome of Gossypium raimondii has the lowest complexity which has been sequenced and assembled using the next-generation Illumina paired-end sequencing strategy The seeds of Gossypium raimondii D 5-1 were planted in an incubator at constant environmental condition having 27 C temperature, 60% relative humidity, 16/8-h light/dark photoperiod, and 100% uorescent light. When sixth euphylla came out, these seedlings were transplanted into big pots. Approximately 3gram young leaves from Gossypium raimondii plants were collected and immediately treated with formaldehyde.

Hi-C pipeline
During this study, we have used the same Hi-C pipeline as in Arabidopsis thaliana (Xie, Zheng et al. 2015). Before start this experiment, we have tested the integrity of DNA from the formaldehyde-treated tissue, and then the DNA was isolated and digested by MboI instead of HindIII because of the shorter recognition site (only four bases of MboI). The resulting sticky ends were lled with nucleotides in which cytosine is biotinylated, and ligated the adjacent blunt ends to a chimeric circle under extremely dilute conditions. Subsequently, DNA was puri ed and broken into 300-500 base pairs using ultrasonic, pull-down the biotin labeled DNA and performed the PCR reaction (10 cycles). After DNA puri cation, the nished Hi-C library was sequenced with an Illumina Hiseq (PE150). A total of 570,412,361 read-pairs were obtained.

Genome assembly based on Hi-C data
Assembling of Gossypium raimondii genome involved three steps. First, valid Hi-C paired-end reads and contact matrix with a resolution of 100 kb were generated by HiC-Pro (Servant, Varoquaux et al. 2015).
The raw sequence data with low quality, unmapped and invalid mapped paired reads were ltered out by HiC-Pro and contact matrix based on interaction frequency was created. At the second step, the Gossypium raimondii genome was assembled with the Hi-C data by Lachesis (Burton, Adey et al. 2013),which contained clustering, ordering and orienting. Lastly, the assembled Gossypium raimondii genome was assessed by Mummer and Python scripts, resulting heat-map and collinear analysis.

Mapping and ltering
Initially, the low-quality and invalid paired-reads were ltered out which was caused by sequencing errors from the raw Hi-C reads of Gossypium raimondii (Table 1). Results revealed that 95.6% of sequence is clean Q30 bases, showing a good quality of sequence data. Then, clean Hi-C reads were mapped to the previously sequenced genome of Gossypium raimondii (Wang et al., 2012) using Bowtie2 (Langmead and Salzberg 2012), and unique mapped paired-end reads were retained ( Fig.1 and Table 2). Subsequent analysis was performed to remove the invalid paired-reads. The reference genome was broken into restriction fragments by cutting them at the MobI restriction enzyme site "GTAC". Approximately 54.7% uniquely mapped paired-read was aligned to single restriction fragments. Among uniquely mapped reads, valid paired-end reads were present in different restriction fragments, but non-ligation, self-ligation and dangling end paired-end reads were recognized by mapping orientation information in the same restriction fragment (Belton, McCord et al. 2012) (Table 3, Fig. 2). After the Hi-C data were ltered out, results showed that 81.95% of uniquely mapped sequences are valid paired-end reads. Thus, the valid paired-end reads (223,304,666) were used for further analysis.

Creating contact matrix
The genome was into non-overlapping 100kb windows, and the number of valid paired-end reads in the 100kb windows was referred as the contact count. The Hi-C contact matrix was built and normalized by its restriction sites because the Hi-C signal was in linear proportion to the number of restriction sites.

Identi cation and correction of errors within scaffolds
Errors in scaffolds of the initial draft assembly were identi ed and corrected following the Aedes aegypti's de novo assembly procedure (Dudchenko, Batra et al. 2017). Brie y, the errors were corrected by identifying the bins where a scaffold's long-range contact pattern changes abruptly, which is unlikely for a correct scaffold. We cut out the error bins as a new scaffold. There are 259 errors within scaffolds. The list of error bins is presented in Supplemental.1.
2 Genome assembly by Lachesis

Clustering
Lachesis is a computational method that exploits Hi-C data sets for de novo genome assemblies (Burton, Adey et al. 2013). Hi-C data has two classical models, Hi-C interactions within one chromosome are distinctly more than it between two chromosomes, and Hi-C interactions between two loci are inversely proportional to their distance. Based on the rst model, Lachesis clustered the scaffolds into 13 groups by agglomerative hierarchical clustering (Table 4). A total of 2,883 scaffolds were clustered successfully.

Ordering and orientation
In each clustered group, an acyclic spanning tree was built with vertexes corresponding to the scaffolds, while edge weights representing the normalized Hi-C interactions between pairs of scaffolds. A total of 1,328 scaffolds (744,578,885 bp, representing 99.72% of the total length) were ordered by Lachesis, among which 697 scaffolds (724,960,878 bp, representing 97.37% of the total length) are "trunk" (Table  5). For each ordered group, the acyclic spanning tree was built to represent all of the possible ways to orient the scaffolds. Lachesis has built a scoring function based on the difference between forward and backward interaction. Highest score represented the maximum likelihood through predicting orientations for each of the scaffolds. Among 1,328 ordered scaffolds, 1,129 scaffolds (743,948,690 bp, representing 99.92% of the ordered length) were of high scores ( Table 6). All of the "trunk" scaffolds were of high score.

Assembling results
The genome of Gossypium raimondii were reassembled using Hi-C data (Supplemental.2). About 98.42% of total sequence length was clustered successfully, among which 99.72% and 97.37% of the clustered sequence were ordered and high-quality ordered, respectively. And approximately 99.92% of the ordered sequence was oriented with high-quality. The statistics of pseudo-chromosome length is shown in Table   7,while the indicator statistics of the initial draft genome and reassembly results are shown in Table 8 and 9, respectively. From the parameters like scaffolds number, N50 and N60 of previously draft genome and reassemble genome, we found that the Gossypium raimondii genome using a Hi-C-based proximityguided assembly is clearly much better than the reported draft genome (Wang et al., 2012).
Further, the results were also veri ed by heat-map (Fig. 3) and collinear-analysis (Fig.4). The heat-map directly proves the validity of the processing methods. Based on the two regular models as we mentioned above, the heat-map of the reassembled results revealed that the diagonal interaction is much higher. The boundaries of each pseudo-chromosome are relatively clear and it is under low background noise that shows good reassembling results (Fig. 3). In addition, the collinear relationship between the draft (Wang et al., 2012) and reassembled genomes (Fig. 4) showed that the current reassembled genome is quite different from the previous one. Further when we compared our results with another version of Gossypium raimondii genome (Paterson et al., 2012), results showed that current reassembled genome is improved with respect to both quality and integrity (Fig. 5).

Conclusion And Discussion
With the development of sequencing techniques and bioinformatics tools, a high-quality genome sequence is the basis for cotton molecular breeding. In several previous studies, Illumina short sequencing reads are extensively used for de novo genome sequencing and assembling of different organisms ( Results from the comparative analysis of different parameters between the draft genome (Wang et al., 2012) and the current reassembled genome showed a signi cantly improved quality as compared to previous one. Such as we increased the rate of clustering from 73% of the previous draft assembly to 98.42% of current reassembly. Similarly, the rates of ordering and orienting were also from 52.4 % (previous draft assembly) to 98.07% (current reassembly), con rming the better quality of current reassembled Gossypium raimondii genome.
Previously, Paterson et al. (2012) also sequenced and assembled the Gossypium raimondii genome with good results and abundant markers. However, these markers are not evenly distributed across the genome which might indicate some errors in its genome assembly. Thus in the current study, we also compare the reassembly genome with this version of Gossypium raimondii genome (Paterson et al., 2012) by the collinear analysis. Our results showed that despite of fewer differences, the current reassembled genome is improved with respect to both quality and integrity. However, based on the differences between the current reassembled genome and the genome reported by Paterson et al. (2012), further work may be necessary for integrating the two versions of Gossypium raimondii genome into a best version.

Authors' contributions
Yang QH and Song GL conceived and designed the experiments; all authors performed data analysis and interpretation; Yang QH, Javaria Ashraf and Song GL wrote the manuscript; All authors read and approved the nal manuscript.  Clean paired-end reads are the high-quality reads after ltering. Ns paired-end reads having the more than 5% N's percentage. Q30 bases rate is the ratio of base's sequencing quality which is higher than 30, it means the base's sequencing error percentage is less than 0.1%. Multi mapped paired-end read means Hi-C sequencing read mapped to more than one loci of reference sequence. Dangling end paired-end read means the biotin labeled base is at the end of read. Dumped paired-end reads do not contain any biotin labeled base or it's inter size is out of range. Interaction paired-end reads were mapped to different restriction fragment. Valid paired-end reads are interaction paired-end which taken out repeat paired-end reads caused by PCR.    We named the reassemble chromosome by its length order. The length of chromosome contained the 100 bp N between neighboring scaffolds. All of the statistics got rid of the short scaffolds (scaffold length < 1,000bp). All of the statistics got rid of the short scaffolds (scaffold length < 1,000bp).
Supplemental le 2 was not provided with this version of the manuscript.