1. Hi-C data analysis by HiC-Pro
1.1 Mapping and filtering
Initially, the low-quality and invalid paired-reads were filtered out which was caused by sequencing errors from the raw Hi-C reads of Gossypium raimondii (Table 1). Results revealed that 95.6% of sequence is clean Q30 bases, showing a good quality of sequence data. Then, clean Hi-C reads were mapped to the previously sequenced genome of Gossypium raimondii (Wang et al., 2012) using Bowtie2 (Langmead and Salzberg 2012), and unique mapped paired-end reads were retained (Fig.1 and Table 2). Subsequent analysis was performed to remove the invalid paired-reads. The reference genome was broken into restriction fragments by cutting them at the MobI restriction enzyme site “GTAC”. Approximately 54.7% uniquely mapped paired-read was aligned to single restriction fragments. Among uniquely mapped reads, valid paired-end reads were present in different restriction fragments, but non-ligation, self-ligation and dangling end paired-end reads were recognized by mapping orientation information in the same restriction fragment (Belton, McCord et al. 2012) (Table 3, Fig. 2). After the Hi-C data were filtered out, results showed that 81.95% of uniquely mapped sequences are valid paired-end reads. Thus, the valid paired-end reads (223,304,666) were used for further analysis.
1.2 Creating contact matrix
The genome was into non-overlapping 100kb windows, and the number of valid paired-end reads in the 100kb windows was referred as the contact count. The Hi-C contact matrix was built and normalized by its restriction sites because the Hi-C signal was in linear proportion to the number of restriction sites.
1.3 Identification and correction of errors within scaffolds
Errors in scaffolds of the initial draft assembly were identified and corrected following the Aedes aegypti’s de novo assembly procedure (Dudchenko, Batra et al. 2017). Briefly, the errors were corrected by identifying the bins where a scaffold’s long-range contact pattern changes abruptly, which is unlikely for a correct scaffold. We cut out the error bins as a new scaffold. There are 259 errors within scaffolds. The list of error bins is presented in Supplemental.1.
2 Genome assembly by Lachesis
2.1 Clustering
Lachesis is a computational method that exploits Hi-C data sets for de novo genome assemblies (Burton, Adey et al. 2013). Hi-C data has two classical models, Hi-C interactions within one chromosome are distinctly more than it between two chromosomes, and Hi-C interactions between two loci are inversely proportional to their distance. Based on the first model, Lachesis clustered the scaffolds into 13 groups by agglomerative hierarchical clustering (Table 4). A total of 2,883 scaffolds were clustered successfully.
2.2 Ordering and orientation
In each clustered group, an acyclic spanning tree was built with vertexes corresponding to the scaffolds, while edge weights representing the normalized Hi-C interactions between pairs of scaffolds. A total of 1,328 scaffolds (744,578,885 bp, representing 99.72% of the total length) were ordered by Lachesis, among which 697 scaffolds (724,960,878 bp, representing 97.37% of the total length) are “trunk” (Table 5). For each ordered group, the acyclic spanning tree was built to represent all of the possible ways to orient the scaffolds. Lachesis has built a scoring function based on the difference between forward and backward interaction. Highest score represented the maximum likelihood through predicting orientations for each of the scaffolds. Among 1,328 ordered scaffolds, 1,129 scaffolds (743,948,690 bp, representing 99.92% of the ordered length) were of high scores (Table 6). All of the “trunk” scaffolds were of high score.
3 Assembling results
The genome of Gossypium raimondii were reassembled using Hi-C data (Supplemental.2). About 98.42% of total sequence length was clustered successfully, among which 99.72% and 97.37% of the clustered sequence were ordered and high-quality ordered, respectively. And approximately 99.92% of the ordered sequence was oriented with high-quality. The statistics of pseudo-chromosome length is shown in Table 7,while the indicator statistics of the initial draft genome and reassembly results are shown in Table 8 and 9, respectively. From the parameters like scaffolds number, N50 and N60 of previously draft genome and reassemble genome, we found that the Gossypium raimondii genome using a Hi-C-based proximity-guided assembly is clearly much better than the reported draft genome (Wang et al., 2012).
Further, the results were also verified by heat-map (Fig. 3) and collinear-analysis (Fig.4). The heat-map directly proves the validity of the processing methods. Based on the two regular models as we mentioned above, the heat-map of the reassembled results revealed that the diagonal interaction is much higher. The boundaries of each pseudo-chromosome are relatively clear and it is under low background noise that shows good reassembling results (Fig. 3). In addition, the collinear relationship between the draft (Wang et al., 2012) and reassembled genomes (Fig. 4) showed that the current reassembled genome is quite different from the previous one. Further when we compared our results with another version of Gossypium raimondii genome (Paterson et al., 2012), results showed that current reassembled genome is improved with respect to both quality and integrity (Fig. 5).