BAC library is still a powerful resource for genome assembly, functional genomics research, and long-term genetic resoure storage for the endangered species. BAC seqeunces, espcially paired BESs, are usally used to detect assembly errors or assist assembly [12]. The most conventional and convenient approach to decode BESs is Sanger sequencing method. However, its one-by-one style is laborious, time-consuming and expensive. Fortunately, a few of efficient approaches had developped, based on next generation sequencing technologies, such as pBACcode and BAC-anchor[13, 14]. pBACode determines paired BESs by a pair of random barcodes flanking the cloning site. BAC-anchor determines general paired BESs by using specific restriction enzyme sites and searching for utlra-long paired-end subreads containing large internal gaps. We previously also developed a high-throughput approach for long accurate BES profiling with PacBio sequencing technology [15]. However, these approaches focused on paired BES profiles, and they are hard to trace BESs of specific clones in 384-plate wells. In this study, we generated BESs and assigned them to physical wells in 384-plates by applying the characteristics of row-cross-column in two-dimension arrays and cost-effective Illumina platform. These BAC end sequences in average are longer and accurater than that generated by Sanger method, for assembling by SPAdes assembler[16]. After alignment, 89.65% of clones were successfully associated with the broomcorn millet longmi 4 genome. These clones covered 308 of the 829 gaps left by the genome and can be uesd to close the genome gaps.
In conventional whole genome sequencing project, sequencing coverage is usallly at least 30x, and not more than 200x. Higher coverage will result in more sequences that are generated by PCR mutations or sequencing errors and lead to contigs with shorter N50. We added index into each secondary pool and mixed 96 secondary pools as a sequencing library to run in a lane of Illumina flow cell for greatly reducing the cost of sequencing. As a result, after removing the contaminated E. coli genomic DNA reads the average valid sequencing depths of row pools and column pools were 45x and 78x, respectively.
In the process of BESs extraction, we designed two pathways: short BES pathway and long BES pathway. Short BES pathway searched all reads overhanging with verctor end sequences before assembling by Cap3; consequently, it generated all potenital BESs. Long BES pathway identified all contigs overhanging with vector end seqeunces after assembling by SPAdes; consequently, it generated longer but less BESs than the first pathway. The assignment of BAC end sequences at the intersection sites is affected by many factors, such as the overlapping rate of the BAC clones and the correct rate of the sequences. The overlapping rate of BAC clones in the superpool is the most important factor. If overlapping BAC clones appear in the same row or column pool, we can assign them to wells easily and correctly. Also If two overlapping BAC clones appear in a different row or column pool, we can rectify them by our previous method [17]. However, if more than two overlapping BAC clones appear in a different row or column pool, our method will filter out potential BESs, so that the BESs of the intersection well will be absent. In the process of BES assignment, the flow of the forward and reverse BESs are completely independent. When more than one forward and/or reverse BESs were assigned to a well, we cannot determine which BESs are a pair of BESs. If a high-qualty genome is available, it is easy to assess which pair of BESs are derived from the same BAC by mapping. However, if the variety used for BAC library construction is not the same as that the reference genome stands for, the alignment results that do not satisfy the location requirement of BES pairs will be discarded. The clones that may contain a large structural variation can be picked out for further analysis from the 384-plate wells.
Plant cells contain an abundance of chloroplasts and mitochondria. The chloroplast genome is generally around 150 kb, while the mitochondrial genome size varies widely, typically between 200 kb and 750 kb [18]. Although nuclei are extracted for BAC library construction, a trace of organelle DNA cotamination is inevitable. In this study, a small number of clones of chloroplast genome was found by BESs, while clones of mitochondrial genome were almost absent.
By high-throughput sequencing and mapping clones in the secondary pool of BAC libraries, we can find the coordinates of genome sequences in the broomcorn millet BAC libraries. Therefore, if we find genes that play an important role in biology, we can quickly locate the BAC clones containing this gene, and obtain the experimental materials for further research and analysis. At the same time, because BAC contains a long DNA fragment (about 120 kb), it is also convenient for us to quickly analyze the upstream and/or downstream DNA elements of interested genes or adjacent genes.