A Python script to merge Sanger sequences

Merging Sanger sequences is frequently needed during the gene cloning process. In this study, we provide a Python script that is able to assemble multiple overlapping Sanger sequences. The script utilizes the overlapping regions within the tandem Sanger sequences to merge the Sanger sequences. The results demonstrate that the script can produce the merged sequence from the input Sanger sequences in a single run. The script offers a simple and free method for merging Sanger sequences and is useful for gene cloning.


Introduction
Gene cloning is customarily required to study the gene functions in vivo and in vitro. For the most part, the gene of concern is ligated to a vector with a canonical method of restriction enzyme cutting and ligation, or the new-fashioned technique of seamless ligation [1]. To ensure that the gene sequence is correct within the constructed plasmid, the Sanger sequencing is utilized to sequence the gene and the results are aligned and checked [2].
For those genes with lengths of only several hundreds, it is possible to go through each one of the whole sequences with one single Sanger sequencing reaction and the results can be directly aligned to the correct gene les. However, in other cases, the lengths of the genes exceed one thousand, which is beyond the scope of one single Sanger sequencing reaction. In order to get the sequence of the full-length gene, it is necessary to carry out DNA walking sequencing using a new primer based on the previously sequenced region. To sequence a large gene, several reactions might be required in both forward and reverse directions. These results have to be aligned to the correct gene le for con rmation.
It is preferred to merging all the walking results prior to alignment with the correct gene le, rather than aligning each walking result with the target gene le separately. To merge the multiple walking results, there are several commercial software such as the DNASTAR [3], DNAMAN [4] and Vector NTI [5], which are powerful, however, expensive. In terms of free software, there is an on-line tool that is able to merge overlapping long sequence fragments based on the program merger in the EMBOSS suite [6,7]. However, the web tool relies on web access and server status. Here, we introduce a Python script that is capable of merging multiple overlapping Sanger sequencing les by employing the alignment module of Biopython input les are in the seq format. Our script requires that the rst four characters of the le name consist of three numbers (000, 001, 002, 003,…) plus one capital letter F or R (F for forward Sanger sequencing and R for reverse Sanger sequencing). This step is done by the user by adding four characters to the original lename provided by the sequencing company, according to the positions of the Sanger sequencing result, with a smaller number in the upstream and larger number in the downstream along the sense strand of the full-length sequence.

Running script
To run the script, open the macOS terminal, cd to the directory containing the script (S2, see Supplementary Materials) and Sanger sequences and input the following bash command.
1) Make a new directory and copy the sequencing result les of the above section (Running environments and input les) to the newly made directory. Also, copy the script 'Merge_Sanger_v2.py' to the directory.
2) Open a new terminal and cd the directory. Run the following command.
The Sanger sequencing result les must be input from lower to higher according to the rst three numbers of the le name. It is convenient to input the rst three numbers and then use the tab key to ll the rest of the le name automatically. Our script is able to merge dozens of Sanger sequencing result les, which is far enough for typical lab needs.
3) After running nishes, there is a folder named 'merged_sequence', which contains the merged sequence le with the name 'merged.seq'.

Algorithm
Our algorithm's starting point is to take advantage of the overlapping regions within the tandem Sanger sequences, which are commonly no less than 200 bps (Fig. 2). The alignment module of the Biopython is called to conduct the needle alignment of the two consecutive Sanger results. The script then parses the output le of needle alignment to seek for the rst line with full consensus, which is 50 bps, to obtain the critical coordinates of the corresponding Sanger sequences. These critical coordinates are then used to extract base pairs from each Sanger sequence for merging to the full-length sequence. As can be seen in the directory after a script running, the merged Sanger sequence 'merged.seq' is stored in a new folder named 'merged_sequence' (Fig. 3).

Results
Besides, there are outputs in the screen showing the needle alignment and the critical junction of the consensus region between the two adjacent Sanger sequences (Fig 4), which clari es the working process of merging. Further, the merged sequence is shown on the screen. Our results showed that the running process of the script is fast, and the output is clear.

Discussion
Merging Sanger sequences is routinely required during the gene cloning. The EMBOSS command merger is capable of generating a merged sequenced based on the global alignment of Needleman and Wunsch [7]. In the case of merging two overlapping Sanger sequences, the global alignment might bring in mismatches due to the decrease of signal-to-noise ratio during the later stage of the Sanger sequencing reaction and gel running. The EMBOSS merger deals with the disputed bases according to the local sequence quality score, which is rather complicated and often introduces the unwanted bases.
To address this issue, we look for the rst full consensus line within the output of needle alignment and pick up the fragments to join the merged sequence according to the variations of the signal-to-noise ratio of the Sanger sequencing reaction, which circumvents the dispute settlement for the mismatch bases.

Conclusion
In sum, we provide a simple and direct method to merge the Sanger sequences via Python programming, which can be run at a local computer and satis es the need during the gene cloning process.

Declarations
Compliance with Ethical Standards Con ict of Interest The authors declare no con ict of interest.

Supplementary Materials
The example input les and the Python script Merge_Sanger_v2.py can be found in Supplementary Materials. Figure 1 The test les used for running the script Figure 1 The test les used for running the script The thin blue arrow above the thick blue line indicates the part to be extracted for joining to the full-length sequence, which is shown by the long thin blue arrow at the bottom. All the Sanger sequences are preprocessed to 5'-3' direction, aligned and merged, thus the direction of all thin arrows is 5'-3'

Figure 2
A diagrammatic view of merging multiple Sanger sequences. The thick blue lines show the multiple (no less than two) Sanger sequences to be merged, with the color depth indicating the sequencing quality, which is high in the middle part and low in both ends. The thick orange lines represent the rst line of full consensus (50 bps long) appearing in the needle output le for the two consecutive Sanger sequences. The thin blue arrow above the thick blue line indicates the part to be extracted for joining to the full-length sequence, which is shown by the long thin blue arrow at the bottom. All the Sanger sequences are preprocessed to 5'-3' direction, aligned and merged, thus the direction of all thin arrows is 5'-3' Figure 3 The directory showing the output le after merging the Sanger sequences The directory showing the output le after merging the Sanger sequences Figure 4 The screenshots of the script run showing the partial output of the needle alignment (left panel) and the rst consensus line and the merged DNA sequence (right panel).

Figure 4
The screenshots of the script run showing the partial output of the needle alignment (left panel) and the rst consensus line and the merged DNA sequence (right panel).

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.