Tandem repeats structure of gel-forming mucin domains could be revealed by SMRT sequencing data

Mucins are large glycoproteins that cover and protect epithelial surface of the body. Mucin domains of gel-forming mucins are rich in proline, threonine, and serine that are heavily glycosylated. These domains show great complexity with tandem repeats, thus make it difficult to study the sequences. With the coming of single molecule real-time (SMRT) sequencing technologies, we manage to present sequence structure of mucin domains via SMRT long reads for gel-forming mucins MUC2, MUC5AC, MUC5B and MUC6. Our study shows that for different individuals, single nucleotide polymorphisms could be found in mucin domains of MUC2, MUC5AC, MUC5B and MUC6, while different number of tandem repeats could be found in mucin domains of MUC2 and MUC6. Furthermore, we get the sequence of MUC2, MUC5AC, and MUC5B mucin domain in a Chinese individual for each nucleotide at accuracy of possibly 99.98–99.99%, 99.93–99.99%, and 99.76–99.99%, respectively. We report a new method to obtain DNA sequence of gel-forming mucin domains. This method will provided new insights on getting the sequence for Tandem Repeat parts which locate in coding region. With the sequences we obtained through this method, we can give more information for people to study the sequences of gel-forming mucin domains.

Downloading of whole human genome sequence read archive (SRA) data of a Chinese individual (HX1) and converting of SRA data to fasta sequence data. We downloaded whole human genome SRA data of a Chinese individual (HX1) from NCBI. There are two types of SRA data. One is PACBIO_ SMRT data from PacBio RS II, and the other is ILLUMINA data from Illumina Hiseq 2000. HX1 human genome assembly was made with Illumina Hiseq data which has 2.8 billion reads (N50 length 151) as well as PacBio SMRT data which has 44.2 million reads (N50 length 12.1 k). We only downloaded PACBIO_SMRT data and the size is 2773.6Gbp.
For the genome data of HX1 downloaded, we used perl and python scripts to convert SRA data to fasta sequence data (Supplementary S1 and S2). In the converting process, we only kept the reads which were longer than 5 k bases. The number of reads kept was 24,290,526. The number of bases kept was 284,581,254,229, and that was about 94.86X coverage. Assembly of mucin domain of MUC2, MUC5AC, MUC5B and MUC6 in HX1 with SMRT reads. Due to the TR character, it is hard for CLUSTALW to make a good alignment if only PTS exon was used. Therefore, it is essential to include the introns up-stream and down-stream.

Obtaining of mucin domain of
For MUC2, in all the SMRT reads downloaded, 4 reads could be found to cover introns both up-stream and down-stream of mucin domain exon. In human MUC2 mucin domain, there is a CysD domain in the middle of a domain which is full of proline, threonine and serine (PTS). Although TR structure of PTS domain makes it impossible to be identified accurately with similarity search, CysD domain could be precisely found. Therefore, we took use of the CysD domain in the middle of PTS exon. We regarded the last "cysteine" in CysD domain as delimiter. The first part could contain the intron up-stream, PTS TRs, and the CysD island, and the second part could contain the CysD island, PTS TRs, and the intron down-stream. Thus we searched for two types of read: I. For CLUSTALW, we set gap_opening_penalty and gap_extension_penalty to 0. The reason is that if you consider to align two sequences with different TRs numbers, the default numbers of gap_opening_penalty and gap_extension_penalty in CLUSTALW will cause the interruption of one complete repeat unit. However, the alignment of two common sequences (none TR) has no such problem. The length of mucin domains of MUC5AC of Refseq, Korean and American assembly are 10,371 bases, 10,576 bases and 11,196 bases, respectively. We regard the mucin domain of MUC5AC in Refseq as the most accurate assembly among all the three assemblies by similarity search.

Results
The length of mucin domains of MUC5B of Refseq, Korean and American assembly are 10,893 bases, 11,589 bases and 10,772 bases, respectively. We regard the mucin domain of MUC5B in Refseq as the most accurate assembly among all the three assemblies by similarity search.
The lengths of mucin domains in MUC6 of Refseq, Korean and American assembly are 3009 bases, 13,299 bases and 8727 bases, respectively. For the American mucin domain of MUC6, the beginning sequence goes into gap region. We regard the mucin domain of MUC6 in Korean individual as the most complete assembly among the three assemblies by similarity search.
We could see that even by length, these domains in each of the MUC2, MUC5AC, MUC5B and MUC6 make some differences (Table 1).
Programing pipelines to get consensus sequence with SMRT reads. The PacBio SMRT data of HX1 human genome was used to get the mucin domains. Since the whole data set has 44.2 million reads, the data with a size of several tens of Terabytes was dealt with. Therefore, it is essential to make an efficient pipeline to obtain the reads which could be used to assemble the mucin domains.
The SMRT reads downloaded were bax.h5 files and bas.h5 files. We transferred them into fastq files for further treatments. We used quality control and length control methods on those fastq files with programming scripts (Supplementary S3 and S4). By similarity search with BLAST to the Refseq sequences of MUC2, MUC5AC, MUC5B and MUC6 which could be downloaded from NCBI in both DNA and protein level, we managed to find reads which could cover each mucin domain 17 . Then we used multiple alignment method CLUSTALW to get an alignment 18 (Fig. 1).
In the alignment, for each column we took the nucleotide (including insertion, i.e. A/T/C/G/-) which appears maximum number of times. Then we removed all the insertions and got the consensus sequence (Supplementary S5). Next we aligned back the consensus to all reads, and corrected the errors which were caused by several continuous insertions or same nucleotides (Fig. 1). For example, in a fragment with three '-'s and one ' A' , some reads will give '---A' and some will give ' A ---' . Then ' A' will be replaced by '-' and later be removed in the consensus, and this will cause a missing of a nucleotide. Same principle, several continuous same nucleotides Table 1. Length comparison of mucin domain exons of MUC2, MUC5AC, MUC5B and MUC6 in Refseq, Korean assembly, American assembly, HX1 own assembly and HX1 assembly with our pipeline. We can see that the length of the domains in each mucin make differences, and the domain constructed from our pipeline has the more length than HX1 own assembly for each mucin. www.nature.com/scientificreports/ with an insertion might cause a redundant nucleotide in the consensus (Fig. 2B). This problem could cause a frame shift, and we managed to correct it according to the translation result. If we regard the error rate of SMRT as average 15-30% 20 , for each position, the average error rate is 0.15-0.3 to the power of the number of reads which could be aligned in this position.
MUC2 mucin domain assembly result and TR structure. Altogether 31 reads could be used to build MUC2 mucin domain DNA sequence ( Fig. 2A). In each position at least 7 reads could be aligned, thus for each nucleotide the average error rate is 0.15-0.3 to the power of 7 and the accuracy is 99.98-99.99%. The whole mucin domain exon of MUC2 in HX1 has 8994 bases. The protein sequence of up-stream part of MUC2 mucin domain in HX1 has 2 CysD domains at the beginning and the end ( Fig. 2A). They have 95 and 97 amino acids, respectively. Between them is the PTS TR up-stream part (Fig. 3A). The TR lengths vary a lot. We define each TR with a symbol "PS" at the start of it. Therefore, the PTS TR up-stream part has 28 TRs. The shortest TR has 7 amino acids and the longest TR has 26 amino acids. The protein sequences of PTS TR up-stream part as well as two CysD domains of MUC2 mucin domain in HX Figure 1. Programing pipeline to get consensus with SMRT reads. For SMRT reads from HX1, original files downloaded were bax.h5 files and bas.h5 files. These files were transferred into fastq files. Then they were trimmed by quality control and length control methods. Meanwhile, DNA sequence of mucin domain exons of MUC2, MUC5AC, MUC5B and MUC6 were extracted from Refseq by comparing with their protein sequences which were also downloaded from NCBI. Next the similarity search were performed using both SMRT reads and extracted Refseq to get reads which could be used to build the mucin domain. After multiple sequence alignment and maximum number count, original consensus sequences could be obtained. After frame adjustment, final consensus sequences were constructed. TRs. Comparing with that in HX1, it has 4 more TR CNVs directly after 89th repeat and 18 SNPs at 15th, 18th, 25th, 25th, 38th, 38th, 39th, 39th, 39th, 40th, 40th, 40th, 41st, 41st, 42nd, 42nd, 60th, and 104th repeat, respectively (Fig. 3C,D).
MUC5AC mucin domain assembly result and TR structure. Altogether 10 reads could be used to build MUC5AC mucin domain DNA sequence (Fig. 4A). In each position at least 6 reads could be aligned, thus for each nucleotide the average error rate is 0.15-0.3 to the power 6 and the accuracy is 99.93-99.99%. The whole mucin domain exon of MUC5AC in HX1 has 10,371 bases.
The protein sequence of MUC5AC mucin domain in HX1 has one main head, one main tail, 6 CysD domains, 2 Long Tandem Repeat (LTR) groups, 4 Short Tandem Repeat (STR) groups and 1 unique short piece (Fig. 4A). The main head is composed of a PTS domain of 45 amino acids long and a CysD like domain of 99 amino acids long (Fig. 4B). The main tail is composed of a PTS domain of 50 amino acids long and a short piece of 12 amino acids long (Fig. 4C). For all 6 CysD domains, each has 105 amino acids and locates after each LTR/STR group (Fig. 4L). For 2 LTR groups, each is composed of one PTS domain of 95 amino acids long, one CysD like domain of 101 amino acids long, and one PTS domain of 65 amino acids long (Fig. 4E). Other than one CysD domain, there is one small PTS piece of 7 amino acids long between the 2 LTR groups (Fig. 4D). For 4 STR groups, each has one PTS head of 36 amino acids long and one PTS tail of 13 amino acids long (Fig. 4J,K). 1st, 2nd, 3rd and 4th STR group have 119, 18, 35, and 65 STRs, respectively (Fig. 4F-H,J). Each repeat has 8 amino acids except that 14th repeat of 3rd STR group has only 7 amino acids (Fig. 4F-I). As the delimiters of LTR/STR groups, 6 CysD domains are nearly identical (Fig. 4L).
The protein sequence of MUC5AC mucin domain in NCBI (Nucleotide accession number NM_001304359.1; Protein accession number NP_001291288.1) has same length and TR structure as that in HX1. There are only 3 SNPs. One is in 99th repeat in 1st STR group; another two are in 3rd and 4th CysD, respectively (Fig. 4M). www.nature.com/scientificreports/  SNPs could be found in PTS3 tail, one SNP could be found in 11th repeat of PTS5, one SNP could be found in 4th repeat of PTS6, one SNP could be found in 7th repeat of PTS7, one SNP could be found in 8th repeat of PTS7 and one SNP could be found in 9th repeat of PTS7. www.nature.com/scientificreports/ MUC5B mucin domain assembly result and TR structure. Altogether 9 reads could be used to build MUC5B mucin domain DNA sequence (Fig. 5A). In each position at least 5 reads could be aligned, thus for each nucleotide the average error rate is 0.15-0.3 to the power 5 and the accuracy is 99.76-99.99%. The whole mucin domain exon of MUC5B in HX1 has 10,893 bases. The protein sequence of MUC5B mucin domain in HX1 has one main head, one main tail, 1 CysD-similar domain, 6 CysD domains, and 7 PTS domains (Fig. 5A). The main head is composed of a small piece of 8 amino acids long (Fig. 5B). The main tail is composed of a small piece of 12 amino acids long (Fig. 5C). The CysDsimilar domain has 100 amino acids (Fig. 5M). 2nd CysD domain has 102 amino acids (Fig. 5L). For other 5 CysD domains, each has 101 amino acids (Fig. 5L). For all 7 CysD and CysD-similar domains, each locates before one PTS domain (Fig. 5A). The first 2 PTS domains have no repeats, but a long piece of 70 and 180 amino acids, respectively (Fig. 5D,E). Each of the last 5 PTS domains has several STRs and one PTS tail (Fig. 5F-K,N). The number of STRs of the bodies of 3rd, 4th, 5th, 6th, and 7th PTS domain are 10, 11, 16, 11, and 22, respectively ( Fig. 5F-J). For all the STRs in the bodies of last 5 PTS domains, 5 have 24 amino acids, respectively; 8 have 26 amino acids, respectively; 8 have 28 amino acids, respectively; 48 have 29 amino acids, respectively; one has 34 amino acids (Fig. 5F-J). The PTS tails of 3rd, 4th, 5th and 6th PTS domains are homologous and they all have 147 amino acids, respectively (Fig. 5K). The PTS tail of 7th PTS domain has 87 amino acids (Fig. 5N). As the delimiters of 7 PTS domains, 2nd, 3rd, 4th, 5th, and 6th CysD domains are nearly identical (Fig. 5L).
The protein sequence of MUC5B mucin domain in NCBI (Nucleotide accession number NM_002458.2; Protein accession number NP_002449.2) has same length and TR structure as that in HX1. There are only 7 SNPs. Three are in PTS tail of 3rd PTS domain, one is in 11th repeat of 5th PTS domain, one is in 4th repeat of 6th PTS domain, and three are in 7th, 8th and 9th repeats of 7th PTS domain, respectively (Fig. 5O).

MUC6 mucin domain assembly result and TR structure.
Altogether only 3 reads could be used to build MUC5B mucin domain DNA sequence (Fig. 5A). Therefore, it is impossible to get the exactly correct nucleotide in each position. However, due to the TR structure, the number of TRs and the lengths of each TR could be obtained. The MUC6 Refseq has all the non-TR part of mucin domain, thus we can use this as the template to get the whole mucin domain exon of MUC6 in HX1 which has been identified to have 13,470 bases by us.
The protein sequence of MUC6 mucin domain inHX1 has one head, one tail and 27 TRs (Fig. 6A). The head has 60 amino acids and the tail has 265 amino acids (Fig. 6B,C). 27 TRs could be found between the head and the tail. 1st, 2nd, 3rd, 4th, 5th, 7th, 8th, 9th, 12th, 13th, 14th, 18th, 22nd, and 26th TRs have 169 amino acids, respectively. This number is most of the case among all TRs, thus we call this type of TRs "typical TR". 6th TR has 171 amino acids, and there is a "TG" insertion comparing with the typical TR. 10th, 11th, 15th, 19th, and 23rd TRs have 168 amino acids, respectively, and there is a deletion comparing with the typical TR. 16th, 20th, and 24th TRs have 150 amino acids, respectively, and they are first 150 amino acids of the typical TR. 17th, 21st, and 25th TRs have 74 amino acids, respectively, and they are last 74 amino acids of the typical TR. 27th TR has 115 amino acids, and it is the first 115 amino acids of the typical TR (Fig. 6D).
The protein sequence of MUC6 mucin domain in NCBI (Nucleotide accession number NM_005961.2; Protein accession number NP_005952.2) only has the head, 1st TR, first 33 amino acids of 2nd TR, last 117 amino acids of 24th TR, 25th TR, 26th TR, 27th TR, and the tail (Fig. 6E). Since we cannot be sure the exact nucleotide in each position due to only 3 reads available, we cannot say SNP information.

Estimation of number of TRs in down-stream part of mucin domain of MUC2 for another individual.
In the result from the pipeline, all frameshifts are caused by several continuous same nucleotides. In HX1, in the DNA sequence of TRs in down-stream part of mucin domain of MUC2 mucin domain, no continuous multiple "T"s could be found other than two SNPs at 46th and 96th TR, respectively, which cause two continuous "T"s ( Fig. 7). Therefore, the number of "T"s in the TR part from pipeline consensus could be used to estimate the number of repeats without arranging each frameshift (Table 2).
In down-stream part of mucin domain of MUC2 for HX1, the number of "T"s keeps same after adjustment. For each repeat, in most cases "T" appears 2 or 3 or 4 or 5 times, in some less number of cases "T" appears 6 times, and only once "T" appears 7 times. Therefore, for any individual, after checking "T" number we could get repeat number roughly by comparing with that in HX1. In HX1, the down-stream part has 101 TRs and 364 "T"s. 46th and 96th TR have one "TT", respectively. Since in common repeats we cannot find "TT", we regard such cases as SNPs and count two continuous "T"s as one. Therefore "T" number of 362 shall correspond to repeat number of 101, and on average each repeat has 3.58 "T"s (Fig. 7). If we divide "T" number difference by 3.58, we can roughly get repeat number difference. For instance, DNA sequence of most complete MUC2 protein in NCBI (accession number NP_002448.4) has 13 more "T"s and 4 more repeats. However, some SNPs (T/A, T/C, or T/G) might affect T number difference. Anyway a roughly estimation of repeat number could be obtained in this way (Fig. 8).

Discussion
The DNA sequences of mucin domains of human gel-forming mucin genes MUC2, MUC5AC, MUC5B and MUC6 consist of long tandem repeats that are difficult to read and measure. This lack of essential sequence information has left these genes mysterious with huge amount of gaps, which has hindered people for understanding gene function in health and disease 21 . To fill in these gaps, it is essential to use the correct way of assembly with SMRT long sequence reads which could cover the mucin domains. www.nature.com/scientificreports/ Other than 2 "TT" SNPs at 17th and 96th TR, no continuous "T"s could be found in this part. The number difference of "T"s between two MUC2 mucin domain DNA sequences could be used to estimate the number difference of TRs between these two domains. www.nature.com/scientificreports/ In HX1 own assembly, mucin domains of MUC2, MUC5AC, MUC5B and MUC6 could not be found in one cluster 13 . MUC2 mucin domain could not be found at all. MUC5AC mucin domain and MUC5B mucin domain could be found in one contig with 10,368 bases and 10,789 bases, respectively. MUC6 mucin domain could be found in another contig with 11,992 bases 13 . However, with our pipeline, we got the DNA lengths of mucin domains of MUC2, MUC5AC, MUC5B and MUC6 at 8994 bases, 10,371 bases (same as Refseq), 10,893 bases (same as Refseq), and 13,470 bases, respectively (Table 1). Moreover, for each of the gel-forming mucins, mucin domain exon takes big proportion of whole coding sequence part, and the amino acids translated are highly glycosylated to fulfill protection function, thus make this exon much more important (Supplementary S6-S9). Therefore, the accuracy of deciphering mucin domain exons is of great significance, and our result could provide more precise information than HX1 own assembly (Fig. 9).
Through the method we developed, we managed to generate full and reliable sequences which have verified to be extremely hard to obtain in the past 5,6,11,21 . With this we could reveal the exact number of TRs in gel-forming mucin domains in a rather quick way. Therefore, people could try to seek the link between mucin domain TR numbers and the corresponding human traits in large populations, and this will be helpful to study genetic associations with many mucus related diseases such as pancreatic cancer, breast cancer, lung cancer, egg groove cancer, colon cancer, asthma, bronchitis, chronic obstructive pulmonary disease, pulmonary cyst fibrosis, gastric ulcer, edema, etc. Table 2. Nucleotide statistics of MUC2 mucin domain down-stream TR part. We can see that the number of single letter "T" keeps same after adjustment. Therefore that number could be used to estimate the number of tandem repeats. www.nature.com/scientificreports/