Complete Genome Sequencing and Evolutionary Analysis of HCV Subtype 6a Including Strains from Guangdong Province, China

In recent years, Hepatitis C virus 6a was not only the predominant subtype in Guangdong but also has the potential of increasing in non-Guangdong in China over time. To understand the virus evolution and transmission mode, the representative HCV 6a specimens in Guangdong province were chosen for deep sequencing and performed evolutionary analysis with all the 6a whole genome reference sequences. Our results showed that less than 5% of the genome was found to be under positive selection. The protein with the highest proportion of sites under positive selection was E1 and E2; A positive association between positive selection sites and the presence of CD8 epitopes was found in non-Guangdong genomes (χ 2 = 9.168, P < 0.05). The evolutionary rate of 6a whole genome was 9.59×10 − 4 s/s/y. E2 has the fastest evolutionary rate (9.40×10 − 3 s/s/y), followed by E1, NS3 and NS5A. Spatio-temporal evolutionary analysis showed that Vietnam and other Southeast Asian countries were the origins of global HCV6a. Further, there were two important transmission modes which were all originated from Vietnam and other Southeast Asian countries in Guangdong. One (Group B) indicated that HCV6a was then transmitted to Hong Kong, and then to Guangdong and cross-dissemination between these two places; Another (Group C) revealed that HCV6a was directly disseminate to Guangdong and formed a regional epidemic. Our study rstly revealed the transmission mode of HCV6a by whole genome sequence, indicating that the impact of selective constraints in Guangdong and non-Guangdong, it can be helpful to plan for future prevention and management of HCV6a infection in Guangdong, China. basic the population in similar, which resulting in a certain of infected virus Our study showed that E2 has the fastest evolutionary rate, roughly ten times than the majority partitioned genes of HCV 6a. It is well known that E2 contains three hypervariable regions (HVR), known as HVR1-3 23 , which are under continuous selection pressure exerted by the host immune system. Therefore, it is reasonable that E2 evolve fastest among all the partitional genes. The evolutionary rate of E1, NS3, NS5A and NS4A was the same as a rate typical of RNA viruses -3 s/s/y, almost equal to whole genome evolutionary rate. E1, an HCV gene, usually represent whole-genome for HCV genotyping, as the variability of E1 can approximately replace HCV whole-genome 22 . Our result also supported this insight. NS3 is the HCV protease and NTPase/helicase, NS4A acts as a cofactor for the NS3 protease 23 , which can explain the NS3 and NS4A have the same evolutionary rate. The evolutionary rate of NS2 was 9.94×10 -4 s/s/y, NS5B and NS4A slightly slower than NS2 gene. Core and 5’UTR had the slowest evolutionary rate among all the HCV 6a partitional genes. Study also showed that HCV core protein was the lowest variability and have most pan-genotypic consensus positions 24 . Combination with these two results, we speculated that was because HCV core protein forms viral capsid interacts with multiple cellular proteins, may a high level of conserved residues.


Introduction
HCV is classi ed into eight genotypes (1)(2)(3)(4)(5)(6)(7)(8) and subdivided into more than 90 subtypes named in alphabetical order from a to z due to a lack of proof reading activity by its viral RNA polymerase (https://talk.ictvonline.org/ictv_wikis/ aviviridae/w/sg_ avi/634/table-1). Among them, HCV genotype 6 was the most divergent genotype and predominantly distribute in South China and South-East Asian countries, the latter carrying the greatest burden of HCV gt6 in Asia [1][2][3] . Several of HCV gt6 subtypes were identi ed and classi ed only recently, suggesting that a considerable proportion of HCV diversity remains undescribed. HCV 6a, the rst subtype to be discovered, was endemic in Vietnam, Taiwan, Hongkong and South China [3][4][5] , especially in Guangdong Province, which accounts for more than half. Guangdong Province located in the southern part of China, is the center of cultural and commercial exchanges in domestic and international crossroads. It is the most populous province of China (126 million people), 32 million are migrants who are on business or temporarily hired as laborers in the myriad of factories with lack good hygiene conditions, leading to a frequently travel between Guangdong and their rural hometowns (http://www.stats.gov.cn/tjsj/pcsj/). Collectively, these factors created an environment to increase HCV prevalence not only in Guangdong but also in non-Guangdong. Our recently study showed that HCV 6a was not only most common in Guangdong Province, but also increased by labor exchange outside Guangdong Province in China 6 , indicating that HCV 6a was transmitted rapidly all over China through some possible propagation networks. In addition, another study also unraveled that the HCV subtype 6a could rapidly spread across China 7 .
It has been reported that transmission was an important force driving HCV viral evolution as the virus infected a novel host immune system, which alters the selection pressure on the viral genome 8, 9 . According to previous study, HCV genome evolves at ~ 0.001 substitutions/site/year (s/s/y), a rate typical of RNA viruses 10,11 . Studies have revealed that HCV genetic variability of the same gene in different genotype is different 10 . Furthermore, HCV genetic variability is not evenly distributed across the viral genome 7,12 . HCV whole genomes could maximize the evolutionary resolution 11 , detect positively selected sites along the genomes. Our aim in this study was to reveal HCV evolutionary history using the representative HCV 6a samples in Guangdong Province and archived reference sequences worldwide, acquire detailed view of the action of selection and how it may differentially affect variants of this virus in Guangdong and non-Guangdong in China.

Sample collection
Previously, a BDs Blood donors, BDs cohort of 100 HCV 6a were genotyped by E1 region, which revealed that HCV 6a was predominant in Guangdong Province 5 . Among them, 14 representative HCV 6a samples were selected according to the topology structure of maximum likelihood phylogenetic tree (supplementary Figure 1) by mega X. They were all treatmentnaïve. This study was approved by the Institutional Review Board at the Guangzhou Blood Center and the guidelines set by this board were strictly followed. Study protocols followed the ethical guidelines set in place by the 1975 Declaration of Helsinki and were approved by the Medical Ethics Committee of Guangzhou Blood Center.

Next generation sequencing
Total RNA was extracted using the AgencourtRNAdvance blood kit (Beckman Coulter) and then reverse transcribed using Superscript III (Invitrogen). Metagenomic libraries were prepared using the KAPA Library Prep kit (KAPA Biosystems) with index tagging by 16 cycles of PCR using KAPAHiFi Hot Start (KAPA Biosystems) and NEBNext Multiplex Oligos (oligonucleotides). Libraries were quanti ed by Qubit (ThermoFisher) and TapeStation (Agilent) and then pooled at equimolar concentrations for sequencing on the Illumina HiSeq platform.
Finally, the sam les were generated using Tanoti and a majority consensus sequence was generated for sites with >5 reads. The assemblies were individually inspected in UGene (http://ugene.net/).

HCV genotype and sequence datasets
The consensus sequences were placed within the context of previously known genotypes(https://talk.ictvonline.org/ictv_wikis/ aviviridae/w/sg_ avi/56/hcv-classi cation) using MEGA X software package using the maximum likelihood method with 500 bootstrap replicates. RDP4 13 and GARD 14 were used to test for recombination. All available (n=108) HCV 6a whole genome reference sequences were downloaded from the HCV-GLUE website (http://hcv.glue.cvr.ac.uk/) and NCBI (https://www.ncbi.nlm.nih.gov/gene). The sequences lacked location and sampling date information were excluded (n=5). Ultimately, 117 sequences (103 6a reference sequences with 14 6a sequences in this study) were used to construct the MCMC tree.
Among reference sequences, 15 were from Vietnam, 10 were Asian_immigrants_Canada (collected from Canada, but they are Asian immigrants), 15 were from Hongkong, 21 were from a community in China(no detailed province information), 28 were from Guangdong Province in China, others countries were less than 10 sequences, including United States, Denmark, Hunan province, Hainan province and Yunnan province in China. 6. Positive selection sites for HCV 6a genomes between Guangdong and Non-Guangdong

Spatio-temporal phylogenetic analysis
The total of HCV-6a whole genome sequences were divided into two groups (Guangdong and non-Guangdong) according to their geographic location. we ensured that all HCV sequences derived from DAA-naïve patients, excluding the reference sequences from DAA-treated patients described in literature. Mixed Effects Model of Evolution (MEME), implemented in datamonkey package (http://www.datamonkey.org/), has more superior performance over older models such as the tworate Fixed Effects likelihood (FEL and Single-Likelihood Ancestor Counting (SLAC) under a broad range of scenarios. Therefore, MEME was used to detect adaptive evolution in the codon alignment between Guangdong genomes and non-Guangdong genomes. CD4 and CD8 T-cell epitope positions were retrieved from the Los Alamos National Laboratory website (http://hcv.lanl.gov/content/immuno/immuno-main.html) and the Immune Epitope Database and Analysis resource (http://www.iedb.org).

Resistance-associated substitutions (RAS) prevalence in Guangdong sequences
The prevalence of HCV 6a RASs was determined for NS3, NS5A and NS5B in patients at baseline. RASs were detected with 95%, 5% and 1% threshold in a patient using phdrSamReporter module within HCV-GLUE project 15 . The majority and minority variants for each subtype were reported relative to the amino acid position of GT1a H77 reference.

Accession number
The nucleotide sequences reported in this study were deposited in Genbank with the following accession numbers:

Consensus sequences were obtained
In order to get reliable consensus sequences for each sample, we de ned clear criteria for consensus calling based on quality score and depth. The criteria were: 1) Phred quality score should be above 32; 2) the minimum depth at a position of the HCV genome should be greater or equal to 5. After trimming the sequences using above criteria, all the 14 samples had complete open-reading-frames (Table 1). The average depth of each position was greater than 1000 (mean:4399, mixmax:1143-7399). After removing human reads and rRNA reads, the percentage of mapped reads to total reads was signi cantly increased from 3.05%±1.58% to 43.11%±14.82%. The reads unmapped to HCV contain bacterial and fungi, no other virus was found. Neither RDP4 nor GARD showed evidence of recombination. The genotyping results of the whole genome sequences were the same as those of E1 gene in previous study 6 . 3. The evolutionary rate for HCV 6a partitioned gene For HCV 6a partitioned genes, E2 has the fastest evolutionary rate (9.40×10 -3 s/s/y) among all the partition genes,

Polymorphisms associated with resistance to NS3 Protease, NS5A and NS5B Polymerase inhibitors in whole genome sequences of HCV 6a
We characterized the prevalence of baseline NS3, NS5A and NS5B RASs with different sensitivities using 95%, 5% and 1% cut-offs for the frequency of a RAS in intra-host viral population sequences. Baseline RAS frequencies are summarized in Table 2. Mutations Q80K and 158V/I in NS3 region that confer resistance to siméprévir and boceprevir were found in 100.0% (14/14) and 7.14% (1/14) of the isolates, respectively. Even mutation V36G conferring low-level resistance to telaprevir and asunaprevir were found in 28.57% (4/14) of the patients, the cutoff for each patient was less than 5%. In NS5A region, M28F, Q30R and Y93T/S that confer resistance to daclatasvir, ombitasvir, ledipasvir and ledipasvir were found in 42.9% (6/14), 100.0% (14/14) and 64.3% (9/14) with 95% as a cutoff value, respectively. Mutation K24R was found in 14.3% (2/14) of the patients, the cutoff for each patient was less than 5%. In NS5B, only one mutation C445F that confer resistance to dasabuvir was found in 71.4% (10/14) with 95% as a cutoff value.

Discussion
With the development of next-generate sequencing (NGS) technology and bioinformation analysis methods, wholegenome sequencing of HCV has until now been a convenient, easy and accurate procedure that has now been adopted in scienti c research and clinical applications 16,17 . The NGS methodology we used in this study was able to generate HCV whole genomes from blood donor's samples, furthermore, it could accurately and reliably de ne the HCV subtype as the high average depths and strict access standards.
The molecular evolution of HCV 6a has previously only been investigated using single genome region or two concatenated partial genome regions 18 . Utilizing Bayesian MCMC phylogenetic methods, we analyzed 107 whole genome sequences and found that HCV 6a worldwide evolved from a most recent common ancestor (tMRCA) in 1905 (95%CI:1031-1943), which was younger than any other subtypes of HCV gt6 19 . The topology of the MCMC tree showed that Vietnam and other Asian countries could be the origin of 6a all over the world, which was consistent with hypothesis of previous studies 3, 12 .
Even though there is no spatio-temporal phylogenetic analysis, they provided considerable evidence to support their hypothesis 20 . In addition, we found that HCV 6a was introduced into Guangdong around 1960-1970 by two important transmission events. Originated from Vietnam, HCV 6a was transmitted to Hongkong. There were some evidences for the association between Hong Kong and Vietnamese Immigration 21 , our study also showed that Hongkong 6a was originated from Vietnam. One of the important transmission events was that HCV 6a in Guangdong were originated from Hongkong in 1966 (1941-1975), and then formed a local endemic in Guangdong province. We suspected that was because of the exchange between Guangdong Province and Hongkong in the early founding of the country 22 . Another transmission event showed that HCV6a was directly disseminate to Guangdong and formed a regional epidemic. From the topology of tree, we found that majority of sequences in group C was from Guangdong province not only in this study but also in reference sequences. Therefore, we conclude that this clade is clustered together, because the basic characteristics or lifestyles of the population in this area are similar, which resulting in a certain concentration of infected virus sequences.
Our study showed that E2 has the fastest evolutionary rate, roughly ten times than the majority partitioned genes of HCV 6a. It is well known that E2 contains three hypervariable regions (HVR), known as HVR1-3 23 , which are under continuous selection pressure exerted by the host immune system. Therefore, it is reasonable that E2 evolve fastest among all the partitional genes. The evolutionary rate of E1, NS3, NS5A and NS4A was the same as a rate typical of RNA viruses 1.00×10 -3 s/s/y, almost equal to whole genome evolutionary rate. E1, an HCV gene, usually represent whole-genome for HCV genotyping, as the variability of E1 can approximately replace HCV whole-genome 22 . Our result also supported this insight. NS3 is the HCV protease and NTPase/helicase, NS4A acts as a cofactor for the NS3 protease 23 , which can explain why the NS3 and NS4A have the same evolutionary rate. The evolutionary rate of NS2 was 9.94×10 -4 s/s/y, NS5B and NS4A was slightly slower than NS2 gene. Core and 5'UTR had the slowest evolutionary rate among all the HCV 6a partitional genes. Study also showed that HCV core protein was the lowest variability and have most pan-genotypic consensus positions 24 . Combination with these two results, we speculated that was because HCV core protein forms the viral capsid and interacts with multiple cellular proteins, which may require a high level of conserved residues. 5'UTR was so a conversed gene that do not distinguish some of the HCV-6 subtypes prevalent in Southeast Asia, and that these subtypes should instead be classi ed as genotype 1 or 1b 25 when based on primers from the 5'UTR. Using core sequencing data as well as 5' UTR data, it shown a remarkable improvement as it in genotyping accuracy and differentiation between HCV-1 and HCV-6 variants 26 .
Most published studies detecting positively selected sites in this virus have analyzed for different major subtypes such as HCV 1a and HCV 1b, no considering the region of the sequence source 27,28 . In this study, we found that the proportion of positively selected sites along HCV genomes was difference between Guangdong and non-Guangdong sequences, which indicating that geographical location is related viral adaptation between HCV and the cellular immune response. Analyses at the genomic level also revealed that Guangdong and non-Guangdong HCV 6a evolve under difference forces and constraints (CD8 T cell epitopes favoring selection in non-Guangdong HCV 6a). We speculate that the virus transmission pattern together with host genetic background in different regions leading to the different positive sites. The highest proportion of positive selection sites was found in E1 and E2 in both geographical locations, which is consistent with previous published studies 24,29,30 . These regions were reported to have the high genetic variability along the HCV genomes 31 , which re ect the generation of escape mutants whereby the interaction between positive selection and host immune response. As we know, E1 and E2 glycoproteins determines binding and fusion properties of HCV during cell entry 32 . Consequently, the positive site was the consequence of the human immune response and virus mutation when infected a new host. On the contrary, the proteins with the lowest proportion of sites under positive selection were NS3 which is a potential target for direct-acting antivirals that function to inhibit virus replication 33 . In addition, NS3 was also one of the most frequently HCV protein targets of the CD4 and CD8 T cell response 34,35 . The lowest positive selection sites in this gene would make DAA treatment more effective and facilitate viral escape from the immune system whereby CD4 and CD8 T mediated cell immune response.
The frequency of RASs varied by both HCV subtype and geographic regions 33 , therefore, we analyzed the frequency and distribution of naturally occurring RASs of HCV 6a in Guangdong regions using NGS sequence data. The RASs of HCV 6a has previously only been investigated using sanger sequencing, which can only detect more than 15%-25% of HCV variations, while deep sequencing can detect as low as 1% of HCV variation 36, 37 . In this study, mutations V36G in NS3 region and K24R in NS5A region were not found when the cutoff >95%, but after adjusting the cutoff to 5%, the prevalence was as high as 28.57% (4/14) and 14.3% (2/14) of the patients. Other RASs reported in this study were also found in other region's 6a sequences 38 . Even HCV 6a RASs could compromise treatment in robust infectious cell culture models 39 , but 100% SVR rates have been reported in a few reports on the outcomes of DAA therapy in gt6-infected patients 40,41 . More functional and clinical trials are needed to validate the RASs found in this study.
In conclusion, this study rstly analyzed the propagation mode and evolution rate of HCV 6a by whole genome sequence, revealed the transmission mode of HCV 6a in Guangdong. We have shown that, Guangdong and non-Guangdong HCV 6a share different selective pressures along their genomes, which give information about the effect of some of the interactions between HCV and its host on HCV variability. It may be useful for antiviral research against this virus. Overall, the results from this study provided data support for the formulation of HCV 6a prevention strategy in Guangdong, China.

Con icts of Interest
The authors declare no con ict of interest.

Availability of data and material
Genome sequences have been submitted to the GenBank database.
Authors' Contributions RX, JTH, MW, QL, ZGS and HSZ participated in the study design, wrote the draft, and collected the documentation materials. RX, XR and YSF participated in the study design and helped revise the draft. All authors read and approved the manuscript. Figure 1 A MCMC tree estimated from the whole genome sequences of HCV 6a. Coloured branches represent the geographic distribution of the sequences (reference sequences and 14 sequences in this study). HCV 6a isolates identi ed in this study are labeled in red solid circle. The posterior value and tMRCA were labeled in the main clusters.

Figure 2
The distribution of positively selected sites along the different regions of the H77 reference polyprotein for the Guangdong and non-Guangdong subsets. PSC represented positively selected sites.

Figure 3
Map of the HCV-6a Guangdong (Green), non-Guangdong (Yellow) genomes, indicating the location of totally positively selected sites, CD8 T cell epitopes (Blue) and CD4 T cell epitopes (Purple).