Exploring the zoonotic potential of Mycobacterium bovis using variant calling approaches

Abstract


Background
The genus Mycobacterium is known to cause the disease tuberculosis, which infects ~ 10 million people world-wide annually, according to Global Tuberculosis Report published by World Health Organization [1].Within this genus, Mycobacterium tuberculosis complex (MTBC) is known to cause tuberculosis in humans as well as animals [2].The members of the pathogenic MTBC family are known to be host-speci c.Mycobacterium tuberculosis var.bovis (M.bovis) infects cattle and many other species, whereas Mycobacterium tuberculosis H37Rv (M.tuberculosis) and its related substrains are mainly known to infect humans [3,4].M.bovis is also zoonotic and can spread from animals to humans, is an identi ed public health problem globally and is also known to have intrinsic drug resistance towards several drug molecules [5][6][7][8].M.bovis infection in humans is identical with that of M.tuberculosis infection, differing only in being non-transmissible among immunocompetent hosts [7,9].Infection due to M.bovis, causing Bovine Tuberculosis (BTB) is recognized as a One Health problem in certain parts of the globe where cattle-oriented domestication is practised.Close contacts with livestock and consumption of unpasteurized milk, raw or uncooked meat is the key source of zoonosis [6].According to WHO reports, although BTB cases make up only a small portion of human tuberculosis disease burden, efforts to curb global TB by 2030 seems to be hindered by incidences of zoonosis [10].Though TB remains to be an escalating problem worldwide, it is curable (non-treatable in cases of total drug resistance) and preventable, only if all efforts to cure it are intensi ed to reduce mortality and morbidity [11].
Continuous efforts of controlling global TB is delayed mainly due to less sensitive diagnostic tests, lack of effective vaccines and drugs, along with a rise in multi-drug resistant (MDR), extensively-drug resistant (XDR) and total-drug resistant (TDR) strains of TB [2].
The standard method of diagnosing BTB is the regular tuberculin skin test, which cannot distinguish between M.tuberculosis and M.bovis infections, moreover its results can be affected by cross-reactivity to BCG and other environmental mycobacteria [12].Conventional diagnosis using culture tests takes time for con rmation, hence diagnosis gets considerably delayed [13,14].Several non-sequence based molecular typing techniques [15][16][17] have been used independently or in combination for genotyping.These methods have variable speci city, and are suitable to be used only in clinics with well-equipped microbiological laboratories [13].Hence, a robust methodology which would help identify M.bovis isolates likely to infect hosts other than cattle would be useful in controlling BTB infection.
M.bovis AF2122/97 genome is 99.95% similar to M.tuberculosis H37Rv [18,19] differing predominantly due to Insertion-deletions(indels) and Single Nucleotide Polymorphisms (SNP)s [2,19], which in turn attribute towards their host-speci city [20].MTBC members are also known for their low mutation rates and limited genomic diversity [21], which makes studying their variation pro le important for identi cation as biomarkers.These polymorphisms obtained from analyzing whole genome sequence (WGS) data are capable enough to differentiate amongst populations and predict their host-speci city [22].SNP pro les of M.bovis isolates capable of causing zoonosis may give us clue towards their changing host-associations [23].Repositories of SNP data for TB community has been generated and is on the rise [24][25][26][27][28][29], but, all predictions till date have been done using individual samples, which have their own limitations and are known to include false-positives, due to low coverage, small read lengths and sequencing errors [30].An effort to enlist the variation pro le present within a cohort is still lacking as it becomes computationally challenging to predict SNP/Indel present across samples within a cohort during multisample variant prediction [31].
In the light of these facts, for effective control and treatment, a sequence-based improved and targeted rapid diagnostics for BTB is desirable to provide accurate identi cation.With the advent of massively-parallel next generation sequencing (NGS) techniques, bacterial populations can be sequenced to study population dynamics using WGS.Although heterogeneity analysis has been a regular phenomenon with respect to (w.r.t.) viral genomics, its application in bacterial genomes seems to be tricky.Heterogeneity refers to the genetic differences present within certain isolates of a genetically similar homogeneous population.Heterogeneity w.r.t variation pro le in prokaryotic populations forms the basis of survival in stressed environment like drug resistance, or change in metabolic requirements, etc. and this provides seed to microevolution [32,33].Microevolution can be de ned as gradual acquisition of mutations within a population to give rise to variations leading to speciation [34].The mutations arising in a prokaryotic heterogeneous or a homogeneous population needs to be studied carefully keeping in mind the gene information, annotation and their functionality.In order to identify polymorphisms responsible for causing BTB, a cross-infecting M.bovis population, i.e., a heterogeneous population needs to be identi ed and compared against a M.bovis cohort capable of infecting only cattle and has low divergence ratio amongst isolates, i.e. a homogeneous population.Clustering approaches capable of distinguishing such features have been implemented in this study to identify individual populations from the global M.bovis isolates.The initial approach uses Principal Component Analysis (PCA), which help us identify individuals with similar variation pro le [35].The second approach for clustering of isolates was performed using distance-based UPGMA and maximum-likelihood (ML) based methods for the SNP data [36].Based on PCA clustering and distance-based clustering, a homogenous population and a heterogenous population was identi ed to study the variant distribution using different variant calling approaches and their signi cance on each population type.A study of the variant distribution of the homogeneous and heterogeneous populations of M.bovis gives us an insight into the SNPs under selection and their distribution for each population type.
Approaches like Joint Variant Calling (JVC), (concept used for the rst time on prokaryotes in this study) which predict variants present in a cohort, promises to overcome the shortcomings proposed by single sample variant calling (SVC) methods, as variants are analyzed simultaneously across all samples in a population [31,37,38].JVC can predict variants for low coverage data in cohorts with high sensitivity.This approach works well for both homogeneous as well as heterogeneous population, wherein, population-speci c biomarkers can be identi ed for each cohort.JVC in diverging population can also help identify SNPs which are under selection pressure and may give rise to phenotypic variations within the population.Hence, for non-model organisms like Mycobacterium, polymorphisms can be detected with more con dence within populations using different variant calling approaches like, Bayesian or heuristic [37].Variant callers capable of handling cohort data, like FreeBayes and GATK use Bayesian approach to predict variants, whereas, VarScan2 and BCFtools uses pileup results along with heuristic approaches for variant detection [39][40][41][42][43][44].JVC was performed on a heterogenous as well as a homogenous population using a combination of various tools with different methodologies, also capable of handling prokaryotic population data for best results.Variant annotation along with functional distribution analysis of the same was done to identify the high con dent variants and their contribution towards gene functionality.The TB community has a catalogue of studies related to SNPs and Indels in MTBC genomes [26], but efforts to enlist and map all to their respective chromosomal position along with their Reference SNP cluster id (rsid) is lacking.Hence, rsids were assigned to the variants mapping them to their speci c chromosomal position.These polymorphisms could be used by other TB researchers for further addition and future reference of SNPs based on their unique rsids.The current study is the rst report of a comparative analysis of JVC approach versus single isolate variant detection in prokaryotes using homogenous and heterogenous cohort data, namely, M.bovis United Kingdom, henceforth abbreviated as UK [45,46], and New Zealand (NZ) isolates [47] respectively.JVC approach promises to improve consistency with fewer artefacts, and hence, more accurate variants were detected for homogenous as well as heterogeneous distribution of population [37,48], that have the potential to be used as biomarkers for diagnostics and treatment purposes apart from aiding in improvising our understanding of the pathogen in each population.The SNPs identi ed across a heterogeneous population as compared to the homogeneous population also throws light on the differential metabolic capabilities of the isolates which may explain certain aspects of their zoonosis.
We also aim to enlist/catalogue the distinct polymorphisms present between M.tuberculosis and M.bovis by performing JVC on the global M.bovis population to detect SNPs which occur across all samples to identify a set of "core SNP" of M.bovis, which may help in understanding host-tropism in bovine hosts apart from adding onto existing list of known polymorphisms.These core-SNPs in addition to Regions of Difference (RD), may be used for lineage identi cation [28] in M.bovis, in turn aiding in identi cation of speci c biomarkers.

Population distribution and identi cation of SNPs
A total of 705 M.bovis isolates were downloaded from public domain databases and later ltered as described in the Methods section.To understand the variant distribution across M.bovis population, SVC was performed across all 705 M.bovis isolates using GATK-HC and clustering was carried out using PCA and distance-based UPGMA methods using the vcf les generated for each isolate.Clustering was also performed using ML-based methods which showed similar results like UPGMA method (Fig. 1a,b&c).Hence, results pertaining to PCA and UPGMA have been discussed further.In the PCA plot, The NZ and Mexico populations are seen to be spread across both the principal axis and form heterogeneous clusters, whereas, UK, USA and Eritrea populations are seen to form single independent clusters on the PCA plot (Fig. 1a).As is apparent from Fig. 1b&c, the M.bovis isolates from NZ (pink) do not seem to form a de nite cluster and are found to span across the phylogenetic tree (polyphyletic).Whereas, the M.bovis isolates from UK (green) are found to be isolated in a single clade (monophyletic), though sharing similar variation pro le with other M.bovis isolates from UK, USA and others.The Mexico population is also seen to cooccur with other populations like NZ and "mixed", it was found to be almost missing in the UK clade.Hence, in terms of membership to other clusters, NZ was ranked the highest, as its presence was observed across all clades, followed by Mexico; therefore, NZ was selected for further study as a heterogeneous cluster.Isolates from USA were also found to be monophyletic and co-occurring with UK, NZ and other M.bovis isolates, but were not selected further for study because of their lower number of sample counts.Hence, the results obtained from both clustering methods indicate NZ to be a heterogeneous or a fast evolving population, and the UK to be a slow evolving homogeneous population.Based on the results obtained from the clustering methods, M.bovis isolates obtained from NZ and UK were identi ed as heterogeneous and homogeneous population respectively (Fig. 1a&b).

Variant Identi cation in M.bovis populations
In order to understand the variation between the SNP pro le of homogeneous and heterogeneous populations, JVC was implemented using four variant callers, viz., BCFtools, GATK, VarScan and FreeBayes for both NZ and UK populations independently (Table 1).SNPs present in all samples predicted by individual variant caller for both populations were extracted and named as "core-UK-(tool-name)" and "core-NZ-(tool-name)" respectively.An intersection of variants predicted by all tools across samples in homogeneous as well as heterogeneous clusters, is also reported and was found to reduce false positives.These were named as "total-UK" and "total-NZ" SNPs respectively (Fig. 2a&b).~50% of the core-SNPs in both the populations mapped onto existing rsids as compared to ~ 20% in NZ total-SNP and ~ 35% in UK total-SNP datasets.With the implementation of JVC approach for NZ population, a higher number of SNPs were reported by all tools when compared with the UK population (Table 1).Analysis of total-SNP datasets between both populations was carried out to identify common and unique SNPs.Paralogous genes corresponding to unique and common SNP datasets were identi ed for both populations in order to evaluate the number of spontaneous mutations occurring in single copy genes of each population.
To identify a core set of high con dence SNPs for UK and NZ populations, a consensus of all four variant callers was used, which led to 1866 SNPs ("core-UK") in UK and 1953 SNPs ("core-NZ") in NZ populations (Fig. 3a&b).As observed in Table 1, the total number of SNPs predicted by each tool for NZ when compared against the respective tool outcome for UK was found to be almost twice as high.This observation is in contrast to that obtained for SNPs in core-UK and core-NZ that were found to be comparable.The total-UK SNPs are similar in count to that of core-UK SNPs, whereas, in the case of heterogeneous NZ population, the total-NZ SNPs predicted are found to be high as compared to the core-NZ dataset.A comparative analysis of the single sample variant call was also performed using all four variant callers respectively for each UK sample.As observed in Table 1, the number of merged SNPs obtained using SVC is high when compared to JVC approach for each of the individual variant callers.This may be due to the artefacts related to single sample variant prediction [38].Strikingly, GATK showed comparable results in single sample calls as well as JVC predictions.
This is attributed to the fact that GATK-HC uses a scalable variant calling approach, wherein, per sample runs are done to generate an intermediate genomic vcf (gVCF) output that is further processed for joint genotyping [41], a step absent in other variant callers.
A functional distribution of the total-UK and core-UK SNPs revealed maximum number of variations to be present across Intermediatory metabolism and respiration (IMR) genes (596 and 433 respectively), followed by cell wall & cellular processes (576 and 429 respectively), conserved hypothetical (528 and 387 respectively), virulence, detoxi cation & adaptation and insertion sequences & phages (Table 2).Largest number of SNPs were found to be varied in PE/PPE genes (187) and least variation was observed in virulence, detoxi cation and adaptation (11) between total-UK and core-UK datasets.The functional distribution of total-NZ and core-NZ SNPs also followed a similar trend as seen in the case of the UK population (Table 2).Notable SNP differences were observed for all functional distribution groups between total-NZ and core-NZ SNPs, with highest SNP differences belonging to IMR, conserved hypotheticals and cell wall & cellular processes.A detailed list of the functional distribution of the SNPs for total-NZ, core-NZ, total-UK and core-UK can be found in supplementary table S1(a-d).

Discussion
SNP shapes the population structure of prokaryotes, wherein the SNPs are responsible for important bacterial phenotypes and may aid in their survival with the ability to infect and thrive on different metabolites [7].M.bovis isolates available in public domain repositories were collected and single SVC was performed to identify SNPs.Population distribution analysis of the M.bovis isolates using their SNP occurrences through clustering techniques identi ed NZ to be a more heterogeneous population as compared to the UK population.PCA clusters samples and variables with similar pro les together.In our study, the It is known that in heterogeneous populations, the frequency of SNPs due to spontaneous mutations is high as compared to homogeneous populations [49].A comparative study between the total-UK and total-NZ SNPs revealed 1617 genes having 2313 SNPs to be common.SNPs associated with 73 and 934 genes belonging to the UK and NZ population respectively were found to be unique.A comparative study of SNP distribution between these populations aided in the identi cation of the frequency of SNPs present in paralogs and the frequency of SNPs occurring spontaneously.Eight paralogues were found to carry SNPs in the NZ population and the other 828 genes harboured spontaneous mutations.No paralogous genes were observed to carry SNPs in the UK population.Hence, the proportion of SNPs carrying spontaneous mutation in single copy genes in the heterogeneous NZ population was found to be more as expected.A further analysis of these spontaneous mutations would help identify genes under positive selection, if any.

Identi cation of M.bovis speci c SNPs
A total of 351 SNPs (known as core-SNP) were found to be common across all 705 M.bovis isolates, which may aid in identi cation of a set of SNPs for distinguishing M.bovis isolates.Of these, 170 were non-synonymous, 104 were synonymous, 69 were present as upstream and eight SNPs were graded as "high" effect mutations by SNPeff annotations, which hampered protein function [50].Of these, pstA1 and mmpL9, both belonging to cell wall and cellular components category, were found to be truncated genes with altered functionality.A functional distribution of these 351 SNPs revealed highest number of missense-variants belonging to the IMR category.A complete list of the functional distribution of the core-SNPs can be found in Supplementary table S2.To test the utility of 351 core SNPs, M.bovis isolates available in NCBI SRA were chosen randomly, which are not a part of the 705 isolates mentioned above and analysed using GATK-HC.100% prediction accuracy for the SNPs was obtained, irrespective of their geographic origin.
SNP distribution of the NZ population was found to be highly varied as it co-occurred with all other populations.This indicated a rapidly evolving population, that is usually responsible for underlying subpopulations giving rise to heterogeneity within the population.This heterogeneous population is found to be polyphyletic in the distance-based tree and also has the capability of infecting hosts other than cattle as reported by Crispell et al. [47].UK isolates although were found to be sharing SNPs with other population M.bovis isolates, their occurrence was sporadic and were found as a monophyletic clade in the distancebased tree.The UK isolates were present as a single cluster on the PCA plot and accordingly, were found to be a homogeneous cluster having low divergence ratio.This hypothesis was further con rmed through variant calling based on individual populations.JVC was chosen to be the best method for distinguishing SNPs for homogeneous and heterogeneous populations.This method has the ability to identify variants with high con dence value and was preferred over SVC of multiple samples, which is underpowered due to several reasons.First, JVC performs variant analysis simultaneously across all samples, whereas, in SVC, samples are analyzed individually.Secondly, JVC can clearly distinguish between homozygous reference sites and sites with missing variant data, as it emits information of the variant site present in the population.This feature is missing when SVC is performed and hence reduces the chance of false positives in case of JVC.The read quality across M.bovis samples varied drastically.JVC also helped to overcome this issue as it uses information from high coverage samples present within the study to fetch con dent variant site information for a sample which has lower coverage at that location.JVC allows to gather variant information across all samples, thereby reducing false positives.SVC methods lack the capability of handling multiple samples at a time and hence, merging results from single sample calling would result in false discoveries.Moreover, the differences observed in the total-SNPs was found to be responsible for the overall population dynamics, which gets enlisted as high con dence variants in case of JVC as compared to SVC.Hence, variant calling using JVC method was preferred which helped us to overcome such artefacts and predict variants by joint analysis of multiple samples.
Through JVC, signi cant differences could be observed in the distribution pattern for SNP calls between UK and NZ isolates.The core-SNP for each population revealed the " xed" mutations present within the population, whereas the total-SNP for respective population indicated presence of clonal subpopulations.The higher the difference between the total-SNP and core-SNP for a population, the higher the heterogeneity quotient in the population.This result reveals microevolution to be predominant in NZ population, which is in cognizance to the heterogeneous SNP distribution of a population, whereas the SNPs present in the UK M.bovis isolates are likely to be xed, as the total and the core-SNP were comparable in terms of numbers.This phenomenon was also reported while mapping rsids to the SNPs identi ed in homogeneous as well as heterogeneous populations.Rsids were mapped to the chromosomal positions of SNPs which indicate that these SNP positions have already been reported earlier.Rsids when mapped onto the core-SNP for each population, were found to be consistent, where, ~ 50% SNPs mapped onto known SNPs in each population.These SNPs may be considered as " xed" SNPs within the population.While comparing the percentage of SNPs having rsids in total-NZ vs total-UK, NZ was found to have lesser number of rsid mapping when compared with total-UK SNPs irrespective of the variant caller.In the case of UK isolates, it is a slowly evolving homogeneous population and hence, the SNPs occurring in the population are the ones which have already been reported in the past and the total number of new SNPs is less.Whereas, in the case of heterogeneous NZ population, most of the SNPs were found to be "recent" and hence have not been reported earlier.Hence total-NZ SNPs with rsid mapping is comparatively low irrespective of the variant caller.SNPs present in the NZ population were also evaluated for their presence in paralogous genes, because, presence of SNPs in the paralogous gene groups helps the bacteria to adapt to the various changing environments [51].Moreover, the presence of SNPs in the single copy genes due to spontaneous mutations generate heterologous bacterial populations as reported earlier by Yu [32].Hence, in order to adapt to different hosts, NZ being a fast evolving heterogeneous population, SNPs were found to occur in the paralogous genes as well as single-copy genes, whereas, no SNPs were found to occur in the paralogs of the homogeneous UK population.However, as observed, the SNPs occurring in NZ outnumbered the SNPs occurring in the single copy genes of UK population.
Majority of SNPs associated with the IMR genes were found to be present in the NZ population.This phenomenon can be explained by the fact that Mycobacteria are faced with constant challenge of rerouting their metabolic activity with respect to their current environment, either in replicative growth or in non-replicative latent phase [52].A lack of functional pykA gene in M.bovis due to a SNP "E220D" makes it incapable of utilizing glycerol as the sole carbon source for energy generation [53].Searches from the UniProtKB suggests E220 to be a Magnesium (metal) binding site.Recent studies by Snášel and Pichová [54] also suggested the growth of Mycobacteria to be directly proportional to the divalent dependency of Mg 2+ and Mn 2+ ions for pykA gene.Hence, any isolate carrying a substitution E220D may not be capable of binding to Mg 2+ , thereby making it incapable of utilizing glycerol as a carbon source.It was observed that in our study, all UK isolates had the E220D substitution in the pykA gene, whereas, none of the NZ isolates had the same (Supplementary table S1).This feature may be attributed to the fact that as NZ is a heterogeneous evolving population capable of cross infecting other hosts.The NZ isolates evolve themselves to utilize other sources of energy like glycerol, which may not be a case with the UK isolates.A fundamental feature of bacterial adaptation is also their ability to respire and sustain metabolism and generate ATP via oxidative phosphorylation [52].An organism with reduced genome, such as M.leprae is limited in performing only anaerobic respiration by utilization of limited substrates [55].Whereas, early studies have proved that M.tuberculosis grown in vitro were capable of utilizing exogenous substrates for respiration and survival [52,56].This feature is observed in a heterogeneously diversifying population, like NZ M.bovis isolates wherein, maximum number of SNPs are observed within the genes responsible for metabolism and respiration, which in turn supports their ability to infect hosts other than cattle, having wider availability of substrates.For instance, Mycobacteria have evolved highly specialized systems to extract nutrients from their hosts, in the form of lipids.One such important transporter needed for scavenging lipids from hosts is the mce1 transporter complex [57].A comparative study of the SNPs involved in mce1 transporter complex in both populations revealed more number of SNPs to be present in the NZ heterogeneous population as compared to the UK homogeneous population (Supplementary Table S1).These additional SNPs may help evade the host immune responses to be able to infect more number of hosts.Hence, these observations can be attributed to the fact that M.bovis isolates belonging to NZ population are capable of zoonoses, which is in agreement with BTB infection in New Zealand reported by Crispell et al. [47].
The cell wall components constitute an important part of virulence and host-speci city in Mycobacterium [58-60].The SNP distribution in the cell wall components contribute to pathogenesis and virulence and also forms the interface for host-pathogen interaction.The NZ isolates show a larger variation in their cell wall genes as compared to the UK isolates.This also may be attributed to the fact that this population has the capability to infect a wider range of hosts by evading immune system, from cattle to wildlife [47].
By using JVC methods, we could list the SNPs present in M.bovis isolates which may be responsible for zoonoses when compared against a cohort showing low divergence.A core-UK set for identifying UK isolates and a core-NZ set for identifying NZ isolates were also identi ed using this methodology.The mutations present in the pstA1 and mmpL9 genes along with other genes of the cell wall components of M.bovis may help accommodate the cell wall so as to increase its capability to infect multiple hosts.As reported in literature earlier, a mutant pstsA1 gene in M.tuberculosis renders the bacterium hypersensitive to various stress conditions, as this gene is responsible for phosphate uptake along with inducing expression of several PE and PPE genes in the wild type.
Most of the PE PPE genes are localized to the cell wall for maintaining the cell wall integrity.It was also observed by Ramakrishnan et al, that the permeability issues created by the mutant pstsA1 gene could be replaced by deleting pe19 gene [61].Hence, we hypothesize that the observed mutations pertaining to the cell wall and associated genes, like the several PE-PPE genes in NZ population may together contribute in modulating the M.bovis envelope to resist different stress conditions while infecting multiple hosts.Another biomarker gene, mmpL9 was found to be mutated across all M.bovis isolates, when compared against M.tuberculosis.This gene although is involved in inhibition of phagosomal maturation and oxidative stress management during infection, mutant strains were found to survive effectively in mouse infection models [62].Hence, an integrated system-level approach to study these SNPs would help us understand the mechanism of zoonoses amongst these isolates.The core-SNP set consisting of 351 SNPs is a robust genetic marker for identi cation of M.bovis isolates in addition to already known RDs, as these were capable of identifying M.bovis isolates globally.

Conclusions
To conclude, this study promises to be a robust strategy for identifying traits in the form of variants present across a diverging as well as a slowly evolving population.By using strategies like PCA and distance-based UPGMA methods for clustering variants, an overall idea about the population distribution helped in identi cation of most evolving bacterial populations undergoing rapid mutational changes.These highly evolving divergent populations may have the potential of infecting multiple hosts or may be undergoing mutational changes under selective pressure, like the multi and extensively drug resistant strains of Mycobacteria.Once these populations are identi ed, variant calling approaches like JVC would aid in identi cation of speci c variants in turn facilitating identi cation of speci c biomarkers, leading to better diagnostics and cure.

Data Collection and Processing
Whole genome sequencing data 705 M.bovis isolates were downloaded from NCBI Sequence Read Archive available as of March 2019 (Table 3) [45][46][47].M.bovis samples were downloaded from NCBI SRA for cattle as host.A total of 323, 155, 139, 39, 22 and 14 samples were isolated as M.bovis infecting NZ, Mexico, UK, USA, Uruguay and Eritrea cattle respectively.The rest 13 M.bovis isolates from Guatemala, Costa Rica, Panama, South Africa, Brazil and Canada respectively were combined together as "mixed" population (Table 3).Ten additional M.bovis samples which are not a part of the primary dataset were randomly downloaded from NCBI SRA for further testing.Reads for each M.bovis isolate was checked for quality using FastQC   [44].The common SNPs across all isolate for each tool was inferred using GATK CombineVariants [41] (Fig. 4).
JVC was performed for the global collection of 705 M.bovis isolates using four variant callers, viz., GATK-HC [41,40,43], FreeBayes [70], VarScan2 [39] and Samtools/BCftools [44].A proposed consensus of the variants predicted by above tools across all isolates is hypothesized to serve as a biomarker for global M.bovis isolates.Studies were carried out accordingly, but the results could not be reproduced when tested across other randomly chosen M.bovis isolates which were not a part of the present study (data not shown).Hence, an assessment of the population structure of M.bovis isolates was made by reconstructing a distance tree based on the UPGMA algorithm with 100 bootstrap replicates [36] using the vcf les obtained for all 705 samples.A maximumlikelihood tree of the 705 samples was also reconstructed using RAxML software with 100 bootstrap replicates [71].PCA was carried out across all vcf les using the genotype-by-sequencing (GBS) tool for population genetic analysis in R [72].Identi cation of homogeneous and heterogeneous population cluster was performed based on the results obtained from PCA, Maximim-likelihood and UPGMA methods.The distance-based tree was visualized using iTOL v4 tree viewer [73].
JVC was performed for both homogeneous as well as heterogeneous clusters with four different variant callers capable of handling haploid organism [37].
Heuristic tools, viz., VarScan2, BCFtools, Bayesian based tools viz., FreeBayes, Haplotype-based tools like GATK-HC were used for JVC, with ploidy "1".Multithreading option for GATK-HC enabled rapid and timely prediction of SNPs.BLAST package (version 2.6.0) was used to predict paralogues and assign rsids to the variants with dbSNP entries for Mycobacterium [74,75].Paralogs and their corresponding SNPs in each population were detected using BLASTn with a threshold of 80% "query coverage" and e-value of <e -20 .In-house custom scripts were used for data processing and data analysis.All variant calling tools were highly compute intensive and required long stretches of computational runs.
JVC was also performed for all 705 M.bovis isolates using GATK-HC to identify the common set of SNPs present across global M.bovis isolates used for this study.These SNPs were further tested for their presence in randomly selected M.bovis samples, which were not a part of the original 705 M.bovis isolates with accession numbers SRR1791699, SRR1791716, SRR1791718, SRR1791784, SRR1791808, SRR1791830, SRR1791840, SRR1791854, SRR1791885, SRR1791888.All predicted variations were annotated using SNPeff and SNPsift [50] packages.Variants were classi ed according to their functional category as reported in Tuberculist [66].

Assignment of rsid
Rsid accession number are available for Mycobacterium polymorphic sites in NCBI dbSNP available at (ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/archive/mycobacterium_tuberculosis_1773/).Currently NCBI has stopped support for SNPs of non-human organisms, but data can be accessed through the above link (personal communication).Rsids for non-model organisms, such as M.tuberculosis lack genomic coordinates.Hence, a strategy to map rsids to speci c chromosomal position for M.tuberculosis was implemented (Fig. 5).M.tuberculosis SNP data entries (37388) along with anking sequences were downloaded from NCBI dbSNP.
Sequence alignment of these entries using standalone BLAST version version 2.6.0 [76] with M.tuberculosis H37Rv gene sequences as database, resulted in 45938 matches.As the number of matches were more than the query sequences, ltering was required for entries with more than one match.The BLAST output was ltered for mismatches more or less than 1 along with removal of entries with unequal subject and query length.Entries with unique and repeated rsids were analysed further.This resulted in 33565 unique rsids, along with 114 repeated rsids.It was found that the query sequences had anking regions of varying lengths(60, 120 and 255).Hence, high-scoring Segment Pairs matching with repeated rsids were further ltered based on length of anking sequence being 255.After ltration, 42 repeated rsids were retained and rsids corresponding to 33607 chromosomal positions were mapped.It was found that by changing the order of ltering, the number of genes being assigned rsids reduced.Two distinct rsids were assigned to the same chromosomal position in 282 entries, and hence were discarded and only unique rsid mapping onto single chromosomal position were retained.In total, we could assign 33325 unique rsids to corresponding positions of the M.tuberculosis genome and the vcf le has been provided as Supplementary le S3.Methodology to identify population speci c SNPs [63].Read length varied from 70-250 base pairs across samples.Read quality > 28 were retained.TrimGalore [64] was used to trim the reads such that > 80% of the read length were retained, as trimming of reads and retaining the good quality bases increases the accuracy of the analysis [65].

Figure 1 a 2 a
Figure 1 a: PCA plots for all 705 M.bovis isolates; b: Distance-based UPGMA (unrooted tree) for all 705 M.bovis isolates; Maximum likelihood based (unrooted tree) for all 705 M.bovis isolates

Figure 3 a
Figure 3 a: Core-UK SNPs predicted by all tools; b: Core-NZ SNPs predicted by all tools

Table 2
Functional distribution of SNPs in UK and NZ