We performed genomic analysis on 278 Mtb isolates (226 drug-resistant, 52 drug-sensitive, as determined by standard clinical drug susceptibility tests) from 190 TB patients. The majority of the isolates were from two spoligotype lineages in nearly equal distribution (41% H3 and 38% Beijing). All H3 spoligotypes had the 4.2.1 numeric SNP barcode and all Beijing spoligotypes had the 2.2.1 numeric SNP barcode. Not all 4.2.1 lineage samples had the H3 spoligotype. The remaining samples were distributed among nine other lineages, all present at a frequency of less than 10% (Figure 1A). This pattern in the complete sample set is driven by the drug-resistant samples in our study, as the H3/4.2.1 and Beijing/2.2.1 lineages are dominant among drug-resistant samples (Figure 1B). There is much greater lineage variation among drug-sensitive samples (Figure 1C).
Bayesian phylogenetic analysis of genomic SNPs demonstrated differing evolutionary structures between Beijing/2.2.1 and H3/4.2.1 lineage samples (Figure 2). The majority of the H3/4.2.1 lineage samples clustered into one shallow, strongly-supported (clade posterior probability > 0.95) group (mean Hamming distance 17.1 SNPs, range 0-40 SNPs). Beijing/2.2.1 lineage samples mostly clustered into four well-supported (P > 0.95) clades (Figure 2) more divergent than the main H3/4.2.1 clade (mean pairwise Hamming distances 40.2 (range 0-53), 30.0 (0-53), 35.3 (3-68), and 33.1 (21-31) SNPs). All H3/4.2.1 lineage samples had a smaller mean pairwise diversity (mean Hamming distance 74.4 SNPs) than the Beijing/2.2.1 lineage samples (mean Hamming distance 144.5 SNPs) but a greater range of individual pairwise diversity (Hamming distance range 0-495 for H3/4.2.1, 0-233 for Beijing/2.2.1).
Whole genome variant analysis (WGVA) was performed to reveal molecular patterns of drug resistance within our sample set. The presence or absence of drug-resistance SNPs, as determined by data in the ReSeqTB database (Table 1) , were mapped onto the tips of this phylogeny and are presented as a heat map to the right of the tree in Figure 2. Our analysis shows the rpoB S450L (rifampicin resistance), inhA promoter c(-15)t (isoniazid resistance), and katG S315T (isoniazid resistance) mutations broadly distributed across DR samples in both the H3/4.2.1 and Beijing/2.2.1 clades. The lineages in the H3/4.2.1 clade do not appear to have any mutations in the 16S ribosomal RNA gene rrs that would lead to resistance to second-line injectable drugs (amikacin, kanamycin, or capreomycin). Several widely phylogenetically distributed Beijing/2.2.1 lineage samples have the rrs a1401g mutation. This distribution implies the rrs a1401g mutation has arisen independently multiple times among the divergent Beijing/2.2.1 sublineages.
Susceptibility to first and second-line drugs was determined using phenotypic drug susceptibility testing (pDST) assays. Standard line tests of drug resistance (Hain MTBDRplus and GenoType MTBDRsl and Cepheid GeneXpert) also were performed on many of these samples. All samples with a line test also had a phenotypic test, and no line test results were more severe than phenotypic DST results. There were 12 samples in which pDST-determined drug resistance did not match what was predicted by WGVA (Table 2). In 10 of these cases the WGVA drug resistance profile was “susceptible” while that from pDST was mono-resistant, MDR, or XDR. Two patients with drug susceptible Mtb infection, as determined by pDST, had WGVA profiles concordant with mono-resistance.
This discrepancy between pDST and WGVA results could be an indication of mixed infection when there is evidence of low-level DR variants. For the 10 samples with a WGVA drug susceptible profile and a pDST resistant profile, the full genomic read data were examined to determine if low-frequency variants were present at the ReSeqTB DR SNP sites. Analysis with LoFreq found no statistically significant low-frequency variation in these samples. Four of these samples (SRR3743495, SRR5153913, SRR3743493, and SRR3743482) had low level variation (greater than 1% of reads but not statistically significant) at several ReSeqTB DR loci.
The sample with the most low-level variation was SRR5153913. This sample was classified as MDR by the BACTEC and Hain DSTs but it had no SNP calls at ReSeqTB DR SNP sites. This sample had low-frequency variation greater than 1% of reads at rpoB S450L (4/107 reads), katG S315T (4/142), inhH promoter c(-15)t (4/158), rpsL K88R(4/195), and pncA G97D (2/139). These variants would allow subpopulations of the pathogen to grow in cultures containing rifampicin, isoniazid, streptomycin, and pyrazinamide (respectively), corresponding to the MDR profile.
We also examined depth of coverage as a possible explanation for the discrepancy between pDST and WGVA results. Three of the 10 samples with WGVA susceptible/pDST resistant profiles had low coverage (average number <100 reads) across all of the ReSeqTB DR SNP sites. Due to sampling effects low coverage at DR SNP sites could lead to under-reporting of low-frequency DR SNPs that may be present in the patient. The low frequency variation found at ReSeqTB DR SNP sites in these three samples was on the order of 1% of reads or less.
Two samples (patient/isolate IDs 1242/SRR5153868 and 1248/SRR5153901) were identified as being WGVA-resistant and pDST sensitive (Table 2). Isolate SRR5153868 had the rpsL K43R variant for streptomycin resistance and isolate SRR5153901 had the gyrA A90V variant for ofloxacin resistance. Additionally, SRR5153868 had three DR variants present at very low frequency: rpoB L452P (2/177), pncA Q10P (2/159), and pncA H57D (2/192). Isolate SRR5153901 had one low frequency variant in the rpoB RRDR that was not a recognized DR variant. There are several potential sources for this disagreement, including differences in diversity between WGVA samples and pDST samples  and localized failure of individual DSTs.
Longitudinal sampling. Eighty-nine patients had two samples collected at different times (Supplemental Table 1). The time span of the paired samples ranged from 46 days to 7.38 years. Forty-six pairs had fewer than 10 SNPs different between their genomes (range 0-9); 7 of those pairs contained identical genomes (Figure 3). Thirty pairs had different spoligotype lineages, and 27 of these differed by more than 400 SNPs (range: 448-1298).
Five sample pairs had the same spoligotype and SNP barcode lineage within the pair but differed by an elevated number of SNPs (patients 559, 1250, 1605, 1611, and 1614, range 39-437 SNPs) (Figure 3). The second sample for these patients were labelled “relapse” (except in the case where a patient died and was labelled “failure”). However, in all of these cases the second sample is most likely a reinfection due to the elevated number of SNPs between the two samples. Also, these pairs of samples typically clustered into separate subclades within their main clades, further supporting that they represent multiple infections by independent strains within the H3/4.2.1 or Beijing/2.2.1 lineages.
We found instances in which samples from other patients were more similar to one of the paired samples than the paired samples were to each other (Figure 4). These pairs and closely-related samples were all in the H3/4.2.1 clade. This clade contains a large subgroup of samples too similar for reliable determination of phylogenetic relationships, as would be expected for a population of very closely related clonal lineages. To accommodate these ambiguities we performed a network analysis. We found two significant networks: one small network of five samples (two were identical) and a second much larger network of 61 samples (Figure 4). The larger network was very complex with three main interior nodes from which most other samples were derived. Two of these interior nodes were samples present in the data set (SRR6807719 and SRR3743486) and the third was an unsampled hypothetical genotype. These three nodes were very similar, differing only by two or four SNPs. Only eight of the connections in the 61-sample network were greater than 10 SNPs. Most samples were connected by very few genomic differences, indicating that many of the H3/4.2.1 TB lineages infecting patients in Moldova are very closely related.
Mapping the paired sample relationships onto these networks showed that most paired samples were each other’s closest relatives, which would be consistent with a possible reactivation in these patients. Four paired samples were not each other’s closest relatives, even though they differed by relatively few SNPs (7, 8, 8, and 10 SNPs). While bacterial evolution during drug treatment might account for the minimal SNP differentiation in these four pairs , their lack of a direct relationship in the network indicates that these may be cases of reinfection by a closely-related lineage.