We studied genetic relationships among 5069 Mycobacterium tuberculosis strains recovered from patients enrolled in 4 population-based studies in the United States and Europe, by analysis of 36 synonymous single-nucleotide polymorphisms (SNPs). All strains were assigned to 1 of 9 major genetic clusters based on sSNP profile. The same 9 genetic clusters were revealed by analysis of 227 nonsynonymous SNPs, 121 intergenic SNPs, and concatenated profiles of 578 SNPs available for a subset of 48 representative strains. IS6110 profiles, spoligotypes, and mycobacterial interspersed repetitive unit patterns were nonrandomly associated with SNP-based phylogenetic lineages, together indicating a strongly clonal population structure. Isolates of the 9 genetic clusters were not distributed with equal frequency in all localities, reflecting geographic subdivision. The SNP-based phylogenetic framework provides new insight into the worldwide evolution of M. tuberculosis and a gateway for investigating genotype-disease phenotype relationships in large samples of strains
Mycobacterium tuberculosis is the most successful human pathogen worldwide, responsible for 3 million deaths each year and for extensive morbidity and mortality [1]. The global threat of M. tuberculosis to human health emphasizes the need to understand the nature and extent of genetic variation in the species and the phylogenetic relationships among all strains. Historically, population-based studies of M. tuberculosis epidemiology have been hindered because methods commonly used to analyze isolates [2–6] do not yield estimates of overall levels of chromosomal relationships among all isolates. This problem was recently solved by the development of genomewide analysis of synonymous single-nucleotide polymorphisms (sSNPs—i.e., those that do not produce amino acid changes), made possible by the availability of genome sequence data for 4 M. tuberculosis complex strains [7–10]. Because sSNPs are evolutionarily neutral or nearly so [11] and are easy to assay, they provide a powerful strategy for large-scale molecular population-genetic studies examining phylogenetic relationships among bacterial strains
In the present work, we used sSNP analysis to study genetic relationships among 5069 M. tuberculosis isolates recovered from patients enrolled in 4 population-based studies in the United States and Europe. We also analyzed nonsynonymous SNPs (nsSNPs—i.e., those that produce amino acid replacements) and SNPs located in intergenic regions (iSNPs) in a representative genetic subset of these 5069 organisms
Bacterial strains M. tuberculosis isolates (n=5069) were cultured from patients living in 4 areas with ongoing population-based tuberculosis surveillance studies (supplementary table 1, available for download as a PDF in the online edition of the Journal). The sample included strains from New York City (n=1114; year 2001); New Jersey (n=775; years 2000–2001); Houston, Texas (n=2328; years 1995–2000); and Finland (n=852; years 2000–2001). Organisms representing ∼95% of all culture-positive tuberculosis cases in these localities were studied. All of the organisms have been characterized by IS6110 profiling, and most have been analyzed by spoligotyping [6, 12, 13]
Ninety-two M. bovis strains recovered from 11 different host species worldwide, as well as 1 M. africanum strain, 4 M. microti strains, 1 M. tuberculosis complex strain from a seal, and 1 M. canettii isolate, described in a recent publication [7], also were studied. All isolates were assigned to species on the basis of conventional biochemical test results [14]
Identification, selection, and verification of SNPs SNPs were identified by aligning genome sequence information available for 4 M. tuberculosis strains [7]. Briefly, the genome sequences of M. tuberculosis strain H37Rv [8] and strain CDC1551 ([9]; http://www.tigr.org/) were aligned using CrossMatch (P. Green; http://www.genome.washington.edu/UWGC/analysistools/Swat.cfm) to identify polymorphic nucleotide sites. In addition, these 2 genomes were aligned with partial genome sequence data available for M. tuberculosis strain 210 (http://www.tigr.org/) and M. bovis strain AF2122/97 (http://www.sanger.ac.uk/). Although the complete genome sequence of strain AF2122/97 was published recently [10], our analysis was begun when only partial genome data were available. iSNPs were defined as SNPs not present in open reading frames, using the genome annotation available for strain H37Rv and CDC1551
Thirty-six sSNPs were selected from among the 230 sSNPs studied previously [7]. These 36 sSNPs include 27 sSNPs described previously and reproduce the overall topology of the phylogenetic tree generated by the set of 230 sSNPs [7]. Each sSNP proportionally represents 10–32 of these 230 sSNPs, on the basis of the number of sSNPs needed to separate one branch of the tree from the other
Two hundred twenty-seven sequence-confirmed nsSNPs in 194 distinct genes were selected for analysis on the basis of 3 major criteria. First, inasmuch as a gain or loss of a proline amino acid residue can produce a major structural change in proteins, virtually all nsSNPs resulting in the replacement of an amino acid residue by a proline residue, or vice versa, were analyzed [15, 16]. Second, multiple nsSNPs present in the same gene were analyzed on the basis of the premise that the encoded protein may be under elevated selective pressure. Third, nsSNPs located in genes encoding proteins of known function or implicated in a particular disease or strain phenotype, as determined on the basis of a survey of the published tuberculosis literature, also were analyzed. Eighty-six percent of all putative nsSNPs were confirmed to be true SNPs. Sixty-four iSNPs were randomly selected and used for initial analysis. All SNPs were verified by sequencing the relevant gene regions in strains TN587, CDC1551, and H37Rv, which are representatives of principal genetic groups 1, 2, and 3, respectively [17], as well as in the M. bovis strain TN10130
SNP analysis sSNPs and nsSNPs were analyzed by the SNaPshot primer extension method (Applied Biosystems), as described elsewhere [7]. The primers used are listed in supplementary table 2 (available for download as a PDF in the online edition of the Journal). iSNPs were studied by DNA sequence analysis of the entire relevant intergenic region (i.e., the full-length intergenic region was sequenced). The PCR and sequencing primers used are listed in supplementary table 2. sSNPs and nsSNPs were identified using ABI Prism Genotyper software (version 3.7NT; Applied Biosystems), and iSNPs were identified using Sequencher (version 4.0.5; Gene Codes Corporation) and DNASTAR (DNASTAR) software
IS 6110 restriction fragment–length polymorphism (RFLP) profiling, spoligotyping, and mycobacterial interspersed repetitive unit (MIRU)–variable-number tandem repeat (VNTR) analysis IS6110 RFLP profiling, spoligotyping, and MIRU-VNTR analysis were performed by procedures described elsewhere [2, 6, 17–19]
SNP data analysis SNP data were concatenated, resulting in a single character string (nucleotide sequence) for each strain. Phylogenetic and molecular evolutionary analyses were conducted using MEGA (version 2.1; available at: http://www.megasoftware.net/), using the neighbor-joining method with 1000 bootstrap replicates, with distance calculated using the number of different SNPs
Statistical analysis Associations were tested for statistical significance by use of the χ2 or Fisher’s exact test. A P value <.05 was considered to be statistically significant
Nine clusters of genetically related organisms identified by sSNP analysis Recent analysis of a convenience sample of 316 M. tuberculosis sensu stricto organisms identified 8 distinct clusters of genetically related organisms on the basis of phylogenetic analysis of 230 sSNPs [7]. Because analysis of hundreds of SNPs in thousands of isolates currently is time consuming and not economically realistic, we identified a sample of 36 sSNPs that recapitulated the overall topology of the phylogenetic tree based on the 230 sSNPs studied previously [7]. A sample of 5069 M. tuberculosis isolates collected in population-based studies conducted in New York City, New Jersey, Houston, and Finland was analyzed for these 36 sSNPs
Nine genetic clusters were identified, including 8 that were found in our recent analysis [7]. One new cluster (designated II.A), located phylogenetically between clusters II and III (figure 1), was identified. Our results confirmed a nonrandom relationship between genetic cluster and 3 principal genetic groups of the species [20]. Principal genetic group 1 organisms were assigned to clusters I and II, principal genetic group 2 organisms were assigned to clusters III–VI, and principal genetic group 3 isolates were assigned to clusters VII and VIII (figure 1). Importantly, cluster II.A included principal genetic group 1 (44% of cluster II.A isolates) and principal genetic group 2 (56% of cluster II.A isolates) organisms, which is additional evidence in support of the hypothesis [7, 20] that principal genetic group 1 and 2 organisms are phylogenetically allied
Phylogenetic trees showing estimates of genetic relationships among 9 genetic clusters of Mycobacterium tuberculosis strains identified by single-nucleotide polymorphism (SNP) analysis. For simplicity, the trees were truncated to show only the major genetic clusters (subclusters of 2–10 strains differing by 1–2 SNPs from the main branches were omitted for reasons of space). The clusters are numbered as described elsewhere [7]. A Synonymous SNP (sSNP) tree. B Nonsynonymous SNP (nsSNP) tree. C Intergenic SNP (iSNP) tree. D Tree generated using concatenated sSNP (including the above-mentioned 230 sSNPs), nsSNP, and iSNP data (n=578 SNPs for each of 48 representative strains). It is noteworthy that, to analyze sSNPs in 5069 strains (A) a subset of 36 representative sSNPs was selected from the 230 sSNPs studied previously [7]. These 36 sSNPs reproduce the overall topology of the phylogenetic tree generated by the set of 230 sSNPs. Therefore, each of the 36 sSNPs is proportionally representative of 10–32 of the total 230 sSNPs analyzed previously, and the scale of 2 indicated is representative of ∼10 sSNPs. Furthermore, it has to be noted that, with 5069 strains analyzed by 36 sSNPs, the computational limits were exceeded. Therefore, the bootstrap values in panel A are a summary of multiple bootstrap analyses using subsets of the 5069 strains. Groups designated 1–3 in panel A refer to principal genetic groups described elsewhere [20]. The reliability of these branches was assessed by bootstrapping with 1000 (neighbor joining) pseudoreplicates. MTB cplx, M. tuberculosis complex (includes M. bovis, M. africanum, M. microti, M. tuberculosis [from a seal], and M. canettii strains)
Analysis of nsSNPs We recently reported that the abundance of nsSNPs exceeded that of sSNPs by a ratio of roughly 2:1, on the basis of comparison of the strain H37Rv and CDC1551 genomes [7, 9]. A similar abundance of nsSNPs was reported on the basis of comparison of an M. bovis genome sequence and strain H37Rv and CDC1551 genomes [10]. This abundance of nsSNPs relative to sSNPs is unusual for bacteria characterized thus far. sSNPs generally outnumber nsSNPs, in part because the latter are, it is believed, purged from the population by purifying selection [11]. One explanation for this unusual finding postulates very rapid expansion of the population size of M. tuberculosis sensu stricto, such that sufficient evolutionary time has not elapsed to permit the loss of deleterious mutations [7]. The relatively restricted allelic variation in structural genes in the M. tuberculosis complex is consistent with the hypothesis of recent population expansion. If this is the case, we anticipate that the overall topology of a phylogenetic tree based on nsSNPs will be identical or closely similar to that of the tree based on sSNPs
To test this tree-topology hypothesis, we characterized 227 nsSNPs in the 48 strains representing the 9 genetic clusters. Consistent with the hypothesis, the overall topology of the nsSNP-based tree was very similar to that of the sSNP-based tree (figure 1)
Analysis of iSNPs iSNPs—SNPs located between coding sequences—potentially can alter the level of gene expression and, hence, may confer a selective advantage. Genomewide analysis of iSNPs in natural populations of M. tuberculosis has not been conducted, nor has it been studied extensively in other pathogens
Comparison of the genomes of strains H37Rv and CDC1551 revealed 1075 SNPs, of which ∼15% (n=161) were located in noncoding regions [9]. Sixty-four sequence-confirmed iSNPs present in 63 intergenic regions were studied by comparative DNA sequencing of the entire intergenic region containing the target iSNP in 48 strains representing the 9 genetic clusters. Forty of the 63 intergenic regions had 57 additional iSNPs that were not present in the 4 genomes we compared. Phylogenetic analysis of the entire iSNP data set (a total of 121 iSNPs/strain) identified the same 9 genetic clusters found by study of the sSNPs and nsSNPs (figure 1), indicating that, on average, iSNPs have accumulated in a fashion phylogenetically similar to that of sSNPs
One group of iSNPs warrants special comment, because their pattern of distribution is not readily explained by sequential accumulation along phylogenetic lineages. The intergenic region between Rv0980c (PE-PGRS family protein gene) and Rv0981 (2-component response regulator gene) [21] has 5 polymorphic nucleotides (figure 2 and supplementary table 3, available for download as a PDF in the online edition of the Journal). The segment of this intergenic region containing iSNPs is present at 2 other sites in the genome of strain H37Rv, CDC1551, and AF2122/97. Sequence analysis of these 2 additional intergenic regions revealed that they are only nominally polymorphic in the 48 strains studied, and the distribution of polymorphisms in them is readily explainable by the accumulation of occasional mutations. To obtain additional information about the distribution of the Rv0980c-Rv0981 iSNPs in natural populations, we sequenced this intergenic region in an additional 251 M. tuberculosis complex isolates (including M. tuberculosis, M. bovis, M. microti, M. africanum and M. canettii strains). We found a complicated pattern of distribution of the iSNPs (figure 2), suggesting that recombination has helped to shape polymorphism at this locus
Schematic showing intergenic single-nucleotide polymorphisms (iSNPs) located in the intergenic region between Rv0980c (PE-PGRS family protein gene) and Rv0981 (2-component response regulator gene). A Location of relevant intergenic regions. Pink bars indicate segments that are repeated at 3 sites in the genome and that contain intergenic polymorphic nucleotide sites (iSNPs, denoted by vertical arrows). B Distribution of Rv0980c-Rv0981–region iSNP profiles in the phylogenetic tree of Mycobacterium tuberculosis. The five 5-nt profiles are shown in the colored ovals. The 91 MTB cplx oval includes 4 M. microti 1 M. africanum 1 M. canettii and 85 M. bovis strains
Estimates of genetic relationships based on an aggregate data set of 578 SNPs We next concatenated data for the 578 SNPs available for all 48 strains (230 sSNPs, 227 nSNPs, and 121 iSNPs) and estimated overall genetic relationships. The results (figure 1) confirmed that 9 major genetic clusters were present among the strains analyzed on the basis of these 578 SNPs
Variation in IS 6110 profile among the 9 genetic clusters M. tuberculosis isolates have a species-specific insertion sequence, IS6110 which is highly polymorphic in copy number and position in the chromosome [2, 4, 6]. Despite the widespread application of IS6110 RFLP fingerprinting, population-based studies of M. tuberculosis epidemiology and genetic relationships have been hindered by the fact that 50% of strains have unique IS6110 profiles or low IS6110 copy numbers [4, 6, 12, 22, 23]. We have recently reported that analysis of sSNPs can overcome some of the inherent limitations associated with IS6110 profiling [7]
In the present study, SNP analysis provides a phylogenetic framework to enhance understanding of the evolution of IS6110 profiles, as illustrated by an example. Strains with an IS6110 profile designated as BE (1 IS6110 copy), H6 (2 IS6110 copies), or C28 (3 IS6110 copies) have IS6110 inserted in the mmpS1 gene [12]. All strains with this IS6110 insertion were genetically allied in cluster IV, indicating that these strains are related by descent
There is controversy regarding the amount of phylogenetic signal present in IS6110 copy number [17, 20, 22, 23]. The distribution of IS6110 copies in the 9 genetic clusters showed that isolates with <6 copies of IS6110 can be represented among very different phylogenetic lineages (figure 3). This observation, together with the distribution of strains with 1 IS6110 copy in clusters I, II.A, and IV, definitively rules out the overly simplistic subdivision of M. tuberculosis sensu stricto into only 2 major lineages defined by “low” and “high” IS6110 copy numbers [17, 22]
Distribution of IS6110 copy number in 9 primary genetic clusters. The color figure is available in its entirety in the online edition of the Journal of Infectious Diseases
Variation in spoligotype among the 9 genetic clusters Spoligotyping, together with IS6110 RFLP profiling, also has been used to differentiate among M. tuberculosis isolates [6, 24–26]. Therefore, we next studied the relationship between phylogenetic lineage and spoligotype. There was no sharing of spoligotypes among isolates assigned to genetic clusters I–II (representing principal genetic group 1 organisms) and isolates assigned to genetic clusters III–VIII (representing principal genetic group 2/3 organisms). In contrast, 11 spoligotypes were shared among isolates assigned to SNP clusters containing principal genetic group 2/3 organisms (supplementary table 1). Cluster II.A was the only cluster with principal genetic group 1/2 strains. Similarly, 7 spoligotypes were shared between organisms of cluster II.A and cluster II, clusters III–V, and clusters VI–VIII (supplementary table 1). Thus, spoligotyping provided additional evidence that principal genetic group 2/3 organisms are, as a population, genetically related but differentiated from principal genetic group 1 organisms [6, 7, 20]
Variation in MIRU polymorphisms among the 9 genetic clusters Analysis of MIRUs also has been used to infer relationships among M. tuberculosis isolates [17–19, 27], and 12 MIRUs are commonly used for this purpose. MIRU regions are characterized by VNTRs and are dispersed around the chromosome, often in intergenic regions [5, 18]
As observed for the spoligotype data, there was no sharing of MIRU profile between strains of principal genetic group 1 and principal genetic group 2/3. In contrast, MIRU profiles were shared among principal genetic group 1 organisms and within and between principal genetic group 2/3 isolates (supplementary table 1). These results support our thesis that principal genetic group 1 strains are genetically distinct from strains of principal genetic group 2/3 [7, 20]
Genetic relationships among strains isolated from distinct geographic areas The SNP, spoligotype, IS6110 profile, and MIRU data together indicate that each of the 9 genetic clusters is composed of strains that have shared a common ancestor—that is, they are clonally related by descent. Inasmuch as M. tuberculosis strains cannot spread worldwide instantaneously, we hypothesized that isolates of the 9 clusters are not equally distributed in all localities. The results (supplementary table 1) are consistent with the hypothesis. For example, strains from Finland were found to be significantly (P⩽.001) more abundant in clusters III and VII than in other clusters and were almost nonexistent in cluster II (P⩽.001) (figure 4), a cluster containing strains (W and Beijing) that are abundant in many areas of Asia [28–30]. In contrast, strains from patients in New York City, New Jersey, and Houston were present at similar high rates in all clusters. This result was not unexpected, considering that Finland has a far more modest immigration history than the 3 metropolitan sites studied in the United States
Nonrandom geographic subdivision of Mycobacterium tuberculosis clone families. Asterisks indicate examples of significant (nonrandom) distribution of genotypes: the abundance of cluster III strains and the low proportion of cluster II strains in Finland
To test the hypothesis that the wide array of distinct M. tuberculosis genotypes present at reasonably high frequencies in the United States sites is due to population admixture, we investigated the relationship between genetic clusters and patient place of birth. The vast majority of the strains assigned to clusters I, II, and II.A (mainly principal genetic group 1 organisms) were cultured from patients born in Asia. For example, 79% of the strains isolated from patients born in Pakistan, India, and Bangladesh; 78% of the strains isolated from patients born in Vietnam, Cambodia, and the Philippines; and 64% of the strains isolated from patients born in China and South Korea were in these 3 genetic clusters (figure 5). Analogous results were obtained for patients born in other geographic areas, consistent with the hypothesis of inflation of strain population heterogeneity caused by population admixture (figure 5 and supplementary table 1)
Distribution of Mycobacterium tuberculosis clone families, based on patient birth location. CEP, Colombia, Ecuador, and Peru; FIN, Finland; GEH, Guatemala, El Salvador, and Honduras; HDP, Haiti, Dominican Republic, and Puerto Rico; MEX, Mexico; PIB, Pakistan, India, and Bangladesh; SC, South Korea and China; USA, United States; VCP, Vietnam, Cambodia, and the Philippines
Phylogenetic framework: 9 major M. tuberculosis clusters The observation that IS6110 RFLP, spoligotype, and MIRU profiles were not shared between principal genetic group 1 and 2/3 strains suggests that organisms in these genetic groups diverged from a common ancestor relatively early during M. tuberculosis evolution. The hypothesis that a common ancestor had some characteristics of a cluster II.A strain is supported by the central position of this cluster in the phylogenetic tree and by the observation that a specific spoligotype, S29 (also referred to as “type 53” [31]), was found in strains of all clusters from II.A through VIII. Moreover, this spoligotype has been identified in strains originating from diverse geographic locations, was the second-most-frequent spoligotype in our strain collection, and is very prevalent globally [6, 24–26, 31]. Spoligotype S29 has all spacer regions except direct variable repeats (DVRs) 33–36 (figure 6) and is the spoligotype with one of the larger numbers of DVRs. This fact, together with the absence of DVRs 33–36 in all genetic group 2/3 (clusters II.A–VIII) organisms, suggests a loss of these DVRs from an ancestral cluster II.A strain. Furthermore, some cluster II.A/spoligotype S29 strains have only 1 copy of IS6110 Inasmuch as evolution of the direct repeat (DR) region is dominated by loss of DVRs over time [32], and because evidence suggests that the DR region is the ancestral IS6110 insertion locus from which IS6110 radiated to other genomic locations [33], a cluster II.A/spoligotype S29 isolate with 1 IS6110 copy is a possible ancestor of principal genetic group 2/3 strains. On the basis of this line of thought, one might expect to identify a strain with very similar or identical genetic characteristics in ancient human remains. The finding of spoligotype S29 DNA in ancient Egyptian mummies (∼4500 b.c) supports this hypothesis [34]
Widespread distribution of spoligotype S29 in the Mycobacterium tuberculosis phylogenetic tree. The color figure and legend are available in their entirety in the online edition of the Journal of Infectious Diseases
Strongly clonal population structure of M. tuberculosis. Nonrandom associations of alleles over loci, nonrandom distribution of IS6110 profiles and spoligotypes with a phylogenetic structure based on analysis of 230 sSNPs, and greatly restricted allelic variation in structural genes relative to those of many other bacterial species led to the conclusion that the population structure of M. tuberculosis is clonal [7]. This idea was supported by the finding of highly significant linkage disequilibrium among 12 MIRU-VNTR loci and among MIRU-VNTR loci and IS6110 profiles [35]. More recently, Tsolaki et al. [36] reported that distinct genomic deletions were nonrandomly associated with closely related M. tuberculosis isolates, which provides additional evidence for a clonal population structure. Notwithstanding the possibility of very rare horizontal gene-transfer events in natural populations, it is clear from all lines of evidence that the population structure of M. tuberculosis is clonal [20, 37, 38]. One implication of this finding is that strains of each phylogenetic lineage are evolving essentially in isolation from other M. tuberculosis cell lines, raising the possibility that distinct clones differ in medically relevant phenotypes such as virulence, transmissibility, or ability to reactivate. Consistent with this idea, isolates of genetic cluster II have been reported to be far more virulent in mice than strains CDC1551 and H37Rv (genetic clusters VI and VIII, respectively) [39, 40]
Geographic distribution of M. tuberculosis clones Isolates assigned to the 9 genetic clusters have a nonrandom geographic distribution (figure 4). The most likely explanation for the observed geographic variation in the frequencies of different strain genotypes is human population migration history coupled with clonal expansion. This phenomenon was first noted in the study of the global population genetics of serotype b Haemophilus influenzae and was subsequently reported in studies of Helicobacter pylori [41, 42]. Importantly, we identified a strong correlation between country of birth and genotype of the infecting strain, thereby confirming the results obtained in a study of genomic deletions and spoligotypes [43]. As discussed elsewhere for serotype b H. influenzae [41], the observed pattern of geographic variation in the clonal composition of a pathogen may simply reflect a pattern of differentiation that evolved in relative geographic isolation before the Age of Exploration and that has not yet been obscured by recent population migration, human admixture, and strain transmission. Importantly, we cannot yet reject the alternative hypothesis that contemporary patterns of variation in relative clone abundance reflect differential susceptibility on the part of distinct human populations. It has also been speculated that widespread use of bacille Calmette-Guérin vaccination has contributed to local variation in the frequencies of certain M. tuberculosis strains [27], although little evidence exists to support this notion [29]. It will be important to differentiate among these hypotheses in future studies
We thank G. Adams, T. Bowland, Y.-X. Fu, S. Hua, X. Liu, C. A. Lux Migliaccio, S. M. Ricklefs, J. Smoot, D. Sturdevant, G. Sylva, L. Teeter, and Nathalie Williams-Bouyer, for their critical contributions to the success of this project
Potential conflicts of interest: none reported
Financial support: Division of Intramural Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health
Present affiliation: Instituto Cantonale di Microbiologia, Bellinzona, Switzerland
IDSA Members: For your free access to this journal, log in via the IDSA members area.
Open access options for authors visit Oxford Open
This journal enables compliance with the NIH Public Access Policy