Insights Breast Cancer through mitochondrial genomics

Breast cancer has a predominant incidence and prevalence in Caucasian women. Although alterations in the mitochondrial genome probably play an important role in carcinogenesis, the actual evidence is ambiguous and inconclusive. The purpose of the present work was to determine the frequency of polymorphisms associated with breast cancer exploring mitochondrial sequences of clinical cases diagnosed from different origins reusing information available in the public free access database GenBank. 110 mtDNA sequences associated to breast cancer cases were identified, of which 72 sequences were complete mitochondrial chromosome and 38 partial sequences. Of these 110 identified sequences, 85 belong to patients with a confirmed diagnosis of breast cancer and 25 complete sequences were obtained from healthy mammary tissue of the same patients. In addition, we obtained 49 complete mtDNA sequences from Eskimo and Inuit individuals, a low prevalence breast cancer population, used as controls. From patients diagnosed with breast cancer, T16519C was found in 60% of the breast cancer sequences and less than 20% of controls sequences (P = 0.427). Two changes were found in D130 in all sequences analyzed, the first characterized by the insertion of base C in position 315 and, the second, a transition of cytosine by at least two cytosines and one thymine at position 309 always followed by another transition of thymine by cytosine in the position 310.

of patients with breast cancer are polymorphic and not associated with a few unique mutations.

Background
Breast cancer has ranked first for more than a decade as the leading cause of death from malignant neoplasms in women, mainly impacting age groups between the fourth and sixth decade of life and affecting all socioeconomic levels (1,2). Mortality rates for this neoplastic disease have shown a significant increase in the last five decades. Between 1955 and 1960, the rate was around two to four deaths per 100,000 women, rising steadily in adult women of all groups of age, but with a greater impact from 30 years of age (3).
The actual evidence indicates that breast cancer is a multifactorial disease, with modifiable and nonmodifiable risk factors. Obesity, postmenopausal state, smoking, and physical inactivity are the most important modifiable factors. Inherited genetic autosomal mutations have been associated in ~10% of total cases with breast cancer (4).
In Latin America, there are important regional differences regarding incidence and prevalence, with a higher frequency of breast cancer in areas with the highest socioeconomic incomes compared with indigenous areas where the socioeconomic level is lower. This evidence is compatible with the epidemiology of breast cancer worldwide, where it affects the developed countries with greater importance and has a lower incidence and prevalence rates in third world populations (5,6). More than a socioeconomic or environmental component, several authors have mentioned the relationship with the genetics of different human populations, with special attention to the different mitochondrial haplogroups. However, there is a lack of evidence of mitochondrial haplogroups with sporadic (7) or familial hereditary breast cancer (8).
Some authors have suggested that mitochondrial DNA (mtDNA) alterations play an important role in carcinogenesis (9)(10)(11)(12)(13)(14)(15)(16)(17) involving the control region D-loop of the mtDNA, due to it containing essential sequences for transcription and replication. Previous studies have suggested that polymorphisms in these non-coding regions of the control area could play an important role in the pathogenesis of breast cancer (18)(19)(20)(21)(22). These changes in the sequence can be associated with a particular phenotype and serve as markers for the development of various malignancies (23). However, some studies suggest that different mitochondrial polymorphisms vary from one population to another. In addition, some polymorphisms are more related to the presence of breast cancer than others (24)(25)(26).
Many works have focused on polymorphisms associated with regions related to the synthesis of proteins and RNA for mitochondrial metabolism. Among the most cited in literature, the A10398G polymorphism, related with synthesis of the protein NADHubiquinone oxidoreductase 3 (ND3), has been identified as a biomarker in different populations as Polish (27), Indian (23,28), Chinese (29,30), and mainly in Afro-descendant groups (23,(31)(32)(33). This polymorphism has also been associated with metabolic syndrome and mental disorders in populations of Asiatic origin (34). However, in a meta-analysis, it has been identified that there is no association when analyzing this polymorphism individually, without correlating with other mitochondrial polymorphisms in women affected by this malignancy (35). Although in Caucasian American human groups with European descent, the A10398G substitution confers an increase in the risk of breast cancer, other polymorphisms with more statistical significance have been identified, such as the T16519C located in the control region (36)(37)(38)(39)(40)(41).
Other studies have focused on T16519C (19,31,32,34,36). In particular, on the mutation in the D-loop due to an increased breast cancer risk, either occurring singularly or in association with other mitochondrial protein-coding genes alterations such as A10398G, G13368A or C14766T (34,36). Moreover, the association of several variants resulted in a significant predictive breast cancer factor. Indeed, A10398G, together with some other mutations such as T4216C, G9055A, A12308G or T16519C, is considered to increase the risk of developing breast cancer in women (34,38).
The frequency of two novel polymorphisms in the D-loop, one at position 16290Tins and the other at position 16293Adel, was statistically significant more prevalent in breast cancer patients than in control subjects. The study suggests that two novel polymorphisms in the D-loop may be candidate biomarkers for breast cancer diagnosis in Bangladeshi women (24).
Despite these promising results, and that most of these mtDNA polymorphisms have functional consequences, associations with specific polymorphisms and the risk of cancer have been subject to intense debate. Several studies involving the association of specific polymorphisms with cancer risk have been thoroughly analyzed due to an erroneous experimental design, interpretation and low-quality data (42). Many of these mtDNA variants may not be conclusive due to artifacts related to genotyping errors or inadequate experimental design (7,43,44). However, due to its potential usefulness as a diagnostic tool, the mitochondrial DNA study and its relation to cancer must remain an important focus of oncological biomarker research with an adequate study design, population stratification and independent replication from the results.
The purpose of the present work was to determine the frequency of polymorphisms associated with breast cancer exploring mitochondrial sequences of clinical cases diagnosed from different origins reusing information available in the public free access database GenBank.

Methods
The study protocol was approved by the research and ethical committee of the Universitary Center of Tonala of University of Guadalajara in Mexico. The search for complete and partial mtDNA sequences was performed in the NCBI Genbank Nucleotide database (National Center for Biotechnology Information) (https://www.ncbi.nlm.nih.gov/nucleotide).. For the search of complete chromosomes, fragments smaller than 15,400 base pairs were not considered. Sequencesof a small size were researchedconsidering fragments no greater than 1,500 bp.
The search strategy was carried out using the keywords Homo sapiens, mitochondrion, and Breast Cancer, using Booleans and filters to select the results and excluding the term "ancient human remains" to facilitate the search in the database. As control sequences, we obtained samples from Eskimo and Inuit, a low prevalence population of breast cancer.
Once the sequences were identified in the database, the accompanying metadata were analyzed to validate the diagnosis of breast cancer and eliminate those that did not correspond or did not specify the presence of the neoplastic disease. From the selected sequences, the reference of the work appointment was obtained to evaluate the end of the sampling and its objectives in the experimental design of each particular work.
Once this was done, using the sequence identification number of GenBank, the complete sequences were obtained in FASTA format files. Subsequently, The FASTA files were used as inputs to feed the University of California Santa Cruz Genome Browser (http://genome.ucsc.edu/), (45) and aligned the sequences against the revised Cambridge Reference Sequence or (rCRS) with the latest available version (GRCh38 / hg38 Assembly).. The FASTA files were aligned using BLAT, a BLAST-like alignment tool (46) and created customs sequence tracks in the browser with mtDNA obtained from breast cancer cases, allowing us to observe the variants in a general way and panoramic perspective in the mitochondrial genomic landscape.
The criteria for the classification of the different haplogroups can be found in the Phylotree database (http://www.phylotree.org/) (47). Once the haplotype was performed and the variants in the sequences were identified, a database was constructed to quantify the haplogroups, haplotypes and, main subclades of the population analyzed with breast cancer.
To evaluate the presence of phantom polymorphisms and discordance between the different haplogroups, the Haplogrep package (available in http://haplogrep.uibk.ac.at/) was used with the latest version of PhyloTree 17. Furthermore , using as areference the sequence NC_012920 deposited in GenBank as rCRS (28) and with the tool Haplogrep 2 (v. 2.1.19) accessible in the same platform, we constructed a graphical phylogenetic tree to explore the population structure and identify the possible phantom polymorphisms not assigned to the resulting haplogroup. Allvariants with a <3 score was considering and which are found in at least two different sequences according to the criteria of Soares et al. (48) Only the complete mtDNA sequences were analyzed and a phylogenetic tree was generated, graphically representing the presence of polymorphisms associated to the various haplogroups, global and local private mutations, regressive mutations, as well as the expected loss of polymorphisms for each haplogroup.
For the analysis of the structure of the populations, a single file with all the sequences in FASTA format was used as an input for the Multiple sequence alignment application with CLUSTAL W using the MEGA (Molecular Evolutionary Genetics Analysis) package (https://www.megasoftware.net/).. Once aligned, mitochondrial single nucleotide polymorphism (mtSNP) profile and frequency of the package mtDNAprofiler (http://mtprofiler.yonsei.ac.kr/) was used. From these tools, changes in the sequence were identified as well as insertion, deletion, and heteroplasmy sites. Also, all the sequence alignments were reviewed to detect polymorphic heteroplasmic sites.
The number of mtSNPs as total polymorphic loci, the number of fixed differences, the polymorphic and monomorphic mutations among the populations, the shared polymorphisms and the average number of different nucleotides between populations were also estimated. In this way, the identification of the most common polymorphisms in the control region and in the coding region of mtDNA was facilitated.
To evaluate its correlation, the breast cancer group frequency was compared with select polymorphisms in breast cancer mtDNA sequence and controls using the X 2 test using SPSS software (IBM Corporation, Armonk, NY) version 21.

Results
In the search in the NCBI GenBank database with the previously defined criteria, 110 mtDNA sequences were identified, of which 72 sequences were complete (16.5 kb) and 38 partial sequences (Table 1). Of the 3 partial sequences smaller than 0.5 kb, 12 sequences covered the HVR1 region and 26 sequences from the HVR2 region of the mitochondrial chromosome. When reviewing the information contained in the metadata of each sequence, the origin of each sequence was confirmed.
Of these 110 identified sequences, 85 belonged to patients with a confirmed diagnosis of breast cancer and 25 complete sequences were used as controls (Table 1). These control sequences were distributed in three categories: 6 distant normal tissue sequences and 7 para-cancerous normal tissue sequences obtained from Wang et al. 2007 (49); also 12 sequences obtained from cells were identified as normal under laser capture microdissection obtained from biopsies of patients diagnosed with invasive breast carcinoma reported in the work of Fendt et al. (50). It is important to note that 42 sequences (38%) were reported in Europe by Gasparre et al. (51) and Fendt et al. (50), and about half of those complete sequences obtained (50.6%) and 28.9% partial sequences were related with Caucasian or Euro Asiatic origin with the haplogroup H (Tables 2 and 3 (Table 2 and Supplemental material Figure S1). Seven possible phantom polymorphisms (54) shown in Table 3 were identified in the 47 complete mtDNA sequences and were excluded from the subsequent analyses.
The most frequent haplogroup in the sample studied with breast cancer was H with 23 individuals representing about half of the population, followed by haplogroup B in 7 individuals (13.5%) and haplogroups C, N and J, with 3 individuals each one, representing as a whole 11.4% of the population. Haplogroups A, B, and C, usually associated with Asiatic and Amerindian populations, were found in 12 individuals from the breast cancer group, which together represented almost a quarter of the population (23.1%) ( Table 2 and Supplemental material Figure S1).
After haplotyping and navigation in UCSC genome browser exercise, we found 214 different polymorphisms shared sequences that we used as controls, 241 unique polymorphisms found only in breast cancer and only one polymorphism, A3480G was found exclusive in a control sample (Figure 1 and Supplemental material Figure S2).
The cumulative analysis shows that most of the variation in breasts cancer sequences are mainly unique polymorphisms and only 13 mutations are related in almost 50% of the sequences studied. In contrast, in the control group, we observed that in a half of the sequences there are up to 22 different polymorphisms, being proportionally less frequent if we compare it with the samples associated with breast cancer (Figures 1 and 2). Most of the variation was found in unique polymorphisms distributed along the mitochondrial genome (Supplemental material Figure S5), but the most frequent repetitions in different sequences are concentrated in the control region between positions 16024 and 576 of the mtDNA (Supplemental material Figure S4).
The most frequently shared polymorphism between breast cancer and controls were A16183C in 7 sequences, T16217C in 5 sequences, T16298C in 4 sequences, A249d, T401d, C16184A, T16092C, A16235G, A263G, A15326G, A1438G and, A4769G in 3 sequences (Table 4 and Figure 2). Moreover, 32 polymorphisms were present in two sequences and 60 sequences with a single polymorphism in each sequence. In the position C16147T, there were two different polymorphisms: a transition C/T in 3 sequences and C/A in one.
When we compare the relative frequencies of each sequence, we can observe that there is a significant proportion of exclusive polymorphisms in breast cancer which are practically absent in the control samples ( Figure 3). When comparing the absolute frequencies of breast cancer polymorphisms, we could define the variants that were less frequent and which appear more than expected, taking as reference the proportions of the controls (Supplemental material Figure S5).
Even though the most frequent variants in breast cancer and less incidence in controls was G3010A, T16311C, T16189C, and T16519C, only T16519C present a statistical significance after Pearson X 2 test (Table 6). T16519C, the most important polymorphism, only represented 2.6% of the total sequences shared analyzed ( Figure 2). Moreover, it was found in nearly 60% of the breast cancer sequences and only in 20% of the controls ( Figure 1 and Supplemental material Figure S3).
From the total of 72 mtDNA sequences obtained from patients diagnosed with breast cancer, 6 common polymorphisms were identified in the sequences: A263G, A750G, A1438G, A4769G, A8860G and, A15326G. These represented 98% of the analyzed sequences obtained of malignant tissue. However, they were also found in most of the control samples.
The polymorphism A10398G, usually associated with breast cancer in other studies, was found only in 17 (24%) of the mtDNA sequences analyzed, without a statistical significance (Table 5), and the frequency of 315.1C. and T1659C, the most recurrent polymorphisms found in our analysis, meeting in isolation or both in the same sequence, but without any statistical significance (Table 5 and Supplemental material Figure S5).  Table 5).
The insertion of base C at position 315 (315.1C) was found in 36 (50%) of the cancer samples and only in 12 (48%) of the samples used as controls. When this polymorphism was absent, we identified that in the rest of the analyzed sequences 36 sequences (50%) present the T310C polymorphisms and in position 309. Even though the most frequent polymorphism in position 309 was C309CCT, our analysis found C309CCCT in three sequences (4%) and C309CCCCT in one sequence (1.4%). After the statistical analysis using the X 2 test, C309CCT-C310T and 315.1C were not significantly different, we also ruled out as a potential artifact associated with ghost mutations (Tables 5 and 6). Discussion protection against breast cancer. However, when we analyzed the sequence information, it is possible to observe polymorphisms commonly related to breast cancer. It is important to note that most of the samples have a European origin and these are similar to half of the population with breast cancer (44.9%) and to the sequence used as the rCRS reference genome in the database and genomic browsers.
Otherwise, of the 241 polymorphisms exclusive in neoplastic tissue, only 9 were repeated in more than 3 different sequences in our analysis: A16183C repeat in 11 different sequences, T16217C in 5, and T16298C in 4 sequences. Some of these changes in the mtDNA sequences are related with other malignant neoplastic diseases (49).
These variants in D310 were present in 98% of the analyzed sequences. Although most of these polymorphisms are commonly associated with specific haplogroups, it was identified that the 315.1C polymorphism, evolutionarily is associated with haplogroup H of Caucasian origin and infrequent in Indo-European haplogroups with a frequency lower than 1.5%. In our analysis in 99% of sequences of non-European origin with breast cancer.
It is an important focus that T16189C and T16311C are positioned in the HVR1 and T16519C is placed in a non-coding position in the control region. In our analysis we found a characteristic pattern related to breast cancer characterized by the presence of 315.1C and the absence of C309CCT and C310T; and in the same way, when polymorphisms C309CCT and C310T are present, 315.1C was invariably absent. Although when analyzed individually, each of these changes has no statistical significance (tables 5 and 6), the 315.1C polymorphism was identified related in the literature mainly to haplogroup H of Caucasian origin and infrequent in Indo-European haplogroups with a frequency lower than 1.5%, in our analysisit was found in 99% of the non-European sequences with breast cancer.
In a previous study, we identified that the polymorphism 315.1C is not associated with breast cancer. However, it has as areference two samples of a medium prevalence and very low for neoplastic disease (55). This change into the sequence is written 315.1C, where.1C means that a copy of the base C has been inserted at position 315 compared to the rCRS, so it is translated into these sequences and the rCRS is slightly (a base) more extensive than many other sequences in mitochondrial genomes.
The polymorphism 315.1C is considered to be one of the most recent mutations occurred in the last 60,000 years. This is the type of mutations, now well accepted and detailed in many of the phylogenetic trees, related to haplogroup H (56). Although this haplogroup is related to populations of ethnic origin, it has been found to have a high prevalence in Latin American mestizos and in other populations with different haplogroups (55)(56)(57)(58)(59)(60)(61).
The polymorphisms, 315.1C and C309CCT-T310C, which are found in the non-coding HVR2 fragment of the D-Loop at the mtDNA control region, are considered as an access point due to the high frequency of changes. Furthermore, the modifications of these have been given the go-ahead, may have implications with the appropriate transcription and regulation of mitochondrial genome expression. This Poly-C hotspot area is located in the mitochondrial genome between positions 303-315, defined as D310, and is considered very polymorphic and can be different between direct relatives by the mother line (48,(62)(63)(64).
Due to the high prevalence in Caucasian groups, 315.1C usually is not considered during the bioinformatic analysis and is not used in the construction of phylogenetic trees. This is the reason why it is usually going unnoticed. However, there are reports in the literature where the 315.1C polymorphism is associated with various forms of cancer and other chronic-degenerative diseases (53,63) This poly-C tract in the mitochondrial D-Loop located commonly between 303 and 315 nucleotides has been identified as a frequent hotspot mutation region in human neoplasia, including breast cancer (56)(57)(58), suggesting that mtDNA instability in this site may be a common characteristic in this malignant disease.
In Mexico, there is a high prevalence and incidence of breast cancer, however, is lower than the United States of America or Western Europe, due to both populations have Is important to remark that in our analysis the presence of T16519C, was found in near 60% of the breast cancer sequences and less than 20% of controls sequences. Other studies have focused on this polymorphism, apparently, without relation with other polymorphisms previously described in the available literature (20,(32)(33)(34)(35)38). Moreover, the association of several variants resulted in a significant predictive breast cancer factor. Indeed, A10398G, usually associated with breast cancer in other studies, (28,29,(31)(32)(33)(34)52) was found only in 24% of the mtDNA sequences analyzed, without statistical significance and in association with other mutations such as T4216C, G9055A, A12308G or T16519C. It was found to increase the risk of developing breast cancer (36). supporting the theory to first human migration outside of African continent (56).
Once again, the high number of mutations suggests that there was a significant bottleneck in human evolution at the time, perhaps around 120,000 years ago, which might have lasted for many thousands of years (56).
Finally, the molecular mechanisms underlying that increased risk of cancer, due to these specific mtDNA polymorphisms, are still unclear. The control region is important for the regulation of mitochondrial genome replication and expression. The polymorphisms in this region might affect mtDNA replication and lead to electron transport chain alteration, resulting in compensatory increases in glycolytic ATP production (66). However, one of the inevitable products of these alterations is an increased release of highly reactive oxygen species, which may lead to mitochondrial abnormalities. These abnormalities invoke a mitochondria-to-nucleus retrograde response and finallyresult in nuclear genome damage, which contributes to initial events related to carcinogenesis (67)(68)(69). The regulation of mitochondrial genome replication from the control region might also lead to the mtDNA damage (70)(71)(72) and, with a critical number of mitochondrial genome changes, to cellular apoptosis (73), which finally could induce cancer development.
The evidence so far suggests that these changes previously described in the literature and the findings in our analysis are probably theresult of damage by oxidative stress at the level of the mitochondrial genome rather than being the origin of the changes associated with carcinogenesis.
Further analysis is required with the objective to evaluate more sequences and calculate the correlation with the risk in the development of malignancy. This approach will give us a general perspective about the importance of consolidating the evidence in specialized repositories and that has not been crossed yet. Finally, it is important to include populations of different origins than Caucasian with the purpose of showing the genetic differences related to breast cancer, regardless of high and low prevalence. With this focus, our understanding of these malignant diseases will improve through the interpretation of the complete mitogenome.   3106d  47  0  3107C  47  0  310.1C  39  0  514d  9  0  515d  9  0  398d  8  0  248d  6  0  430A  4  0  49T  4  0  16011d  4  0  8279d  3  0  8272d  3  0  8271d  3  0  8274d  3  0  8273d  3  0  8276d  3  0  8275d  3  0  8278d 3 0  *Statistical significance P = 0.0  Figure 1 Comparative cumulative proportions of breast cancer and control polymorphisms.

Supplementary Files
This is a list of supplementary files associated with the primary manuscript. Click to download.