As a rule, projects to assess the genome-wide methylation profile using the IHM 450K BeadChip target a cohort of people who represent a specific cross-section of the population of a particular regional or ethnicity. Researchers aim to find relations between the DNA methylation profile and a disease as applied to a specific country or geographic region. We carried out comparative studies and characterized the correlation coefficients for the 16 CpG-dinucleotides listed above, depending on the ethnoregional and sex identity of individuals, as well as on the presence of chronic diseases in history (rheumatoid arthritis, HIV, multiple sclerosis, depressive disorders, oncological diseases) or bad habits (nicotine addiction).
Correlation coefficients of DNA methylation with biological age depending on the ethnogeographic factor. Correlation coefficients (R) for 16 CpG-dinucleotides were calculated within 8 GEO projects within the countries of the European (UK, Italy, Sweden, Germany) and North American (USA) regions are presented in Table 2. The information is used only for healthy individuals, taking into account ethnogeographic status and without regard to sex. The number of persons for the European region was 3579 (Great Britain – 2614, Italy – 362, Sweden – 430, Germany – 173), for the North American region – 672.
For three CpG-dinucleotides, the R-values were the most reproducible, as evidenced by the low values of the standard deviation – cg19283806 (-0.571 ± 0.068), cg25410668 (0.492 ± 0.069) and cg16867657 (0.810 ± 0.073), while for two of them – cg19283806 and cg16867657 shows the largest absolute values of R. The largest fluctuation of R-values is shown for the CpG-dinucleotides cg18473521 (standard deviation – 0.151), cg11807280 (0.146) and cg24079702 (0.135).
Correlation coefficients of DNA methylation with biological age depending on sex. The R coefficients for 16 CpG-dinucleotides were calculated within 6 GEO projects and are presented in Table 3. The number of males (sample "M") was 2247 individuals, female (sample "F") – 1777 individuals. The most reproducible R-values for males are shown for cg25410668 (0.448 ± 0.082), cg08128734 (-0.453 ± [0.091]), cg16867657 (0.749 ± 0.095), cg16054275 (-0.462 ± [0.098]) and cg08468401 (-0.399 ± [ 0.099]); for females - for cg19283806 (-0.565 ± [0.066]), cg25410668 (0.519 ± 0.077), cg02872426 (-0.374 ± [0.077]), cg16867657 (0.819 ± 0.081) and cg16054275 (-0.458 ± [0.091]).
Differences between R-values depending on sex ranged from 0.002 to 0.071. The smallest fluctuation in R-values is shown for CpG-dinucleotides cg22454769 (difference – 0.002), cg16054275 (0.004), cg18473521 (0.011), cg14361627 (0.014), and cg19283806 (0.016).
Calculation of determination coefficients (R ^ 2), MAD and RSME for regression models of predicting biological age. Based on the data on the methylation level of 16 CpG-dinucleotides, we adjusted determination coefficients for multiple linear regression for GEO projects (according to Table 1). According to the data presented in Fig. 1, it can be seen that the narrower the range for the indicator "Chronological age, number of years" appeared in the study (for example, for projects GSE51032 or GSE50660), the less the adjusted R ^ 2 was. As known, the regression model is able to adequately (with the calculated level of accuracy) predict the dependent variable (biological age) when modeling only in the analyzed range of values; therefore, expanding the scope for the dependent variable is able to stabilize the model.
Thus, the adjusted R ^ 2 values were in the range 0.675-0.911, and for GEO projects with the widest age range – GSE125105, GSE40279 and GSE55763 - the percentage of explained variation for the dependent variable was at least 82.6%.
According to 8 GEO projects, CpG-dinucleotides had a different effect on the change in the coefficient of determination R ^ 2 (Table 4). The largest contribution to the percentage of explained variance of the dependent variable in the regression model equation belonged to CpG-dinucleotides: cg16867657 - mean value R ^ 2 = 0.669, cg14361627– 0.056 and cg19283806 - 0.044. The predictive potential for the CpG dinucleotide cg19283806 proved comparable to the value for cg14361627. The high predictive potential of this CpG dinucleotide was also shown in the study [Chao Pan et al. 2020].
When models for predicting biological age were created, we used an approach according to which the dependence of the level of DNA methylation on the age of individuals was considered linear. In our view, its use is justified under the condition of a relatively large number of individuals in the study when analyzing contrasting age samples.
We found that the percentage of the explained variance R ^ 2 when modeling multiple linear regression using the stepwise selection function (inclusion with a probability F <0.05, an exclusion with a probability F> 0.10) varied in the range 0.676-0.911. MAD values were in the range of 1.92-3.26 years (Table 4). Each model for predicting biological age included a different number of CpG-dinucleotides - from 5 for GEO77696 to 10 for GEO55763.
It is known that the R ^ 2, MAD, and RMSE indices reflect the overall accuracy of the model and make it possible to compare the models with each other, but they poorly characterize the predictive accuracy of the dependent variable (biological age) for a particular sample. In fig. 2 provides information on the number of individuals, expressed as a percentage (%) within each GEO-project, for which the predicted values of biological age were calculated using the regression model (Table 4) within a given error - “≤2 years”, “> 2 and ≤4 years ”,“> 4 and ≤6 years ”,“> 6 and ≤8 years ”,“> 8 and ≤10 years ”and“> 10 years ”.
Thus, the percentage of predicted biological age values with an error of ≤4 years ranged from 58.6% (for the GEO project GSE55763) to 80.3% (for the GEO project GSE125105), with an error of ≤6 years – 76.8-96.1%. The number of cases with an error in predicting the biological age of more than 8 years on average for eight GEO projects was less than 5.0% (Fig. 2).
As can be seen from Fig. 3, in three age groups “≤40 years old”, “> 40 and ≤60 years old”, “> 60 years old” the percentage of predicted values of biological age with an error of ± 6 years was 81.9 ± 12.2%, 90.6 ± 5.6% and 83.9 ± 10.7%, respectively. The highest percentage of correct calculations (± 6 years) falls on the age range "> 40 and ≤60 years." In the sample “> 60 years old,” the error in predicting biological age gradually increases. This may be due to an increase in the variance for the level of methylation of the analyzed CpG sites with age during aging, which is due to a wide range of reactions of the human body in normal and pathological gerontological processes.
Calculation of the coefficients of determination (R ^ 2), MAD and RSME for regression models for predicting biological age, depending on the anamnesis. The question of the influence of pathological processes in the body on changes in the methylation level of the analyzed CpG-dinucleotides in determining the biological age of an individual, is important. To develop a method for determining the age of an unknown individual, which can be used in forensic practice, it is necessary to use those CpG-dinucleotides, the methylation level of which does not critically differ in healthy and sick individuals. The key characteristic of the CpG dinucleotide for assessing its predictive potential in determining biological age is the value of the determination coefficient R, the differences of which in the group of sick and healthy individuals must be identified.
In this regard, we analyzed information from open sources regarding the level of DNA methylation for 16 CpG-dinucleotides for pathological conditions: rheumatoid arthritis (GSE42861, n = 306, age range 22.0-69.0 years); HIV (GSE77696, n = 229, 25.0-70.0 years); multiple sclerosis (GSE106648, n = 130, 18.0-66.0 years); depressive disorders (GSE125105, n = 420, 17.0-87.0 years); oncological diseases (GSE51032: breast cancer, n = 191; colorectal cancer, n = 68; other primary tumors, n = 101; 35.0-72.0 years), as well as for individuals with nicotine addiction (quit smoking after prolonged period - GSE50660, n = 221; continuing smoking - GSE50660, n = 19; 44.0-65.0 years).
In fig. 4 provides information on regression models for predicting biological age and their characteristics for the indicated pathological conditions.
The calculated MAD values for the studied pathological conditions were arranged in decreasing order in the following sequence: HIV - 3.9 years, depressive disorders - 3.3 years, rheumatoid arthritis - 2.7 years, oncological diseases - 2.5 years, multiple sclerosis - 1.9 years old. For individuals with nicotine addiction, the accuracy of predicting biological age was 3 years.
Only for patients with HIV, the MAD values were 3.9 years, the difference between sick and healthy individuals was more than one year. For other pathological conditions, the difference in MAD values between healthy individuals and patients was less than one year.
Thus, pathological conditions do not have a critical impact on determining the biological age of a person by the methylation level of the studied CpG-dinucleotides.
Calculation of the probability of attributing an unknown sample to a specific age group based on DNA methylation data. Often, for forensic practice, when determining the estimated age of an unknown individual, the question is not about a specific age, but about the assignment of a given subject to a certain age group: "under 20" or "over 20", "under 30" or "over 30" etc. In this case, the accuracy of assigning an unknown individual to a specific group based on the results of DNA methylation analysis will be higher than when answering the question about the true value of the biological age. At the same time, to clarify the predicted age, a two-stage scheme can be used: 1) assigning an unknown sample to a certain age group (with a level of accuracy acceptable for specific tasks of forensic science); 2) predicting the value of biological age in years (with a level of accuracy within the predictive model) already within the age group.
Therefore, depending on the type of division of samples array by age categories, the accuracy of assigning a particular sample varies in a wide range (Fig. 5). With a probability of 99.21 ± 0.86%, it can be concluded that the age of the unknown resident, established using 5-10 СpG dinucleotides, is more than 30 years, with a probability of 97.61 ± 1.74%, it is more than 40 years, 91.56 ± 5.19% - more than 50 years, etc. The average classification accuracy within each boundary age point "30" - "60" was 87.05 ± 3.82%.
Thus, the conducted bioinformatics and statistical analysis of GEO projects allows us to draw a number of conclusions. First, of the 16 analyzed CpG-dinucleotides, cg16867657, cg14361627, and cg19283806 have the highest predictive potential. Secondly, for all eight regression models within the GEO projects, comparable accuracy in predicting biological age was shown based on the values of MAD (1.92-3.26) and RMSE (1.94-3.29). At the same time, all 3 CpG-dinucleotides with the highest predictive potential are involved in the models for seven of the eight GEO projects. Thirdly, concomitant factors (sex, ethnogeographic affiliation, the presence of pathological conditions) do not significantly affect the accuracy of predicting biological age when using the analyzed CpG-dinucleotides.
However, it should be noted that the results obtained have a number of limitations on interpretation and extrapolation. Thus, it is known that the results obtained using the IHM 450K BeadChip technology (Illumina, USA) may not coincide with the results obtained using the SNaPshot technology (Applied Biosystems, USA), and, thus, CpG-dinucleotides determined on the basis of bioinformatic analysis as highly informative (in R> 0.5) may not show themselves when studying specific groups using the SNaPshot microsequencing technology. In this regard, for individuals from the Republic of Belarus, we determined the methylation levels of 7 CpG-dinucleotides. The predictive potential of which according to the results of the analysis (Table 4) was maximum: cg07553761, cg14361627, cg16054275, cg16867657, cg19283806, cg24079702 and cg25410668.
In general, our data on the level of DNA methylation for 7 CpG-dinucleotides for the Belarus sample are comparable to those for the largest GEO project, GSE55769, despite the statistically significant differences (Fig. 6). According to the value of the correlation coefficients R with biological age, CpG-dinucleotides were arranged in the following sequence (in decreasing order of the absolute value of R): cg19283806 (R = -0.739, p = 5.57E-42), cg16867657 (0.687, 2.37E-34) , cg07553761 (0.654, 3.87E-30), cg14361627 (0.642, 8.25E-29), cg25410668 (0.559, 8.34E-21), cg16054275 (-0.378, 2.02E-09) and cg24079702 (0.170, 8,95E-03).
By analogy with the previous analysis, statistical data preprocessing was carried out and the regression model was calculated, which is graphically presented in Fig. 7. The largest contribution to the variance of the variable "Biological age" is made by the CpG dinucleotide cg19283806 (gene CCDC102B) - no less than 62.9%. Next are CpG-dinucleotides in the order of decreasing influence on the variable "Biological age" in the regression model: cg14361627 (KLF14 gene) – + 13.3%, cg16867657 (EVOLV2 gene) – + 6.1%, cg07553761 (TRIM59 gene) – + 1.0%,%, cg25410668 (PRA2 gene) – + 0.7%, cg24079702 (FHL2 gene) – +0.7, cg16054275 (F5 gene) – + 0.5%.