Specics of Determination of Human Biological Age by Blood Samples Using Epigenetic Markers

Our research focused on the selection of already known markers, as well as the search for other infonormative markers based on data made publicly available on the GEO NCBI platform (genome-wide DNA methylation projects using the Innium Human Methylation 450K BeadChip (Illumina ©)). The main objective of the study was to demonstrate that the accuracy of determining the biological age of a person in the presence of chronic diseases using linear-dependent methylation markers is comparable to the accuracy of determining the biological age of a healthy person. Criminologists, as a rule, do not have information about the chronic diseases of a person who has left a biological trace at the scene (blood, for example). However, the lack of this information, as we have shown for a some of diseases, does not play a critical role in the precise determination of biological age. Additionally, an obstacle was removed when transferring the information content of markers from Innium Human Methylation 450K BeadChip chips to SNaPshot technology. The analysis was carried out on a sample of 236 Belarusians, for whom the methylation prole for 7 Cpg markers is presented. It is shown that the information content of markers is preserved Our analysis shows the possibility of creating the universal test system for predicting biological age according to marker methylation. The system can be used in the work of the most criminalist in the world with the same task. independently determined high prognostic in silico bioinformatics analysis data projects for CpG-dinucleotides: cg05213896 (IL4I1 cg08128734 (RASSF5 gene), cg08468401, cg19283806 values pathological conditions arranged in decreasing order in the following sequence: HIV - 3.9 years, depressive disorders - 3.3 years, rheumatoid arthritis - 2.7 years, oncological diseases - 2.5 years, multiple sclerosis - 1.9 years old. For individuals with nicotine addiction, the accuracy of predicting biological age was 3 years.


Introduction
Determination of biological age based on samples of biological uids and tissue fragments plays an important role in forensic practice. It helps to limit the range of searches when identifying remains, to narrow the circle of suspects for saving time, as it is often a limiting factor in the investigation process. To solve the problem of determining the biological age of a person, the most sensitive, reproducible and economically justi ed approach based on identifying the level of DNA methylation in speci c CpG-dinucleotides [ Smeers 2018]. Biological age, which re ects the degree of morphological and physiological development of the organism, in the context of DNA methylation, has a trend different from linear, but as close to it as possible. This is due to the hyper or hypofunctional expression of genes during the intensive growth of the body in the pre-pubertal and early pubertal periods, the presence of chronic diseases (bronchial asthma, multiple sclerosis, epilepsy, diabetes mellitus, cancer, etc.), normal gerontological processes, or the presence of alcoholic nicotine dependence and others [Horvath 2013;Rana 2018;Freire-Aradas 2018]. Deviations in the change in the methylation pro le from the linear trend for biological age associated with the growth and aging of the organism are most pronounced before 25 and after 60 years. The discrepancies between the biological and biological age, which make it possible to assess the intensity of aging and the functional capabilities of an individual, are ambiguous in different phases of the development of the human body. In addition, the methylation level of speci c CpG-dinucleotides may differ depending on the ethnogeographic origin of the individuals [Fleckhaus 2017].
Modern methods for studying DNA methylation at the genome level suggest the use of one of two technological platforms for high-throughput analysis of nucleotide sequences -DNA hybridization on microarrays (microarray), or parallel clonal DNA sequencing (Massive parallel sequencing MPS or Next generation sequencing NGS). Illumina © hybridization microarrays remain the most popular platform for genomic DNA methylation analysis. Relatively low costs compared to whole genome sequencing positioned microarrays as a tool convenient for studying differentially methylated regions based on analysis of the methylation status of known CpG-sites in the human genome. For the In nium HumanMethylation450 BeadChip (IHM 450K BeadChip), the largest array of primary data has been accumulated (in the form of the methylation level, expressed in % or fractions of a unit) for various types of biological samples (blood, individual blood cell fractions of, buccal epithelium, sperm, etc.), and for different ethnic groups or patients with a history of chronic diseases. The data is located in the Gene Expression Omnibus (GEO) Database repository (https://www.ncbi.nlm.nih.gov/geo/). Statistical analysis of raw data sets of the full genome DNA methylation pro le will not only assess the accuracy of determining the biological age according to existing predictive models for independent samples differing in age, sex, or geography of residence of the studied groups within the framework of GEO projects, but will also make it possible to identify previously characterized CpG-dinucleotides with high predictive potential.
The purpose of this work is to assess the evaluate of ethnoregional, sex, and other factors in the context of determining biological age from blood samples using methylation data of CpG-dinucleotides. It based on the analysis of the primary data of the whole genome DNA methylation pro le from GEO DataSets NCBI, as well as to check the revealed patterns in the contribution of highly informative CpGdinucleotides in the accuracy of determining the biological age of individuals from the Republic of Belarus.

Materials And Methods
DNA samples in silico . Information on the DNA methylation level for blood samples is available on the NCBI GEO datasets platform for 8  projects: GSE40279, GSE42861, GSE51032, GSE50660, GSE55763, GSE77696, GSE106648, GSE125105. The main criterion for selecting projects is the availability of information on the DNA methylation pro le for at least 250 people. After a two-stage mathematical preparation of the primary data, the number of healthy individuals of various ethnogeographic origin was 4251 people, with a history of acute or chronic diseases of 1685 people. The biological age range was from 17 to 93 years. The number of blood samples from men is 3169, from women 2766, for 3 samples there was no information on sex.
DNA samples of individuals from the Republic of Belarus. Blood samples from 236 individuals aged 18 to 93 years were obtained after signing an informed consent approved by the Bioethics Committee of the Institute of Genetics and Cytology of the National Academy of Sciences of Belarus. BD Vacutainer K2E tubes were used to collect venous blood. DNA was isolated using MagMAX ™ DNA Multi-Sample Kit (ThermoFisher, USA) according to the manufacturer's recommendations. The quality and quantity of DNA was analyzed using a NanoPhotometer® N50 spectrophotometer (IMPLEN, USA).
We independently determined a high prognostic potential in silico based on bioinformatics analysis of data from GEO projects for 6 CpGdinucleotides: cg05213896 (IL4I1 gene), cg08128734 (RASSF5 gene), cg08468401, cg19283806 (CCDC102B gene), cg2245H269 (FHL2 gene), cg24079702 (FHL2 gene). Information on the methylation level of CpG-dinucleotides and the characteristics of individuals included in the analysis in silico is presented in "Supplementary materials.docx / Sheet 1".
Determination of methylation level using SNaPshot. Analysis of the methylation level for CpG-dinucleotides was performed using SNaPshot technology (Applied Biosystems ™, USA). Primers and SBE-oligonucleotides (Single-base extension SBE) for CpG-dinucleotides are presented in Table 1. Primers for ampli cation of bisul te-converted genomic DNA were developed using the BiSearch program (http://bisearch.enzim.hu/). Then 10 μl of each SBE product was puri ed using 1 μl FastAP Thermosensitive Alkaline Phosphatase (ThermoFisher, USA). SBE products were analyzed using an ABI PRISM 3500 genetic analyzer and GeneMapper® 5.0 software (Applied Biosystems, USA). The percentage methylation value (0-100%) for each CpG-dinucleotide was calculated by dividing the uorescence intensity value for C / G nucleotides (detection of unconverted methylated DNA) by the uorescence intensity value for C / G nucleotides plus T / A (detection of converted unmethylated DNA).
Statistical data analysis.
The rst stage in preparing GEO project data for mathematical analysis is excluding values outside the range calculated by the formula: [(X 25 -1,5 * (X 75 -X 25 ), (X 75 + 1,5 * (X 75 -X 25 ))] This range is calculated separately for each GEO project.
The second stage is normalization of the data remaining after the rst stage using a nonlinear transformation within [-1, 1] by the formula: The second stage is performed for the data array obtained at the rst stage. Thus, the two-stage data preparation made it possible to minimize the contribution of extreme values as much as possible.
We used the same data preparation scheme for statistical analysis to establish the DNA methylation level values from 16 CpG-dinucleotides of blood samples from Belarusian individuals.
Using the SPSS v.20.0 program (IBM, USA), we calculated rank correlation coe cients (R) via the bootstrap function for 1000 samples (with bias correction and acceleration) and calculating a 95% con dence interval. Also were corrected values of the coe cients of determination (R^2), equal to the proportion of the variance of the dependent variable "biological age" due to the in uence of independent variables (the level of methylation of CpG-dinucleotides); mean absolute deviation (MAD) and root mean square errors (RMS Error, RMSE) for regression models.

Results And Discussion
As a rule, projects to assess the genome-wide methylation pro le using the IHM 450K BeadChip target a cohort of people who represent a speci c cross-section of the population of a particular regional or ethnicity. Researchers aim to nd relations between the DNA methylation pro le and a disease as applied to a speci c country or geographic region. We carried out comparative studies and characterized the correlation coe cients for the 16 CpG-dinucleotides listed above, depending on the ethnoregional and sex identity of individuals, as well as on the presence of chronic diseases in history (rheumatoid arthritis, HIV, multiple sclerosis, depressive disorders, oncological diseases) or bad habits (nicotine addiction).
Correlation coe cients of DNA methylation with biological age depending on the ethnogeographic factor. Correlation coe cients (R) for 16 CpG-dinucleotides were calculated within 8 GEO projects within the countries of the European (UK, Italy, Sweden, Germany) and North American (USA) regions are presented in Table 2. The information is used only for healthy individuals, taking into account ethnogeographic status and without regard to sex. The number of persons for the European region was 3579 (Great Britain -2614, Italy -362, Sweden -430, Germany -173), for the North American region -672.
Correlation coe cients of DNA methylation with biological age depending on sex. The R coe cients for 16 CpG-dinucleotides were calculated within 6 GEO projects and are presented in Table 3 Differences between R-values depending on sex ranged from 0.002 to 0.071. The smallest uctuation in R-values is shown for CpGdinucleotides cg22454769 (difference -0.002), cg16054275 (0.004), cg18473521 (0.011), cg14361627 (0.014), and cg19283806 (0.016).
Calculation of determination coe cients (R ^ 2), MAD and RSME for regression models of predicting biological age. Based on the data on the methylation level of 16 CpG-dinucleotides, we adjusted determination coe cients for multiple linear regression for GEO projects (according to Table 1). According to the data presented in Fig. 1, it can be seen that the narrower the range for the indicator "Chronological age, number of years" appeared in the study (for example, for projects GSE51032 or GSE50660), the less the adjusted R ^ 2 was. As known, the regression model is able to adequately (with the calculated level of accuracy) predict the dependent variable (biological age) when modeling only in the analyzed range of values; therefore, expanding the scope for the dependent variable is able to stabilize the model. Thus, the adjusted R ^ 2 values were in the range 0.675-0.911, and for GEO projects with the widest age range -GSE125105, GSE40279 and GSE55763 -the percentage of explained variation for the dependent variable was at least 82.6%.
According to 8 GEO projects, CpG-dinucleotides had a different effect on the change in the coe cient of determination R ^ 2 ( Table 4). The largest contribution to the percentage of explained variance of the dependent variable in the regression model equation belonged to CpGdinucleotides: cg16867657 -mean value R ^ 2 = 0.669, cg14361627-0.056 and cg19283806 -0.044. The predictive potential for the CpG dinucleotide cg19283806 proved comparable to the value for cg14361627. The high predictive potential of this CpG dinucleotide was also shown in the study [Chao Pan et al. 2020].
When models for predicting biological age were created, we used an approach according to which the dependence of the level of DNA methylation on the age of individuals was considered linear. In our view, its use is justi ed under the condition of a relatively large number of individuals in the study when analyzing contrasting age samples.
We found that the percentage of the explained variance R ^ 2 when modeling multiple linear regression using the stepwise selection function (inclusion with a probability F <0.05, an exclusion with a probability F> 0.10) varied in the range 0.676-0.911. MAD values were in the range of 1.92-3.26 years (Table 4). Each model for predicting biological age included a different number of CpG-dinucleotides -from 5 for GEO77696 to 10 for GEO55763.
It is known that the R ^ 2, MAD, and RMSE indices re ect the overall accuracy of the model and make it possible to compare the models with each other, but they poorly characterize the predictive accuracy of the dependent variable (biological age) for a particular sample. In g. 2 provides information on the number of individuals, expressed as a percentage (%) within each GEO-project, for which the predicted values of biological age were calculated using the regression model (Table 4) within a given error -"≤2 years", "> 2 and ≤4 years ","> 4 and ≤6 years ","> 6 and ≤8 years ","> 8 and ≤10 years "and"> 10 years ".
Thus, the percentage of predicted biological age values with an error of ≤4 years ranged from 58.6% (for the GEO project GSE55763) to 80.3% (for the GEO project GSE125105), with an error of ≤6 years -76.8-96.1%. The number of cases with an error in predicting the biological age of more than 8 years on average for eight GEO projects was less than 5.0% (Fig. 2).
As can be seen from Fig. 3, in three age groups "≤40 years old", "> 40 and ≤60 years old", "> 60 years old" the percentage of predicted values of biological age with an error of ± 6 years was 81.9 ± 12.2%, 90.6 ± 5.6% and 83.9 ± 10.7%, respectively. The highest percentage of correct calculations (± 6 years) falls on the age range "> 40 and ≤60 years." In the sample "> 60 years old," the error in predicting biological age gradually increases. This may be due to an increase in the variance for the level of methylation of the analyzed CpG sites with age during aging, which is due to a wide range of reactions of the human body in normal and pathological gerontological processes.
Calculation of the coe cients of determination (R ^ 2), MAD and RSME for regression models for predicting biological age, depending on the anamnesis. The question of the in uence of pathological processes in the body on changes in the methylation level of the analyzed CpGdinucleotides in determining the biological age of an individual, is important. To develop a method for determining the age of an unknown individual, which can be used in forensic practice, it is necessary to use those CpG-dinucleotides, the methylation level of which does not critically differ in healthy and sick individuals. The key characteristic of the CpG dinucleotide for assessing its predictive potential in determining biological age is the value of the determination coe cient R, the differences of which in the group of sick and healthy individuals must be identi ed. In g. 4 provides information on regression models for predicting biological age and their characteristics for the indicated pathological conditions.
The calculated MAD values for the studied pathological conditions were arranged in decreasing order in the following sequence: HIV -3.9 years, depressive disorders -3.3 years, rheumatoid arthritis -2.7 years, oncological diseases -2.5 years, multiple sclerosis -1.9 years old. For individuals with nicotine addiction, the accuracy of predicting biological age was 3 years.
Only for patients with HIV, the MAD values were 3.9 years, the difference between sick and healthy individuals was more than one year. For other pathological conditions, the difference in MAD values between healthy individuals and patients was less than one year.
Thus, pathological conditions do not have a critical impact on determining the biological age of a person by the methylation level of the studied CpG-dinucleotides.
Calculation of the probability of attributing an unknown sample to a speci c age group based on DNA methylation data. Often, for forensic practice, when determining the estimated age of an unknown individual, the question is not about a speci c age, but about the assignment of a given subject to a certain age group: "under 20" or "over 20", "under 30" or "over 30" etc. In this case, the accuracy of assigning an unknown individual to a speci c group based on the results of DNA methylation analysis will be higher than when answering the question about the true value of the biological age. At the same time, to clarify the predicted age, a two-stage scheme can be used: 1) assigning an unknown sample to a certain age group (with a level of accuracy acceptable for speci c tasks of forensic science); 2) predicting the value of biological age in years (with a level of accuracy within the predictive model) already within the age group. Therefore, depending on the type of division of samples array by age categories, the accuracy of assigning a particular sample varies in a wide range (Fig. 5). With a probability of 99.21 ± 0.86%, it can be concluded that the age of the unknown resident, established using 5-10 СpG dinucleotides, is more than 30 years, with a probability of 97.61 ± 1.74%, it is more than 40 years, 91.56 ± 5.19% -more than 50 years, etc. The average classi cation accuracy within each boundary age point "30" -"60" was 87.05 ± 3.82%.
Thus, the conducted bioinformatics and statistical analysis of GEO projects allows us to draw a number of conclusions. First, of the 16 analyzed CpG-dinucleotides, cg16867657, cg14361627, and cg19283806 have the highest predictive potential. Secondly, for all eight regression models within the GEO projects, comparable accuracy in predicting biological age was shown based on the values of MAD (1.92-3.26) and RMSE (1.94-3.29). At the same time, all 3 CpG-dinucleotides with the highest predictive potential are involved in the models for seven of the eight GEO projects. Thirdly, concomitant factors (sex, ethnogeographic a liation, the presence of pathological conditions) do not signi cantly affect the accuracy of predicting biological age when using the analyzed CpG-dinucleotides.
However, it should be noted that the results obtained have a number of limitations on interpretation and extrapolation. Thus, it is known that the results obtained using the IHM 450K BeadChip technology (Illumina, USA) may not coincide with the results obtained using the SNaPshot technology (Applied Biosystems, USA), and, thus, CpG-dinucleotides determined on the basis of bioinformatic analysis as highly informative (in R> 0.5) may not show themselves when studying speci c groups using the SNaPshot microsequencing technology. In this regard, for individuals from the Republic of Belarus, we determined the methylation levels of 7 CpG-dinucleotides. The predictive potential of which according to the results of the analysis (Table 4) was maximum: cg07553761, cg14361627, cg16054275, cg16867657, cg19283806, cg24079702 and cg25410668.
In general, our data on the level of DNA methylation for 7 CpG-dinucleotides for the Belarus sample are comparable to those for the largest GEO project, GSE55769, despite the statistically signi cant differences (Fig. 6). According to the value of the correlation coe cients R with biological age, CpG-dinucleotides were arranged in the following sequence By analogy with the previous analysis, statistical data preprocessing was carried out and the regression model was calculated, which is graphically presented in Fig. 7. The largest contribution to the variance of the variable "Biological age" is made by the CpG dinucleotide cg19283806 (gene CCDC102B) -no less than 62.9%. Next are CpG-dinucleotides in the order of decreasing in uence on the variable "Biological age" in the regression model: cg14361627 (KLF14 gene) -+ 13.3%, cg16867657 (EVOLV2 gene) -+ 6.1%, cg07553761 (TRIM59 gene) -+ 1.0%,%, cg25410668 (PRA2 gene) -+ 0.7%, cg24079702 (FHL2 gene) -+0.7, cg16054275 (F5 gene) -+ 0.5%.

Conclusion
Based on the data presented in the public domain on the GEO NCBI platform for 8 projects to determine the full genome DNA methylation pro le using the In nium Human Methylation 450K BeadChip (Illumina ©) -GSE40279, GSE42861, GSE51032, GSE50660, GSE55763, GSE77696, GSE1051048 with a total number of individuals of more than 4 thousand (without a history of chronic and acute diseases), we calculated the correlation coe cients (R) with biological age for 16 CpG-dinucleotides. Also we calculated the corrected coe cients of determination (R ^ 2), MAD and RMSE for comparisons and characteristics of multivariate linear regression equations.
Based on bioinformatics and statistical analysis, we have shown that for individuals without a history of chronic or acute diseases, regardless of ethnogeographic and sexual factors, CpG-dinucleotides cg14361627 (gene KLF14), cg16867657 (gene ELOVL2 ) and cg19283806 (gene CCDC102B), on average they are able to explain the variance of the variable "Biological age" by 35.6 ± 10.4%, 65.0 ± 11.8%, and 33.0 ± 8.7%, respectively. For individuals from the Republic of Belarus, for the CpG dinucleotide cg19283806 (CCDC102B), the percentage of the explained variance of the variable "Biological age" turned out to be the maximum -62.7%, the share of cg14361627 (KLF14 gene) and cg16867657 (ELOVL2 gene) accounted for + 13.3% and + 6.1%, respectively. For a total these three CpG-dinucleotides can explain at least 80% of the variation in the biological age of a person.
The methodology for determining biological age by establishing a DNA methylation pro le based on a limited number of CpG-dinucleotides (5-10 pcs.). The prognostic potential of which has been con rmed in a number of studies and demonstrated by us on samples of Belarussian individuals, is universal. It is possible to provide su ciently accurate information about the estimated age of an individual or about belonging to a particular age group, regardless of the ethnogeographic status of an unknown person, sex, or the presence of a number of chronic diseases.

Declarations
Funding: The   Figure 1 Characteristics of the GEO-projects analyzed by age of the individuals included in the study (for projects from Table 2). The values of the corrected coe cients of determination R ^ 2 for biological age, calculated from the methylation data all 16 CpG-dinucleotides Accuracy of prediction of biological age for GEO-projects (the percentage of values with a given prediction accuracy is shown) Figure 3 Accuracy of prediction of biological age (the average percentage of values with a given prediction accuracy within all GEO projects is shown) Figure 4 Prediction of biological age depending on the presence of a pathological process in anamnesis (regression models were used for calculations for each GEO-project separately according to the Table 4) Figure 5 The accuracy of the classi cation of samples (based on the predicted values of biological age) depending on the belonging of the samples to a speci c age group (based on the true values of the age of individuals) Figure 6 Methylation level of 7 CpG-dinucleotides for Belarus samples (n = 236) and GSE55763 (n = 2458)