The relatively new appearance of FDP and its debated implementation [304–308] translates into a complicated harmonization of its methodology, which is clear after inspecting all the SNP panels included in this review. It can be concluded that the factors that will influence the accuracy of the prediction are the genetic heritability of the trait, the method of SNP selection and genotyping, the informativeness of the SNP, the reference dataset, and the mathematical approach [7,9,12,103]. Thus, before FDP methods can be used in forensic investigations, they need to be standardized and forensically validated, according to the Scientific Working Group on DNA Analysis Methods (SWGDAM) guidelines [309], to finally provide reliable and reproducible results. To do so, all the technical advantages and limitations of FDP must be considered. In addition, a consensus between researchers and field experts is needed to prepare protocols and directives to meet all ethical, social, and legal requirements (reviewed in [310] ).
4.1 Terminology and reporting
The first most important issue is the terminology employed to identify FDP research. Although the word ‘FDP’ was already introduced in 2008 [10,47], not all articles on BGA or EVC inference identify it as such and simply refer it as an intelligence tool. For instance, only 78 articles included in this review identify FDP (two of them as molecular photofitting). Thus, the correct identification of the term as keyword and in the text would allow a more congruent literature search.
Similarly, the second issue is the definition, categorization, and measurement of traits. On one hand, considering the nature of the traits (i.e., quantitative, like height and BMI; or qualitative, such as pigmentation traits and BGA), encasing the latter into categories, may lead to oversimplification [24], irreproducible results, and incomparable studies [287]. This especially becomes challenging when analysing data from multiple sources. Moreover, these categories are usually mistaken with stereotypes or sense of nationality [33,287]. Although categorization in forensics is preferred [39,205,214,311] – since the application in casework implies human interpretation (i.e., investigators) -, some researchers recommend using a continuous and quantitative spectrum instead [9,188,195]. On the other hand, measurements tend to be quite subjective, with most studies based on self-reported EVCs data via questionnaires or reported by simple observation of a non- or medical expert. For example, even when pigmentation traits are usually recorded via digital photographs they are later interpreted and put into categories by researchers. To avoid errors due to different perceptions of a trait [24,197], several studies suggested applying specialized equipment and reflectance, bioimaging and biochemical technologies [194,210] to find stronger genotype-phenotype associations [214]. In the case of BGA, information on up until a third-degree familial ancestry is usually reported and accompanied with a family pedigree.
The same issue arises when FDP results are being reported. For instance, Atwood et al. [34] compared different service providers in terms of prediction accuracy, clarity of reporting and consistent terminology, limitations, cost, and time. The authors concluded that it is imperative that guidelines are created for a shared methodology, and clear reporting and easy interpretation of the analysis for non-experts. Interestingly, results were shown in many ways, from simple verbal “not-/likely” to highlighting -or not- the highest probability for each trait variation or ancestry, or finally with a visual map representing where the individual falls on the represented population clusters.
4.2 Development of panels
Before developing a panel for a certain trait or combinations of traits, researchers concentrate on finding the most informative set of markers for each trait. Usually, the discovery is performed via GWAS and later, confirmed by association studies [9,13,287]. This allows to avoid false positives and to find genes with weaker effects that may have been ignored [10,14]. Even so, these studies are usually carried out with small sample size and are not extensively replicated, creating some scepticism on the validity of the found associations [14]. Ideally, a worldwide population scan would be key to find candidate genes [312], considering that normally sub-populations are less represented in exploratory panels [266]. Other studies find SNPs by comparing allele frequencies found in genetic population databases (e.g., HapMap, 1000 Genomes, CEPH Human Genome Diversity Panel, Complete Genomics) with specialized tools (e.g., SPSmart, FROG-kb [313,314]).
In the case of BGA inference, it is important to select those variants with extreme allele frequency differences between populations [65,69,89,102,315] and obtain marker combinations to have equivalent levels of differentiation among those [95]. On the contrary, the genetic complexity of EVCs, due to pleiotropies (i.e., a single SNP influencing multiple traits) [11,14], epistasis (i.e., several SNPs influencing a single trait) [11,197,245], allelic heterogeneity [151,232,257], phenotypic variability, and gene-environment interactions, need to be assessed before selecting the candidate markers. However, these genetic mechanisms are still not fully understood [9], and it is possible that many other implicated and more informative genes are being ignored [15].
One of the first debates is centred around the number of SNPs needed in a panel to obtain reliable predictions. On one hand, small SNP panels must contain the most informative and differentiating markers and are ideal for the current available SNaPshot™ technologies and to obtain lesser partial profiles when typing low DNA samples [90,112,291]. On the other side, increasing the number of SNPs improves the accuracy, especially with missing data [80,136,287,315]. However, the number of SNPs will also depend on the analysis’ purpose and the genetic complexity of the trait. For instance, the four or five main continental populations can be distinguished with ease using less than 40 markers [39,291], and eye colour can be distinguished with only six SNPs [151]. Conversely, even though the heritability of height and eye pigmentation is similar, the number of SNPs needed to infer stature is increasing by hundreds as its molecular mechanisms are discovered [262,263]. In this sense, several authors believe that it is better to have markers with a strong influence [110,312], due to the scare amount of DNA in the samples, while others suggest finding genes with weak effects to complement the inference [254,258,316]. Also, in the case of BGA, researchers recommend using a two-tier approach: first, a panel with maximum 100 markers to infer at least 12 global populations, and later other panels to refine sub-population inference [39,58,121,131]. That is the case of the SNPfor ID 34-plex [90] and its EurasiaPlex [105], Pacifiplex [125] and PIMA’s [122] sub-panels.
Even though some researchers evaluated the capacity of EVC-associated variants to be used as AISNPs [100,104,110,165–167,315,317,318], making indirect inferences based on either BGA or EVC is a highly debated practice. Indeed, some authors made assumptions about individuals’ appearance using only BGA data [34,129], or vice-versa [16,36,67,148,319]. Nonetheless, most researchers discourage this practice, especially with the increasing population admixture [9,33,43,312] and the fact that some shared alleles may not be related to ancestry but to environmental exposures that are the same in different populations [33]. Despite this, it is still important to infer BGA, as well as biological age and sex, together with EVCs, especially if a trait is restricted to a population, sex, or age group [10,165,320], to avoid any misleading interpretations.
Extensive lists of markers associated with EVCs are available [8,15,54] and they have been combined in multiple ways, yet the number of overlapping of unique markers is minimal. Soundararajan et al. [39] reported this same fact on BGA panels and emphasised the need for a collaboration among researchers to find the “best” markers and test them on a large data set representative of all global populations. Therefore, validation and inter-laboratory testing of panels is important to meet the specific quality requirements typical of forensic DNA analysis. Only few systems have been validated for forensic use [6,7,9,11,27]. Furthermore, panels are commonly developed using homogeneous and European reference data, and then validated in other populations; and they are replicated and validated adopting different methodologies, generating a more complicated comparison exercise [39]. The best outcomes would be to adapt the panel to each individual population [321] or obtain a complete allele frequency data for all existing populations and subpopulations [131].
Another factor that influences this choice is if SNPs are found in ‘coding’ or ‘non-coding’ genes, and their informativeness of other health-related phenotypes. This first differentiation follows the legal regulations that have been used for STR identification and although scientist have discussed that these categories do not reflect the reality, it is still used as a reason to include or discard markers. However, FDP implies the use of ‘associative’ markers that can be found in both non- and coding regions. The fear of including coding markers is based on their higher potential to provide health information [9,148,197], although non-coding genes can provide similar information if they are in linkage disequilibrium with the implicated coding genes [10] or regulatory regions [322]. Moreover, many disease- or trait-related candidate genes are first discovered when researching pathological or extreme variations, and other mutation within are found to be associated with normal variation instead [15,188]. For example, OCA2 gene mutations are associated with eye colour and oculocutaneous albinism [15,24,164]. Regarding these off-target phenotypes, Bradbury et al. [323] studied the possibility to reveal health information while predicting EVC, and only 27 out of 1766 FDP-related markers were associated with risk of having cancer, induced asthma or risk of alcoholism. However, these associations do not mean that an individual is suffering from these diseases and a single marker cannot be used to predict or confirm these risks.
Finally, there is a continuous debate on using commercial or non-commercial panels. While commercial houses’ strongest point is their constant supply of ready to use kits, they claim the kit’s technical information (e.g., markers, accuracy, statistical model, etc.) as their intellectual property. Hence, researchers cannot ensure a truthful validation and reproducibility of the kit. Consequently, some companies have been discontinued, like DNAPrint [33]; while others, such as Parabon Nanolabs have been criticised by many FDP-experts [53].
4.3 Genotyping technology
All available SNP typing methodologies have already been evaluated for forensic application ([18–20,22,27]). These techniques are known to be very versatile, allowing the combination of different chemical reactions, assay formats and detection methods [19,20]. Then again, not all techniques are suitable as FDP faces similar problems to STR identification when analysing forensic samples (i.e., low quantity and degraded DNA and often mixtures). The selection of methodology will be based on its accuracy, multiplexing and automation capacity, high-throughput, cost, and time; as well as the purpose of the analysis (e.g., the number of traits and markers to be included).
A great number of genetic techniques have been used to infer BGA or EVCs [13,40,196,242,324]: PCR assays (e.g., PCR-RFLP [171], PCR-REBA [82], and most commonly TaqMan® SNP genotyping assay), microarrays (e.g., GeneChip™ [102,287]), minisequencing (e.g., SNPlex™), MALDI-TOF (matrix-assisted laser desorption/ionization – time-of-flight) together with mass spectrometry (MS) detection (e.g., Sequenom® MassARRAY®) [107,116,120] and high-resolution melting (HRM) [65,196,324]. While some techniques like Sequenom® MassARRAY® or HRM do not reach the sensitivity requirements for forensic samples [107,115,325], others have been developed but discontinued, such as Genomelab™ SNPstream® [66,156,159,214] and Genplex®. Nonetheless, the golden standard is still SNaPshot™ (SBE-CE assay) due to its robustness, simplicity, and efficiency, but more precisely because the instrument is already present in forensic laboratories and great efforts were invested in their standardization [2,13,40,196,325].
Despite this, SNaPshot™-CE has a higher risk of contamination and error, and more importantly is limited to analyse one single trait inferred with 30 to 40 markers at a time and hence, it cannot keep up with the increasing number of markers needed for FDP [98,99,238,287]. For this reason, researchers are shifting to NGS techniques, in particular Ion Torrent™ (Thermofisher Scientific) and Illumina® [41,98,326]. They allow higher throughput, multiplexing capacity and sequencing accuracy [15], as well as the possibility to automate and sequence different markers in the same run (e.g., STR, SNPs, InDels, microhaplotypes) [23,143]. However, this implies a longer preparation, sequencing, and analysis time [271]. As a result, the current focus is on testing SNaPshot™ panels using NGS instruments [98,99,237,238,243,320], applying single cell sequencing and NGS to analyse mixtures and touch DNA samples [136,142,241,297], and automating analysis and result interpretation to reduce analysis time [23]. This last one would allow a better handling of the samples, increase simple size, and reduce costs and time.
All these techniques have their advantages and limitations, making it harder to choose one to proceed with their standardization. Moreover, the methodology will be chosen depending on the investigation requirements and purpose [8,14,19,98] and any new one such as MPS needs to be extensively validated in larger datasets and optimized before being incorporated [35,112,271]. Other factors that restrain technological advancement in the field are the costs to renovate workspaces, to train the staff, and to increase bioinformatic support and storage capacity [13,44,45,291].
4.4 Prediction models and algorithms
Prediction models are created to support and understand the relationship between genotype and phenotype [14,15]. There are two type of algorithms that can be used to predict BGA or EVC outcomes: statistical and machine-learning (ML). Statistical algorithms, such as MLR, work better when the predictors are dependent from each other, while ML algorithms usually assume independence among predictors [15] and detect in a linear or more complex way the dependency between variable and attributes [217]. Both methods may provide similar accuracy when the same SNP panel is used [15] although ML methods require a higher computational cost and expertise. Indeed, several articles compared and introduced different classifiers for FDP analysis [12,48,67,68,70,117,205,206,217,227,252,299,327].
Two of the most used programs, STRUCTURE and Snipper, are based on the NB algorithm. This algorithm calculates how likely a trait belongs to a class comparing it with the allele frequencies that are observed in each cluster and make assumptions on unknown profiles. [68,90]. It is also capable of incorporating missing data [68]. The gold standard for BGA inference is the STRUCTURE software (and its updated version, ADMIXTURE), because of its “efficient clustering based on similarities or dissimilarities with the other samples” [48,49,95] and thus, good inference of admixture proportions, but only if the populations are well differentiated in the reference data [90,117]. Its main disadvantages are assuming HWE, which is not compatible with BGA nor EVC inference [68], and its long and computationally intensive run times when classifying single profiles with large datasets - since the parental data and the unknown profile need to be analysed simultaneously and missing data needs to be imputed. Otherwise, Snipper can solve some of the issues STRUCTURE presents, providing a faster analysis [89,90], allowing the incorporation of one’s own reference dataset [105] and being able to classify single profiles in real time [105]. The later has been both used for BGA and EVC inference.
Other alternative methods have also been tested. For example, GDA provides a continuous clustering by evaluating the informative proportions of each component, it doesn’t assume HWE, and it can be used as input for hierarchical clustering, like neighbour joining trees. Although it is highly sensitive to noise [48], it has been proven better for admixture classification [67,117]. On another note, visual representations of individual and population structure like principal component analysis (PCA), discriminant analysis of principal components (DAPC) [290] or multidimensional scaling (MDS) are helpful to interpretate the outcome [68]. However, since they are reduced to the two or three most important components, it may lead to misclassification [48]. In addition, logistic regression (bi- or multinomial LR) is perfect for assessing categorical outcomes, even though it tends to misclassify partial profiles [68]. It has been traditionally applied to infer pigmentation colours [151,219]. Also, multifactor dimensionality reduction (MDR) is used in small sample size studies to better detect epistatic effects [233,245,328]. Other available and tested ML methods are linear discriminant analysis (LDA), support vector machine (SVM) [110,217,316], partial least square regression (PLRS) [156], extreme gradient boosting (XGB) [217,246], classification and regression trees (CRT) [204,217,218,254], multi-variate adaptive regression splines (MARS) [217], bootstrapped response-based imputation modelling (BRIM), ordinal and stepwise regressions (OR and SR) [209,246], and deep learning approaches such as neural networks (NN) and random forest (RF) [67,117,155,183,217,246,252,316]. NN are proposed as an alternative to LR as it recognises the patterns of complex data typical from EVC inference [156,254].
Hence, not all algorithms are appropriate, and will need to be selected depending on several aspects. First, the amount and type of data [217], as well as the impact of missing/partial profiles in the classification performance [68,89]. Second, the reference population, which not only affects the selection of SNPs but also the training of the classifiers. These must be representative of all variations and ancestries, especially when estimating admixed individuals [35,67,68,117,329]. Third, with the inability to incorporate environmental factors to the prediction, only sex and age can be incorporated as covariates. In the same way, the accuracy of the model will increase when considering both BGA and EVC if there is population dependency [188,189]. Some researchers defend that “when all the causing factors of a trait will be accounted for in the model, then the accuracy will be the same in all populations” [330].
Lastly, there are many options to interpretate the results obtained from the prediction model. It is key that field and legal experts easily understand and apply the findings. Logically, one may recommend continuing using likelihood rations (LR), since it already used in STR identification [133,188,195,272]. Nonetheless, as Caliebe et al. observed [321], since FDP does not apply the same principle of comparing two hypotheses (i.e., sample belonging to a random individual vs the suspect), and the highest value may not represent the correct category [35]. Hence, it will be more appropriate to use statistical probability, represented as posterior odds (PO), but unfortunately, statistics are often harder to understand by the plain audience. Other ways to represent accuracy have been incorporated: area under the curve (AUC) for categorical predictions – that vary from 0.5 (random phenotype) to 1 (exact phenotype) [7,11,12,15,17]; and correlation (R or R2) or mean squared error (MSE) for quantitative measurements [15].