Background
Linking independent sources of data related to same individuals enable innovative epidemiological and health studies but requires a robust record linkage approach. We describe a hybrid record linkage process to link databases from two independent ongoing national studies, GEMO (Genetic Modifiers of BRCA1 and BRCA2), which focuses on the identification of genetic factors modifying cancer risk of BRCA1 and BRCA2 mutation carriers, and GENEPSO (prospective cohort of BRCAx mutation carriers), which focuses on environmental and lifestyle risk factors.
Methods
To identify the maximum individuals participating in the two studies but may not be registered by a common number, we combined Probabilistic Record Linkage (PRL) and supervised Machine Learning (ML). This combined linkage was named “PRL+ML”. We built the ML model using a first version of the two databases as a training dataset on which matching status was assigned by PRL followed manual review.
Results
The Random Forest (RF) algorithm showed a highest sensitivity (0.985) among six widely used ML algorithms: RF, Bagged trees, AdaBoost, Support Vector Machine, Neural Network.
Therefore, RF was selected to build the ML model since our goal was to identify the maximum of true matches. Our combined linkage PRL+ML showed a higher sensitivity (range 0.988-0.992) than either PRL (range 0.916-0.991) or ML (0.981) alone. It identified 2,068 individuals participating in both GEMO (6,375 participants) and GENEPSO (4,925 participants).
Conclusions
Our hybrid linkage process represents an efficient tool for linking GEMO and GENEPSO. It may be generalizable to other epidemiological studies involving other databases and registries.