In this section, we will describe all the methodology to integrate two majors nationwide databases, namely the Live Birth Information System (SINASC) and the United Registry for Social Programmes (CadUnico) from 2001 to 2015.
Datasets
- SINASC (Sistema de Informação Sobre Nascidos Vivos/ Live Birth Information System)
The Brazilian Ministry of Health defines live births as the complete expulsion or extraction from the body of the pregnant woman of a product of conception, independent of the duration of pregnancy, who, after the separation, breathes or shows any other signs of life, such as heartbeat, umbilical cord pulsation, or definite movement of voluntary muscles, whether or not the cord is cut and whether or not the placenta is attached. SINASC records live births in Brazil, and this system is updated using the registration of live birth. It is a compulsory document, completed by a health professional who assisted the delivery. This form is divided into eight blocks. I -characteristics of the newborn; II- identification of the place of birth; III- characteristics of the mother; IV- identification of the father; V- characteristics of pregnancy and delivery; VI- characteristics of congenital anomalies: this block should be filled in when congenital anomalies are identified at birth using the ICD-10 code. VII- identification of the professional completing the notification. VIII- registry office identification 12.
- The baseline of the 100 Million Brazilian Cohort
The Cadastro Único has become the main instrument used by the Brazilian government to assess the inclusion criteria of potential beneficiaries of social programs. To be enrolled in CADU, one person in the family must provide information and required documents of all family members to an interviewer. This person must be at least 16 years old and, preferably, be a woman. The information is renewed periodically as long as the person is a candidate to receive one of the several Brazilian government social protection programs13. The Centre for Data and Knowledge Integration for Health - CIDACS has the custody of several snapshots of CADU. Each snapshot file refers to a year backup from 2001 to 2015. The efforts to build the 100 Million Brazilian Cohort were concentrated in three main steps. The first was the harmonization of attributes with a scheme or meaning divergence on some attributes across three different versions of CADU. Second, the data cleansing to ensure the standardization of the categories. The third step aims to find the first appearance of each record over a disparate CADU backup file. This single register for social programs is an instrument that identifies and characterizes low-income families applying for any social protection program, that also allows to improve the understanding of the social reality of this population group. It contains information on social, environmental, and economic features on named individuals grouped into families.
The process of linking
Data pre-processing
During the data pre-processing phase, first, we searched automatically for invalid names (e.g., "unknown" or "newborn"), by comparing the recorded name with a standardized list of possible Brazilian names. All names considered invalid are submitted to a clerical review to confirm that they cannot be used in the linkage process, then this attribute is excluded. We removed punctuation, deleted consecutive spaces; middle initials, prefixes, and suffixes were maintained as recorded to retain the discriminatory power of the name variable.
Blocking/ Indexing
The complexity of the record linkage task is quadratic. We have to find the best candidate, on database B, for each record in database A, |A| X |B|. To enable the record linkage promptly when massive datasets are involved, we need to resort to methods capable of avoiding unnecessary comparisons, keeping the accuracy, once, the total number of pairwise comparisons between SINASC and CadUnico would be prohibitively high 44,485,267 x 114,007,705=5,07166e15. To meet these challenges, we use the CIDACS-RL 14; a novel record linkage tool developed to link big administrative datasets at the CIDACS.
The CIDACS-RL applies the combination of indexing and searching algorithms implemented in Apache Lucene solution as the blocking strategy to reduce the number of comparisons during the linkage. The indexation strategy allows the CIDACS-RL to search the most similar records from the Indexed baseline of the 100 Million Brazilian Cohort for each record in SINASC and submit them to the pairwise comparisons step, instead of restricts the comparison group as an ordinary blocking step. This search was performed in two ways, (i) using the mothers' name, municipality, and mothers date of birth records as attributes, from 2011 to 2015 (ii) using mothers name and municipality, from 2001-2010, because the mothers' date of birth was not registered before 2011. This search strategy uses a mixture of exact, semi fuzzy and fuzzy queries to return the 1000 best candidates from the indexed baseline of the 100 Million Brazilian Cohort. The exact queries return only records with equal attributes in every querying, while the semi-fuzzy and fuzzy approaches permit more flexibility by retrieving candidates where one (semi-fuzzy) or more attributes differ (fuzzy). In cases were certain uncertainty is included in the name variable, the Damerau-Levenshtein distance is used as a string comparator, and values above 0.5 are considered 14.
Pairwise Comparison
The most discriminant variables available on the live birth database to identify a child are a mother's name, municipality, and age. For those records from 2011 to 2015, the mothers' date of birth attribute becomes available, and its filling increases gradually across the years. For 2001-2010, where the mothers date of birth is not available, we proceeded with the search using only two attributes (mothers name and municipality) then, we create a new variable by subtracting the date of birth of the child information recorded in SINASC from the date of birth of the mother recorded in baseline of the 100 Million Brazilian Cohort, and this value was compared with the age of the mother registered in SINASC, only the candidates with exacted same value were considered as possible candidates and submitted to the pairwise comparison step. This step was also executed for records from 2011 to 2015 with missing values in the mothers' date of birth.
Figure 1 describes the two different approaches for each set of available variables. Then CIDACS-RL set weights according to the discriminatory power of the attributes ( name of the mother: 1 maternal age or date of birth: 1 state of birth: 0.008, municipality of birth: 0.16). At that moment, a combined scoring and query modules are used to perform the record linkage.
The similarities between names recorded in SINASC and the 1000 best candidates from the baseline of the 100 Million Brazilian Cohort were compared using the Jaro-Winkler string comparator 15. The Jaro-Winkler string comparator15 counts the number of common characters between two strings and the number of transpositions of these common characters, producing similarity values varying between 0 and 1 (perfectly similar). To compare the date attributes, we applied the Hamming distance 14, which measures the minimum number of substitutions required to change one string into the other. Then a linkage score is generated, and the function returns all pairs matched along with the score obtained.
Selection of the threshold
Candidate linking records were ordered by the scores achieved; only the comparison pair with the highest score is retained as a potential link. All remaining candidate records are discarded. Then a sample of 2000 pairs stratified in three categories of linkage score (high score – above 0.95, intermediate score – values between 0.90 and 0.95, and low score - bellow 0.90) is evaluated manually, and the records pairs are classified as likely true pairs or likely false pairs. Based on the training dataset of 2000, the receiver operating curve (ROC) is built to choose the best cut off point, and calculating the area under the curves (AUC), balancing between sensitivity and specificity values. Records were therefore classified as links or non-links based on a single threshold. The software R is used to generate accurate results.
Evaluation of the linkage error
Since we expected that all births registered in the baseline of the 100 Million Brazilian Cohort overlapped with the births existing at SINASC databases, we were able to identify the number of missed matches (records from the same mother-baby pair that failed to link) and to estimate the sensitivity (true links among the matches) of the linkage. We then examined which characteristics were associated with missed matches. We examined race, sex, place of residence, sewage treatment, water supply, garbage collection.