Type of study
This is a retrospective cohort study including all patients who initiated treatment for TB in Brazil from January 1, 2008 to December 31, 2013.
Our area of study includes the entire Brazilian territory with an area of 8.5 million km2, representing 47% of the area of South America. Brazil has an estimated population of 214,190,002 in 2018, being the fifth most populous country in the world. Currently, Brazil has five macro-regions (North, Northeast, Central-West, Southeast, and South).
Data source and study population
Two national data sources were used: The Notifiable Diseases Information System (SINAN, acronym in Portuguese) and the Mortality Information System (SIM). The SINAN database contained cases reported between January 1, 2008, and December 31, 2011, and was extracted on 09/20/2013. The SIM database contained deaths reported between January 1, 2008, and December 31, 2013, and was extracted on 04/01/16. Both databases had nominal information.
The study population consisted of all new cases of tuberculosis reported in SINAN that began treatment in the period from January 1, 2008 (date of first entry) until December 31, 2011 (date of last entry). Through record linkage of data with the SIM database, the cases were followed up until the occurrence of deaths or until December 31, 2013, when administrative censoring was considered (end of the follow-up period).
These two information systems are fed continuously, have good coverage throughout the country and are decentralized to all municipalities [15, 16]. In the particular case of TB, which has as its delineator guidelines established for case definition .
The variable color or race, according to the categories adopted by the Brazilian Geography and Statistics Institute (IBGE) was introduced in information systems managed by the Ministry of Health from 2000. In practice, the color/race variable in SINAN notification form is reported from the patient’s self-declaration, based on the color of their skin, according to the five terms used by the Brazilian Institute of Geography and Statistics (or Brazilian Census Bureau, IBGE, acronym in Portuguese): white, black, brown, yellow, and indigenous [17, 18].
The SINAN database containing nominal identification information on TB cases reported between January 1, 2008, and December 31, 2011, was obtained on 09/20/2013.
SIM is the oldest health information system in the country, created in 1975 by the Ministry of Health to address civil registry failures. It is a system with high population coverage, which aims to record data on mortality in Brazil, comprehensively and reliably. Currently, SIM’s coverage is estimated at more than 95% in Brazil . Data available in the SIM is essential to understand the mortality profile of a population. It is used to calculate health indicators, perform trend analysis and to establish investment priorities in the health sector.
Death certificates are the fundamental sources for SIM. Adequate completion of the death certificate, which must necessarily be performed by physicians , is an essential condition for good information quality of the SIM data. Information registered in the death certificate allows adequate knowledge of the causes of death of an individual, if well completed. The term underlying cause, as defined by WHO in successive revisions of the International Statistical Classification of Diseases and Related Health Problems (ICD), refers to the "cause of death" that initiated the sequence of morbid events that led the individual to die. In addition to the underlying cause, the associated causes, which include the terminal and intermediate causes resulting from the underlying cause, as well as the causes that contributed to death without direct relation to the pathological process responsible for it, are also recorded in the death certificate.
The variable that identifies the color or race of individuals according to the categories adopted by the Brazilian Geography and Statistics Institute (IBGE) was introduced in information systems managed by the Ministry of Health in 2000. In practice, the color/race variable in SINAN notification form is reported from the patient’s self-declaration, based on the color of their skin, according to the five terms used by the IBGE: Branca/white, preta/black, parda/brown, amarela/yellow, and indígena/indigenous [17, 18].
Record linkage procedures and study groups definition
According to the Brazilian legislation regulating access to secondary data , we obtained an authorization from the Ministry of Health’s dedicated department (Coordenação Geral de Informação e Análise Epidemiológica – CGIEA) for the use of nominal identification data and, therefore, could performe record linkage procedures between SINAN and SIM databases for the period above.
The linkage was performed in three steps. The first one was conducted in SINAN’s database using a deterministic algorithm for semi automatic linking records, similar to those validated by Pacheco et al.  and Oliveira et al. , with an adaptation to the STATA statistical software. The first task was pre-processing data to ensure that all variables presented the same format. For names, all letters that were upper case or doubles and had accents and different characters were removed. Suffixes such as Junior and Filho were also removed. We also removed terms that indicated the lack of knowledge about the patient's name or the patient's mother (ignored, unknown).
The second task was the removal of duplicate records in SINAN: (1) exact duplications, which are records belonging to the same individual and that relate to the same episode of illness, reported in the same health unit and (2) transfers, which are records belonging to the same patient and related to the same episode reported in different health facilities., as patient may go through several health units throughout the follow-up, in search of clinical or laboratory diagnosis and common or specialized treatment. Besides, at some point in the follow-up, hospitalization may be required. These transfers between health units can be official or spontaneous.
The second step was the linkage between the databases of SINAN and SIM. We used a probabilistic data linkage procedure using a methodology commonly applied for data encryption coding called Bloom filter , using the free software R 3.1.2 and package “PPRL” . For this linkage, the following vital fields were employed: patient's name, mother's name, date of birth, and code of the municipality of residence. For each pair suggested in the linkage step, a score ranging from 8,600 to 10,000 was adopted. Thus, the value of the pairs near the lowest score, established as 8,600, were less likely to be correct pairs, and those close to 10,000 were more likely to be from the same individual. After applying the Bloom filters to the identified pairs, some of them were not from the same person, mainly comprising the score range between 8,600 and 9,200, that is, with lower scores.
Finally, in our third step, only for the groups of records found in the probabilistic data linkage between the SIM and SINAN databases, another deterministic data linkage procedure similar to that used in step one described previously was adopted, with the intention of removing from these groups false-positive for example, records not belonging to the same individual, thus increasing the specificity of the pairs found.
After the records linkage process, three analysis groups were created for the causes of death according to ICD-10 codes: i) death due to TB, those that had underlying cause with codes A15 to A19 of ICD-10; ii) associated TB deaths, those deaths in which there was no mention of any of the ICD-10 codes (A15-A19), referring to TB in any line of part 1 of the death certificate; iii) with no mention of TB, those deaths in which there was no mention of TB (codes A15-A19 of ICD-10) in any part of the death certificate.
Inclusion and exclusion criteria
In order to guarantee the quality of information on TB treatment episodes, an automatic surveillance routine adopted by Bierrenbach et al.  sought to eliminate duplicities and correct classification errors of different treatment episodes from the same patient. Thus, as shown in figure 1, excluded true duplications (records of the same patient by the same health unit and the same date of initiation of treatment, only the oldest, or most complete, if both had the same notification date kept). The cases classified as transference in the variable type of entry and the missing information were corrected. When the first entry was classified as “do not know” correct for a new case. Excluded records of cases terminated as a “change in diagnosis” (i.e., not TB), to analyze only the new cases in the 1st treatment entry; therefore, cases classified as return after default, relapse, and transfer were excluded. Also excluded inconsistencies in treatment starting dates and date of outcome (i.e., cases with treatment date after date of outcome), as well as missing dates (Figure 1).
[INSERT FIGURE 1 HERE]
Variables of Interest
Based on the literature review on factors associated with death among tuberculosis cases, the covariables considered in this article were: sex (female/male); schooling (illiterate, under 8 years old, over 8 years old and ignored); age group (0 to 19 years, 20 to 39 years, 40 to 59 years and 60 or more); color or race (white, black, brown, yellow, indigenous and ignored); macro-region (North, Northeast, Southeast, South and Central-West); clinical form (pulmonary, extrapulmonary and mixed); number of treatments (1; 2 to 3; 4 or more); anti-HIV serology (positive, negative, in progress and not performed); alcoholism (yes and no); diabetes (yes and no).
The four study groups (Death due to TB, associated TB death, death with no mention of TB, and no death reported until December 31, 2013) were compared in a descriptive analysis regarding the variables of interest.
Survival analysis was used to elucidate factors associated with deaths due to TB (TB as the underlying cause) considering the presence of competitive events, characterized here as the other two study groups in which individuals died (TB associated deaths, deaths with no mention of TB). The Fine & Gray sub-distribution model based on the cumulative incidence function (CIF) was used as a reference , considering the probability of an event occurring before a specific time. This model considers a proportional risk model for the sub-distribution of competitive risk, where the covariates directly affect the CIF. Thus, the observations on competing risks should be maintained in the range of risks. That is, for individuals in our study that died due to other causes rather than TB as the underlying cause, the model considers these risks but with a decreasing weight to take into account the reduction of the observations .
Survival time was measured in days comprised by the period between the entry day of treatment start and the exit day of the events of interest (TB as the underlying cause, TB associated deaths, deaths with no mention of TB) or censoring (end of follow-up on 12/31/2013). On the other hand, from the deterministic linkage of the data, the fatal outcomes were divided into three groups of analysis for competitive events according to the ICD-10 codes listed in part I of the death certificates and made available in the SIM.
The cumulative incidence function was used to describe the probability of TB mortality in the presence of competitive events and the Gray test was used to compare the differences between the groups. The Fine-Gray subdistribution model was used to identify factors associated with mortality among TB cases. The first step was to use a simple risk sub-distribution model for all variables selected in this study and variables with a p-value > 0.20 in the Wald test were removed. Then, it was produced various models with all the variables that were statistically significant in the simple model. The final model chosen was with the significant variables (p-value ≤ 0.05) in the multiple models. The risk measure was the subdistribution hazard ratio (sHR) with its respective 95% confidence intervals. The proportionality assumption of the Fine-Gray model was initially checked for CIF and Schoenfeld residuals tests.
Microsoft Excel spreadsheets 2016 were used to structure the data (Microsoft Corp., Redmond, WA, USA). We conducted the statistical analysis in STATA software, College Station, TX, USA  and free software R version 3.3.2 (R Foundation for Statistical Computing, Vienna, Austria) in the "Survival"  and "Riskregression" packages .
This study was approved by the Research Ethics Committee of the National School of Public Health/FIOCRUZ, under the protocol: CAAE: 14643713.0.0000.5240. The nominal identifiers were removed from the database after the data linkage, ensuring the privacy of the subjects involved in the study.
No informed consent was used since only the secondary notification data were analyzed.