Machine learning algorithms using national registry data to predict loss to follow- up during tuberculosis treatment

Background: Identifying patients at increased risk of loss to follow-up (LTFU) is key to developing strategies to optimize the clinical management of tuberculosis (TB). The use of national registry data in prediction models may be a useful tool to inform healthcare workers about risk of LTFU. Here we developed a score to predict the risk of LTFU during anti-TB treatment (ATT) in a nationwide cohort of cases using clinical data reported to the Brazilian Notifiable Disease Information System (SINAN). Methods: We performed a retrospective study of all TB cases reported to SINAN between 2015–2022; excluding children (<18 years-old), vulnerable groups or drug-resistant TB. For the score, data before treatment initiation were used. We trained and internally validated three different prediction scoring systems, based on Logistic Regression, Random Forest, and Light Gradient Boosting. Before applying our models we split our data into train (~80% data) and test (~20%), and then we compare model metrics using a test data set. Results: Of the 243,726 cases included, 41,373 experienced LTFU whereas 202,353 were successfully treated and cured. The groups were different with regards to several clinical and sociodemographic characteristics. The directly observed treatment (DOT) was unbalanced between the groups with lower prevalence in those who were LTFU. Three models were developed to predict LTFU using 8 features (prior TB, drug use, age, sex, HIV infection and schooling level) with different score composition approaches. Those prediction scoring system exhibited an area under the curve (AUC) ranging between 0.71 and 0.72. The Light Gradient Boosting technique resulted in the best prediction performance, weighting specificity, and sensibility. A user-friendly web calculator app was created (https://tbprediction.herokuapp.com/) to facilitate implementation. Conclusions: Our nationwide risk score predicts the risk of LTFU during ATT in Brazilian adults prior to treatment commencement. This is a potential tool to assist in decision-making strategies to guide resource allocation, DOT indications, and improve TB treatment adherence.


INTRODUCTION
Despite the widespread availability of curative treatment of tuberculosis (TB), this disease remains a major plague of humanity, accounting for more than one million deaths annually [1].Global treatment success is still below the targets established by the World Health Organization (WHO) [2,3], especially in low-and middle-income countries (LMIC) such as Brazil [4].
Current WHO treatment recommendations for drug-susceptible TB include six months of a combination of antibiotics [1].Such long treatment is associated with an increased risk of LTFU and may lead to adverse drug reactions [2].Early identi cation of patients at high risk of LTFU at the moment of diagnosis with clinical and sociodemographic characteristics is key to providing personalized care, which may involve directly observed treatment (DOT), and helping decision-making strategies to mitigate losses in the cascade of care.To do so, the establishment of reliable and accurate prediction tools [4] is necessary, especially in low-resource settings with a high-middle TB disease burden.
Brazil is among the countries with the highest number of TB cases in the world, despite the fact that it follows the WHO's standardized TB treatment recommendations.Importantly, the cascade of treatment care in Brazil is composed of 3 steps: 1) mandatory reporting of TB cases to the Noti able Diseases Information System (SINAN) [5,6]; 2) a six-month treatment regimen, usually in xed-dose combination (FDC) [7]; and 3) treatment-associated outcomes are reported in the SINAN database.Thus, this is a signi cant source of data that could be explored to develop prediction models for LTFU during ATT.
Therefore, we developed a prediction model for LTFU among pulmonary TB treatment cases in Brazil.We used publicly available data from SINAN-TB and applied machine-learning methods to choose the most accurate technique.We aimed to test the ability of models to predict a LTFU at the baseline consultation utilizing only data clinical and sociodemographic data that is easily obtained at diagnosis.Importantly, the developed a model that could be used by both the Brazilian government and clinicians as a readily available web-based tool for decision-making to achieve higher rates of TB treatment success.

Ethics Statement
All data accessed in this study were obtained from a publicly available platform and pre-processed by the Brazilian Ministry of Health (https://datasus.saude.gov.br)This processing veri ed the data regarding consistency, duplicate registration, and completeness, following the instructions set by Resolution Number 466/12 on Research Ethics of the National Health Council, Brazil.There was no identi able information in the databases and thus the study was exempt from approval by ethics committees.

Study Population
We performed a retrospective analysis of de-identi ed data from pulmonary TB cases reported to the Brazilian Noti able Diseases Information System (SINAN).
SINAN is a centralized system for the noti cation of transmissible diseases, including TB.Data stored in SINAN are maintained by the Brazilian Ministry of Health speci cally by the DATASUS (the Information Technology Department of the Brazilian Uni ed Health System) and can be accessed through a le transfer protocol [6].
We included in our study all individuals 18 years old or older, noti ed in SINAN with pulmonary TB from 2015 through 2022.
We exclude from our study any patient that: (i) postmortem TB diagnosed; (ii) belongs to any special population (i.e homelessness, liberty deprivation, pregnant, immigrants, and health worker), (iii) is resistant to any drug (rifampin, isoniazid, pyrazinamide, or ethambutol), and (iv) outcome other than cure or LTFU.(Fig. 1).

Data analyses
We divided our data analysis process into seven portions/steps: (i) descriptive analyses, (ii) data under sample, (iii) split data, (iv) feature elimination, (v) hyper-parameters tuning, (vi) model evaluation, and (vii) model building.To conduct descriptive analysis we used median followed by interval interquartile (IQR) to describe continuous variable and absolute and relative frequency to categorical.As our data could be considered imbalanced (i.e.~3 cures for 1 LTFU) we performed an under sample of the most frequent class.Hence, the data set resulting from this process has the same proportion of outcome (i.e. 1 cure for 1 LTFU), and then we split in train test data.The training set was composed by 70% of the total data whereas 30% was kept for model evaluation.To reduce data dimensionality, we used Recursive Feature Elimination using Cross-Validation (RFECV).In this case, we selected RF as the estimator and used it in a 10-fold strati ed cross-validation, then we selected the minimum number of variables that leads to the higher model accuracy following the elbow rule.To nd the best set of parameters we used the grid search approach, thus for each model (i.e.Logistic Regression, Random Forest, and Light Gradient Boosting) we created a grid of parameters, in the train set we evaluated the best combination of the parameters.To select the best algorithm evaluation, we applied each model with its best combination of parameters to the test set.We then evaluate AUC, accuracy, sensitivity, and speci city.To understand the feature importance and feature contribution to each outcome on a global and local level we used Shapley values.The last step consisted of retraining the model using the whole data set [8][9][10][11][12][13][14][15][16][17][18][19].

Comparing machine learning algorithms to predict LTFU
We initiated our model development with 15 variables of which 8 were selected as the most informative by our RFECV approach (Fig. 2): (i) schooling, (ii) sex, (iii) prior TB, (iv) HIV infection, (v) alcohol use, (vi) drug use, (vii) tobacco use and (vii) age.To predict those patients who are more likely to experience an LTFU we proposed three different models using the variables listed above.In our investigation into predicting patient outcomes, three diverse models were employed, each revealing unique hyperparameter preferences for optimal performance.The logistic regression model demonstrated its peak predictive capabilities with a strong regularization, notably C = 0.01.This underscored the critical role of regularization strength in striking a balance between model complexity and generalization.The Random Forest model achieved its optimal performance with a max depth of 8 and a total of 500 decision (no. of estimators) in the ensemble, re ecting the importance of these hyperparameter choices in enhancing predictive accuracy.In the case of the Light Gradient Boosting model, optimal performance was achieved with trees of max depth 4, 500 decision trees (no. of estimators), and a learning rate of 0.01.These results highlighted the intricate interplay between tree complexity, ensemble size, and the learning rate in achieving superior predictive capabilities.These ndings shed light on the nuanced preferences of each model, providing valuable insights into the speci c hyperparameter con gurations that optimize predictive accuracy in the intricate landscape of patient outcome prediction.Such tailored considerations are imperative for the effective application of machine learning approaches to healthcare data.
The next phase consisted of evaluating the three models (using the parameters described above) on the test set.In this case, we found that classi ers presented similar results (Table 2), for example, logistic regression presents an AUC of 0.72 (95% CI = 0.71-0.72)whereas booth Radom Forest and Light Gradient Boosting present an AUC of 0.72 (95% CI = 0.72-0.73)(Fig. 3).When we consider accuracy, we found that all models achieve the same result 0.67 in all the models evaluated (Table 2).However, when we consider sensitivity and speci city models had a different performance.The Logistic Regression model presented the highest speci city (0.75) and the lowest sensitivity (0.58), whereas the Light Gradient Boosting and Radom Forest presented the same sensitivity (0.62) however Light Gradient Boosting presented a higher speci city (0.72) compared to Random Forest (0.70).According to our calibration plot, the Light Gradient Boosting presented the best result since the predicted probability of an LTFU corresponds to the true likelihood of the positive class being true (Fig. 4).The Random Forest presented the worst result.
In this case, the model probability underestimated the real likelihood of the positive class (Fig. 4).Thus based, on all the results we found, we decided to use the Light Gradient Boosting to construct our predictive model.We used SHAP values to allocate the contribution of each feature to a model's prediction, offering insights into feature importance and interactions.Such values help interpret complex models, providing a nuanced understanding of the factors in uencing speci c predictions.According to our model, previous TB was the most important feature.In this case, a patient who experienced prior TB had increased likelihood to evolve to LTFU.Another important feature was drug use.Patients who reported to use drugs had the probability of evolve to LTFU during an ATT increased (Fig. 5).

DISCUSSION
In this study of pulmonary TB cases reported to SINAN in Brazil, we developed a risk score that effectively strati ed before treatment initiation those TB cases at higher risk of LTFU during ATT.Our score used data from 8 features, all of which were from the case noti cation form, and were publicly available.Those features included clinical and epidemiologic information, that can be collected by health professionals before treatment initiation, and which predicted LTFU independent of other characteristics.The use of this risk score could potentially provide crucial information to target speci c patients since the diagnosis and improve the successful ATT completion, potentially facilitating the achievement of the WHO target of 90% of patients with treatment success [20].
Importantly, in our study, 14.5% of the total population experienced LTFU, which represents an important problem for public health because of the risk of M. tuberculosis transmission; drug-resistant strains can also be generated [21].Importantly, the rates of DOT in the group that experienced the LTFU were signi cantly lower than the cure group.Enhancing the importance of the detection of these patients at the beginning of TB treatment might help clinicians in choosing priorities for DOT and the target populations for the Brazilian national TB program.
Our probabilistic score was developed using clinical and sociodemographic data readily collected in most clinical care settings, even in resource-limited settings.Among the variables selected, prior TB, consumption habits (alcohol, tobacco, or drug use), age (adult and elderly), biological sex, HIV infection, and schooling level were the risk factors that most contributed to an LTFU during TB treatment.Some of these characteristics have been explored and linked to unfavorable TB treatment outcomes through the relationship with poor therapy adherence, LTFU, and treatment discontinuation [11,[22][23][24][25][26][27][28].
In a previous study, a similar score was developed to predict unfavorable anti-TB treatment outcomes in people living with diabetes from China, however using clinical and radiologic data [24].Another study from Mexico developed an algorithm to predict mortality, failure, and drug resistance in newly diagnosed TB patients with clinical features and laboratory tests [28].In contrast, our score could be applied in patients with or without diabetes, by utilizing only clinical information, without the necessity of laboratory data or radiographic exams.
While exploring data from the RePORT-Brazil consortium, we have previously reported a clinical prediction model for unfavorable pulmonary TB treatment outcomes [11].That score utilized information that was not readily available in SINAN, thus we found it di cult to translate to the nationwide TB program in Brazil.The present study intended to create a score that could be employed in all settings, especially in those with limited resources, which could certainly help guide interventions at the moment of diagnosis, before starting treatment in a large country such as Brazil.
Our risk model had several limitations.First, the study utilized nationwide public data, and several features had missing data and were exposed to a wide range of demographic and regional discrepancies.Second, most co-morbidities and clinical characteristics were self-reported, which may provide potential misclassi cation bias.Also, the study included only pulmonary TB cases and consequently may not be applied to extrapulmonary or disseminated TB.

Table 1
Characteristics of the overall population of the study De nition of alcohol use: Past or current any consumption of alcohol.De nition of smoking: Past or current smoking of tobacco.De nition of non-white race: combination of black, mixed, pardo, yellow and indigenous.De nition of drug use: Past or current drug use (marijuana, cocaine, heroin, or crack).
Table note: Data represent no.(%), except for age, which is presented as median and interquartile range (IQR).