Data collection
We retrospectively analyzed the VCS parameters of neutrophils, monocytes, and lymphocytes from 97 active Tuberculosis patients (Group:ATB) and 113 latent Tuberculosis patients (Group:LTB) by using a hematology analyzer with VCS (volume, conductivity, light scatter) technology from January 2018 to July 2018 in Chongqing Infectious Disease Medical Center and Army Medical Center (Daping Hospital), Army Medical University, Chongqing, China. The inclusion criteria for each group were as follows: ATB was diagnosed based on typical clinical symptoms and positivity on at least one of the following tests: acid-fast bacilli, bacterial culture or a molecular test (Xpert MTB/RIF, Cepheid, Sunnyvale, CA, USA), or chest X-ray findings in line with tuberculosis imaging lesions. These individuals had no previous history of TB disease or treatment. LTB cases were defined as those who has a history of TB exposure and with a positive interferon-gamma release assay T-SPOT (Oxford Immunotec, Oxfordshire, UK) test, sputum smear and MTB culture were negative, and the absence of clinical and radiographic signs of ATB. The inclusion criteria of the healthy control group were: no history of contact with tuberculosis, normal chest X-ray, T-SPOT negative, without any signs of infection, total leukocyte count, and differential leukocyte count were within the reference interval. As we reported previously. The VCS parameters of neutrophils(NE), monocytes(MO), and lymphocytes(LY) included mean volume (MV) and its deviation(MV-SD), mean conductivity (MC) and its deviation(MC-SD), multiple light scatters like median angle light scatter(MALS), upper median angle light scatter(UMALS), lower median angle light scatter(LMALS), low-angle light scatter(LALS), axial light loss(AL2) and their deviation(MALS-SD, UMALS-SD, LMALS-SD, LALS-SD, AL2-SD) were collectioned and analysised,also including some routine indicators (e.g. total leukocyte count(WBC) and the percentage of neutrophil, monocyte, and lymphocyte(NE%,MO%,LY%)).Thus a total of 46 parameters were obtained.
Classification
In this study, four machine-learning classification algorithms were used namely, LR (logistic regression),RF(random forest), KNN(K-nearest neighbor) and SVM(supportive vector machine) by using Python3 and running on Linux. A brief description of each method is provided in the following paragraphs.Data obtained from participants in discovery cohort were randomly divided at 8:2 ratio. The larger (8/10) one was applied for modeling (training set), whereas the smaller one (2/10) was used as test set.
LR (logistic regression)
Logistic regression is a common Machine Learning algorithm for binary classification.It is a linear model for classification, which can fit binary or multinomial logistic regression.
RF(random forest)
Random Forest (RF) is based on decision trees, proposed by Breiman, is a supervised, non-parametric method of classification.It is an ensemble classifier used for data mining, and is composed of numerous decision trees, each one relying on the values of a random vector sampled independently. By using random subsets of the training data for each tree and considering random features for each decision point, Random Forest prevents over fitting.
KNN(K-nearest neighbor)
K-nearest neighbor (Cover and Hart, 1967) is a simple classification algorithm, and it is also called Reference Sample Plot Method. The idea of this algorithm is when given an unknown sample, a k-nearest neighbor (KNN) classifier searches a feature space for the k training samples that are closest to the unknown sample.This means KNN algorithm predicts the class of a sample with unknown class by considering the classes of k-nearest neighbors.
SVM(supportive vector machine)
SVM is developed by Corinna Cortes and Vapnik, the core of which is the structural risk minimum principle constructed by the empirical risk minimum principle and confidence intervals.
Analysis of baseline features and peripheral blood routine parameters
All data were described as the mean ± SD. Comparison between two means was performed by the Wilcoxon rank-sum test. Comparison of the gender differences between groups was performed using the chi-squared test.
Model building, model performance evaluation and validation
In the training set, k-fold cross-validation (k = 5) was used, all 210 subjects were randomly divided into 5 equally sized subsets.Firstly, we used four of them in turn as the training set and one as the test set.k-Fold cross validation (k = 5)repeats this steps 5 times changing a partition serving as a test set one by one. In the end, averaged predictive performance over k validation steps is regarded as the predictive performance of a classification algorithm( Fig. 1). We evaluated diagnostic ability of each model based on follow parameters: accuracy, precision ,recall,F1 SCORE, Matthews correlation coefcient(MCC), Specifcity and negative predictive value(NPV).We plotted the area under the precision-recall curve (AUPRC) and the area under the receiver operating characteristic curve (AUROC) to compare the performances of the machine learning classification models. In the testing set, we also plotted the area under the precision-recall curve (AUPRC) and the area under the receiver operating characteristic curve (AUROC) to estimate of the model’s predictive performance.