Overview of the study
With standard terminologies for clinical doctors to describe endoscopic images already established [10-12], we developed several classifiers based on different algorithms to automatically classify UC, CD and ITB with free-text endoscopic descriptions as the input.
Study population
Electronic health records (EHRs) of a total of 6399 consecutive patients who had undergone colonoscopies in Peking Union Medical College Hospital (PUMCH) and were clinically diagnosed as having UC (n=5128), CD (n=875), or ITB (n=396) from January 2008 to November 2018 were collected successively. This research was approved by the Ethics Committee of Peking Union Medical College Hospital on September 31st, 2018 (IRB S-K894).
The clinical diagnoses of UC and CD were made via a combination of medical history, endoscopic features, pathological features, and treatment follow-up based on the Chinese consensus of IBD (2018) by IBD specialists in this hospital. Diagnosis of ITB was obtained by the presence of any of the following: (1) positive acid-fast bacilli on histological examination or positive M. tuberculosis culture; (2) radiological, colonoscopic, and/or other proven TB; or (3) full response to anti-TB therapy. Colonoscopies were performed by well-trained endoscopists at PUMCH using Olympus CF-Q260 or H260 colonoscopes.
Based on the well-established terminology used by endoscopists to describe colonoscopic images, we extracted descriptions of colonoscopic images of the patients’ index colonoscopy in the form of free text. Clinically confirmed diagnoses extracted from the hospital information system (HIS) were used as labels.
Data processing
Figure 1 shows the flow path of data processing. An example of input data could be found in the supplementary material.
The image descriptions were preprocessed with natural language processing (NLP) techniques to extract linguistic features before being input into the classifiers. First, Chinese word segmentation was applied to the description to tokenize the input text, using the Python package ‘jieba’ and enhanced by the Xiangya Professional Medical Dictionary. Punctuation and words without actual clinical meanings such as ‘patients’ and ‘prepare for the examination’ were deleted.
The second step of NLP was keyword filtering, which aimed to identify informative keywords in the description. Term frequency-inverse document frequency (TF-IDF) was applied to filter keywords. The TF-IDF value was defined as follows:
Terms whose TF-IDF values for a document were out of the range of [0.3, 0.7] were removed from the input for that document.
The last step of NLP was dimension reduction with non-negative matrix factorization (NMF) [13], which could improve the interpretability of the extracted features and allow clinical doctors to understand the results better.
In addition to single words, N-grams and L1 regularization (also known as the least absolute shrinkage and selection operator, or LASSO) were applied in the above process as well. Details were provided in Supplementary Materials for conciseness.
Development of classifiers
Random forest (RF) is appreciated for the advantage in weighting the importance of features. Convolutional neural networks (CNNs) were selected for their capability to extract features automatically. In addition, all algorithms applied are able to analyze free text directly, thus requiring little manual work.
RF was applied to two-class classifications (UC and CD, UC and ITB, and CD and ITB) and the three-class classification of UC, CD and ITB, while the CNN was applied to the three-class classification and the CD/ITB classification. The reason for not applying the CNN to the two-class classifications including UC was an unbalanced sample number.
The labeled dataset was randomly split into a training group (70%) and testing group (30%) for both RF and CNN.
(1) Random forest
RF was applied to data processed by NMF. The RF parameters can be found in Supplementary Materials Table 1. Due to the unbalanced data, cost-sensitive learning was employed in order to assign different weights to different diseases. This approach could improve the performance of the model of CD and ITB, which had a small number of samples.
For choosing hyperparameters on RF, ten-fold cross-validation was applied on the training set (70% of total data). The training set were split into ten equal subsets randomly. Nine subsets were used to train for hyperparameters, while the remaining subset was used as the validation set. The chosen hyperparameters were then applied for further train and test.
Regarding feature extraction, we first extracted 50 features sorted by variable importance of RF, which comprised phrases produced by segmentation. All the features were then reviewed by two experienced clinical doctors to combine similar features and omit meaningless or duplicated features.
(2) Convolutional neural network
To train CNN models, we extracted approximately 110,000 descriptions from the endoscopic center of PUMCH without labels. The data were employed to train a GloVe[14] model for word embedding, which was used to initiate the CNN embedding layer.
The CNN model applied a structure similar to the Text-CNN model proposed by Yoon Kim [15]. A word list was built, and an integer was allocated to each word. The input sentences were segmented into words and represented by a corresponding integer sequence. The integer sequence was then embedded into a 100-dimension vector. The vectors were used as the input for the CNN. The parameters of the CNN can be found in Supplementary Materials Table 2.
The model was optimized by the Adam algorithm [16]. FocalLoss was used as the loss function to handle the imbalanced sample sizes and accelerate convergence [17]. A total of 100 iterations were employed to allow the model to converge.
Result visualization
T-distributed stochastic neighbor embedding (t-SNE) [18] was applied to reduce the dimensionality of the features to visualize the results.
Statistical analysis
The receiver operator characteristic (ROC) [19] curve was applied to evaluate the performance of the two-class classifiers. The sensitivity, specificity and area under the curve (AUC) were calculated. AUC could evaluate the classifiers globally and was not sensitive to the ratio of positive and negative samples. The precision (also known as the positive predictive value), recall (also known as sensitivity), and F1 score were used to evaluate the performance of the three-class classifier. The statistics mentioned above were defined as follows:
All statistical analyses were performed by R 3.5.1 software and Python 3.7.