All methods were carried out in accordance with relevant guidelines and regulations of the Human Research Ethics Committee (Medical) at the University of the Witwatersrand (Faculty of Health Sciences).
Text mining algorithm development
We used the Python Spyder integrated development environment (IDE) for the development of the text mining algorithm because of its robustness in advanced editing, debugging, profiling, data exploration and interactive execution [14, 15]. An IDE is software that is used to build and develop applications. The Python code for this algorithm has been uploaded on GitHub (https://github.com/VictorO2/text-mining-gleason-score). The following Python modules were imported: (i) os, (ii) pandas, (iii) time, (iv) matplotlib, (v) seaborn, (vi) WordCloud and, (vii) Natural Language Toolkit (NLTK). We followed the text mining pipeline as depicted in the flowchart below (Fig. 1). The logical steps for the text mining algorithm were as follows: (i) data acquisition (ii) pre-processing, (iii) feature extraction, (iv) feature value representation, (v) feature selection, (vi) information extraction, (vii) classification, and (viii) discovered knowledge.
Data acquisition
We extracted all prostate biopsies performed for men aged ≥ 30 years between 1 January 2006 and 31 December 2016 that were referred to the NHLS for pathology evaluation in the Gauteng province, South Africa. Two data sets were extracted from the national laboratory data repository that houses LIS collated patient laboratory reports. The narrative prostate biopsy reports are captured as free-text in the LIS and stored in the national laboratory data repository. The Systematised Nomenclature of Medicine (SNOMED) clinical terms (CT) dataset was used to develop lookup tables to identify biopsies with an adenocarcinoma histological finding (n = 8,201) [16]. Once the biopsies with PCa were identified (adenocarcinoma histological findings with a reported GS), we extracted a random sample of 1,000 biopsies using Microsoft Excel (Redmond, Washington, USA) [17]. We chose a random sample as we did not want to select biopsies that were reported in a similar fashion from one laboratory.
To evaluate the text mining algorithm, we also randomly extracted 1,000 prostate biopsy narrative reports with a PCa diagnosis that were submitted from private sector laboratories to the National Cancer Registry (NCR) (referred to as the validation dataset). These narrative reports are generated by various private sector pathology practices and could be used to validate the algorithm. We received only the narrative biopsy reports.
For both datasets, the GS were manually coded by two experts. Manual coding was required as the GS is not extracted by the NCR and is embedded within the narrative report. Following this, a random sample of 369 biopsies were independently verified to validate the manual coding.
Pre-processing
We used pre-processing to ensure that the narrative biopsy reports were in a machine-readable format. The first step was to convert the narrative reports to a document format (also referred to as a corpus). A corpus is defined as large and unstructured text. This is required to convert the narrative reports into a structured format that is required for text mining [14, 18, 19]. Next, the data cleaning process involved using the NLP tokenization, stopwords removal and stemming techniques [15, 19]. We used tokenization to condense the streams of text into smaller meaningful elements (called tokens) that comprised of words, phrases and symbols. For example, the words ‘do not stop’ would result in 3 tokens (do-not-stop). We employed stemming to create various variants of words into a common representation known as the stem. Stemming takes words or a set of words to their root form, e.g., root of “gleasen” is “gleason”. We also standardised the word Gleason, major, minor, score, etc. Finally, we used the NLTK toolkit stop words to filter and remove irrelevant words before text processing, e.g. the, is, at, etc. This removed all possible English stopwords. We also converted text to lowercase for standardisation.
Feature extraction
We extracted features of interest from narrative prostate biopsy reports. We used regular expressions representative of the GS target feature such as “gleason”, “Gleason”, “GLEASON”, “Gleeson”, etc for feature extraction. Regular expressions can be used to define a sequence of characters that are associated with a feature. Each of these text patterns can be used as a rule-based approach to extract a feature. Similar approaches have been described by Napolitano and Spacic et al [9, 20]. Next, we used N-grams as our feature extraction strategy to extract the major and minor Gleason scores. We created unigrams, bigrams, trigrams and quadgrams which generated these scores. N-grams is a methodology that looks at sequences of words which are most occurring depending on the size of n, i.e. sequence of n words. N-grams are a set of co-occurring terms that were reported in a sentence or paragraph in the corpus [21, 22]. For example, when n = 1 (unigram) this represents single words in a sentence [22]. Similarly, when n is equal to 2 (bigram), 3 (trigram) or 4 (quadgram) this is represented as two, three and four words in a sentence respectively [22]. From the N-grams generated, we extracted the GS feature for each biopsy. The N-gram feature extraction output is provided for a sample of biopsies (Table 2)
Table 2
N-grams feature extraction output for a sample of biopsies.
['major 4 minor 5', '4 + 5']
|
['4 + 4', '4 + 4']
|
['3 + 3', '3 + 3']
|
['4 + 4', '4 + 4']
|
['major 4 + minor 3']
|
['4 + 5']
|
['major 4 minor 5', '4 + 5']
|
['major 4 minor 4']
|
['4 + 4', '4 + 4']
|
['4 + 4', '4 + 4']
|
['major 5 minor 4']
|
['major 3 minor 5']
|
['3 + 3', '3 + 3']
|
['2 + 2']
|
['3 + 4', '3 + 4']
|
['3 + 2', '3 + 3', '3 + 3']
|
['major 4 + minor 5']
|
['major 3 minor 4']
|
['3 + 5']
|
['3 + 3', '3 + 3']
|
['2 + 2', '2 + 2', '2 + 2']
|
['3 + 5', '3 + 5']
|
['4 + 3']
|
['3 + 5', 'major 4 + minor 5']
|
['major 4 minor 5']
|
['4 + 3', '4 + 3']
|
['3 + 2', '3 + 5', '3 + 5']
|
['3 + 3']
|
['major 5 minor 4']
|
['major 3 minor 5']
|
['4 + 3']
|
['2 + 3', '2 + 3']
|
['2 + 2']
|
['3 + 2', '3 + 2']
|
['major 5 + minor 4']
|
['major 5 + minor 4']
|
['major 4 minor 3']
|
['3 + 3', '3 + 3']
|
['4 + 4', '4 + 4']
|
['3 + 4', '3 + 4']
|
['major 3 minor 4']
|
['major 4 minor 5']
|
['3 + 2', '3 + 2']
|
['4 + 3', '4 + 3']
|
['2 + 3', '3 + 4', '3 + 4']
|
['4 + 3', '4 + 3']
|
['major 5 minor 5']
|
['major 3 minor 4']
|
['major 4 + minor 3']
|
['3 + 3', '3 + 3']
|
['4 + 4', '4 + 4']
|
['3 + 3', '3 + 3']
|
['major 5 minor 4']
|
['major 4 minor 5']
|
['5 + 5', '5 + 5']
|
['3 + 3', '3 + 3']
|
['5 + 4', '5 + 4']
|
['4 + 5']
|
['major 3 minor 3']
|
['major 5 minor 3']
|
['major 3 minor 4']
|
['major 5 + minor 5']
|
['major 5 minor 4']
|
['major 3 minor 5']
|
['major 4 minor 5']
|
|
Feature value representation
For feature representation, we created a document term matrix using term frequency. This was used to transform the document into a numeric feature vector space. We reported the twenty most frequently occurring unigrams, bigrams, trigrams and quadgrams as horizontal bar graphs (Fig. 2).
Feature selection
For feature selection, we used pathologists (experts) who identified the features of interest in the narrative prostate biopsy reports. As part of expert driven feature selection, we manually selected the following features: (i) episode number, (ii) major score, (iii) minor score, (iv) total score and (v) combined score. Because we used expert driven feature selection, we only chose relevant features and reduced the feature space (without using dimensionality reduction). Reducing the number of features selected would improve the model performance. Even though the feature space was reduced, there was no loss of information [23].
Information extraction
Information extraction is used to select specific entities and relationships of interest [9]. For information extraction, we manipulated the N-grams output to extract the numerical value of the major and minor scores. Next, we calculated the total score and reported the GS in a standardised format, e.g., 4 + 4 = 8.
Classification
We classified biopsies into the three risk categories: (i) low (≤ 6), (ii) intermediate (7) and (iii) high-risk (≥ 8) based on local guidelines [5]. The classification process was automated using a rule-based approach and implemented within the algorithm.
Discovered knowledge.
The discovered knowledge included the episode number, major score, minor score, total score, standardised GS and risk category. For each biopsy, the algorithm extracted a single row of structured data. From the narrative biopsy report depicted in Table 1, the following discovered knowledge was reported: (i) ABC1234, (ii) 4, (iii) 3, (iv) 7, (v) 4 + 3 = 7 and (vi) intermediate.
Text Mining Algorithm Evaluation
A confusion matrix (also known as a sensitivity/specificity analysis) was used to compare the text mining algorithm extracted against the manually coded values [24]. The confusion matrix consists of four values: (i) True Positives (TP): correctly extracting the GS, (ii) True Negatives (TN): correctly extracting a biopsy without a GS, (iii) False Positive (FP): falsely extracting a GS and (iv) False Negative (FN): falsely extracting the manually coded GS [24]. The precision and recall are calculated using these four values as follows: (i) \(\frac{TP}{TP+FP}\) and (ii) \(\frac{TP}{TP+FN}\) respectively. Precision and recall are similar to positive predictive value (PPV) and sensitivity respectively. The F-score is the harmonic mean of precision and recall and is calculated using the formula \(\frac{2*\left(Recall *Precision\right)}{(Recall+Precision)}\). As the manually coded values were assumed to be the gold standard without any incorrect values, there was no need to report the zero values. Therefore, we removed the ‘Actual: No’ column. However, the zero values were still used for the F-score calculation.
Statistical analysis
We reported the top ten GS alpha, numeric and alphanumeric reporting formats as a table, i.e., how they were captured in the narrative prostate biopsy report. We also reported the top five GS reported, with the remaining scores categorised as ‘Others’. The percentage of a top five GS categorised as high-risk (≥ 8) is also indicated. As we reported data for a multi-class problem, we reported the frequencies for the predicted and manually coded values for a low, intermediate and high-risk GS as a table. Next, we calculated the macro averaged F-score (F-score for each GS risk category added up and then divided by the number of measurements) [25].