Deep Active Learning for Classifying Cancer Pathology Reports
Background: Automated text classification has many important applications in the clinical setting; however, obtaining labelled data for training machine learning and deep learning models is often difficult and expensive. Active learning techniques may mitigate this challenge by reducing the amount of labelled data required to effectively train a model. In this study, we analyze the effectiveness of eleven active learning algorithms on classifying subsite and histology from cancer pathology reports using a Convolutional Neural Network (CNN) as the text classification model.
Results: We compare the performance of each active learning strategy using two differently sized datasets and two different classification tasks. Our results show that on all tasks and dataset sizes, all active learning strategies except diversity-sampling strategies outperformed random sampling, i.e., no active learning. On our large dataset (15K initial labelled samples, adding 15K additional labelled samples each iteration of active learning), there was no clear winner between the different active learning strategies. On our small dataset (1K initial labelled samples, adding 1K additional labelled samples each iteration of active learning), marginal and ratio uncertainty sampling performed better than all other active learning techniques. We found that compared to random sampling, active learning strongly helps performance on rare classes by focusing on underrepresented classes.
Conclusions: Active learning can save annotation cost by helping human annotators efficiently and intelligently select which samples to label. Our results show that a dataset constructed using effective active learning techniques requires less than half the amount of labelled data to achieve the same performance as a dataset that constructed using random sampling.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Due to technical limitations, full-text HTML conversion of this manuscript could not be completed. However, the latest manuscript can be downloaded and accessed as a PDF.
This is a list of supplementary files associated with this preprint. Click to download.
Posted 24 Sep, 2020
On 20 Oct, 2020
Received 19 Oct, 2020
Received 02 Oct, 2020
On 01 Oct, 2020
Invitations sent on 30 Sep, 2020
On 30 Sep, 2020
On 30 Sep, 2020
On 07 Sep, 2020
On 06 Sep, 2020
On 06 Sep, 2020
On 03 Sep, 2020
Deep Active Learning for Classifying Cancer Pathology Reports
Posted 24 Sep, 2020
On 20 Oct, 2020
Received 19 Oct, 2020
Received 02 Oct, 2020
On 01 Oct, 2020
Invitations sent on 30 Sep, 2020
On 30 Sep, 2020
On 30 Sep, 2020
On 07 Sep, 2020
On 06 Sep, 2020
On 06 Sep, 2020
On 03 Sep, 2020
Background: Automated text classification has many important applications in the clinical setting; however, obtaining labelled data for training machine learning and deep learning models is often difficult and expensive. Active learning techniques may mitigate this challenge by reducing the amount of labelled data required to effectively train a model. In this study, we analyze the effectiveness of eleven active learning algorithms on classifying subsite and histology from cancer pathology reports using a Convolutional Neural Network (CNN) as the text classification model.
Results: We compare the performance of each active learning strategy using two differently sized datasets and two different classification tasks. Our results show that on all tasks and dataset sizes, all active learning strategies except diversity-sampling strategies outperformed random sampling, i.e., no active learning. On our large dataset (15K initial labelled samples, adding 15K additional labelled samples each iteration of active learning), there was no clear winner between the different active learning strategies. On our small dataset (1K initial labelled samples, adding 1K additional labelled samples each iteration of active learning), marginal and ratio uncertainty sampling performed better than all other active learning techniques. We found that compared to random sampling, active learning strongly helps performance on rare classes by focusing on underrepresented classes.
Conclusions: Active learning can save annotation cost by helping human annotators efficiently and intelligently select which samples to label. Our results show that a dataset constructed using effective active learning techniques requires less than half the amount of labelled data to achieve the same performance as a dataset that constructed using random sampling.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Due to technical limitations, full-text HTML conversion of this manuscript could not be completed. However, the latest manuscript can be downloaded and accessed as a PDF.