Machine learning for predictions of cervical cancer identication – preliminary investigation based on refractive index

Cervical cancer is one of the most commonly appearing cancers, which early diagnosis is of greatest importance. 16 Unfortunately, many diagnoses are based on subjective opinions of doctors – to date, there is no general 17 measurement method with a calibrated standard. The problem can be solved with the measurement system being a 18 fusion of an optoelectronic sensor and machine learning algorithm to provide reliable assistance for doctors in the 19 early diagnosis stage of cervical cancer. We demonstrate the preliminary research on cervical cancer assessment 20 utilizing optical sensor and prediction algorithm. Since each matter is characterized by refractive index, measuring its 21 value and detecting changes give information about the state of the tissue. The optical measurements provided 22 datasets for training and validating the analyzing software. We present data preprocessing, machine learning results utilizing three algorithms (Random Forest, eXtreme Gradient Boosting, Naïve Bayes) and assessment of their performance for classification of tissue as healthy or sick. All of them provided high values (>89%) of the measures 25 describing them. Our solution allows for rapid sample measurement and automatic classification of the results 26 constituting a potential support tool for doctors. In this study, we propose a method of preliminary cervical cancer identification based on a prediction taught on data obtained from low-coherence measurements of certified index We measured and analyzed samples within the range of actual index for lesions. The acquisition and preparation of learning process and results date, no peculiar Our approach which


30
Cervical cancer is one of the most common cancers worldwide 1,2 . Every year around the world cervical 31 cancer is diagnosed in about half a million women, including about 2,5 thousand in Poland 3 . The incidence and 32 mortality of cervical cancer have been dramatically reduced by screening programs 4 . However, in many cases

46
"halo" and with enlarged, irregular nucleus, which suggested an intensive mitosis process. The solution to this issue 47 may be the fusion of the most dynamically developing technologies: optical sensing and machine learning techniques 48 9-11 . With a fast, reliable and non-destructive optical method, we can investigate the biological sample and then 49 analyze the acquired data with dedicated software 12,13 , allowing for auto-identification of neoplastic cervical lesions 50 which will be invaluable support for doctors at the stage of initial diagnosis 14,15 . The identification will be based on 51 refractive index values of measured tissues.

52
The refractive index is one of the most important physical properties characterizing materials. In case of 53 biological tissues, it is highly correlated with the morphological features including the cell density and the nuclear-

59
In this study, we propose a method of preliminary cervical cancer identification based on a prediction 60 algorithm, taught on data obtained from low-coherence measurements of certified refractive index liquids. We have

79
The main goal of the cervical cancer identification method is to detect neoplastic lesions according to the

91
should be noted that the cancer is diagnosed when the basal membrane is invaded due to differences in treatment.

92
However, the evaluation of the refractive index should be correlated with the identification of the basal layer.

93
Therefore an essential element of the elaborated method is sensitivity to the Fabry-Perot interferometer length 94 changes. This parameter corresponds to the depth of the cervical epithelium of the measurement sample that 95 determines the grade of dysplasia.

Dataset acquisition 97
The optical determination of refractive indices of the investigated liquids was performed in a Fabry-Perot 98 interferometer. The measurement setup was built in a reflective configuration using fiber-optic technology. The 99 components of the system were a superluminescence diode (SLD-1550-13-, FiberLabs Inc., Fujimino, Japan), an

122
The highest signal contrast of V = 0.9956 was obtained for the cavity length equal to 280 µm. The reference 123 signal was acquired to control the intact cavity setting. Next, 30 µL of the liquid sample with a known refractive 124 index was introduced into the cavity. The optical spectra were recorded and the cavity was cleaned. The whole 125 procedure was then repeated for all liquids (a total of 10 spectra for each sample).

Dataset preparation 127
Interferograms obtained in accordance with the adopted methodology were the basis for further analyzes.

144
Prepared dataset allowed to build a machine learning model based on selected supervised learning algorithm.

145
The following formulas were introduced into preprocessing procedure in order to estimate the distortion of

180
They are among the simplest Bayesian network models but coupled with kernel density estimation, they can achieve 181 higher accuracy levels 41 .

182
XGBoost algorithm was used and the following parameters were selected: boostergbtree, learning_rate -0.3, 185 min_split_loss -0, max_depth -6 and sampling_methoduniform. As a part of the application of a different 186 approach to classification, an NB algorithm was used. Following parameters were selected: priors -None,    Table 3. 212 Table 3 Classification results

220
The obtained results are presented also as confusion matrices in Figure 5.

225
Additionally, to extend the model evaluation, the learning time from the training data and making 226 predictions was measured for each algorithm. The results are presented in Table 4.

229
It can be noted that the Naive Bayes method not only gives the best results for the validation test, but also is 230 the fastest regarding the training and prediction phases.

233
In this study, we presented a novel approach to the analysis of data acquired by a low-coherence classification of healthy and sick tissues. The tested classifiers were characterized by high accuracy above 95%, 244 precision above 95%, recall above 95% and F1-score above 95% for training datasets, and for validation accuracy 245 above 89%, precision above 90%, recall above 90% and F1-score above 89%. The method we reported can be of 246 great assistance for doctors in early cervical cancer diagnosis.

252
The authors declare no conflict of interest.