Fast Characteristic of Skin Lesions by Machine-Learning of 1 Raman Spectrum

15 Background: The traditional diagnosis of skin lesions mainly relies on dermoscope and 16 pathological biopsy, of which the former is non-objective and the latter is invasive and time- 17 consuming. It is necessary to find an objective and non-invasive inspection method for the diagnosis 18 of skin cancer which is the most common malignant tumor. Herein, we aimed to fast identify the 19 skin cancers on ultrathin frozen fresh tissue sections by combining Raman spectroscopy detection 20 and machine learning technology. 21 Methods and material: 22 fresh frozen tissue sections including 3 squamous cell carcinomas, 22 basal cell carcinomas, 2 malignant melanomas, 3 seborrheic keratosis, and 3 melanocytic nevi, were 23 included and performed Raman detection. To prevent the discrete Raman data distribution affecting 24 the generalization ability of the learning model, a series of adaptive preprocessing algorithms were 25 first applied to standardize the raw Raman data of five skin lesions. The processed Raman data were 26 performed visualized cluster analysis by principal components analysis (PCA) and t-distributed 27 stochastic neighbor embedding (t-SNE). And, using K-nearest Neighbor (KNN) and support vector 28 machine (SVM) classifiers, two predictive models for diagnose were established and evaluated in 29 the training set and test set by the confusion matrixes and receiver operating characteristic (ROC) 30 curves. 31 Results: The mean variance Raman spectrum graph of 5 skin lesion types were acquired after 32 standardization procession and 4 peak positions with large differences were found. Through 33 dimensionality reduction by PCA and t-SNE, the visual clustering results of Raman data showed 34 heterogeneous intra-cluster homogeneity and inter-cluster dispersion. The test accuracies reached 35 94.56% and 98.94% in KNN and SVM classifiers respectively. The areas under the ROCs of the 36 two classifiers, in the category dimension and the sample dimension, were all more than 0.99 which 37 is close to the perfect classification effect. 38 Conclusions: Raman spectroscopy is a competitive candidate for the fast and accurate diagnosis of 39 skin lesions and the molecular information provided may be used in the pathological classification, 40 predicting immunotherapy responsiveness and stratifying prognostic risk. Furthermore, the 41 combination of Raman spectroscopy and machine learning methods showed great diagnostic 42 capabilities with high accuracy is a promising tool for the diagnosis of skin lesions.

basal cell carcinomas, 2 malignant melanomas, 3 seborrheic keratosis, and 3 melanocytic nevi, were 23 included and performed Raman detection. To prevent the discrete Raman data distribution affecting 24 the generalization ability of the learning model, a series of adaptive preprocessing algorithms were 25 first applied to standardize the raw Raman data of five skin lesions. The processed Raman data were 26 performed visualized cluster analysis by principal components analysis (PCA) and t-distributed 27 stochastic neighbor embedding (t-SNE). And, using K-nearest Neighbor (KNN) and support vector 28 machine (SVM) classifiers, two predictive models for diagnose were established and evaluated in 29 the training set and test set by the confusion matrixes and receiver operating characteristic (ROC) 30

curves. 31
Results: The mean variance Raman spectrum graph of 5 skin lesion types were acquired after 32 standardization procession and 4 peak positions with large differences were found. Through 33 dimensionality reduction by PCA and t-SNE, the visual clustering results of Raman data showed 34 heterogeneous intra-cluster homogeneity and inter-cluster dispersion. The test accuracies reached 35 94.56% and 98.94% in KNN and SVM classifiers respectively. The areas under the ROCs of the 36 two classifiers, in the category dimension and the sample dimension, were all more than 0.99 which 37 is close to the perfect classification effect. approximately 3 million patients of NMSC are expected to be treated each year while ~10 thousands 52 new cases and almost 7 thousands death arise. [4,5] Moreover, ~ 40% of patients will relapse within 53 2 years.
[6] 54 The diagnosis of skin lesions mainly relies on dermoscope and pathological biopsy. Dermoscope 55 is a non-invasive in situ diagnostic tool based on visual and morphological recognition. However, 56 it is non-objective and highly dependents on the experience of doctors. Many studies showed the 57 accuracy of melanoma diagnosis by dermatologists varies between 56% and 82.8%, while up to one 58 third of melanomas were misdiagnosed as benign lesions. [7,8] Although pathological biopsy is the 59 gold standard for diagnosis, its invasion and time-consuming burden on patients and doctors, and 60 increases a number of unnecessary biopsies. The number needed to treatment for the resection of 61 one malignant skin lesion was reported 20 ~ 59.[9, 10] Therefore, it is necessary to find a non-62 invasive, objective and high-efficient screening and diagnosis method. 63 part) and remove the high-frequency component (the noise part) from the Raman spectrum. Finally, 135 we used adaptive iteratively reweighted penalized least squares algorithm (airPLS) baseline 136 correction algorithm to remove the background introduced by fluorescence, chip and the tissue slide 137 itself. As an example, the Raman spectrum standardization results of BCC were shown in Figure 2, 138 as well as the preprocessing schemes and corresponding algorithms in Table 1. 139

Unsupervised learning 141
The Raman spectra of skin lesions has 701 dimensions, of which many are noise and redundant 142 information that have no contribution to classifications. To compress spectral dimensions and reduce 143 overfittings, a feature dimension reduction method is required. Without knowing data feature 144 contribution, unsupervised learning used in data compression can improve the usability of the 145 algorithms and their performance in high dimensions, and is helpful for the visualization of the data. 146 (1) PCA (principal component analysis) is a conversion technique used in unsupervised linear 147 data, which is the most widely used data compression algorithm [27,28]. It carries out orthogonal 148 transformation (a kind of linear transformation, in which the inner product of two vector spaces 149 remains unchanged during transformation) according to the data characteristics to eliminate the 150 correlation between each component of the original vectors. The corresponding eigenvectors with 151 decreasing eigenvalues are obtained by transformation. After orthogonal transformation, the high-152 dimensional space Raman spectrum can be expressed as a low-dimensional space. 153 (2) t-SNE (t-distributed stochastic neighbor embedding) algorithm is a non-linear dimension 154 reduction method. First, it converts the Euclidean distance between two high-dimensional space 155 data points into similarity probability. Then, the joint probability of the high-dimensional space data 156 point and the corresponding low-dimensional space analog data point is used to replace the 157 conditional probability in the random neighborhood embedding algorithm. t-SNE makes the shorter distance data points in high dimensional space have larger distances after mapping, so that the points 159 in the same cluster were gathered more closely, and the points in different clusters farther apart, 160 which effectively solves the data crowding problem in the low-dimension space. 161

Supervised learning 162
Compared with unsupervised learning, the training data had eigenvalues and label values. Through 163 the study of training data, the learning model independently established the connection between 164 eigenvalues and label values, and predicted label values based on data features. Following are the 165 methods we used, KNN (K-nearest Neighbor) and SVM (support vector machine). 166 (1) KNN, also known as the nearest neighbor algorithm, is based on an analogous learning 167 method by comparing a given test tuple with its similar training tuple. For each new data, the 168 closest K data will be found in the given data tuple, then the K data and the new data will be 169 initially set to the same category. In this paper, Euclidean distance was adopted. Suppose the 170 Euclidean distance of two points or tuple sum ( 1   11 12 (2) SVM is a method of classifying linear and nonlinear data and has been widely used in many 174 clinical predictions [29,30]. It maps the training data to higher new dimensions and searches for the 175 best classification plane (decision boundary) which can separate the data into different classes. Two 176 data tuples can always be separated by the decision boundary as higher as the non-linear mapping 177 dimension is. For RS data with a lot of eigenvectors, the calculation of the inner product in the high-178 dimensional space is too large to solve and remains the core of SVM function. In this article, SVM 179 kernel function selects polynomial kernel: the kernel function of samples i

Liquid chromatography-mass spectrometry (LC-MS) 184
According to our inclusion criteria and single pathology principle, another three lesion tissues (Table  185 5, 6) were obtained for LC-MS. After getting the same amounts of tissues from the central of lesions, 186 chromatographic grade methanol (mass volume ratio 1 g : 2.5 mL) were immediately added in and 187 vortexed for 1 min. Then tissues were homogenized for 3 min with 2 ~ 3 zirconium dioxide grinding 188 beads. After grinding 3 min, the homogenates were centrifuged at 14,000 rpm for 10 min at 4 °C, 189 and the upper aqueous layers were used in LC-MS analysis. The standards of phenylalanine and 190 tryptophan were dissolved in pure methanol to obtain 2.00 mg/mL stock solutions. Before sample 191 detection and analysis, the stock solutions were diluted with pure methanol and made into mixed 192 standards (2000 ng/mL). 193 The high-resolution mass spectrometry (MS) (Q ExactiveTM, Thermo Fisher Scientific (China) 194 Co., Ltd.) coupled with electrospray ionization (ESI) was performed in the positive and negative 195 ion switching scan mode. Parallel reaction monitoring (PRM) was selected in the detection. The 196 resolution of the equipment is 17500 and scan range 50.0～500.0 m/z. During detection, the spray 197 voltage was set at 3.2 kV in positive ionization mode, capillary temperature at 300 ℃ and nitrogen 198 at 40 Arb. Data collection time was 8.00 min. Analyte information were shown in Table 2. The 199 Liquid chromatography (LC) (UltiMate 3000 RS, Thermo Fisher Scientific (China) Co., Ltd.) used 200 T3 column (2.1 × 150 mm 3 µm, waters) with a flow rate of 0.30 ml/min which was maintained at 201 35 °C. The aqueous phase was 10 mM ammonium formate solution at PH 3.0 adjusted by formic 202 acid and the organic phase was acetonitrile. The elution gradients were showed in Table 3. The 203 injection volume was 5 μL for each sample. 204 The chromatogram acquisition and integration were processed by the software Xcalibur 3.0 205 (Thermo Fisher) and linear regression with 1/X 2 as weighting coefficient was performed to get the 206 standard curves of phenylalanine and tryptophan (Table 5, 6). 207

Raman characteristics and molecular information of five skin lesion types 209
In order to avoid the uneven distribution of categories, 50 Raman spectra were collected for each 210 lesion sample and a total of 1100 Raman spectra were acquired. The 1100 spectra were sequentially 211 standardized and batched. After processing, the data was trained and tested for the identification 212 model. The following Figure 3 showed the mean variance map of the preprocessed Raman spectra 213 of 5 skin lesions, where the solid lines were the mean spectra, and the shaded bars represented the 214 standard deviations within groups. 215 From the mean variance graph of Raman spectra of 5 skin lesions, 4 peak positions (720 cm -1 , 216 752 cm -1 , 853 cm -1 , 1002 cm -1 ) with significant differences were noticed. Their physical origins and 217 peak intensity disparities were summarized in Table 4 and  (Table 4) in SCC and MM were significantly higher than other 223 lesions (Fig. 4B, D). In order to verify the reliability of our test results, we used LC-MS to detect 224 the contents of phenylalanine and tryptophan in SCC, SK and MN. The contents of phenylalanine 225 (Table 5) and tryptophan (Table 6) in SCC were indeed higher than SK and MN. In addition, the 226 peak intensity of 853 cm -1 (the stretching of collagen proline ring (C-C))[35] (Table 4) in MN was 227 much higher than other lesions, followed by SK (Fig. 4C), indicating that proline decreases as the 228 malignancy of skin lesions increases. 229

Visualized clustering results in t-SNE and PCA 230
Using the standardized Raman data of 5 skin lesions, two ways were carried out for cluster analysis. 231 Figure 5 showed that nonlinear t-SNE dimensionality reduction and linear PCA were used to 232 visualize the clustering results. Figure 5A showed t-SNE dimensionality reduction results which 233 using two largest contribution dimensions of t-SNE 1 and t-SNE 2 achieved highly nonlinear 234 distinguishable for these 5 types of skin lesions. Figure 5B showed the result of PCA three-235 dimensional visualization. The three largest principal components of PC1, PC2 and PC3 spectra 236 were used to achieve linear separable of 5 types of skin lesions. The above two unsupervised 237 learning methods both showed heterogeneous intra-cluster homogeneity and inter-cluster dispersion. 238

Confusion matrixes and validations of SVM and KNN models 239
Next, RS data of five skin lesions were learned and analyzed by two common splitters of SVM and 240 KNN in supervised learning. 20% RS data were tested and confusion matrixes of the recognition 241 results were showed in Figure 6 (A, B). In the confusion matrixes, the horizontal direction 242 represented the true category label, the vertical direction labeled the represented category label (the 243 category label corresponding to the highest predicted probability), and the diagonal value indicated 244 the recognition accuracy of the corresponding category test data. Calculating the mean value of the 245 diagonal lines of the confusion matrixes, KNN and SVM test accuracies were 94.56% and 98.94% 246 respectively. In KNN, 11.1% of SCCs were misjudged as BCCs, 5.6% of SKs were confused with 247 MMs and 10.5% of MNs were misdiagnosed as SCCs (Fig. 6A). In SVM, 5.3% of MNs were 248 misjudged as SKs (Fig. 6B). 249 With false positive rate (FPR) as the horizontal axis and true positive rate (TPR) as the vertical 250 axis, ROC curves of five skin lesion categories were drawn in KNN and SVM, and area under 251 curve (AUC) was used to measure the excellence of the prediction models (Fig. 6C, D). Macro-252 average ROC curves were drawn using the mean value of ROC curves of 5 categories indicating 253 category dimension prediction. Micro-average ROC curves were drawn using the mean value of 254

ROC curves of all test samples indicating sample dimension prediction. 255
After calculating, the AUCs of the five categories in KNN classifier were all greater than 0.97, 256 and the mean AUCs in the category dimension and the sample dimension were both 0.99 (Fig.  257   6C). The AUCs of all test samples and categories both were 1 in the SVM classifier (Fig. 6D). 258 These data indicated that KNN and SVM were all close to perfect classifiers. 259

Discussions 260
Since the specimens in pathology department were all preserved with formalin fixation and paraffin 261 embedding, many human tissue specimens were tested directly within the wax blocks [17,18] and 262 some underwent gentle dewaxing treatment [19] or digital dewaxing of RS signal [46]. Although 263 some studies demonstrated that the detection of wax or formalin-fix blocks have no effects on the 264 Raman spectra of specimens and paraffin tissues can be almost completely dewaxed [21,47,48] SNE both show better clustering based on categories (Fig. 5). With its lower coupling feature, t-290 SNE showed its superiority in high-dimensional Raman spectral data dimensionality reduction and 291 visualization (Fig. 5A). Two classifiers of KNN and SVM showed high test accuracies in the RS 292 identification of 5 skin lesion types. In KNN, 11.1% of SCC was misjudged as BCC and 5.6% of 293 SK was confused with MM ( Fig. 6A). For SCC, BCC, SK and MM, further surgical resections and 294 pathological biopsies are necessary, so the above misjudgments are acceptable in the clinic. 10.5% 295 of MNs were misjudged as SCC (Fig. 6A), which may increase unnecessary biopsies. Compared to 296 the lower-skilled physicians, KNN has more experience in diagnosis deserving the title of "senior 297 physician". SVM, with only 5.3% of MNs misjudged as SK (Fig. 6B), did show an almost ideal 298 classification effect. Moreover, in KNN AUCs of ROC curves in the category dimension and the 299 sample dimension were both 0.99, and in SVM were both 1 (Fig. 6C, D), indicating that both 300 classifiers are perfect classifiers of RS in the identification of 5 skin lesion types. All in all, the 301 application of machine learning in RS can better identify skin lesion types and provides a better 302 bridging method for the application of RS in AI diagnostics. 303 How to achieve individualized treatment is still one of the ten challenges facing tumor 304 immunotherapy.
[58] The essential amino acid tryptophan catabolism is recognized as an important 305 microenvironmental factor that suppresses antitumor immune responses in cancer and regulates T 306 cell proliferation, activation and anti-tumor effects [36][37][38]. Phenylalanine is involved in regulating 307 cell cycle progression,[42] modulating invasion-related signaling/function proteins, [43] and 308 promoting tumor cell adhesion and spread.
[44] The lack of phenylalanine can induce focal adhesion 309 kinase-dependent apoptosis and mitochondria-initiated apoptosis. [43,45]. Interestingly, the 310 different contents of tryptophan and phenylalanine were detected by RS in the 5 skin lesion types 311 we studied and were highly consistent with LC-MS results (Fig. 4B, D, Table 5, 6). In addition, as 312 the main component of collagen, the decrease of proline indicates tumor metastasis and poor 313 prognosis [39][40][41]. Similarly, lower strengthen of proline signal was found in malignant lesions as 314 compared to benign lesions in our study (Fig. 4C). These results indicate that RS provides reliable 315 molecular information related with tumor therapy and progression. Furthermore, AI-aided RS may 316 be a reliable screening method for immunotherapy responsiveness and individualized therapy. These 317 conjectures will be examined in our following research. 318

Conclusion 319
In summary, RS is a competitive candidate for the fast and accurate diagnosis of skin lesions with 320 ultrathin frozen fresh sections providing high-quality Raman spectra. And, the application of 321 machine learning methods in Raman spectrum classification showed excellent diagnostic 322 capabilities for 5 skin lesion types. KNN and SVM predictive models diagnosed 5 skin lesion types with almost perfect accuracy. The significant differences of tryptophan, phenylalanine and proline 324 indicated by RS may imply different progression and treatment responsiveness of 5 skin lesion types. 325 These results identify that ML-aided RS is a potential tool in clinic diagnosis and screening of tumor 326 immunotherapy, progression and prognosis. 327 laser. Raman spectra were processed gradually and analyzed by machine learning methods. 527 Diagnose information were put out at last. 528