Grade diagnosis of human glioma based on fingerprint and artificial neural network

Grade diagnosis of human glioma based on fingerprint and artificial neural network Wenyu Peng, Shuo Chen, Dongsheng Kong, Xiaojie zhou, Xiaoyun Lu, and Chao Chang # Key Laboratory of Biomedical Information Engineering of Ministry of Education, School of Life Science, Xi’an Jiaotong University, Xi’an 710049, China Innovation Laboratory of Terahertz Biophysics, National Innovation Institute of Defense Technology, Beijing 100071, China Department of Neurosurgery, Chinese People’s Liberation Army (PLA) General Hospital, Beijing, P.R. China National Facility for Protein Science in Shanghai, Shanghai Advanced Research Institute, Chinese Academy of Science, Shanghai 201210, China.


Background
Glioma accounts for almost 30% of all primary brain tumors and 80% of all malignant tumors and is responsible for the majority of deaths from primary brain tumors; furthermore, gliomas can be classified as astrocytomas, oligodendrogliomas, mixed oligoastrocytic gliomas or ependymomas and assigned WHO grades from I to IV on the basis of their histological appearance [1]. The WHO lower grades (I/II) indicate the least malignant behavior. Low-grade cells look less like normal cells and usually grow slowly, but they can grow into nearby brain tissue. Surgery is usually the only treatment for a low-grade tumor, but the tumor is more likely to return after surgery and tends to develop into a malignant tumor. For high-grade (III/IV) glioma, the cells look very abnormal and grow very fast. High-grade cells often return soon after treatment and sometimes spread to other parts of the brain and spinal cord, with treatment involving radiotherapy and chemotherapy being necessary [2]. Historically, histopathology has been the gold standard method for classification diagnosis, with sections deposited on microscopy slides examined based on some key criteria, such as cellular density, nuclear atypia, mitotic activity, necrosis and microvascular proliferation [3]. However, the traditional method also has limitations, such as a tedious experimental process, delayed diagnostic results and the subjectivity of pathologists to some degree. Therefore, a complementary approach based on a rapid, accurate, objective and quantitative analysis tool could provide classification advice for aggressive diseases for clinicians.
Spectroscopy including FTIR, Raman and Terahertz are valuable instruments that have been used for studying molecular component changes in lipids, proteins and nucleic acids in biological samples, such as biological fluids, tissues, and cancer cell lines [4][5][6][7][8][9][10][11][12]. The commonly method of FTIR mainly relies on the resonant frequencies of the molecular bonds, which result 6 in some absorption peaks when the transmission electromagnetic waves are collected by a detector of the interferometer [13]. In contrast to those procedures involving dyes and other histological approaches, the FTIR method is rapid and nondestructive, and does not require reagents [14].
Multifactorial statistical analysis methods related to FTIR have been widely used for identifying changes in lipids, proteins, nucleic acids, and carbohydrates, such as principal component analysis (PCA) [15][16][17] and partial least squares (PLS) [18,19] combined with discriminant analysis (DA), hierarchical cluster analysis (HCA) [20,21], support vector machines (SVMs) [22,23] and random forest (RF) [24]. Smith et al. [25] used the supervised machine learning algorithm of RF as a classifier to separate patients into cancer and noncancer categories based upon the intensities of wavenumbers presented in their spectra and finally achieved a sensitivity and specificity up to 92.8% and 91.5%, respectively. Cameron et al. [26] assessed patients with various brain tumors by using their serum and applied the PLS-DA model to their spectral signatures collected by attenuated-total-reflection FTIR spectroscopy, achieving a sensitivity and specificity greater than 92% in the classification of brain tumors and control patients. Moreover, metastasis vs. glioblastoma with the linear SVM reported a 84.3% sensitivity, 96.2% specificity and receiver operating characteristic (ROC) curve with an area under the curve (AUC) of 0.9, suggesting a high diagnostic capability [26]. As a pattern-recognition-based approach, the artificial neural network (ANN) has been proved to be effective in the analysis of biological specimens [18]. The ANN method consists of many neurons arranged in separate layers and has the capability to transfer the input signal to the output layer of classification through an activation function. A. D. Surowka combined the ANN and synchrotron-radiation-based infrared spectroscopy to study the protein composition of human glial tumors. After the network was optimized and tested, the standard error of prediction (SEP) was found to be lower than 5% [27]. By using FTIR spectroscopy and the ANN, Argov et al. [28] reported that the method could separate an adenomatous polyp from a malignant cell, with classification percentages of 89%, 81% and 83% for normal, adenomatous polyp, and malignant cells, respectively.
To truly reflect the molecular change during grading, this paper adopts FTIR spectroscopy for tissues instead of serums [29][30][31]. Additionally, the samples are collected from different patients diagnosed with either low or malignant glioma, and the spectroscopic results are statistically significant. After a comparison of two different supervised machine learning algorithms, i.e., PCA-LDA and ANN, the results demonstrate that the FTIR-ANN method performs better than PCA-LDA. Thus, FTIR-ANN can be a promising clinical diagnostic alternative to histopathology.

Methods
A total of 9360 spectra were collected from 77 patients with different grades of glioma. The subtypes of low-grade glioma (WHO II) are oligodendroglioma (n = 6) and diffuse astrocytoma (n = 15), with 14 males and 7 females, aged 10 to 63 years old, with an average age of 38.3 years old. High-grade glioma (WHO III/IV) includes anaplastic astrocytoma (n = 15), anaplastic oligodendroglioma (n = 10) and glioblastoma (n = 31), covering 36 males and 20 females with ages ranging from 14 to 69 years old and an average age of 48. 8 (Fig. 1b). In the subsequent dewaxing step, the tissue slides were immersed in xylene at room temperature for 5 min; this step was repeated twice with fresh xylene. Then, the tissue slides were washed and cleared by immersing them in 100% ethanol for 5 min, which was repeated twice with fresh ethanol. In the last step, these tissue slides were allowed to air dry before the IR spectra were collected [32].
The spectra were detected by FTIR microscopy (Nicolet 6700) at the BL01B beamline of the Shanghai Synchrotron Radiation Facility (SSRF). The absorbance spectra were obtained in transmission mode (Fig. 1a) in the wavenumber range of 800-4000 cm -1 at a resolution of 4 cm -1 with 16 coadded scans. The aperture size was set to 80 x 80 µm with a step size of 80 µm. The 9 background spectrum was obtained on the blank area of a 1 mm thick barium fluoride substrate.
Each patient contributed approximately 140 spectra to the dataset, and the data were collected and processed by the OMNIC 9.2 software. Data preprocessing included automatic baseline correction and amide I (1649 cm -1 ) normalization ( Fig. 1c and d). normalization.

Data Analysis
Both the lipid range of 2800-3000 cm -1 and the fingerprint region of 800-1800 cm -1 were extracted from the whole spectra, which were preprocessed with baseline correction and amide I normalization. Including high-and low-grade glioma, all of the spectra were randomly divided into a training set (70%) and a test set (30%). The spectral data analysis was performed by using the classification toolbox version 5.4 of Milano Chemometrics and the QSAR Research Group in the MATLAB R2020a environment (MathWorks, Natick, USA). In this research, the PCA-LDA and ANN methods were used.
The PCA-LDA algorithm is a supervised machine learning method that is commonly employed to process IR spectra data. The goal of PCA is to reduce the dimensionality of data and retain as much as possible the variation present in a dataset. Assume the following original space representation: where m1, m2, m3, ..., mN is the base in the original n-dimensional space.
The information loss is shown below: Then, K is chosen according to the following criterion: The goal of LDA is to find directions along which the classes are best separated, taking into consideration the within-classes and between-classes regimes.
where U is the projection matrix, w S is the within-class scatter matrix, and b S is the between-class scatter matrix.
Thus, the following can be concluded: .. In this model, dimension reduction was used for the spectral ranges of 800-1800 cm -1 (2076 dimensions) and 2800-3000 cm -1 (417 dimensions) and a combination of the two ranges (2493 dimensions). Subsequently, the first 16 PCs for LDA were chosen according to the minimum error rate of fivefold Venetian blind cross-validation. Then, the first two principal components (PCs) were displayed in a scattering graph, which can be clearly visualized.
As a sophisticated computational model based on the nonlinear processing of neurons (nodes), the ANN has been proved to be effective in the analysis of biological specimens [18]. A general ANN consists of two main operation phases: forward propagation for producing the 12 output results and backward propagation for minimizing the cost value. During the process of error back propagation, the weights are adjusted constantly until the cost value reaches the minimum value with an appropriate learning rate (Fig. 2). Finally, the optimized network has the power of prediction. Assuming that the number of layers in the ANN net is K (K>1), the dimensions of the input and output layers are mo and mk, respectively. The output of each layer of the network is expressed as follows: where f (k) is an activation function,  is the learning rate and W L   is the gradient.
In this model, a four-layer perceptron was used for the classification of high-and low-grade glioma. The numbers of nodes in the input layer were the same with wavenumber ranges of 2800-3000 cm -1 (417 nodes) and 800-1800 cm -1 (2076 nodes) and a combination of the two ranges (2493 nodes). The hidden 1-layer consisted of 50 neurons, with each node receiving all of the nodes from the input layer. The hidden 2-layer included 5 neurons that were also fully connected to the hidden 1-layer and the output layer of one neuron. The activation function was a sigmoidal function, and the learning rate was 0.001. The momentum term alpha was set as 0.5 to cancel the opposing components and enhance the reinforcing components at successive positions.
Venetian blind cross-validation (CV) was adopted, and the number of CV groups was five. The model calibration was terminated after 5000 epochs on the training set.
To determine the performance of the models, accuracy, specificity and sensitivity were used as the evaluation metrics. Accuracy represents the ratio of correctly assigned samples.

Results
The high-and low-grade tissues were predefined based on the histopathologic results before collecting the spectra. Fig. 3 shows the H&E staining of a case of glioblastoma tissue diagnosed with WHO grade IV (a-c) and a case of oligodendroglioma diagnosed with WHO grade II (d-f).
Specifically, the left panel presents the H&E staining of a tissue slide under a 10x microscope (a: high, d: low), the middle panel is the tissue morphology under a 32x microscope without dye (b: high degree, e: low degree), and the right panel is the corresponding IR mapping at 1539 cm -1 (c: high degree, f: low degree). In addition, Fig. 3(a) presents numerous necrotic foci and blood vessel proliferation, while Fig. 3(b) shows that the cells are characterized richly in some areas, the surrounding nucleus is hollow and the cytoplasm is transparent, with no clear mitosis.

15
A total of 4610 and 4750 spectra were collected from 21 low-grade and 56 high-grade glioma patients, respectively. The spectra of low-and high-grade gliomas in the ranges of 2800-3000 cm -1 and 800-1800 cm -1 are shown in Fig. 4(a) and (b). The bands at 2800-3000 cm -1 were attributed to lipid absorbance, and the major bands at 2957 cm -1 , 2917 cm -1 and 2849 cm -1 were identified. In detail, 2957 cm -1 , 2917 cm -1 and 2849 cm -1 correspond to the CH3 asymmetric stretch and the CH2 asymmetric and symmetric stretching vibrations from lipids, respectively.
The range of 800-1800 cm -1 represents the fingerprint range, and the bands at 1741 cm -1 and 1453 cm -1 are assigned to the carbonyl C=O stretch and CH2 bending stretch from lipids, respectively. There was a phenomenon in the spectra in which the peak at 1741 cm -1 was observed only in low-grade tissue and missed in malignant tissue. Thus, the 1741 cm -1 peak may be a potential marker of disease progression. Serving as an internal reference normalized in the preprocessed data, the band at 1649 cm -1 is the stretching vibration of the C=O groups of the peptide chains from amide I. Amide II at 1539 cm -1 belongs to N-H bending and C-N stretching, and 1390 cm -1 is attributed to the C=O stretching of COOsymmetric stretching. Moreover, 1234 cm -1 and 1061 cm -1 are assigned to the asymmetric and symmetric PO3 2− groups from DNA, RNA and phospholipids, respectively. The IR bands and the corresponding assignments [33] are summarized in Table 2. Statistical analysis was applied to the relative intensity of the bands to obtain semiquantitative information of the graded tissues. Fig. 4(c) shows that there is a considerable difference between the IR spectra of the low and malignant tissues. The intensity in the lipid band of 2917 cm -1 for the high grade decreased significantly in contrast to the low grade with a ratio of approximately 0.91 times. According to Student's t-test analysis, the P value was    Table 2. Absorbance bands observed in the spectra and the corresponding assignments

PCA-LDA
The PCA-LDA method was applied to three wavenumber ranges, which are denoted as ranges 1, 2, and 3. Ranges 1 and 2 correspond to the fingerprint range of 800-1800 cm -1 and lipid range of 2800-3000 cm -1 , while range 3 combines both 1 and 2. In this model, the first 16 PCs attributed to 100% variance were used for the linear discriminate analysis. Fig. 5(a)

ANN
A total of 6552 spectra, including high-and low-grade spectra, were used as the training set to train the ANN net with Venetian blind cross-validation, while the remaining 2808 spectra were used as the test set to evaluate the model. The classification outputs based on ranges 1, 2, and 3 are shown in Fig. 6(a)-(c), respectively. Similar to Figs.5 uses dots and stars to indicate the training and test sets, respectively. As shown in Fig. 6(a) and (c), the grades could be separated clearly in ranges 1 and 3, and the related accuracy, sensitivity and specificity were all greater than 0.98 (f, j). However, for range 2, the ANN network exhibited lower performance with an 19 accuracy of 0.90, sensitivity of 0.91, and specificity of 0.91 (g). As a graph of the true positive rate vs. false positive rate, the ROC curve also represents the performance of the classification model at all classification thresholds. The AUC represents the area under the ROC curve integrated from (0, 0) to (1,1). It is an attractive indicator with scale invariance and classification-threshold invariance. The AUC ranges from 0 to 1, with higher values indicating better performance of the model. As shown in Fig. 4(d) and (h), the AUC values related to ranges 1 and 3 can reach 1, while the value is 0.98 for range 2 (Fig. 4(f)).
20   cancer presents a significant increase in the intensity ratio of amide I, amide II, and nucleic acid [28], and the difference may be attributed to a greater metabolic activity of cancer cells in disease progression [36]. The ratios of the intensities of the 2849 (CH2) and 2917 (CH3) bands are diminished, indicating a large number of methylene groups in malignant tissue. Moreover, the band at 1741 cm -1 could be observed in the low-grade tissue but not in the spectra of malignant tissue. Thus, the 1741 cm -1 band could serve as a marker that distinguishes the grades. In research on the chemical changes in healthy brain tissues and glioblastoma tumor tissues, Depciuch et al. [7] found that compared to control brain cancer, FTIR spectra of cancer brain tissue showed a significant difference in chemical composition; hence, they assumed that lipids could be a spectroscopic marker for brain tumors. Combined with the multifactorial statistical analysis of PCA-LDA and the ANN in the ranges of 800-1800 cm -1 and 2800-3000 cm -1 and a combination of the two ranges, the results (Table 3) demonstrate that the ANN algorithm operating within 800-1800 cm -1 achieves the best performance, with an accuracy, a specificity and a sensitivity all reaching 99% on the training set and a prediction accuracy, specificity and sensitivity are above 99% on the test set, which is much superior to the PCA-LDA, which the prediction accuracy, specificity and sensitivity are only 87%, 89% and 86%, respectively.
Therefore, it can be concluded that the infrared range of 800-1800 cm -1 is the major indicator for cancer progression, and the ANN-based method could be established as a promising diagnostic tool in clinic. Although the use of IR spectra can be seen as a promising method for the detection of cancer progression, some weaknesses still exist, such as the low signal on aqueous samples with strong absorbance. Furthermore, contamination in a sample will affect the spectral data and lead to incorrect interpretation [32], and the data preprocessing procedure should also be standardized to avoid different interpretations of the results. When the process is standardized and unified, the method can be used as an alternative approach for the clinical grade diagnosis of human glioma.

Conclusions
In this study, we report an alternative workflow that combines the Fourier transform infrared (FTIR) spectroscopy and artificial neural network (ANN) to predict diagnosis the grade of human glioma in a fast (within several minutes, the efficiency raises almost 500 times), accurate (overall accuracy, specificity and sensitivity evaluation metrics can reach above 99%), and without reagent way, this method is much superior to the common classification method of

Consent for publication
Not applicable.
Availability of data and materials The datasets used and analysed during the current study are available from the corresponding author on reasonable request.

Competing interests
The authors declare no conflict of interest for this article.

Funding
There was no funding support.