Phenylalanylphenylalanine as a Diagnostic Biomarker for Lung Cancer and Tuberculosis.

Background: Worldwide, lung cancer has the highest mortality rate, and pulmonary tuberculosis has a high incidence in China, and both may be misdiagnosed frequently because of similar clinical presentation and atypical imaging ndings. Diagnostic biomarkers to distinguish between lung cancer and other pulmonary diseases can be detected by metabolomics to avoid non-essential treatment. Methods: This cohort study employed non-targeted and targeted metabolomic analysis in participants enrolled from three independent centers. Multivariate statistics, variable importance in the projection parameter, receiver operating characteristics (ROC) curve were used to build potential key diagnostic biomarkers model of lung cancer and these were subsequently analyzed using targeted metabolomics in test set. Quantitative analysis of differences in biomarker levels was conducted, and a support vector machine (SVM) classier was used to identify the prediction rate of diagnostic biomarker model. Results: Phenylalanylphenylalanine showed opposite trends in lung cancer and tuberculosis. The area under the curve 0.8887 (95% CI 0.8064–0.9710, p<0.001, sensitivity 85.45%, specicity 84%), 0.8149 (95% CI 0.7419–0.8878, p<0.001, the sensitivity was 73.26%, the specicity was 78.43%) and SVM results (prediction rate 77.94%) showed the feasibility of using phenylalanylphenylalanine as a diagnostic marker for the differential diagnosis of lung cancer and tuberculosis. Conclusions: Changes in the levels of phenylalanylphenylalanine facilitate differential diagnosis between lung cancer and tuberculosis, thereby potentially reducing the damage caused by misdiagnosis in the clinical setting, and enabling early treatment of lung cancer patients.


Background
Lung cancer (LC) is a malignant respiratory system tumor and has been the commonest cause of cancer mortality worldwide for nearly half a century [1]. According to the International Agency for Research on Cancer, LC has the highest incidence rate in the world (1.8 million, 13% of the total), and in terms of mortality, LC once again ranks rst (1.6 million, 19.4% of the total) [2]. Statistics show that the 5-year survival rate of lung cancer is only 18% [3]. In the past few decades, many scientists and medical researchers have undertaken great efforts to control the occurrence of lung cancer [4]. If LC is detected at an early stage, the 5-year survival rate of lung cancer can be increased by 70% [5]. Due to the occult early onset of LC, the clinical symptoms lack speci city; therefore, it is di cult to distinguish LC from other lung diseases and most patients present with advanced-stage LC at diagnosis, and this adversely affects the prognosis and treatment of patients [6]. LC and atypical tuberculosis (TB) have similar clinical symptoms and examination ndings, and often evince the phenomenon of "different diseases and same shadows" in imaging investigations as some TB lesions resemble cancer. In some non-tumor hospital in China, LC is often misdiagnosed, further increasing the di culty for an accurate diagnosis by clinicians [7][8][9][10][11]. 18F-FDG-PET/CT for molecular/atomic imaging is an important tool for the detection, identi cation, and staging of LC and has broad clinical application, but is expensive and di cult to widely popularize. Furthermore, in areas where tuberculosis is endemic, the false positive rate is high and the speci city is low [12]. The gold standard for the clinical diagnosis of LC is histopathology of biopsy specimens but this method is invasive which poses a risk to middle-aged and elderly people. Therefore, safe methods for early diagnosis of LC are needed for clinical application [13].
For a more comprehensive understanding of LC pathogenesis, to improve diagnostic accuracy, and to reduce LC mortality rates, researchers have studied molecular biomarkers in genomics, epigenomics, proteomics, and metabolomics to facilitate the early detection of LC [14][15][16][17]. In recent years, the identi cation of e cient tumor markers for clinical diagnosis has become a research hotspot, and tumor markers that have been found to be effective and have wide clinical application include: carcinoembryonic antigen, cytokeratin-19-fragment, and neuron-speci c enolase [18], however, the lack of accuracy of these markers limits their usefulness for the early diagnosis of LC. Metabolites can re ect the human physiological functions and pathological characteristics in great detail. Metabolomics is a powerful tool to analyze the differences in metabolites between the healthy and diseased populations.
The rise of bioinformatics has resulted in the rapid progress of metabolomics in the past decade.
Metabolomics can enable qualitative or quantitative analysis of metabolites and facilitate the identi cation of unknown metabolites [19,20]. The use of metabolomics technology to nd metabolic biomarkers for use in supplementary diagnosis is an effective method to detect disease onset, and can possibly provide new options for clinical diagnosis. Furthermore, metabolomics can monitor response changes of key biomolecules in patients, and noninvasive methods using metabolic markers for clinical diagnosis have been widely used for the identi cation of tumor markers for pancreatic [21] and ovarian [22] cancer, which has important implications for establishing early diagnostic methods for LC. This study aimed to identify metabolic markers that differ signi cantly between LC and TB patients through non-targeted metabonomic analysis and to undertake subsequent quantitative targeted metabonomic analysis to nd metabolic markers that can be applied in actual clinical practice, to facilitate a more accurate clinical diagnosis of LC, prevent clinical misdiagnosis events, and enable timely and effective treatment of LC patients.

Study Design
In training set and identi cation set, we prospectively enrolled TB and LC patients and non-cancer patients as participants for the serum-based non-targeted metabolomics analysis to detect speci c markers of both diseases and the identi ed markers were screened through a binary logistic regression model. In test set, targeted metabolomics technology was used to analyze differences among TB, LC, and pneumonia patients and in non-cancer controls. The SVM model evaluated the predicting rate of potential biomarkers in LC and TB, and provide data for clinical research.

Participants
The study participants were patients who were treated at three centers: LC patients (n=262) from the Second Hospital of Tianjin Medical University; TB patients from Tianjin Haihe Hospital (n=182); and pneumonia patients from the First A liated Hospital of the Tianjin University of Chinese Medicine (n=30). The non-cancer control group (n=218) comprised patients with noncancerous diseases who were treated at the Second Hospital of Tianjin Medical University during the same period and were con rmed to have no malignant disease based on a review of the hospital's medical records from the hospital's medical database. The prespeci ed blood sampling and sample preparation were undertaken by welltrained researchers. The subject has passed the review of the ethics committee of the Second Hospital of Tianjin Medical University. The age and clinical manifestations of each participant were ascertained, and the basic information of the matched participants is presented in Table 1. This study is registered in the China Clinical Trial Registration Center (registration number ChiCTR2000040666, Registered 07 December 2020, http://www.chictr.org.cn/index.aspx), and the registration unit is the Second Hospital of Tianjin Medical University. Inclusion criteria: Patients who were diagnosed based on diagnostic criteria for TB and LC, with pathological con rmation of LC in biopsy specimens or a clinician-con rmed LC based on radiological and clinical ndings. According to the 7th edition of the TNM classi cation, all LC participants had primary lung cancer. The TB patients were diagnosed based on positive results on sputum examination (sputum smear or culture) and evidence of pulmonary TB on chest X-ray or computed tomography scanning. Patients in the age range of 18 to 80 years, regardless of sex; unimpaired consciousness level.
This study was conducted in accordance with the Declaration of Helsinki and the ethical rules of good clinical practice, and has been approved (approval number KY2020K089) by the ethics committee of the Second Hospital of Tianjin Medical University. The overall sample was randomly subdivided into training, identi cation, and test sets. Non-targeted metabonomics was used in training and identi cation set, targeted metabonomics was used in test set( Figure 1).

Sample Preparation
The samples that were frozen and stored at −80℃ were completely thawed in the refrigerator at 4℃, and sample was pipetted into a centrifuge tube, and then mixed and vortexed for 1 min to prepare a quality control (QC) sample. The QC sample contained the biological information of all samples and can re ect the overall sample status [23] and was used for methodological investigation with the same preprocessing method as that used for the sample.

Mass Spectrometric Analysis
Using electrospray ionization (ESI source), mass spectrometric analysis was performed in positive ionization mode with the following conditions: capillary voltage 2.0 kV, ionization source temperature 100°C, dry gas ow rate 10 mL/min, esolvation ow rate 600 L/D, desolvation temperature 450℃, cone air ow rate 50 L/D, and quadrupole scanning range m/z 50-1000.

Procedures of Methodological Investigation
Instrument precision test: We took the same QC sample and made 6 consecutive injections. The data were exported as the peak area, after the missing value was lled, and the relative standard deviation (RSD) value of each ion feature was calculated; the features with <30% accounted for >70%.
Method precision test: Six QC samples were prepared in parallel and continuously injected for analysis. The data were exported as peak area, after the missing value was lled, the RSD value of each ion feature was calculated, and the features with RSD <30% accounted for >70%.
Sample stability test: We took the same QC sample and injected samples at six time points during the whole injection process. The data were exported as peak area, after the missing value was lled, the RSD value of each ion feature was calculated. and the features with RSD <30% accounted for >70%.
Statistical Analysis UPLC-Q-TOF/MS technology was used to analyze the metabolomic pro le of clinical serum samples in the NC, TB, and LC groups. Data collected by the instrument were passed through the Masslynx (version 4.1) data processing system (software parameters: mass error 0.01 da; retention time error 0.5 min), the intensity of each ion was normalized to the total number of ions, and the data formed included retention time, m/z value, and peak area. These data were imported into the SIMCA-P version 14.1 (Umetrics, Sweden) for multivariate statistical analysis. Unsupervised principal component analysis (PCA) and supervised partial least squares discriminant analysis (PLS-DA) models were established to identify potential discriminant variables [24]. The PCA model was used to eliminate outlier samples. According to a variable importance in the projection (VIP) parameter >1 of the metabolic ion in the PLS-DA model, the T-test was used to screen out substances with p<0.05 to indicate the least-different molecular metabolites. We used the m/z value to search through HMDB (http://www.hmdb.ca/), and preliminarily screened out different small molecular metabolites based on fragment information.

Targeted Metabolomic Analysis
Sample preparation: Samples frozen and stored at −80℃ were thawed completely in the refrigerator at 4℃; then, 5 μL samples were taken into an EP tube, 995 μL methanol was added, vortexed for 1 min, allowed to stand for 5 min, centrifuged at 13,000 rpm for 10 min, and the supernatant was obtained for injection analysis. An appropriate amount of methanol was added into the weighed mixture to dilute 250, 100, 50, 25, 10, 5, 2.5, 1, and 0.5 ng/mL series solution for use as the standard solution.
Conditions of mass spectrometry analysis: The source voltage was ES+3.00 KV, source temperature was 400℃, gas-ow rate was 700 L/h, and cone was 50 L/h. The ion pair information of phenylalanylphenylalanine is 313.1/119.9, the cone is 30V, the collision is 18V.
Methodology investigation: The standard curve was detected every day to determine the linearity and the minimum limit of quantitation.

Non-targeted Metabonomics Methodology
Instrument precision test: The RSD <30% accounted for 86.27%, >80%. Method precision test: The RSD <30% accounted for 80.88%, >80%. Sample stability test: The RSD <30% accounted for 73.88%, >70%. The method veri cation results prove that the instrument precision, method repeatability, and sample stability all met the requirements of metabolomic research. Serum samples are shown in the positive-ion mode base peak intensity diagram ( Figure 2).

Metabolic Global Analysis
First, we used Origin 2021 to conduct a global metabolomic analysis through the volcano map, which is conducive to describing the overall situation of the substance. Using log 2 fold-change as the abscissa and p-value (-log 10 ) as the ordinate to create a volcano map, we set the threshold to 1.2, and each point represents a substance. The larger the ordinate and the smaller the p-value, the more signi cant was the difference. The closer the abscissa was to both sides, the greater was the upregulation or downregulation of the substance (Figure 3).

Multivariate Statistical Analysis
The serum sample information was collected by UPLC-Q-TOF/MS, and the multivariate pattern recognition method was used to process the data, and SIMCA-P was used to perform multivariate statistical analysis on the complex data obtained from the experiment through dimensionality reduction. First, we perform unsupervised principal component analysis (PCA) model analysis on the data in the positive-ion mode for the LC NC and TB groups. PCA was used to determine the separation of samples and to eliminate outliers and the differential metabolites between groups are determined by partial Least Squares Discrimination Analysis (PLS-DA). Both the LC and TB groups and the control group showed obvious classi cation and aggregation on the scatter diagram (Figure 4), indicating that the metabolic pattern of the disease group is quite different from that of the control group. reliable. In this study, in the NC-TB or NC-LC, the PLS-DA model had no over-tting phenomenon and was therefore reliable and predictive (Figure 4).
Through the search of training set and the veri cation of identi cation set, the validation criteria are: the markers can be reproduced in the identi cation set and consistent with the change trend of the training set, we nally determined 30 biomarkers the results are shown in Table S1. Among them, Lphenylalanine, phenylalanylphenylalanine, and 1-methylinosine were compared with the standard product (Phenylalanylphenylalanine (98%) and 1-methylcreatinine (≥99%) were purchased from Ron reagent (Shanghai Yien Chemical Technology Co., Ltd.). L-phenylalanine (≥98%) was purchased from Shanghai Yuanye Biotechnology Co., Ltd.). The MetPA (https://www.metaboanaly-st.ca/) database provides a free online metabolomic metabolic pathway analysis tool, which is often used for metabolic pathway analysis related to differential metabolic biomarkers [25]. The results of the MetPA analysis are shown in Figure 5, and mainly involves porphyrin and chlorophyll metabolism; glycerophospholipid metabolism; phenylalanine, tyrosine, and tryptophan biosynthesis; phenylalanine metabolism, etc., based on which we can draw a network diagram ( Figure 6) to clearly visualize the interactive relationship among various metabolites.

Biomarker Analysis
Cluster analysis To observe the relative changes of biomarker levels in different groups more intuitively and concretely, we used hierarchical cluster analysis-heat map to analyze the key biomarkers (eight differential markers with common signi cant changes in LC and TB groups). In Figure 7, the horizontal axis represents sample information, and the vertical axis represents the biomarkers. The color block re ects the value of the variable: the higher the content, the darker the color block. The closer bifurcation of the left vertical axis indicates that the similarity of these substances is higher-that is, these metabolites may originate from the same substance. From Figure 7, it is apparent that compared with the non-cancer controls the contents of these metabolites signi cantly differed between the LC or TB groups. Among them, Lphenylalanine and phenylalanylphenylalanine showed a signi cant upward and downward trend in the LC and TB groups, respectively. Thus, the multiple of change is signi cant.
In the identi cation set, both L-phenylalanine and phenylalanylphenylalanine showed similar change trends to those in the training set. The change multiple of L-phenylalanine in the identi cation set slightly differed from that in the training set, which may be related to the increased sample size. But from the clustering, the great similarity of L-phenylalanine and phenylalanylphenylalanine is evident. Therefore, in the next study, we plan to use L-phenylalanine and phenylalanylphenylalanine as potential biomarkers to analyze and then determine the nal biomarker.

Evaluation of Clinical Performance
To explore whether these markers have diagnostic signi cance, we used graphpad 8.3.0 and conducted receiver operating characteristics (ROC) curve analysis to verify the diagnostic ability of the abovementioned screening substances, establish an effective diagnosis model. Sensitivity and 1 − Speci city were the ordinate and abscissa, respectively, and the relationship between the sensitivity and speci city of the markers is visually observable as the diagnostic e cacy of markers in the area under the curve (AUC) corresponding to each marker. The metabolite information of the two biomarkers screened in LC and TB were used to establish the binary logistic regression model, and the AUC value of each biomarker was obtained. Combining the results of the training and identi cation sets (Figure 8), the AUC of phenylalanylphenylalanine is 0.8887 (95% CI 0.8064-0.9710, p<0.001, the sensitivity is 85.45%, the speci city is 84%) and 0.8149 (95% CI 0.7419-0.8878, p<0.001, the sensitivity was 73.26%, the speci city was 78.43%) All>0.8, phenylalanylphenylalanine has greater diagnostic signi cance. The AUC of L-phenylalanine was 0.8615 (95% CI 0.7771-0.9459, p<0.001, the sensitivity was 81.48%, the speci city was 88%) and 0.5889 (95% CI 0.4927-0.6851, p=0.50, the sensitivity was 19.77%, the speci city was 98.04%). Based on the above analysis, we selected phenylalanylphenylalanine as diagnostic model, then as a potential diagnostic biomarker for the subsequent analysis.

Targeted metabolomics analysis
In order to analyze the level of markers in blood and the trend of changes among groups more accurately, 358 patients were analyzed using targeted metabolomics analysis with higher speci city and sensitivity. The test set included 89 TB, 119 LC, 30 pneumonia (P), and 120 non-cancer controls. To test the speci city of the marker in the differential diagnosis of LC, we added 30 pneumonia patients.
The results of standard and serum samples are shown in Figure 9. The lowest limit of quanti cation (LLOQ) of phenylalanylphenylalanine is 0.5 ng/mL. The results of the standard curve analysis are shown in Table 2. The marker levels in each group are shown in Table 3 and Figure 10.

Support Vector Machine Model Prediction
A support vector machine prediction model was established by Matlab R2013a to judge the feasibility of the prediction. The samples of test set were divided into two groups: Group 0 (TB) and Group 1 (LC). The blood sample content in each group of samples was used as the input variable of the support vector machine to verify the accuracy and speci city of the markers. With two-thirds of the samples as the model's training set, the remaining one-third was used as the test set to determine the model's predictive accuracy. The support vector machine model parameters in the cross-validation method of phenylalanylphenylalanine are shown in Figure 11. The accuracy of the model can reach 83.57%, which indicates that the model is stable and reliable. The results show that the prediction rate reaches 77.94%, which indicates that the biomarker diagnostic model has high predictive accuracy, good predictive ability, stability.

Discussion
This study describes a comprehensive metabolomics assessment of LC and TB patients from three independent centers who were divided into the training, identi cation, and test sets. A non-targeted metabolomics evaluation based on UPLC-Q-TOF/MS technology initially explored the small-molecule metabolites in samples from LC and TB patients, and found signi cant differences in the metabolic phenotypic pattern between LC and TB patients. Through the search of training set and the veri cation of identi cation set, eight metabolic markers with signi cant differences were screened out. Among them, Lphenylalanine and phenylalanylphenylalanine showed opposite trends in LC (upward) and TB (downward), respectively. Therefore, these two may be ideal biomarkers for the identi cation of LC and TB. In the training set and identi cation set, the AUCs of phenylalanylphenylalanine were 0.8887 and 0.8149 and that of L-phenylalanine were 0.8615 and 0.5889, respectively. Through the binary logistic regression model of the identi cation set, we selected phenylalanylphenylalanine as a potential diagnostic marker. The test set showed a downward trend for phenylalanylphenylalanine in TB patients (change multiple 0.8) and an upward trend in LC patients (change multiple 2.92). Pneumonia patients had the same trend as TB patients, but the opposite trend to that of LC patients (change multiple 0.69, indicating good speci city for LC). Support vector machine (SVM) model analysis showed that the model prediction rate of TB and LC was 77.94%, whereby phenylalanylphenylalanine provides excellent predictive value for the differential diagnosis between LC and pulmonary TB.
At present, the diagnosis of malignant tumors mainly relies on invasive biopsy. With the continuous indepth development of modern bioinformatics technology, "noninvasive blood tests" have emerged as the need of the hour. Researchers have successively explored and discovered various tumor markers. Metabolic markers are speci cally expressed in the blood of cancer patients, often before the onset of clinical symptoms and imaging manifestations [26]. However, there is no speci c marker for LC at present, and the development of noninvasive, accurate biomarkers to determine the LC risk can facilitate early diagnosis and prolong survival.
Amino acids, the structural unit of protein synthesis in cells, provide energy for the growth of organic cells while maintaining normal body metabolism, and have always been considered as potential biomarkers of malignant diseases. Although glutamine is a major nutrient for tumor cell growth, studies have found that the ability of glutamine to supplement the intermediates of the tricarboxylic acid cycle is mainly used in the biosynthesis of amino acids. For example, aspartic acid and serine can not only contribute a large number of carbon atoms to the formation of new cells, but also constitute proteins. Therefore, amino acids are considered the most important source of nutrition for tumor cells [27]. Phenylalanine metabolism is one of the metabolic pathways with abnormal levels in various cancers, and changes in the blood levels of phenylalanine have attracted attention. Some clinical studies have shown that, compared with healthy controls, patients with non-small cell lung cancer have higher levels of phenylalanine and ornithine [29]. A study found that phenylalanine concentrations decreased in patients with oral cancer, whereas another study found an increasing trend [30,31]. Phenylalanine hydroxylase catalyzes the conversion of phenylalanine to tyrosine, and the active function of the enzyme is dysfunctional in in ammatory states or in malignant diseases [32,33]. A search of the Kyoto Encyclopedia of Genes and Genomes database showed that phenylalanine may be transformed into phenylpyruvate by aspartate transaminase, histidine phosphate transaminase, phenylalanine (histidine) transaminase, tyrosine transaminase, aromatic amino acid transaminase, phenylalanine dehydrogenase, and other enzymes; phenylpyruvate is closely related to the tricarboxylic acid cycle [34]. As a glycogenic amino acid, phenylalanine can produce glucose in the tricarboxylic acid cycle and via gluconeogenesis, which can be used as an energy source during the rapid proliferation of cancer cells. We found that Lphenylalanine and phenylalanylphenylalanine levels increased in LC patients; therefore, we speculated that these amino acids might be related to the tricarboxylic acid cycle and may provide the energy needed for the growth of cancer cells. The dipeptide is the short-lived intermediate produced when the protein is hydrolyzed through an amino acid degradation pathway, and it is the product of dehydration and condensation of amino acids, which has strong physiological activity. Dipeptide metabolites are products of incomplete breakdown in protein catabolism that may be related to the removal of nutrients as well as intracellular and extracellular protein catabolism; thus, they may contribute to tumor growth [35]. Glutamylleucine, a dipeptide, plays a role in the development of the metabolic syndrome an in colorectal cancer [36] and re ects oxidative stress and the in ammatory state of the metabolic syndrome. Phenylalanine has been found in thyroid cancer [37], but showed a downward trend, and it is also found in this study, however, there is an upward trend in the lung cancer group, which also proves the speci city of the markers we found for lung cancer. In this study, serum phenylalanylphenylalanine levels of LC patients increased signi cantly, but decreased signi cantly in the serum of TB patients, possibly due to the in ammatory reaction of LC and resultant stress in the body. However, the role of phenylalanylphenylalanine in LC and its related mechanisms need to be explored.
An advantage of this study is the large number of samples. The qualitative and quantitative analyses of markers were conducted with non-targeted and targeted metabonomics, and the biomarkers were found to be stable and reliable. However, the speci city and sensitivity of these biomarkers need to be determined in long-term follow-up experiments, and more large-scale, multicenter, and even multi-region sample studies are needed before their clinical application.

Conclusions
We found and veri ed metabolic markers phenylalanylphenylalanine based on UPLC-Q/TOF-MS in nontargeted metabolomics to distinguish LC and TB patients, and was selected as the optimal diagnostic biomarker by binary logic regression model. It showed opposite trends in LC and TB patients and. In the test set, the results of quantitative analysis of targeted metabolism showed results that were consistent with the previous experimental stage, and the change multiple indicated a large and signi cant difference. The SVM algorithm using machine learning showed that phenylalanylphenylalanine facilitates differential diagnosis between LC and TB, the classi cation prediction rate is fairly good, and reducing the irreversible injury caused by misdiagnosis in the clinical setting, and enabling early treatment of LC patients.

Consent for publication
Not applicable.

Availability of data and materials
All data generated or analysed during this study are included in this published article [and its supplementary information les].

Competing interests
The authors declare that they have no competing interests.  The quality control sample positive-ion BPI chart.