Based on network pharmacology, using methods such as molecular docking and gene difference analysis, small molecule compounds for the treatment of lung adenocarcinoma and lung squamous cell carcinoma were screened in saussurea involucrata, and new therapeutic targets were discovered

Purpose: Based on systemic pharmacology and network pharmacology, to evaluate the therapeutic effect of saussurea involucrata (SAIN) on lung adenocarcinoma and lung squamous cell carcinoma (LUAD&LUSC) through clinical sample genetic difference analysis and compound-target molecular docking, and discover new The target of prevention or treatment of LUAD&LUSC. Materials and methods: Using the TCM System Pharmacology Database (TCMSP) as the starting point for preliminary selection of ingredients and targets (OB ≥ 30%, DL ≥ 0.18, n=9), with the GDC database as the end point, using Cytoscape 3.8, TBtools 1.082, AutoDock 4.2.6, R 4.0.4, PyMol and other tools have conducted a preliminary screening of the ingredients and targets of SAIN. In order to further screen the effective ingredients and targets, we used clinical samples from LUAD and LUSC from TCGA and GEPIA to perform genetic difference analysis (n=6), and perform biological process (BP) analysis (FDR) on these targets. ≤ 0.05, n=6), KEGG pathway analysis (FDR ≤ 0.05, n=6), protein interaction network (PPI) analysis (n=6) and compounds-targets-pathways network analysis (n=6), obtain biological processes, disease pathways and various compounds regulated by targets-the relationship between targets and pathways. Through the precise molecular docking of ingredients and targets, we further screened the targets of the effective ingredients of SAIN (anity ≤ -7.0 kcal/mol, H-Bond dist ≤ 3.0, n=6) and visualized the data, Then these targets were veried by using PSORT(cid:0), CELLO and BUSCA databases for subcellular localization prediction (n=6). Finally, use the large amount of TCGA clinical data provided by Cbiportal for the prognostic survival analysis of LUAD and LUSC for the genes obtained through the screening. , And consult a large number of documents to verify the results. Results: After screening, comparing, analyzing and verifying a series of data, it is nally conrmed that there are three main active ingredients in SAIN. They are Quercetin, Luteolin and Kaempferol, which mainly act on 6 protein targets. The target mainly regulates 20 signal pathways including Pathways in cancer, Transcriptional misregulation in cancer, EGFR tyrosine kinase inhibitor resistance, Adherens junction, IL-17 signaling pathway, Melanoma, Non-small cell lung cancer and MicroRNAs in cancer. The preventive or therapeutic effects of LUSC. Conclusion: There are three active compounds of Q, L and K in SAIN, which play a role in the treatment and prevention of NSCLC by directly or indirectly regulating the expression of genes such as MMP1, MMP3 and EGFR.


Screening of active ingredients with drug-like properties in SAIN
TCMSP (https://tcmsp-e.com/) database is one of the most commonly used databases for screening of active ingredients in traditional Chinese medicine. Its advantage is that the database provides oral bioavailability (OB) and drug similarity (DL) Parameters. These two parameters play an important role in the evaluation of drug e cacy. Only when OB exceeds a certain value (OB≥30%) and DL is within a certain range (DL≥0.18), can it effectively re ect the effect of a certain compound drug-like properties.
Among them, the calculation of the DL value of the system follows the formula (1). In order to obtain the target drug, the DL value will be only when the lead compound is chemically easy to be synthesized and has the properties of ADME (absorption, distribution, metabolism, excretion) kick in. The x in the formula represents the descriptive index of all ingredients in SAIN, and y represents the average drug similarity index of the ingredient from the DrugBank database (https://www.drugbank.ca/).
The fat-water partition coe cient logP_(O⁄W) refers to the partition coe cient of the drug in the noctanol-water system. It is widely used as a measure of the hydrophobicity of chemical compounds. The main driving force of the bio lm composed of layers controls the compounds-targets binding effect, logP_(O⁄W) follows the following formula (2): In the formula, CO represents the equilibrium concentration of the drug in the oil phase, and CW represents the equilibrium concentration of the drug in the water phase. The value of logP_ (O⁄W) indicates the hydrophobicity of the solute. The larger the logP_(O⁄W), the stronger the hydrophobicity, and vice versa, the stronger the hydrophilicity. Therefore, we use logP_(O⁄W)≤5 as a screening criterion.

Target prediction and construction of active compounds-targets network
The structural formula of the SAIN active ingredient obtained above is drawn on the STP database (http://www.swisstargetprediction.ch) for target prediction, and the prediction result is combined with the NSCLC target retrieved in the TCMSP database to obtain the nal required Disease target information. Use Cytoscape 3.8.0 software (7) to visually construct a compounds-targets network (C-T network) for the above active compounds and targets to obtain the C-T network relationship diagram we need.
So far, we have completed the preliminary screening of the active ingredients related to lung cancer in SAIN.

Gene difference analysis and LUAD&LUSC gene TPM data analysis
In order to further screen the active ingredients that we have already screened and their targets, we obtained a large number of clinical case samples of LUAD and LUSC from the GEO database (https://www.ncbi.nlm.nih.gov/geo/), through the analysis of gene differences of tens of thousands of samples, samples are randomly selected to draw a heat map of related gene differences, and a preliminary screening of targets is carried out. In order to screen the target for a second time to make the result more clear, we used the gene ID of the target and used the TCGA data provided by the GEPIA database (http://gepia.cancer-pku.cn/) as the basis to draw Quantitative scatter plots of transcripts per million (TPM) of the screened genes further screened out the effective compounds and targets of SAIN for the treatment of LUAD and LUSC.

Construction of protein interaction network, compounds-targets-pathways network and gene enrichment analysis
In order to evaluate the interaction between the targets screened above, the protein interaction network and the active compounds-targets-pathways network were constructed. In order to explore the biological processes and pathways that each target participates in the body, the STRING database (https://www.string-db.org) retrieved the target gene data, we performed biological process (BP) analysis in GO (gene ontology) analysis and KEGG (kyoto encyclopedia of genes and genomes) enrichment analysis. Among them, the FDR of GO analysis is less than or equal to 0.05, and the FDR of KEGG enrichment analysis is less than or equal to 0.05, both of which meet the requirements and statistical signi cance of signi cant gene enrichment in vivo.
The FDR value is to correct the value of P, and the results obtained by using it are more accurate. Therefore, this step aims to obtain the biological process and in vivo pathways of the target action, which provides a basis for subsequent research.

Molecular docking and subcellular localization prediction
Through AutoDock 4.2.6 and PyMol and other tools for molecular docking, the binding energy of the compound and the target is rst used to verify whether the effective compounds of SAIN are reliable in the treatment of LUAD&LUSC, and further exclude the compounds of SAIN that have poor effects on LUAD&LUSC.
Molecular docking (8) is a method of drug design based on the characteristics of the receptor and the interaction between the receptor and the drug molecule. A theoretical simulation method that mainly studies the interaction between molecules (such as ligands and receptors), and predicts its binding mode and a nity. In recent years, molecular docking methods have become an important technology in the eld of computer-aided drug research (9). Subcellular localization prediction is a popular subcellular localization method in recent years. It uses existing data to create a database of the sequence relationship between various genes and their regulatory target sequences and subcellular structures, which can accurately predict the target protein The location of various organelles and cell membranes has brought great convenience to scienti c researchers. Currently commonly used subcellular location prediction tools are (1) PSORT (https://psort.hgc.jp/form2.html), the database uses k-Nearest Neighbor (K-NN) algorithm, K-NN is data mining Compared with the commonly used learning algorithms in machine learning, K-NN has a wide range of applications. PSORT II can identify a classic nuclear localization signal (cNLS) sequence, and its accuracy is very high when the sample size is large enough (10,11 ,12); (2) CELLO (http://cello.life.nctu.edu.tw/), the database uses the support vector machine recursive feature elimination algorithm (SVM-RFE), and the SVM-RFE algorithm is trained on the basis of SVM The weight vector w generated at time is used to construct the sorting coe cient, and each iteration removes a characteristic attribute with the smallest sorting coe cient, and nally obtains the descending order of all the characteristic attributes (13,14); (3) BUSCA (http://busca. biocomp.unibo.it/), the database uses betaware algorithms to solve the detection of transmembrane beta-barrels (TMBB) in the proteome and the prediction of its topology (15,16).
Combined with the analysis of biological processes, PSORT , CELLO and BUSCA databases were used to predict subcellular localization, and the prediction results of the three tools were compared. We selected the overlapping parts to obtain the main location of the target protein in the cell.

Survival analysis of patient prognosis
In medical research, in order to evaluate the e cacy of a certain drug and understand survival data such as survival time of patients after surgery, the analysis of these survival data is called survival analysis.
After a series of screening and analysis of the compounds and targets of SAIN, we performed prognostic overall survival analysis (17) on the gene IDs of the targets of the effective compounds of SAIN (17) to explore when the drugs act on these targets, the length of time the patient can survive over time.

Screening results of SAIN active ingredients
We collected a total of 55 known active ingredients of SAIN from the TCMSP database, including alkaloids, lipids, avonoids and avonoids, etc. We selected 55 from 55 under the screening conditions of OB ≥ 30%&DL ≥ 0.18&logP o/w ≤5, six effective drug-like ingredients were screened out of the SAIN, namely azin, quercetin, kaempferol, luteolin, alloisoimperatorin and hispidulin. See (Table 1) for the screening details of these six ingredients.

Target screening and compounds-targets network analysis results
By merging the 1023 targets screened from the STP database and the 764 LUAD and LUSC targets screened from the TCMSP database, we obtained a total of 9 targets with six compounds effects. These 9 protein targets were combined with The six SAIN compounds interact, and Cytoscape is used to construct a compounds-targets network, and a C-T network with 15 nodes and 24 edges as the characteristics is obtained (Fig. 2).
Through the C-T network, it can be found that the six compounds of SAIN that have strong and weak effects on the target are ranked quercetin (Q), luteolin (L), kaempferol (K) and hispidulin (H), azin (F), alloisoimperatorin (A), The size order of the 9 targets and compounds is prostaglandin-endoperoxide synthase 2 (PTGS2), 90kDa heat shock protein ΑA1 (HSP90AA1), Matrix metalloproteinase 1 (MMP1), Epidermal growth factor receptor (EGFR), etc.

Gene difference analysis and TPM data analysis results
In order to ensure the accuracy of the experimental data, we further screened the compounds and targets. Through the GEO database and R 4.0.4, we obtained a total of 22,589 LUAD and 22697 LUSC-related disease genes through the GDC database. (Https://portal.gdc.cancer.gov/) TCGA data, we obtained 535 LUAD samples and 59 LUAD control samples, 502 LUSC samples and 49 LUSC control samples. It is undeniable that these data all have high reference value. However, in order to make the experiment proceed smoothly, we aim at the 9 target points studied. Through the P value range (P ≤ 0.05) of these data and patient samples, the target points are determined by R. Combined with clinical samples, the data was screened, and the gene expression matrices of LUAD&Control and LUSC&Control corresponding to 9 targets were obtained. We randomly selected 15 samples from each of the two groups in the two groups, namely the LUAD&Control group, the LUSC&Control group Thirty clinical samples were randomly selected, and TBtools v1.082 (18) software was used to draw genetic difference heat maps of LUAD ( Fig. 3b) and LUSC ( Fig. 3a The direction of the ordinate is "←"). In the gure, since 9 targets can act on both types of lung cancer, we cluster genes. Since our samples are divided into two groups (experimental group and control group), in order to make the samples evenly distributed, and on the premise of not affecting the experimental results, we do not allow the samples to cluster.
According to the results of genetic difference analysis, we can nd that HSP90AA1 has the least obvious effect on LUAD and LUSC, so we discard it. Through the GEPIA database, input the gene ID, and select the analysis of variance (ANOVA) for the difference method. The Log 2 FC value and Q value are 1 and 0.01 respectively. The gene IDs of the remaining 8 targets are analyzed by TPM. The total number of samples involved is LUAD (T = 483, N = 347), LUSC (T = 486, N = 338), where T stands for tumor patients and N stands for normal control group (Fig. 4).
Combining the corresponding relationship between the binding compounds and the target, the results of gene difference analysis and the results of TPM analysis, we found that the expression and changes of MMP2 and PTGS2 in the T group were consistent with those in the N group, so they were discarded.
Therefore, the SAIN compounds that play a major role in LUAD and LUSC are Q, L, and K, and the genes that play a major role in LUAD and LUSC are MMP1, MMP3, EGFR, MET, NQO1 and MPO.
For the analysis of six genes, we came to the conclusion that the expression of EGFR and MMP3 genes in LUSC far exceeds that in LUAD, while MET is the opposite. Although the other three genes are consistent in the occurrence and development of the two diseases, MPO is worthy of attention. This gene showed high expression in all stages in the N group of cases, and low expression in the T group. Expression status, therefore, we speculate that excessive activation of MPO expression in tissues may be effective in inhibiting the development of LUAD and LUSC.

GO (BP), KEGG enrichment analysis, PPI analysis and C-T pathway analysis results
According to the above analysis, we have excluded the three targets of HSP90AA1, PTGS2 and MMP2. Studies have shown that HSP90AA1 and the same type of HSP90AB1 may be required for the treatment of the two subtypes of breast cancer (19). PTGS2 is used in colorectal cancer. It is needed in other diseases (20), and MMP2 mostly plays a role in in ammation and immune response (21,22).
In order to achieve the credibility and statistical signi cance of the data, we performed GO (BP) analysis and KEGG enrichment analysis with FDR ≤ 0.05. The results are more rigorous and reliable than the results obtained by using P values. Therefore, the results are very great reference value, see the attachment (Table 2 and Table 3) for detailed information.
With reference to the results of KEGG enrichment analysis, in order to make it easy to understand, we use Cytoscape to link the 3 compounds, 6 targets and 20 pathways, and draw a compounds-targetspathways diagram (Fig. 7) to make the compounds The relationship with the target and the relationship between the target and the pathway are visualized. A total of 20 metabolic regulation pathways including 29 nodes and 53 edges are obtained.
First, from the C-T-P network diagram, we sort the targets into EGFR, MET, MMP1, MMP3, NQO1, and MPO according to the number of target regulatory pathways and the number of corresponding compounds, and sort the pathways into Pathways in cancer, Transcriptional misregulation in cancer, Hepatocellular carcinoma, Bladder cancer, EGFR tyrosine kinase inhibitor resistance, Adherens junction, IL-17 signaling pathway, Melanoma, Non-small cell lung cancer, Relaxin signaling pathway and MicroRNAs in cancer, etc.
However, despite these pathways both LUAD and LUSC have contributed, but the strength of the contribution is still not easy to distinguish. Therefore, further analysis and veri cation are needed.

Analysis of molecular docking results
After the above series of analysis, screening and veri cation work, we have determined the target sites of the three compounds respectively. These three compounds can act on multiple targets. Among them, Q can act on MMP1, MMP3, EGFR, MPO and NQO1, L can act on MMP1, EGFR and MET, K only acts on MMP1.
After obtaining this information, we used Chemdraw 19.0 software to draw the structure of the 6 compounds, saved as a mol2 le, and used the PDB database (https://www.rcsb.org/) to select protein tertiary structure of ligand-containing and ligand according to the structure of the compounds and download the pdb le, process the pdb protein le through Discovery Studio 4.5 Client software (remove water molecules and redundant structures, etc.), use AutoDock 1.5.6 software to save the two le formats as pdbqt les, select the con guration Body and coordinate position (x, y, z), the compounds are screened and sorted by the distance of the hydrogen bond. These tertiary structures are all predicted using the X-RAY DIFFRACTION method, and the resolution of the six proteins is 1.90 Å, and the highest is 2.40 Å. Then, we use AutoDock 1.5.6 and AutoDock Vina to do the molecular docking of the compound and the target, and nally use PyMol to process the docking result (Fig. 8).
In the gure, the coordinates of each compound in the tertiary structure are EGFR (12,12,14), MET (16, 14, 12), MMP1 (12,12,12), MMP3 (16, 16, 16), MPO (16, 14, 10), NQO1 (22,22,22), where the greater the value in this coordinate, the greater the a nity for docking, and the greater the a nity, which proves that the binding of the compound to the target is tighter, and the compound is also The more effective. However, in order to ensure the availability and reference value of the data, we set the coordinates (x, y, z) according to the structural size of the compound itself. Therefore, the coordinate values we use are just suitable for the compound itself. The results show that the binding energy (kcal/mol) of the compound and the target is the best is -10.1 kcal/mol, and the worst is -7.9 kcal/mol.
According to the results of molecular docking, we selected items with a nity within a certain range (a nity≤-7.0 kcal/mol). The speci c molecular docking results of 3 SAIN compounds and 6 targets can be found in the attachment (Table 4).
In order to visualize the results, we use heat maps to show the results in detail below (Fig. 9).

Analysis of subcellular location prediction results
In order to explore the speci c positions of the 6 proteins we screened out in the cell, we used the PSORT , CELLO and BUSCA databases to predict the subcellular location, and obtained the prediction results of the three databases PSORT (Table 5), CELLO (see attachment for Table 5) and BUSCA (see attachment for Table 5).
Comparing the results of the three subcellular location prediction databases, we found that PSORT has higher compatibility and reliability than the other two databases, so we chose to use the prediction results of PSORT .
The database uses the k-NN algorithm (k = 23), which is currently a relatively accurate algorithm for subcellular location prediction. 3.7 Analysis of prognosis and survival of patients with 6 target genes Survival analysis in the traditional sense, its purpose is to analyze the genetic data related to the disease before the patient undergoes surgery or other treatment. In this sense, it provides patients with a choice of whether to continue treatment (23).
After analyzing the survival of these 6 targets, we found that the 6 genes all showed statistical signi cance in the regulation of LUAD (P < 0.05), while in the regulation of LUSC, only EGFR and MMP3 had statistical signi cance (P < 0.05), which is consistent with our previous TPM analysis results (Fig. 11).

Current status of LUSC treatment targets and MMP3
Compared with LUAD, LUSC has a unique pathological morphology (the growth position, direction and variation of cancer cells) in NSCLC. Therefore, because there are few carcinogenic aberrations that can target LUSC, no targeted therapy method for LUSC has been developed. We already know that insulin enhances the proliferation, migration and drug resistance of non-small cell lung cancer cells by activating the PI3K/Akt pathway. Studies have shown that insulin can up-regulate the expression of MMP3 genes in NSCLC cells, but its expression will be affected by PI3K/Akt. Inhibited by pathway inhibitors (29).
Studies have shown that FGFR-1 and MMP3 genes and proteins are highly expressed in esophageal squamous cell carcinoma, and both may be related to the invasion and metastasis of esophageal cancer (30

Application of EGFR and MET in LUAD targeted therapy
The latest NSCLC guidelines 2021 V2 released by the National Comprehensive Cancer Network (NCCN) show that the detected genes related to NSCLC targeted therapy include EGFR, KRAS, ALK, ROS1, MET, BRAF, RET, and NTRK (32). However, these genes are mainly for LUAD.
Receptor tyrosine kinase (RTK) is a high-a nity cell surface receptor for many hormones, cytokines and polypeptide growth factors. Among the 90 unique tyrosine kinase genes identi ed in the human genome, 58 encode receptor tyrosine kinase proteins. Receptor tyrosine kinases have not only been shown to be key regulators of normal cellular processes, but also play a key role in the occurrence and development of many types of cancer (33).
Epidermal growth factor receptor (EGFR) and hepatocyte growth factor (HGF) are members of the RTK family, and MET is the tyrosine kinase transmembrane receptor of HGF, a protein encoded by the c-MET proto-oncogene The product, as a potential therapeutic target for many types of tumors, has attracted attention in the medical eld.
At present, the world has developed a variety of targeted drugs for NSCLC against EGFR mutations and MET exon14 changes. Among them, drugs for EGFR target mutations are mainly divided into small molecule tyrosine kinase inhibitors (TKI) and monoclonal antibodies. Major categories, the TKI category mainly includes Ge tinib, Erlotinib, Afatinib, Osimertinib, Dacomitinib, Lcotinib and Brigatinib, etc.
Monoclonal antibodies are relatively few, mainly including Necitumumab, Amivantamab and Ramucirumab, of which Amivantamab targets EGFR exon 20 insertion.
The development of these drugs directly shows that our analysis results are correct, that is, the expression of the MET gene in LUAD patients is far more than that of LUSC patients, and the expression of EGFR gene in the two types of patients is roughly the same. Occasionally, the expression of LUSC patients may be even higher. Much phenomenon. The results of molecular docking show that the compound L and Q of SAIN can bind to the EGFR protein stably with an a nity of -8.8 kcal/mol, while the MET protein can only bind to the compound L with an a nity of -7.9 kcal/mol. Through survival analysis, we know that among LUAD patients, EGFR mutation patients (P < 0.001) have a signi cantly shorter prognostic survival period than MET patients (P < 0.02).

The role of MPO and NQO1 in NSCLC patients
We all know that smoking increases the risk of lung cancer. In the early years, researchers found that in a large number of patients with lung adenocarcinoma combined with smoking and non-smoking, Nad(p)h: quinone oxidoreductase 1 ( NQO1) there was no signi cant difference in genotypes, and smokers with myeloperoxidase (MPO) genotype had a lower prevalence of lung cancer than non-smokers (34).
MPO is a member of the heme peroxidase superfamily and is mainly expressed in neutrophils and monocytes. Studies have shown that MPO targeted therapy can show a good prognostic survival status in LUAD patients (35), which is consistent with the results of our survival analysis (P < 0.001). The results of molecular docking show that the compound Q of SAIN can speci cally bind to MPO protein (a nity = -8.8 kcal/mol). Although there is evidence that MPO has an effect on NSCLC patients, the results of genetic difference analysis and TPM analysis show that it is different from other targets. The expression of MPO in tumor tissues of LUAD and LUSC is much lower than its expression in normal tissues. The amount (TPM < 2), that is, MPO is basically not expressed in NSCLC patients. According to the results of KEGG enrichment analysis, we speculate that MPO may play a certain role in LUAD and LUSC patients through the transcriptional misregulation in cancer pathway.
The results of PPI analysis showed that MPO and NQO1 may have similarities in function, and NQO1 has been con rmed to be overexpressed in a variety of solid tumors, including lung (36, 37), which is consistent with our analysis results. Currently, β-lap-dC3 prodrug micelles have been developed to target NSCLC overexpressing NQO1. Studies have shown that the drug can signi cantly prolong the prognostic survival time of NSCLC patients (38), which is consistent with the results of our NQO1 survival analysis. It is consistent, that is, when NQO1 is overexpressed, the patient's survival time is greatly reduced. The molecular docking results showed that the a nity of compound Q when binding to NQO1 protein was − 8.5 kcal/mol, which proved the effectiveness of this compound as a drug.

The potential of high expression of MMP1 in the treatment of NSCLC
The results of the compounds-targets molecular docking show that compound Q, L, and K can bind to the MMP1 protein, and the binding a nity is better than any other compounds, respectively Q (a nity = -10.0 kcal/mol), L (a nity = -10.1 kcal/mol) and K (a nity = -9.6 kcal/mol). We analyzed the survival of LUAD and LUSC patients after MMP1 overexpression and found that compared with the survival time of patients after EGFR mutation, MMP1 overexpression showed a worse prognostic survival. The PPI network analysis results show that MMP1 forms three triangular structures with EGFR, MMP3, NQO1 and MPO (Fig. 12), which suggests that when MMP1 functions, it may form connections with four other targets.

Conclusion
We know that the etiology of NSCLC is very complicated, and we cannot explain it with a single gene change. Therefore, the treatment of lung cancer cannot be achieved by one method. It also requires a combination of multiple methods and multiple targets, although SAIN The three compounds of Q, L and K can prevent and treat NSCLC through the targets of MMP1, MMP3, EGFR, MPO, MET and NQO1, but none of us can guarantee that the cause of NSCLC is the result of these genes. Therefore, the complete cure of NSCLC still requires the joint efforts of researchers in various elds to achieve.  Figure 1 Work ow of experiment to predict the mechanism of SAIN in the treatment of LUAD&LUSC.

Figures
Page 23/32 Eight protein interaction networks mediated by six compounds of SAIN, the thickness of the line represents the strength and weakness relationship between the proteins.