2.1 Screening of active ingredients with drug-like properties in SAIN
TCMSP (https://tcmsp-e.com/) database is one of the most commonly used databases for screening of active ingredients in traditional Chinese medicine. Its advantage is that the database provides oral bioavailability (OB) and drug similarity (DL) Parameters. These two parameters play an important role in the evaluation of drug efficacy. Only when OB exceeds a certain value (OB≥30%) and DL is within a certain range (DL≥0.18), can it effectively reflect the effect of a certain compound drug-like properties.
Among them, the calculation of the DL value of the system follows the formula (1). In order to obtain the target drug, the DL value will be only when the lead compound is chemically easy to be synthesized and has the properties of ADME (absorption, distribution, metabolism, excretion) kick in.
The x in the formula represents the descriptive index of all ingredients in SAIN, and y represents the average drug similarity index of the ingredient from the DrugBank database (https://www.drugbank.ca/).
The fat-water partition coefficient logP_(O⁄W) refers to the partition coefficient of the drug in the n-octanol-water system. It is widely used as a measure of the hydrophobicity of chemical compounds. The main driving force of the biofilm composed of layers controls the compounds-targets binding effect, logP_(O⁄W) follows the following formula (2):
In the formula, CO represents the equilibrium concentration of the drug in the oil phase, and CW represents the equilibrium concentration of the drug in the water phase. The value of logP_(O⁄W) indicates the hydrophobicity of the solute. The larger the logP_(O⁄W), the stronger the hydrophobicity, and vice versa, the stronger the hydrophilicity. Therefore, we use logP_(O⁄W)≤5 as a screening criterion.
2.2 Target prediction and construction of active compounds-targets network
The structural formula of the SAIN active ingredient obtained above is drawn on the STP database (http://www.swisstargetprediction.ch) for target prediction, and the prediction result is combined with the NSCLC target retrieved in the TCMSP database to obtain the final required Disease target information. Use Cytoscape 3.8.0 software (7) to visually construct a compounds-targets network (C-T network) for the above active compounds and targets to obtain the C-T network relationship diagram we need.
So far, we have completed the preliminary screening of the active ingredients related to lung cancer in SAIN.
2.3 Gene difference analysis and LUAD&LUSC gene TPM data analysis
In order to further screen the active ingredients that we have already screened and their targets, we obtained a large number of clinical case samples of LUAD and LUSC from the GEO database (https://www.ncbi.nlm.nih.gov/geo/), through the analysis of gene differences of tens of thousands of samples, samples are randomly selected to draw a heat map of related gene differences, and a preliminary screening of targets is carried out. In order to screen the target for a second time to make the result more clear, we used the gene ID of the target and used the TCGA data provided by the GEPIA database (http://gepia.cancer-pku.cn/) as the basis to draw Quantitative scatter plots of transcripts per million (TPM) of the screened genes further screened out the effective compounds and targets of SAIN for the treatment of LUAD and LUSC.
2.4 Construction of protein interaction network, compounds-targets-pathways network and gene enrichment analysis
In order to evaluate the interaction between the targets screened above, the protein interaction network and the active compounds-targets-pathways network were constructed. In order to explore the biological processes and pathways that each target participates in the body, the STRING database (https://www.string-db.org) retrieved the target gene data, we performed biological process (BP) analysis in GO (gene ontology) analysis and KEGG (kyoto encyclopedia of genes and genomes) enrichment analysis. Among them, the FDR of GO analysis is less than or equal to 0.05, and the FDR of KEGG enrichment analysis is less than or equal to 0.05, both of which meet the requirements and statistical significance of significant gene enrichment in vivo.
The FDR value is to correct the value of P, and the results obtained by using it are more accurate. Therefore, this step aims to obtain the biological process and in vivo pathways of the target action, which provides a basis for subsequent research.
2.5 Molecular docking and subcellular localization prediction
Through AutoDock 4.2.6 and PyMol and other tools for molecular docking, the binding energy of the compound and the target is first used to verify whether the effective compounds of SAIN are reliable in the treatment of LUAD&LUSC, and further exclude the compounds of SAIN that have poor effects on LUAD&LUSC.
Molecular docking (8) is a method of drug design based on the characteristics of the receptor and the interaction between the receptor and the drug molecule. A theoretical simulation method that mainly studies the interaction between molecules (such as ligands and receptors), and predicts its binding mode and affinity. In recent years, molecular docking methods have become an important technology in the field of computer-aided drug research (9).
Subcellular localization prediction is a popular subcellular localization method in recent years. It uses existing data to create a database of the sequence relationship between various genes and their regulatory target sequences and subcellular structures, which can accurately predict the target protein The location of various organelles and cell membranes has brought great convenience to scientific researchers. Currently commonly used subcellular location prediction tools are (1) PSORT Ⅱ (https://psort.hgc.jp/form2.html), the database uses k-Nearest Neighbor (K-NN) algorithm, K-NN is data mining Compared with the commonly used learning algorithms in machine learning, K-NN has a wide range of applications. PSORT II can identify a classic nuclear localization signal (cNLS) sequence, and its accuracy is very high when the sample size is large enough (10,11 ,12); (2) CELLO (http://cello.life.nctu.edu.tw/), the database uses the support vector machine recursive feature elimination algorithm (SVM-RFE), and the SVM-RFE algorithm is trained on the basis of SVM The weight vector w generated at time is used to construct the sorting coefficient, and each iteration removes a characteristic attribute with the smallest sorting coefficient, and finally obtains the descending order of all the characteristic attributes (13,14); (3) BUSCA (http://busca. biocomp.unibo.it/), the database uses betaware algorithms to solve the detection of transmembrane beta-barrels (TMBB) in the proteome and the prediction of its topology (15,16).
Combined with the analysis of biological processes, PSORT Ⅱ, CELLO and BUSCA databases were used to predict subcellular localization, and the prediction results of the three tools were compared. We selected the overlapping parts to obtain the main location of the target protein in the cell.
2.6 Survival analysis of patient prognosis
In medical research, in order to evaluate the efficacy of a certain drug and understand survival data such as survival time of patients after surgery, the analysis of these survival data is called survival analysis.
After a series of screening and analysis of the compounds and targets of SAIN, we performed prognostic overall survival analysis (17) on the gene IDs of the targets of the effective compounds of SAIN (17) to explore when the drugs act on these targets, the length of time the patient can survive over time.