Data collection from electronic medical record information and structuralization with natural language processing
The overall flow of data collection is shown in Fig. 1 a), and Fig. 1 b) outlines the procedure from collection of clinical data through electronic medical record entries (medical record and initial medical questionnaire, CT imaging interpretation report, blood test data) to input data generation. For example, the medical records are entered at the time of medical consultation using a format (referred to as a template) created with items set in advance, or the information is extracted manually from the entries and initial medical history questionnaire freely written in natural language into the template, and then the structured data were generated as input data. The CT imaging interpretation report, are paired by natural language processing with information about the entity related to the lesion and the site where it was observed, and the information about whether the lesion was observed (positive), not observed (negative), or suspected (suspected) is also added. The features were then manually modified and expressed as one-hot vectors for subsequent analysis. Blood test results were collected as structured data by manually extracting the test values for pre-selected items. The all above clinical information was collected at or near the date of blood collection for proteome data acquisition. For the proteome data, phosphatidylserine-positive extracellular vesicles were separated from serum, and the proteins contained were comprehensively measured by mass spectrometry (Fig. 1c). Each missing value was imputed with a representative value in healthy people, resulting in obtaining 6,506 (6,282 attributes from CT image interpretation reports, 171 attributes from blood test results and 53 attributes from medical records) × 602 cases (with overlap from 403 patients and 39 controls) of medical information and 2,445 protein ID × 602 cases matrices.
Basic patient characteristics and clinical items
The number of patients for whom clinical data (medical information and blood samples) was collected in this study and their basic characters are shown in Table I. The collected medical information is listed in Sup Table I, and 2,388 protein groups identified in the proteome analysis (2,445 proteins detected and 2,388 proteins were mapped to the known protein IDs) are shown in Sup Table II.
Concept of subset binding and analysis workflow
The composition of this cohort dataset used for the analysis of this study is shown in Fig. 2a), and the analysis workflow is shown in Fig. 2b). Subset binding (SB), a newly developed algorithm in this study, was used to detect patient stratification rules using structured medical information and proteome data. Subset binding outputs patient stratification rules (e.g., patients with high expression of biomolecules A, B, and C tend to show reticular shadows and traction bronchiectasis) by detecting association between phenotypic information such as medical information and biomolecular data such as omics data. SB uses fuzzy association rule mining as the underlying technology. It accepts two input matrices (e.g., proteome data and structured medical information; the number of rows must be the same, but the number of columns may be different), membership values for “Low” class and “High” class are calculated for each attribute using the membership function for each matrix, and association rules are generated so that the frequent itemsets from both data are linked (Fig. 2b; see Supplementary methods for the details of the algorithm). By using this algorithm, data with a mixture of continuous and discrete values, as is common in medical information, can be handled without any special preprocessing or prior knowledge. There are 6 possible combinations of SB analysis as shown in Fig. 2c, and we selected proteins that were included in the IPF characteristic-related association rules in the output of i) medical records (mixture of binary and numerical values) – protein (numerical values) association, ii) CT interpretation reports (binary) – protein association, and/or iii) blood test (mixture of binary and numerical values) – protein association at least once. The IPF characteristics used to select association rules include a known biomarker KL-6 (sialylated carbohydrate antigen) in blood test, respiratory difficulties during exertion in medical records.
Clustering of proteome data is not suitable for patient stratification
To investigate whether the global similarities of cases in the proteomic data of serum EVs reflect the diagnosis, we visualized their quantitative patterns by heatmap with hierarchical clustering (Fig. 3a), t-SNE (Fig. 3b) and UMAP (Fig. 3c). The heatmap shows that the global similarities among cases didn’t match with their diagnosis, which implied that the canonical approach such as clustering is not suitable for patient stratification. It is indicated that the proteome data contained many proteins that were not directly linked with phenotypes such as diagnosis. Fig. 3b and 3c also supported this tendency, in which several subtypes in IIPs (UIP, probable UIP, indeterminate UIP, alternative, and others) didn’t show co-localization while HC (healthy control) showed weak tendency to co-localize.
Since the canonical machine learning techniques that assume the global similarities among cases are high if the diagnosis and/or phenotypic characteristics are similar, we searched proteins that linked with IPF-characteristics by SB as shown in Fig. 2b, which resulted in finding 20 proteins.
The top 20 proteins that co-occurred with characteristic findings of IPF by SB and their relationship to IPF
The 20 IPF-related proteins found by SB are shown in Table II. The protein-protein interrelationships among the 20 molecules were searched using TargetMine (Chen, Tripathi et al. 2011) (Chen, Tripathi et al. 2016) (Chen, Tripathi et al. 2019). LYN (Tyrosine-protein kinase Lyn), PTPN6 (Tyrosine-protein phosphatase non-receptor type 6), MIF (Macrophage migration inhibitory factor) and RAN (GTP-binding nuclear protein Ran) were found to be the hub molecules in the protein-protein interactions among these 20 molecules (Fig. 4a). In addition, the presence or absence of a relationship between 20 molecules was explored using TargetMine and IPA(Ingenuity Pathway Analysis, QIAGEN), and the results are shown in Fig. 4b) and Sup. Table 3. As a result, molecules with no previously reported association were MRPS17 (28S ribosomal protein S17, mitochondrial) and PEF1 (Peflin), whereas molecules those were found to be associated with IPF through many other molecules were LYN (Tyrosine-protein kinase Lyn), PTPN6 (Tyrosine-protein phosphatase non-receptor type 6), MIF (Macrophage migration inhibitory factor) and RAN (GTP-binding nuclear protein Ran).
Network analysis and search for upstream control factors
In addition, we searched for molecular networks composed of seven core molecules and found pathways such as Carbohydrate Metabolism, Small Molecule Biochemistry, Cellular Assembly and Organization, where all these seven molecules are mapped. The regulatory relationships among the molecules on this network, including the seven core molecules, are depicted in Fig. 4c).
Moreover, the upstream regulatory relationships of the expression of the seven core molecules were explored using IPA causal network analysis. As shown in Fig. 4d), these molecules are regulated by molecules such as ESR1 (Estrogen receptor 1), CCND1 (Cyclin D1), CCR2 (C-C chemokine receptor type 2), NOS2 (Nitric oxide synthase 2), and MMP14 (Matrix metalloproteinase-14), which are in turn regulated by the SRC (Proto-oncogene tyrosine-protein kinase Src) family, ERK1/2 (Extracellular signal-regulated kinase 1/2) and ABL1 (ABL proto-oncogene 1) and finally ponatinib was identified as an upstream regulator.
LYN and PTPN6 knock out mice were reported to have abnormal phenotypes in the lung
The MGI database (http://www.informatics.jax.org/) and the JAXKO mouse phenotype database (https://www.jax.org/jax-mice-and-services) were used to search for KO mice and phenotypes of core and hub molecules, which are summarized in Table III. LYN and PTPN6 were found to have phenotypes such as inflammation in the lung. However, for other molecules, there are no available data or only effects on other organs have been reported.
Immunohistochemical staining reveals many of the proteins are strongly upregulated expression in fibrotic areas, especially in epithelial cells and inflammatory cells
Of the 20 molecules presented by SB, 7 core proteins and 4 hub proteins were investigated for expression in patient lungs and for increased expression in fibrotic areas. The fibrotic and normal areas of the lungs of two IPF patients, who had concomitant cancer and were eligible for surgery, in the different cohort from that for proteome analysis were used for immunostaining using antibodies against each of the proteins. As a representative result, a clear enhancement of Lyn expression was observed in the tissues with obvious fibrosis confirmed by masson’s trichrome staining, shown in Fig. 5a). The results of immunohistochemical staining are summarized in Table IV, which shows that almost all the proteins except ITIH are upregulated in fibrotic areas, especially in epithelial cells and inflammatory cells.
Ponatinib suppressed EMT
Epithelial mesenchymal transition (EMT) has been suggested to be important in the mechanism of pulmonary fibrosis in IPF. In this study, we succeeded in establishing a test system in which EMT is induced by TGF-b using human normal airway epithelial cells BEAS-2B, and the EMT inhibitory effect of ponatinib was confirmed (Fig. 6).