Integrated Analysis of CNV, Gene Expression and Disease State Data in Prostate Cancer

Background: Copy number variation (CNV) may contribute to development of complex diseases. However, due to the complex mechanism of path association and the lack of sufficient samples, understanding the relationship between CNV and cancer remains a major challenge. The unprecedented abundance of CNV, gene and disease label data provide us with an opportunity to design a new machine learning framework to predict potential disease - related CNVs. Results: In this paper, we developed a novel machine learning approach, namely IHI - BMLLR ( I ntegrating H eterogeneous I nformation sources with B iweight M id - correlation and L 1 - regularized L ogistic R egression under stability selection), to predict the CNV - disease path associations by using a data set containing CNV, disease state labels and gene data. CNVs, genes, and diseases are connected through edges, and then constitute a biological association network. To construct a biological network, we first used a self - adaptive biweight mid - correlation (BM) formula to calculate correlation coefficients between CNVs and genes. Then, we used logistic regression with L1 penalty (LLR) function to detect genes related to disease. We added stability selection strategy, which can effectively reduce false positives, when using self - adaptive BM and LLR. Finally, a weighted path search algorithm was applied to find top D path associations and important CNVs. Conclusions: Compared with state - of - the - art methods, IHI - BMLLR discovers CNVs - disease path associations by integrating analysis of CNV, gene expression and disease label data combined with stability selection strategy and weighted path search algorithm, thereby mining more information in the data sets, and improving the accuracy of obtained CNVs. The experimental results on both simulation and prostate cancer data show that IHI - BMLLR is significantly better than two state - of - the - art CNV detection methods (i.e., CCRET and DPtest) under false positive control. Furthermore, we applied IHI - BMLLR to prostate cancer data and found significant path associations. Three new cancer - related genes were discovered in the paths and these genes need to be verified by biological research in the future.


Integrated Analysis of CNV, Gene Expression and Disease State Data in Prostate Cancer
Lin Yuan 1 , Tao Sun 1 , Jing Zhao 1 , and Zhen Shen 2 * Background Copy number variations (CNVs) contribute to a substantial fraction of human genetic variation and are increasingly involved in disease associations and genome evolution [1]. Many evidences reveal the causal relationship between CNVs and many human disease phenotypes, including scores of known genomic diseases and hundreds of complex disease traits [2][3][4]. One of the essential issues in CNV research is to understand how CNVs affect the occurrence of diseases [5,6].
With the increase in the number of verified CNV-disease associations, several databases have been published, such as DGV [7], DGVa [8], dbVar [9], CNVD [10] and DECIPHER [11]. However, known CNV-disease associations include only a small fraction of CNVs and diseases. Calculation models and methods have been developed to predict the potential CNV-disease associations, which can be used as candidates for biological experimental verifications. Calculation models and methods would greatly reduce the experiment cost and save time in finding new CNV-disease associations.
The calculation methods can mainly be categorized into statistical classification-based and machine learning-based methods. People use statistical classification methods to develop innovative solutions to identify disease-related CNVs. For example, Fan et al. [12] found that CNV is highly correlated with differential gene expression by counting the correlation between CNV and gene expression in a large number of cell lines and disease samples. Cai et al. [13] proposed a calculation method that integrates monte carlo feature selection and incremental feature selection to identify discriminative core CNVs in different breast cancer subtypes. Furey et al. [14] proposed a single statistical framework, GSAA, which simultaneously measures genetic variation and gene expression variation across the entire genome to identify gene sets that are differentially expressed and thus can be used as ----------------markers related to studied traits. Sellers et al. [15] performed a genome-wide association study of common (>1%) CNV regions (CNVRs) with EOC (epithelial ovarian cancer) and HGSOC (high-grade serous) risk, and performed in silico analyses of tumor-gene expression. Hurles et al. [16] presented a statistical framework for case-control CNV association study, which uses likelihood ratio to test difference between case and control samples.
Recently, many researchers are committed to using machine learning-based methods to study the complex mechanisms between CNVs and diseases. Hsiao et al. [17] used Pearson correlation coefficient and pathway analysis to perform concurrent genome-wide analyses of CNVs and gene expression to identify gene reproducibly associated with tumorigenesis and survival in non-smoking female lung adenocarcinoma. Sheng et al. [18] proposed a support vector machine (SVM) classifier based on arm-level CNV data to detect early colorectal cancer. Onsongo et al. [19] applied random forest to next-generation sequencing to detect CNVs. CCRET [20] collectively modelled the effects of multiple CNV features by measuring variants on a multi-categorical scale to find disease-related CNVs. Chung et al. [21] introduced CNVRuler for CNV-association studies. CNVRuler supports chi-squared and Fisher's exact tests in addition to logistic and linear regression analyses using defined CNVRs and clinical information. DPtest [22] used a double penalty model to capture CNVs' association with both the intensities and the disease traits. Hao et al. [23] proposed an ensemble learning framework ensembleCNV. ensembleCNV combines multiple individual CNVs with complementary strengths into CNV regions (CNVRs) by using heuristic algorithm, and then performs disease-related analysis on each CNVR through global likelihood model.
Overall, the results of existing machine learning-based methods show that integrating diverse CNV-related information, disease-related information and machine learning methods can boost the prediction accuracy of the CNV-disease association. However, most existing methods are limited to CNV and disease data. The prediction results contain many false positive results (i.e., CNV not related to disease is identified as disease-related CNV) due to lack of consideration of the role of gene in CNV-disease association mechanism. In addition, most methods calculate the CNV/disease similarities only on those that have at least one known CNV-disease association.
To address the aforementioned issues (or limitations), based on our previous work [24], we put forward a novel machine learning approach, namely IHI-BMLLR (Integrating Heterogeneous Information sources with Biweight Mid-correlation and L1-regularized Logistic Regression under stability selection), to predict the CNV-disease path associations by using a data set containing CNV, disease state labels and gene data. IHI-BMLLR uses the three kinds of data to discover paths from CNV to disease. It should be noted that path means an association from a CNV to a gene and from the gene to disease. There is a biological association network where nodes represent CNVs, diseases, or genes and edges with scores represent the correlation between a pair of nodes. CNVs, genes, and diseases are connected through edges, and then constitute a biological association network. To construct a biological network, we first used a self-adaptive biweight mid-correlation (BM) formula to calculate correlation coefficients between CNVs and genes. Although the Pearson correlation coefficient (PCC) is a widely used correlation coefficient calculation method. PCC is strongly affected while the biweight mid-correlation remains practically the same as without the outliers [25][26][27]. Meanwhile, we used logistic regression with L1 penalty function (LLR) [28] to detect genes related to disease. We added stability selection (SS) strategy [29,30], which can effectively reduce false positives, when using self-adaptive BM and LLR. Finally, a weighted path search algorithm was applied to find top D path associations and important CNVs. Fig.  7 illustrates the structure of IHI-BMLLR method.
Compared with the traditional CNV-disease association analysis methods, our proposed approach has the following advantages. Firstly, compared with single CNV analysis, IHI-BMLLR can detect weighted associations, and consider all CNVs and genes simultaneously. Secondly, IHI-BMLLR uses three kinds of data (CNV, gene expression and disease state label data), which can use more information about CNV-disease mechanisms and provide insight into CNV-disease complex association mechanism. Thirdly, the self-adaptive BM and weighted path search algorithm can help IHI-BMLLR accurately identify disease-related CNVs. Finally, due to IHI-BMLLR does not require prior information. Thus IHI-BMLLR is more suitable for large-scale data lacking complete prior information.
In the experiment section, we first compared the receiver operating characteristic (ROC) performance of IHI-BMLLR with two state-of-the-art CNVs detection methods (CCRET and DPtest) using four kinds of simulation data. The experiment results show that IHI-BMLLR can significantly improve the detection performance of disease-related CNVs using gene data. From the results of the boxplots, we can see that the stability of IHI-BMLLR is better than CCRET and DPtest. Prostate Adenocarcinoma (PRAD) is the most common cancer for males and the second death rate caused by cancer in men [31]. IHI-BMLLR was applied to PRAD data and obtained many CNV-disease path associations on the PRAD data from The Cancer Genome Atlas (TCGA) [32]. IHI-BMLLR identified 212 significant paths, among which we analyzed top 10 path associations and calculated statistical significance of CNVs and genes in the paths. We used real and fake data test to calculate statistical significance of the top 10 CNV-disease path associations. The software and data of IHI-BMLLR is available at  https://github.com/nathanyl/IHI-BMLLR.

Results
Our study firstly compared performance of IHI-BMLLR with two state-of-the-art methods (i.e., CCRET and DPtest) using AUC in the simulation data. The results show that IHI-BMLLR performs clearly better than other methods. Then we compared the stability performance of these three methods using boxplots. The results show that the stability of IHI-BMLLR is better than the other two methods. IHI-BMLLR also achieved a better performance in test for real and fake data. In order to find the path-related information in cancer, we applied IHI-BMLLR to PRAD data from TCGA. The results contained 212 path associations. We used disease-related CNV true labels test to calculate the statistical significance of the top 10 high-scoring path associations, and further analyzed the statistical significance of CNVs and genes in the top 10 paths.

Comparison of methods on simulation data
For the parameters in method IHI-BMLLR, we set  0.7 and 0.8, M=100, and D=2000. Fig. 1 and Fig. 2  ). Supplementary Fig. S1 and Fig.S2 show the corresponding AUC values of Fig. 1 and Fig. 2, respectively.
It can be seen that in the ROC and AUC results the IHI-BMLLR achieved higher AUC values regardless under both values of the  parameter. The results suggest that when the pathogenesis mechanism of the disease is complex, for example, when CNVs affect the disease through a complex transmission mechanism, two-biological factors association (i.e., CNVs and disease or CNVs and genes) analysis may not accurately find the causal CNVs that affects the disease. Each kind of simulation data set was randomly generated 100  times. We calculated AUC value of each method 100 times and then generated corresponding boxplot. Fig. 3 and Fig. 4. show the boxplots. It can be seen that the IHI-BMLLR has better stability.

Comparison of methods on PRAD data
PRAD is the most common cancer for males and the second death rate caused by cancer in men. Research shows that CNV makes an important contribution to the proliferation of PRAD malignant cells [33]. CNV-disease path associations (i.e., the association between CNVs and diseases through genes) can provide biological information for in-depth understanding of the complex mechanisms of cancer. Therefore IHI-BMLLR, CCRET, and DPtest were applied to the PRAD data from TCGA. PRAD data contains CNV and gene expression profiles of 490 samples. The data set includes 24776 CNVs and 20530 DNA probe expression values from the same sample which includes known and predicted genes.
Binary labels (i.e., 1 denotes disease state and 0 denotes normal state) were used to indicate the sample state.
The main limitation of using real data sets to test disease-related CNV analysis methods is the lack of experimentally verified CNV data. The lack of verification data makes it difficult to effectively evaluate the performance of a method. In order to effectively compare the performance of various methods, true labels test was applied to three methods. Firstly, FaST-LMM-EWASher was used to identify the PRAD-related CNVs, then these CNVs are defined as true labels. Secondly, we selected top 100 CNVs from the PRAD-related CNVs. Because these 100 CNVs are closely related to the development of PRAD, the method should discover as many as CNVs as possible. Finally, we compared the performance of methods in discovering these true labels. The experiment results are shown in the Fig. 5. Supplementary Table S1 contains the detail information of results.
We also used real data and fake data test to evaluate  and compare the performance of three disease-related CNV detection methods. When the ground truth is unknown, the test method is widely used in bioinformatics research [34][35][36][37]. As shown in Fig. 6, IHI-BMLLR outperforms the two state-of-art methods. For IHI-BMLLR, when 3.7% of CNVs were found on the "real" data set, at the same time 1% of CNVs were found on the "fake" data set. Supplementary Table S2 contains the numerical information of Fig. 6.

Finding path associations from PRAD data
We used IHI-BMLLR with 10-fold cross-validation to find the path associations. We set =0.7  and M=100.
The parameter  is set to 0.7 to ensure that as many biologically meaningful paths as possible are included in result. A path contains a CNV, a gene, and the disease. We found 212 paths in the PRAD data results. It should be noticed that the maximum path score is 2. We studied and analyzed the top 10 high-score paths which contain three prostate oncogenes PCGEM1 [38], ERG [39] and MXI1 [40]. The paths and corresponding statistical analysis value are shown in Table 1.

Significant analysis of CNVs and gene in independent data
In order to verify whether CNV has a specific function, we used independent data GSE79402 to calculate the Student's t-test P-values and T-scores. As the results shown in Table 2, the P-values of 10 CNVs are all less than 1E-10. The P-values indicate that we can reject the null hypothesis and consider that the biological functions of these 10 CNVs are significantly different under normal and disease states. In order to verify whether 3 oncogenes have different functions between prostate cancer cases and controls, we used Student's t test and Wilcoxon rank sum test to calculate the statistical significance of genes from independent data GSE60329. Table 3 contains the results of Wilcoxon rank sum test and Student's t test. In the result of Student's t test, the P-values of 3 oncogenes are all less than 1E-04. Meanwhile, in the result of Wilcoxon rank sum test, the P-values of 3 genes are all less than 1E-04. These two test results indicate that the 3 genes are significantly differentially expressed between prostate cancer cases and controls.
To compare the ability of methods to find differentially expressed genes. Firstly, we used a widely used genome-wide differential expression analysis method edgeR [46] to find differentially expressed genes.
Then, we applied the methods (IHI-BMLLR/CCRET/DPtest) to the data and calculated the number of differentially expressed genes found by each method. edgeR identified 100 differentially expressed genes, IHI-BMLLR, CCRET and DPtest found 63 genes, 45 genes, and 27 genes respectively.

Discussion
We identified 4 paths that contains PCGEM1. PCGEM1 produces a long non-coding RNA that is overexpressed in prostate cancer and may act a marker for tumor progression [41] [42]. Further biological research is needed to confirm the path associations found by method IHI-BMLLR. In other path associations not discussed in the paper, we detected three genes MAPK13, MCM4, and CCNB2 that are not related to PRAD. These genes are reported to be related to bladder cancer [43][44][45]. It is necessary to study the relationship between these genes and PRAD in the future. The real biological regulation mechanism in the human body is much more complicated than what we assumed. For example, the relationship between genes and genes has received extensive attention in disease research. In this article, IHI-BMLLR is dedicated to discovering paths from CNV to gene and from the gene to disease. In the future, we will study and try to propose a method for studying gene-gene associations and optimal association numbers in CNV-disease research.
Finally, the real biological regulation mechanism in the human body is much more complicated than what we assumed. For example, mechanism includes many factors such as circRNA and histone. In the future, we study how to design a machine learning framework that simultaneously studies interaction information (i.e., CNV interaction and gene interaction information) and other biological information. It is also necessary to verify the proposed paths through biological experiment.

Conclusions
In this article, we designed a novel disease-related CNV detection method IHI-BMLLR, which uses CNV, gene and disease data to find path associations. The method consists of two parts. The first part of the method is association search method. It contains adaptive BM correlation coefficient formula and LLR. The second part contains stability selection strategy and weighted search path algorithm. These two methods were used to control false positives and identify paths, respectively. The result of simulation data experiment proves that IHI-BMLLR is significantly better than two state-of-the-art methods CCRET and DPtest. The result of the boxplots indicates that the stability of IHI-BMLLR outperforms the two methods. In the results of PRAD data experiment, IHI-BMLLR identified 212 important paths. Disease-related CNV true labels test and real data and fake data test were used to calculate the statistical significance of top 10 high-score path associations. We also discovered 3 potential PRAD-related genes.

Methods and materials
In this paragraph, we introduce the notations used in this article. We used boldface uppercase to represent matrices, boldface lowercase for vectors, and lowercase letters for scalars. We denote the CNV genotype data matrix by NP   XR , where N is the sample size and P is the CNV number, j x denotes the j-th column of CNV data matrix, i x denotes the i-th row of CNV data matrix, and i j x is the (i, j) element of CNV data matrix. Gene expression matrix is represented by NQ   YR with N sample size and Q gene traits, and disease state label matrix is represented by In the following paragraphs, we introduce the methods in the machine learning framework of finding the CNV-disease path associations. We also introduce how to discover CNVs affecting genes, identify genes affecting diseases, construct a biological network, and define a mathematical formula to calculate the scores of the path associations. Finally, we show how to use a weighted path search algorithm to discover top D path associations and important CNVs. Fig. 7 illustrates the structure of IHI-BMLLR method.

Discovering paths in a biological association network
We constructed a biological association network, the nodes in the network are used to represent CNV, genes and diseases. We describe how to establish a connection between two nodes using self-adaptive BM coefficient and LLR under stability selection strategy. Self-adaptive BM and LLR are powerful techniques to find correlations between CNVs and genes or correlations between genes and diseases, and the stability selection method is used to effectively control the number of false positive results.
Based on the self-adaptive biweight mid-correlation coefficient, we computed correlation coefficients between CNVs and genes: where x and y represent CNV x genotypes and gene y expression profiles, respectively. xi is the i-th value in x vector. med(x) is used to represent the median value of the vector x. mad(x) represents the median absolute deviation of vector x. mad(x) is used to represent the median value of the newly generated numeric vector, where each element is the absolute difference between the value of corresponding element in the original vector and med(x). The parameter α is often set to 9 empirically. However, this setting does not consider the characteristic of the data. In this paper, we set α to the data-driven parameter (mad(x)+med(x))/2. The weight wi in the formula of biweight correlation coefficient can be defined as follows: where x y x xy y x y (6) where n is the number of entries in vector. The range of self-adaptive BM value is from -1 to 1. If the correlation between a pair of elements is stronger, the absolute value of BM is larger.
Next, equation (7) and (8)  In practice, our proposed method IHI-BMLLR is used to study a class of diseases and the disease state matrix is denoted by Regularization parameter can affect the performance of model, the regularization parameter λ is determined by cross validation technique. However, we tend to get many false positive results when only using cross-validation technique [29,46]. We combine the stability selection strategy when using self-adaptive BM and LLR algorithms. We will introduce the stability selection strategy later.
Calculating association score using stability selection strategy In this paper, IHI-BMLLR use self-adaptive BM and LLR with stability selection strategy to find connections in a biological network. We summarized IHI-BMLLR under stability selection in Algorithm 1. Stability selection strategy uses the resampling technique. Specifically, the steps of the stability selection strategy used in IHI-BMLLR are as follows. Firstly, half of the sample is randomly selected M times from the overall sample, for each randomly selected data, self-adaptive BM and LLR are applied to the corresponding selected data set (i.e., self-adaptive BM is applied to data set containing CNV and gene expression data, and LLR is applied to data set containing gene expression and disease state data). Secondly, in M times experiments, if the number of non-zero absolute value of coefficient between CNV and gene or gene and disease is greater than or equal to M   times, then the CNV or gene will be retained.  is a predefined parameter used to show that when M is greater than or equal to 100 times, it is sufficient to control false positives [29]. In practical application, researcher often set  in the range from 0.5 to 1. The larger the value of  , the better the false positives control at the cost of a reduced true positive rate. Meinshausen proved that, the relationship between the number of false positives and the parameter  can be established by mathematical formula under certain conditions. When detecting the potential associations between CNVs and the q-th gene, the mathematical formula between the number of false positive results and  is defined as follows.
( ) where ( ) q EV represents the expected value of false positive CNVs associated with the q-th gene, and parameter c represents the number of non-zero associations found by the IHI-BMLLR method. From formula (9) we can see that the upper limit of false positive results is inversely proportional to the parameter  . When we apply IHI-BMLLR to detect the association between genes and diseases, the same situation exists for the relationship between false positives and parameter  .
After self-adaptive BM and LLR combined with the stability selection method is used to the data set, and obtain the result, we can calculate the significance scores of the connections in the biological association network. When detecting the relationship between the q-th gene trait and the p-th CNV, the significance score of the association can be defined as follows: Based on the association score between two nodes in the biological network, we can calculate the score of path which contains a CNV, a gene and a disease. The association path composed of important connections can be regarded as a significant biological association path. In order to find important paths efficiently, we use a weighted path search algorithm, the details of the algorithm will be described in the next section.

Detecting path associations using weighted path search algorithm
There are a large number of associated paths in a complex heterogeneous biological network. In this paper, a path represents a continuous biological path which a CNV connected to a disease through a gene. In order to accurately find important path associations, a weighted path search algorithm was used to calculate the significance scores of paths and find important biological paths (i.e., high score association paths) [30]. In the biological network, significant paths tend to have larger scores.
In a biological association network, a weighted path search algorithm can be defined as follows. Firstly, in the biological network, we choose genes that are simultaneously associated with CNV and disease. We can find all existing association paths by selecting these genes. Secondly, we can obtain the score of path by summing the weighted scores (equation (10)) of each part of the path. Finally, we sort all paths in descending order of scores and select the top D high-score path associations. The weighted path score formula can be defined as follows: Selecting N/2 samples from N samples using random sampling without replacement 7.
for all selected l 9.
for all selected l 11. ,

12.
In the result section, we compare IHI-BMLLR with CCRET and DPtest on simulation data and prostate cancer data. In this section, we will introduce simulation data set and prostate cancer data. Simulation data set contains four kinds of datasets with same number of CNV-gene true associations, same number of features and different sample sizes (i.e., 1000 CNVs, 100 genes, and 1 disease state). In simulation data set, the number of CNV-gene true associations is 100 CNVs-10 genes associations, the number of samples   200, 500, 800, 1100 N  . The simulation data was generated as follows. Firstly, we generated 100 causal CNVs that are related to disease. The state label of the data is an equal number of diseases or normal states. (i.e., the same number of 0s and 1s, 0 means normal state, 1 means disease state). Secondly, we used a three-layer fully-connected neural network to generate 10 gene expression data. In the three-layer fully-connected neural network, the input layer contains 100 nodes, the middle layer contains 10 nodes, and the output layer contains 1 node. The nodes in the input layer represent CNVs, the nodes in the middle layder represent genes, and the nodes in the output layer represent disease. Thirdly, we used the TensorFlow with the back propagation (BP) algorithm to train the three-layer fully-connected neural network until the neural network can correctly predict more than 95% of disease state label nodes [47]. We used the values of the middle layer nodes as gene expression values, and the values in the input layer nodes were mapped to [-2, -1, 0, 1, 2]. Finally, we added 900 CNV values and 90 gene values to the sample. The CNV values were randomly selected from [-2, -1, 0, 1, 2], and the gene values were from Gaussian distribution. The 900 CNVs and 90 genes represent CNVs/genes that is not related to the diseases. Meanwhile, we also added noise data from Gaussian to the data. The PRAD data were downloaded from https://xenabrowser.net/ [48]. CNV profile of PRAD was measured experimentally using genome microarray and the GISTIC2 method [49,50]. The Illumina HiSeq 2000 RNA Sequencing platform was used to measure the gene expression profile [51]. The corresponding disease state label data was downloaded from TCGA Pan-Cancer Clinical Data Resource (TCGA-CDR) [52,53]. We ran simulation and prostate cancer data experiments on a computer with Intel Xeon W-3175X CPU and 256G RAM.

Criteria for evaluating method performance
In order to evaluate the performance of methods fairly, we used the receiver operating characteristic (ROC) curve to observe the performance of methods and compared the performance of methods using the area under receiver operating characteristic (AUROC). FPR (false positive rate) and TPR (true positive rate) are required to draw ROC and calculate AUROC value. We calculated TPR and FPR based on the confusion matrix (Fig. 8).

Disease-related CNV true labels test
In the disease-related CNV true labels test, FaST-LMM-EWASher [54], which is a conventional linear regression method, was used to predefine disease-related CNVs. The pre-defined CNVs are treated as true labels. Then, we compared performance of IHI-BMLLR, CCRET and DPtest in accurately identify the true labels. An excellent disease-related CNV detection method should report as many true labels as possible in the resulting CNVs.

Real data and fake data test
In the designed real data and fake data test, three methods (i.e., IHI-BMLLR, CCRET and DPtest) were applied to real CNV-disease data and obtained a set of "real" results; then these three CNV detection methods were applied to a "fake" data that exchanges samples between two category conditions and obtained a set of "fake" results. Compared with the expected biologically significant "real" results, the "fake" results have no biological significance. An excellent CNV detection method should find as many as CNVs as possible in the "real" results, while reporting as few CNVs as possible in the "fake" results. In addition, when methods find the same amount of CNVs on the "fake" data set, the method that detects more CNVs on the "real" data set has better performance.

The statistical significance calculation method of paths
In order to measure the statistical significance of the 10 paths. We used a method of calculating statistical significance by comparing the original path score and the random path score. In bioinformatics research, this is a widely used statistical significance calculation method [55][56][57][58]. Firstly, we randomly rearranged CNV samples, gene samples and disease labels, and then calculated the scores of the paths. Secondly, we repeated the previous steps 5000 times, calculated the score of the path in each randomly generated sample, and then constructed a histogram of the scores. Thirdly, we calculated the p-value of the path by calculating the proportion of the path score in 5000 random data that is less than the path score in the original data. The null hypothesis in this paper is that path scores in random data are randomly distributed scores. The alternative hypothesis is that the path score is related to the structure of the sample data. Assuming that the p-value of score is 0.001, this means that there are random path scores less than the original path score under null hypothesis.

Consent for publication
Not applicable.

Availability of data and materials
The software of IHI-BMLLR is available at https://github.com/nathanyl/IHI-BMLLR, and the datasets used are available from the corresponding references.