We developed MDLCN, a multimodal deep learning model, for predicting cell-type-specific FGNs by leveraging single-cell gene expression data with a global protein interaction network (Fig. 1). Gene expression signatures of a gene pair were first transformed to a co-expression matrix that captures the joint density of co-expression patterns of the gene pair across the cells in a particular cell type. We computed a set of proximity features for each gene pair based on a global protein interaction network we assembled from protein physical interaction evidence. The co-expression matrix and global proximity features of each gene pair were integrated to predict its functional relationship status through the MDLCN model. The MDLCN model was trained for each cell type using the cell-type-specific gold standards.
Training dataset
We assembled a gold standard of gene pairs for building cell-type-specific FGNs following the approach for building tissue-specific FGNs in a previous study [4].We first constructed a cell-type-naive functional relationship gold standard from 564 expert-selected gene ontology (GO) terms and experimentally derived gene annotations. Gene pairs co-annotated to expert-selected terms were treated as positive examples (i.e. functionally related), and pairs not co-annotated to any of these terms were considered as negative examples. Next, we identified cell-type-specific genes that are defined as the top ranked genes by a specificity score that was computed by the average expression level of the gene in each cell type divided by the total expression values of the gene across all cell types [5]. The higher the specificity score of a gene in a cell type, the more specific the gene was to that cell type. Then, we combined the cell-type-naive gold standard with the cell-type-specific genes to construct the cell-type-specific gold standard.
Ultimately, our cell-type-naive gold standard included 3,619,063 positive gene pairs and 49,095,410 negative gene pairs. Table. S1 presents the positive and negative class construction procedure (see Additional file 1). The positive gene pairs in our cell-type-specific gold standard were a subset of positive examples in the cell-type-naive gold standard, and the two genes of each pair were cell type-specific or one was cell type-specific and the other was a house-keeping gene as defined previously [6]. The negative gene pairs in our cell-type-specific gold standard were either: 1) positive examples in cell-type-naive gold standard, but one gene was specific to the corresponding cell type, and the other gene was specific to a different cell type; 2) positive examples in cell-type-naive gold standard, but one gene was specific to a different cell type, and the other gene was a house-keeping gene; 3) negative examples in cell-type-naive gold standard, but the two genes of each pair are cell type-specific or one was cell type-specific and the other was a house-keeping gene; 4) negative examples in cell-type-naive gold standard, but one gene was specific to the corresponding cell type, and the other gene was specific to a different cell type; 5) negative examples in cell-type-naive gold standard, but one gene was specific to a different cell type, and the other gene was a house-keeping gene.
The single-nuclei gene expression data used in this study includes 14,873 nuclei from the human brain that were clustered and annotated to major brain cell types in a previous study [7], including 3,400 excitatory neurons, 1,715 inhibitory neurons, 1,897 astrocytes, 245 endothelial cells, 386 microglia, 2,963 oligodendrocytes, and 682 oligodendrocyte precursor cells. We normalized the single nuclei count data by the "LogNormalize" function of Seurat that normalizes each feature count by the total counts in each cell, multiplied by a scale factor (10,000) and transformed to log scale [8]. Cell type-specific genes were the ones which ranked in the top 5% in the specificity scores for all cell types except for excitatory neurons for which the top 10% ranked genes were used so that a sufficient number of labeled gene pairs for model training could be collected. To have a balanced number of positive and negative gene pairs in each training set, the negative classes were randomly down sampled [9]. The number of positive and negative gene pairs in each cell type is presented in Table. S2 (see Additional file 1).
Transforming single-nuclei gene expression data to 2D co-expression matrices
We encoded single-nuclei gene expression data of the gene pairs to an image-like 2D co-expression matrix. Specifically, we first imputed and smoothed single nuclei expression data from the human brain described above using MAGIC, a Markov affinity-based graph imputation method [10]. Then, the range of the expression values of each gene was divided into equal bins. Then, for each pair of genes, 2D co-expression matrix was constructed by counting the number of cells that each gene expressed in the corresponding bins [3]. As the number of bins ( ) plays an important role in the model performance in our experiments, it was tuned to 10, at which the model achieved the best prediction accuracy.
Gene proximity features from global protein interaction network
Global protein interaction networks contain protein structural and functional information that are informative for cell-type-specific gene functional relationships. We assembled a global protein interaction network based on experimentally validated protein physical interaction evidence from multiple resources including Biogrid [11], IntAct [12, 13], APID [14] and Inweb [15].
After overlapping with genes in the single-nuclei gene expression dataset, the global protein interaction network contained 16,873 genes and 142,340,628 pairs of physical interactions. We used five metrics to measure the degree of similarity or proximity between a protein pair in the global protein interaction network, including Common Neighbors (CN), Jaccard’s Coefficient (JC), Preferential Attachment (PA), Adamic-Adar Coefficient (AA), and Path Distance (PD). These metrics measure different topological relationships between two proteins in the network. In particular, CN counts the number of common neighborhoods between the two proteins, JC quantifies the similarity between their neighborhoods, PA calculates the likelihood of link existence by measuring the strength of the hubness of the two proteins, AA computes the proportion of their shared links to the total number of their neighbors, and PD measures the length of the shortest path between the two proteins [16].
Predicting cell-type-specific FGNs
We developed a multimodal deep learning model to predict cell-type-specific gene functional relationships from co-expression matrices and proximity features between two proteins in the global protein interaction network. For each pair of genes, the co-expression matrix and the vector of proximity features were exploited as two modalities in our model, including a co-expression-processor modality to extract representations from the co-expression matrix and a proximity-processor modality to extract representations from proximity features as shown in Fig. 2. In the co-expression-processor modality, the input layer is a co-expression matrix for each gene-pair. The modality consists of three convolutional layers which map the local conjunctions of features from previous layers to a feature map. Immediately after each convolutional layer, there is a max pooling layer which down samples the output of convolutional layers by taking the maximum value over an input window. At the end of this modality, a flattened layer is used to switch 2D features extracted from convolutional process to 1D features by retaining the weight orders and a densely connected layer is employed to compile the features extracted from previous layers to form the representations. The proximity-processor modality consists of an input layer for five proximity features between each gene pair, four densely connected layers and a flattened layer. The representations output from the two modalities are concatenated to a high-dimensional feature vector in a fusion layer and transformed through three densely connected layers. Finally, the feature vector is used in output layer to predict the probability of cell-type-specific functional connection between a gene pair. We used rectified liner activation function (ReLU) as the activation function across the whole network except the output layer where sigmoid function was used for binary classification. The dropout regularization is used in the multimodal deep learning model for preventing overfitting. The model was implemented using the Keras library in Python. We chose the Binary-Cross Entropy as the loss function and the Adam optimizer to update weights. The hyperparameters in the model, including the number of filters in the convolutional layer, the kernel size of the convolutional layer, the kernel size of the max pooling, the size of the dense layer in co-expression modality, the size of the dense layers in proximity modality, the position and rate of dropout, and the optimizer’s type, were tuned using the validation set and summarized in Table. S3 in Additional file 1.
Evaluation of model performance
We evaluated the performance of the MDLCN model in predicting functional relationships of gene pairs in three different test sets, including a 1) dependent test set that included gene pairs with both genes appearing in the training set; 2) partially dependent test set that included gene pairs with only one gene appearing in the training set; and 3) independent test set that included gene pairs with both genes not appearing in the training set. We used the area under ROC curve (AUC-ROC) and the area under Precision-Recall curve (AUC-PRC) as the evaluation metrics with five-fold cross validation. To evaluate the effects of the features from global protein interaction network, we compared the MDLCN with the CNNC that only uses 2D co-expression matrix to predict the interactions among the genes. To determine the strength of the 2D co-expression matrix in predicting the interaction among the genes, we also compared the MDLCN with a boosting tree model that uses the Pearson correlation to measure the co-expression between two genes.
Topological analysis of cell-type marker genes
To evaluate the constructed cell-type-specific networks, we further examined whether cell-type marker genes have a distinctive topological structure across different cell type-specific FGNs. Cell-type marker genes were based on a previous study [17] and were defined as genes with at least one log-fold change in expression levels when cells of a given cell-type were compared against all other cells. The hubness of each marker gene was computed as the summation of edge weights that are directly connected to that gene. We also computed a topological specificity score for each marker gene to test whether cell-type marker genes have distinctive localization compared to random networks as done previously [18]. The topological specificity score represents the hubness of a gene in the predicted network normalized by its hubness distribution in random networks that were created by re-shuffling edge weights in the predicted network. The topological specificity score (topS) for each marker gene is calculated as
Connectivity of disease genes in predicted cell-type-specific FGNs
To evaluate whether disease genes show cell-type-specific modularity in the constructed cell-type-specific networks, we assessed the connectivity strength between disease genes in each network. We considered two brain disorders: autism spectrum disorder (ASD) and Alzheimer’s disease (AD). We collected 408 high confidence ASD risk genes from the SFARI database [19] and 1,611 genes implicated in AD from the DisGeNET database [20]. We calculated the average connectivity over all pairs of disease genes for each network and compared the average connectivity with a background distribution from 1,000 random gene sets matched by gene numbers and gene length with the disease genes. We used the Z-score to test the deviation of connectivity of disease genes