A joint optimization framework integrated with biological knowledge for clustering incomplete gene expression data

Clustering algorithms have been successfully applied to identify co-expressed gene groups from gene expression data. Missing values often occur in gene expression data, which presents a challenge for gene clustering. When partitioning incomplete gene expression data into co-expressed gene groups, missing value imputation and clustering are generally performed as two separate processes. These two-stage methods are likely to result in unsuitable imputation values for clustering task and unsatisfying clustering performance. This paper proposes a multi-objective joint optimization framework for clustering incomplete gene expression data that addresses this problem. The proposed framework can impute the missing expression values under the guidance of clustering, and therefore realize the synergistic improvement of imputation and clustering. In addition, gene expression similarity and gene semantic similarity extracted from the Gene Ontology are combined, as the form of functional neighbor interval for each missing expression value, to provide reasonable constraints for the joint optimization framework. The experiments are carried out on several benchmark data sets. In terms of the average improvement rate over the data sets and different missing rates, our framework can reduce the imputation error by 6.4–14.7% and increase the clustering accuracy by 4.0–10.1% compared with six popular and promising methods. Furthermore, biological significance of the identified gene clusters is reported to evaluate the effectiveness of the proposed framework.


Introduction
The recent development of biological experiments has generated vast amounts of gene expression data. Thus, extracting the intrinsic patterns from the enormous number of genes has become a significant challenge (Chung et al. 2006;Maâtouk et al. 2019). As an essential unsupervised data mining method, clustering has become one of the most promising tools to analyze gene expression data. The major task in gene expression data clustering is to identify coexpression gene groups, which can provide a useful basis for the further investigation of gene function and gene regulation in the field of functional genomics (Chen et al. 2019b;Chung et al. 2006;Giri and Saha 2020;Maâtouk et al. 2019;Maulik et al. 2009;Saha et al. 2013;Stegmayer et al. 2012).
One underlying assumption of the conventional clustering algorithms is that all data are complete. However, biological experiments inevitably generate data with missing values, the presence of incomplete data makes it challenging to perform clustering analysis on such gene expression data (Moorthy et al. 2014;Yu et al. 2017). This problem has attracted the attention of researchers in recent years. A straightforward way is to discard the genes with missing components and perform clustering on the remained complete matrix. In such algorithms, the discarded genes cannot be partitioned for further analysis and would obviously cause information loss. For this reason, various imputation strategies have been proposed for incomplete microarray data. The typical approaches are to use the global or local information from within the expression data to fill up the missing values (Acuna and Rodriguez 2004;Buuren and Oudshoorn 2011;de Souto et al. 2015;Kim et al. 2005;Oba et al. 2003;Troyanskaya et al. 2001;Yu et al. 2017). Though demonstrating good imputation performance, we observe that the aforementioned algorithms treat imputation as an independent task, and the suitability of the imputation values for subsequent analysis is ignored. When identifying co-expressed genes from incomplete gene expression data, a common way is first to fill the missing values with an imputation approach and then apply a clustering algorithm to these imputed data (de Souto et al. 2015;Kim et al. 2005;Moorthy et al. 2014;Saha et al. 2013;Yu et al. 2017). These ''two-stage'' methods are quite popular and simple to implement, but they also suffer from the drawback that the processes of imputation and clustering are disconnected. Obviously, these ''two-stage'' methods prevent the collaborative optimization of the two learning processes, thus leading to imputation values that are unsuitable for clustering and unsatisfying clustering performance. This problem has been observed in other applications, such as incomplete multi-view clustering (Liu et al. 2019) and multiple kernel clustering with incomplete kernels (Liu et al. 2020). In the problem of clustering incomplete gene expression data, missing value imputation and gene clustering are inherently related, as both tasks use the correlation information within the data. Therefore, investigating the joint optimization on imputation and clustering of incomplete gene expression data improves both imputation and clustering accuracy. To the best of our knowledge, no work has been reported on this topic. On the other hand, with the unremitting generation of biological knowledge, publicly available knowledge bases, e.g., Gene Ontology (GO) (Ashburner et al. 2000), have been successfully applied to missing values imputation (Tuikkala et al. 2006), feature selection (Mitra and Ghosh 2012), biomarker detection (Cindy 2020), and multi-view gene clustering (Giri and Saha 2020). Compared with the purely data-driven approaches, the analysis accuracy of gene expression data can be significantly improved by incorporating biological knowledge (Cindy 2020;Giri and Saha 2020;Moorthy et al. 2014;Tuikkala et al. 2006). Therefore, it is predictable that the imputation and clustering accuracy can be further improved if biological knowledge is involved in guiding our joint optimization framework.
With the motivations stated above, this study propose a novel joint optimization framework called MOC-FNI (multi-objective clustering algorithm based on functional neighbor interval) for the problem of clustering incomplete gene expression data. In our algorithm, functional neighbor intervals for missing expression values are constructed based on the combination of gene expression similarity and GO semantic similarity, which can act as interval constraints in the joint optimization. The contributions of the proposed method can be summarized as follows: (i) The proposed functional neighbor intervals can sufficiently integrate multi-source information, including gene expression information and GO semantic information, which are beneficial to provide reasonable constraints to guide the optimization process and make the results more biologically meaningful. (ii) MOC-FNI can realize synergistic improvement of imputation and clustering in the framework of nondominated sorting genetic algorithm-II (NSGA-II) (Deb et al. 2002). Therefore, with the constraints of functional neighbor intervals, MOC-FNI can obtain imputations guided by both multi-source information and cluster validity, as well as gene clusters based on meaningful and reasonable imputations.
The rest of this paper is organized as follows. Section 2 reviews the related works on the imputation and multiobjective clustering of gene expression data. Brief background on imputation methods for gene expression data and multi-objective optimization is demonstrated in Sect. 3. Section 4 presents the proposed MOC-FNI method. Experiments on several bench-mark data sets are shown in Sect. 5. Conclusion and further work are summarized in the last section.

Related works
Microarray data often contain missing values. The leading causes include hybridization failures, artifacts on the microarray, image noise and corruption, etc. (Song et al. 2014;Tuikkala et al. 2006;Yu et al. 2017). Owing to the economic and experimental limitations, it is not always practical to repeat the experiments. Thus, many efforts have been devoted to exploring the missing value imputation, which makes it possible to realize a reliable analysis of incomplete gene expression data.
Generally speaking, imputation methods for gene expression data can be categorized into global-based, localbased and knowledge-assisted methods. The global-based methods assume a correlation structure in the gene data set and focus on such information to impute the missing values. Representative methods include the singular value decomposition (SVD) imputation (SVDimpute) (Troyanskaya et al. 2001), mean imputation (MEANimpute) (Acuna and Rodriguez 2004), and Bayesian principal component analysis (BPCA) (de Souto et al. 2015;Oba et al. 2003). In contrast, the local-based methods exploit local similarity structures among genes. Representative methods include the k-nearest neighbors (KNN) imputation (KNNimpute) (Troyanskaya et al. 2001), local least squares imputation (LLSimpute) (Kim et al. 2005), and the multivariate imputation by chained equations (MICE) (Buuren and Oudshoorn 2011). Knowledge-assisted methods integrate biological knowledge, such as annotation information from Gene Ontology (GO) (Ashburner et al. 2000), into the imputation process. Table 1 shows a summary of several baseline imputation methods for incomplete gene expression data. These methods are still widely used in practical applications in bioinformatics analysis (Chen et al. 2019a;Ghosh et al. 2021;Mohammed et al. 2021;Paul et al 2017;Rahmatbakhsh et al. 2021;Tu et al. 2016;Yang et al. 2015).
Once the missing values are imputed, machine learning techniques, including clustering, can be applied to analyze the complete gene expression data sets. The purpose of clustering gene expression data is to identify co-expression gene groups that indicate co-function and co-regulation. In such problems, partitional clustering techniques, such as K-means and Fuzzy C-means (Bezdek 1981), are often applied to optimize the compactness of the clusters. However, these single-objective clustering algorithms cannot cover different characteristics of the gene expression data and may lead to improper gene clusters (Mukhopadhyay et al. 2013). In recent works, various multi-objective clustering algorithms are proposed to deal with the gene clustering problem. Generally, multi-objective clustering algorithms simultaneously optimize two conflicting objective functions, such as intra-cluster compactness and inter-cluster separation, overall deviation and connectivity, symmetry and separation. In these newly proposed multi-objective clustering algorithms, NSGA-II (Deb et al. 2002) and archived multi-objective simulated annealing (AMOSA) (Bandyopadhyay et al. 2008) are the two most commonly used optimization frameworks. Table 2 shows a summary of multi-objective clustering algorithms for gene expression data.
When identifying co-expressed genes from incomplete gene expression data, the aforementioned methods often disconnect the imputation and clustering tasks. In these two-stage scheme, the imputation is first implemented as a preprocessing step and clustering techniques are consequently applied to the recovered gene expression data (Kim et al. 2005;Saha et al. 2013;de Souto et al. 2015;Yu et al. 2017). These two-stage methods are quite popular for their simplicity and easy implementation. However, recent literature have pointed out that these two-stage methods hinder the collaborative improvement of the two learning processes and thus affect the clustering performance (Liu et al. 2019(Liu et al. , 2020. Besides, existing imputation methods generally filled in each missing value separately and ignored the overall impact of imputation values on data analysis, such as clustering. Therefore, our goal in this paper is to develop a multi-objective joint optimization framework for clustering incomplete gene expression data, with a view to improving imputation and clustering accuracy synergistically. In addition, biological knowledge is integrated into our framework.

Background
In this section, background information on imputation algorithms for incomplete gene expression data and multiobjective optimization is presented, respectively.

Imputation algorithms for incomplete gene expression data
In most cases, the missing values in incomplete gene expression data are imputed based on the global or local statistical information of gene expression data. A simple and commonly used method is the mean imputation (MEANimpute) (Acuna and Rodriguez 2004), where each missing value is filled in with the average of the corresponding attribute values of all genes with complete expression values. Troyanskaya et al. (2001) applied the knearest neighbors rule to the imputation problem, considering the similarity between neighboring genes. Here, we denote $ G ¼ ½g ij NÂM as an incomplete gene expression data set with N genes and M samples (time points in many cases), and $ g b as a target gene which retains at least one missing component. In KNNimpute, for each target gene $ g b , k complete neighboring genes are first found and the missing value is filled in with the weighted average of the corresponding attribute values of these genes. The weight w i of i-th neighboring gene $ g i is calculated by Eq. 1.
where d $ g b ; $ g i is the distance between $ g i and the target gene $ g b . In the singular value decomposition (SVD) imputation (SVDimpute) (Troyanskaya et al. 2001), the SVD is employed to obtain a set of mutually orthogonal expression patterns, called eigengenes. Then, several most significant eigengenes are selected and their linear combination are used to reconstruct the missing values. It is claimed that KNNimpute is superior to SVDimpute in accuracy and robustness (Troyanskaya et al. 2001). Bayesian principal component analysis (BPCA) (de Souto et al. 2015;Oba et al. 2003) is another widely used global imputation method, which involves principal component regression to estimate the missing part in the expression vector, Bayesian estimation and expectationmaximization like repetitive algorithm to estimate the posterior distributions for the model parameter. In BPCA, a gene expression vector g is views as a linear combination of l principle axis vectors, as given by Eq. 2.
where y j represents the factor score, s j ¼ ffiffiffiffi k j p l j is the j th principle axis vector with k j and l j being the j th principle eigenvalue and the eigenvector of the covariance matrix for expression data, e is the residual error. To impute the missing part $ g mis b of the target gene $ g b , factor scores y ¼ y 1 ; y 2 ; . . .; y l ð Þcan be obtained by minimizing the residual and K obs are the observed part of $ g b and principle axis vectors, respectively. Then $ g mis b can be estimated by Eq. 3.
where K mis represents the missing part of principle axis vectors. BPCA adopts probabilistic PCA to obtain K ¼ K obs ; K mis À Á before the aforementioned procedure. For the data with a global covariance structure, BPCA always outperformed KNNimpute and SVDimpute methods (Oba et al. 2003;de Souto et al. 2015;Yu et al. 2017).
Besides, local least squares imputation (LLSimpute) (Kim et al. 2005) introduces a multiple linear regression model to impute each missing value based on neighboring genes. Specifically, the regression model is expressed as Eq. 4.
where z is the regression coefficient vector, A is the matrix composed of similar genes of the target gene $ g b , A obs and A mis represent the observed and missing parts of A that correspond to the observed and missing samples of $ g b , respectively. After minimizing Eq. 4 and obtaining z, the missing part $ g mis b can be estimated by Eq. 5.
Another notable approach is the multivariate imputation by chained equations (MICE) (Buuren and Oudshoorn 2011), whose aim is to produce multiple imputations and integrate these results to fill in the missing values. Especially, MICE is freely available as an R package (Buuren and Oudshoorn 2011).
Knowledge-assisted methods have gradually become a research hotspot in gene analysis in recent years. Gene Ontology (GO) (Ashburner et al. 2000) is one of the most widely used publicly available knowledge bases, which describes biological annotations in the form of directed acyclic graphs. GO contains three dynamic sub-ontologies, and consists of a set of terms related to biological process (BP), cellular component (CC), and molecular function (MF) along with relations between terms. For the imputation problem of gene expression data, a popular way to apply GO annotation information is to calculate the semantic similarity of GO terms and that of genes. Tuikkala et al. (2006) proposed to integrate GO annotations into the KNN imputation algorithm (GOimpute). The experimental results verify the positive effect of GO in improving imputation accuracy. The details of GOimpute will be discussed in the next section.

Multi-objective optimization
There has been an increasing interest in applying metaheuristic algorithms to optimization problems in recent years. The genetic algorithm (GA) and simulated annealing (SA), as two representatives, have received considerable attention and are widely used in various applications, including biclustering (Maâtouk et al. 2019), instance selection (Chen et al. 2015), airline crew pairing problem (Demirel and Deveci 2017;Deveci and Demirel 2018), traveling salesman problem (Doumi et al. 2021), and site selection of electric charging stations (Türk et al. 2021). Due to the presence of multiple conflicting objectives in many real applications, various multi-objective evolutionary algorithms are proposed and shown to be more effective and reliable than single-objective optimization (Bandyopadhyay et al. 2008;Deb et al. 2002;Erdogan et al. 2021).
A multi-objective optimization problem can be generally formulated as Eq. 6.
where K is the number of objective functions, vector x is an element in the decision space, and f 1 x ð Þ; f 2 x ð Þ; Á Á Á ; f K x ð Þ are K objectives to be minimized. Unlike simple-objective optimization, multi-objective optimization provides concurrently achieved trade-off between the multiple objective functions (Erdogan et al. 2021), and generally yields a set of optimal solutions called Pareto optimal set (PS). The concept of Pareto optimality can be given as follows: for two feasible solutions x and x Ã in the decision space, The objective function values corresponding to the solutions in PS compose Pareto optimal front (PF).
As one of the most popular multi-objective evolutionary algorithms, NSGA-II (Deb et al. 2002) is characterized as non-dominated sorting and crowding distance. The main idea of non-dominated sorting is to distribute individuals into different non-dominated fronts based on the Pareto dominate relationship. Specifically, the first front F 1 is composed of solutions which are not dominated by any other ones. Then, these solutions are temporarily removed from the population. The solutions that are now nondominated are assigned to the second front F 2 . The above process is repeated until all the solutions have been assigned to a front. In NSGA-II, crowding distance is used to maintain the population diversity. First, all the solutions in the same front are sorted in an ascend order according to each of the objective function values. Then, the crowding distance of the k-th solution x k can be expressed as Eq. 7.
where f i x kÀ1 ð Þ and f i x kþ1 ð Þ are the i-th objective function value of x kÀ1 and x kþ1 , respectively. Especially, the crowding distances of the first and last solutions (solutions with the smallest and largest objective function values) is set to infinite. The procedure of the proposed NSGA-II can be described as follows: Step 1: Set the genetic population size F, maximal number of generations L max , and the genetic parameters. Initialize the genetic population P 1 ð Þ.
Step 2: When the genetic generation index is l l ¼ 1; 2; . . .; L max ð Þ , generate F offspring Q l ð Þ by applying genetic operators on P l ð Þ.
Step 4: Set the parent population of the next generation P l þ 1 ð Þ¼£, front counter i ¼ 1.
Step 5: If the number of individuals in P

and repeat
Step 5; Otherwise, go to Step 6.
Step 6: Sort F i according to crowding distance, set Step 7: If genetic generation index l ¼ L max , stop and get the set of Pareto optimal solutions P s ; otherwise set l ¼ l þ 1 and return to Step 2.

Proposed MOC-FNI for incomplete gene expression data
To identify the co-expressed gene groups from incomplete gene expression data, we propose a novel clustering method, called multi-objective clustering algorithm based on the functional neighbor interval (MOC-FNI). In MOC-FNI, the imputation of missing expression values and clustering are optimized synergistically in the framework of NSGA-II. To provide reasonable constraints for the optimization, functional neighbor intervals are constructed for the missing expression values based on the combination of gene expression similarity and GO semantic similarity. Our method makes full use of multi-source information, including gene expression information and GO semantic information. The proposed multi-objective joint optimization can promote the synergistic improvement of imputation and clustering. Thus, MOC-FNI can obtain imputation values guided by both multi-source information and cluster validity, as well as clusters based on meaningful and reasonable imputation results. For a set of genes to be analyzed, each gene can be annotated with several GO terms. Thus, the functional similarity between genes can be deduced based on the term similarity. In MOC-FNI, an aggregate information content (AIC) (Song et al. 2014) based approach is adopted to measure the semantic similarity of GO terms.

Determination of functional neighbor intervals
For the given GO terms t 1 and t 2 , the AIC semantic similarity (Song et al. 2014) is defined as Eq. 8.
SVðtÞ ¼ X with T t being the set of ancestors of term t including t itself in the GO graph. p t ð Þ is the probability of t occurring in the GO database and IC t ð Þ ¼ ÀlogpðtÞ reflects the information content of t. Therefore, based on the knowledge represented by 1=IC t ð Þ, SW t ð Þ measures the semantic weight and SV t ð Þ is the semantic value of GO term t by adding the semantic weights of all its ancestors.
When deducing the functional similarity between genes from the term similarities, the AVE method (Azuaje et al. 2005) is one of the most widely used schemes. The AVE method adopts the average inter-set similarity between terms that annotate the gens to measure the gene semantic similarity. Given a pair of genes g a and g b ða; b ¼ 1; 2; . . .; NÞ, let ann g a ð Þ, ann g b ð Þ be the sets of GO terms that annotate the two genes respectively. Then the semantic similarity of g a and g b can be determined by Eq. 11.
where jann g a ð Þj, jann g b ð Þj are the cardinalities of ann g a ð Þ, ann g b ð Þ, respectively. Note that Eq. 11 extracts the semantic similarity based on gene annotations available from GO. Therefore, there is no need to consider the incompleteness of gene expression data.
For the incomplete expression data set $ G ¼ ½g ij NÂM , partial distance (Hathaway and Bezdek 2001;Li et al. 2013) can be used to extract the expression-based dissimilarity of $ g a and $ g b ða; b ¼ 1; 2; . . .; NÞ, as given by Eq. 12. where In MOC-FNI, we take into account both the semanticbased and expression-based dissimilarity, the combined distance (Tuikkala et al. 2006) is defined as Eq. 14.
where the positive weight parameter h [ 0 balances the contribution of two dissimilarities.
In this paper, we consider the case that gene expression values are missing completely at random (MCAR). For each target gene $ g b with missing values, the combined distance (14) is applied to guide the selection of functional neighbors. Then, the functional neighbor interval of its missing valueg bj can be determined. Specifically, we search for q functional neighbors of $ g b with non-missing feature j, where g À bj and g þ bj are the minimum and maximum of the neighbors' j-th expression values, respectively. Therefore,g bj can get its functional neighbor interval as

Objective functions
MOC-FNI is a multi-objective joint optimization method, where the imputation and clustering results are optimized simultaneously in the framework of NSGA-II. In multiobjective clustering problems, the objective functions should conflict with each other and represent different aspects of clustering performance. The cluster validity indices J m (Bezdek 1981) and XB (Xie and Beni 1991) measure the intra-cluster compactness and inter-cluster separation, respectively. These two indices are commonly used as objective functions and have achieved satisfying clustering performance in various multi-objective clustering algorithms (Bandyopadhyay et al. 2007;Maulik et al. 2009;Mukhopadhyay et al. 2013). MOC-FNI simultaneously optimize these objectives, which are formulated as Eqs. 15 and 16, respectively.
A joint optimization framework integrated with biological knowledge… 13645 where represents the degree of g k in the i-th cluster with u ik 2 ½0; 1 8i; k ð Þ and P C i¼1 u ik ¼ 1 8k ð Þ. g k ¼ g 1k ; g 2k ; . . .; g Mk ½ T is a gene expression vector in the recovered complete data matrix G. C is the total number of clusters. v i 2 R M is the ith cluster prototype. The parameter m 2 1; 1 ð Þ influences the fuzziness of the partition. k:k 2 stands for the Euclidean norm. Lower values of J m and XB imply better compactness and separation of the yielded clusters.

Chromosome encoding and population initialization
To optimize the imputation and clustering results simultaneously, a mixed chromosome encoding strategy is adopted in MOC-FNI. Each chromosome includes the cluster prototypes and the imputation values of missing expression values. Let E be the population and F be the population size. For an incomplete gene expression data set $ G ¼ ½g ij NÂM with h missing values, we first sort and renumber the h functional neighbor intervals of missing expression values by their appearance order in $ G . Specifically, for each missing valueg bj , its functional neighbor interval g À bj ; g þ  Figure 1 shows the mixed chromosome encoding strategy. For each chromosome E f l ð Þ with 1 f F at generation l, let the cluster prototype part (with C Â M components) be Then, select C genes with higher local density and having large distance from other density maxima points as the initial prototypes. The DPC algorithm, which has been proven to be effective in analyzing gene expression data (Mehmood et al. 2017), can identify reliable prototypes. The functional neighbor intervals can provide reasonable constraint on the imputation values. These advantages will contribute to improve the convergence speed and optimization ability of genetic search involved in the NSGA-II framework.

Genetic operations
In the genetic process of MOC-FNI, we use the roulette wheel for implementing the selection scheme. As NSGA-II prefers solutions with lower non-domination rank (Deb et al. 2002), a lower-rank chromosome should be selected with a higher probability. Thus, for each chromosome E f l ð Þ with rank f rank , we calculate its selection probability by the rank-based evaluation function (Zhou and Zhu 2018), as shown in Eq. 18.
where a 2 ½0; 1 is a parameter that controls the selective pressure.
Because the imputation values of missing expression values should evolve within the corresponding functional neighbor intervals, a crossover operator based on competition and optimal selection (Ren and San 2007) is adopted. For the parent chromosomes E f 1 l ð Þ and E f 2 l ð Þ with 1 f 1 ; f 2 F at generation l, four offspring are first generated by using Eqs. 19-22.
where crossover factor b 2 ½0; 1 denotes the weight determined by users. Both e max and e min have C Â M þ h components, e max ¼ 1; 1; . . .; 1; e þ 1 ; e þ 2 ; . . .; e þ h Â Ã , e min ¼ 0; 0; . . .; 0; e À 1 ; e À 2 ; . . .; e À h Â Ã , where 1 and 0 are the Vectors and min E f 1 l ð Þ; E f 2 l ð Þ À Á are formed by the maximum and minimum of corresponding components in E f 1 l ð Þ and E f 2 l ð Þ, respectively. Note that the imputation values and prototype components may spread all over the functional neighbor intervals and 0; 1 ½ , respectively. The above crossover operator can generate superior offspring than arithmetic crossover or heuristic crossover (Li et al. 2013;Ren and San 2007). Then, from the two parent chromosomes and four offspring, two chromosomes with lower rank values and lesser crowding distances can be chosen to substitute the parent chromosomes.
Mutation operator generates richer solution and can ensure a better convergence rate and diversity (Doumi et al. 2021). In the mutation process, we adopt the classic uniform mutation scheme. As all expression values have been normalized and the evolution of imputation values should satisfy the interval constraints, each component of the prototype part and the imputation part of a chromosome is replaced by a random value in ½0; 1 and the corresponding functional neighbor interval, respectively, with a small probability P m .

The procedure of the proposed method
For an incomplete gene expression data set $ G ¼ ½g ij NÂM with h missing values, the procedure of the proposed MOC-FNI can be described as follows: Step 1: For each missing expression valuẽ g bj 1 b N; 1 j M ð Þ , find q functional neighbors of the target gene $ g b with non-missing feature j using Eq. 14, then determine the functional neighbor interval g À bj ; g þ bj h i .
Step 3: When the genetic generation index is l l ¼ 1; 2; . . .; L max ð Þ , for each chromosome E f l ð Þ Step 4: Perform roulette wheel selection, the crossover based on competition and optimal selection, and uniform mutation.
Step 5: Combine the parent and offspring population, select the best F solutions for the next iteration with respect to non-domination rank and crowding distance.
Step 6: If genetic generation index l ¼ L max , stop and get the set of Pareto optimal solutions P s ; otherwise set l ¼ l þ 1 and return to Step 3.

Selection of the final solution
After running the proposed MOC-FNI, a set of Pareto optimal solutions P s will be achieved, from which the final imputation results and clustering partition can be extracted. We employ the projection similarity validity index (PSVIndex) (Xia et al. 2013;Zhou and Zhu 2018) to extract the best solution from P s , as given by Eq. 23.
C and M are the numbers of clusters and attributes, respectively. G ¼ ½g ij NÂM is the complete matrix imputed by the imputation part of solutions in P s . n i 1 i C ð Þ denotes the number of genes of the i-th cluster. project aj is the projection coordinates of expression value g aj that represents the projected interval of gene g a on the j-th dimension. If g a and g b belong to the same cluster, they should have the same or quite similar projection coordinates on the M dimensions. That is, SPDis g a ; g b ð Þ should results in a small value. Therefore, we can select the solution with the smallest PSVIndex value to get final imputation results and cluster partition.

Experimental results
In this section, the performance of the proposed MOC-FNI algorithm is evaluated on four benchmark data sets. First, we compare the imputation performance of MOC-FNI with several popular and promising imputation methods in terms of normalized root mean square error (NRMSE). Secondly, the clustering results on these imputed data are compared based on Silhouette index. Finally, we perform biological relevance test on the obtained gene clusters.

Data sets
We apply MOC-FNI to the following four benchmark data sets to prove its performance.
Arabidopsis Thaliana: This data set consists of 138 Arabidopsis Thaliana genes. Each gene has 8 expression values that correspond to 8 time points. The number of clusters C Arabidopsis ¼ 4. To measure the gene semantic similarity, these genes are mapped to GO terms in the three sub-ontologies (BP, MF and CC) of GO. For the Arabidopsis Thaliana data set, the number of GO terms is 2189, with 1365 terms under biological process, 597 terms under molecular function, and 227 terms under the cellular component.
Yeast Cell cycle_384: This data set contains the expression levels of 384 genes involved in yeast cell cycle regulation at 17 time points. These data are related with five phases of cell cycle. Thus, the number of clusters C Yeast 384 ¼ 5. The genes in Yeast Cell cycle_384 also mapped to GO terms in BP, MF, and CC. Consequently, the number of GO terms is 4380, with 2872 terms under biological process, 962 terms under molecular function, and 546 terms under the cellular component.
Yeast Cell cycle_237: This data set consists of 237 genes, whose functions fall into four categories in the MIPS database, i.e., C Yeast 237 ¼ 4. For Yeast Cell cycle_237 data set, the number of GO terms is 3204, with 2139 terms under biological process, 637 terms under molecular function, and 428 terms under cellular component.
Human Fibroblasts Serum: This data set contains the expression levels of 517 human genes. The data set has 13 dimensions and C Serum ¼ 6. For this data set, the number of GO terms is 8469, with 5284 terms under biological process, 2026 terms under molecular function, and 1159 terms under cellular component.
To simulate MCAR, we randomly discard a specified percentage of components in the original data set G ori ð Þ ¼ ½g ori ð Þ ij NÂM and generate the incomplete data set

Evaluation criteria
To evaluate the imputation performance of MOC-FNI, we use normalized root mean square error (NRMSE) (Cheng et al. 2012), as given by Eq. 25.
where h is the number of missing values in $ G , b e o 1 o h ð Þis the imputation value of the o-th missing value, e ori ð Þ o is its original (true) expression value in G ori ð Þ , and e is the average of all missing values. NRMSE is the most commonly used evaluation criterion that measure the imputation accuracy. A smaller NRMSE indicates a better imputation result. An internal cluster validity measure, called Silhouette index, is utilized to evaluate the clustering performance. Let aðiÞ be the average distance between gene g i and other genes in the same cluster, bðiÞ be the minimum average distance between gene g i and genes in other clusters. Then, the silhouette width (Rousseeuw 1987) of g i is defined as Eq. 26.
It can be seen that, for each gene g i , S i ð Þ 2 À1; 1 ½ . The Silhouette index of the gene expression matrix is computed as the average value of the Silhouette width of all genes. A large Silhouette index indicates a good clustering result.
The settings of MOC-FNI are as follows: population size F ¼ 80, maximal number of generations L max ¼ 60, selection factor a ¼ 0:3, crossover factor b ¼ 0:3, mutation probability P m ¼ 0:1, fuzzification parameter m ¼ 2, the number of functional neighbors q ¼ 10. The above parameter configurations are commonly adopted in relevant literatures (Li et al. 2013;Moorthy et al. 2014;Saha et al 2013;Troyanskaya et al. 2001;Zhou and Zhu 2018). As suggested by Tuikkala et al (2006), the parameter h in Eq. 14 is selected adaptively. First, we randomly designate 5% of non-missing expression values as missing artificially from the data set to be imputed. Then, these values are estimated separately for each h value using GOimpute (Tuikkala et al. 2006). The h that products the smallest NRMSE is selected. Figure 2 shows the NRMSE values when h varies from 0.4 to 2.0 with grid 0.2. From  Fig. 2, we observe that for the Arabidopsis Thaliana data and the Yeast Cell cycle_237 data, NRMSE can get minimum when 0:8 h 1:0 and 1:0 h 1:2, respectively. Therefore, we set h Arabidopsis ¼ 0:9 and h Yeast 237 ¼ 1:1. Similarly, the optimal values for the other two data sets are h Yeast 384 ¼ 1:4, h Serum ¼ 1:2. The parameters of the compared methods are set according to the optimal parameters suggested in the original papers (Acuna and Rodriguez 2004;Buuren and Oudshoorn 2011;Kim et al. 2005;Oba et al. 2003;de Souto et al. 2015;Troyanskaya et al. 2001;Tuikkala et al. 2006). As imputation methods are recommended when the missing rate is less than or equal to 30% (Draghici et al. 2006;Yu et al. 2017), various missing rates are randomly generated in the original matrices: 1%, 5%, 10%, 15%, 20%, and 30%, respectively. We present the average results obtained over 10 trials with the same incomplete data sets for each approach.
The experiments are evaluated in a PC computer by running Windows 10 system, with a 3.1 GHz Intel Xeon E3-1535 CPU, 32 GB RAM, and a 2TGB hard disk. In our experiments, R package ''mice'' is used to obtain the imputation results of MICE. GO semantic similarity in GOimpute and MOC-FNI are obtained using website: http://bioinformatics.clemson.edu/G-SESAME, other codes are written in Matlab2019b. Figure 3 shows the average NRMSE values of MOC-FNI on four data sets at different missing rates in comparison with other six methods.

Comparison of imputation performance
From Fig. 3, we observe an obvious trend that NRMSE values increase along with the missing rates for all imputation methods on all data sets. Overall, the proposed MOC-FNI has best performance in all data sets and all missing rates. We also observe that MOC-FNI and GOimpute always achieved the first two smallest NRMSE values, suggesting that GO information improves the imputation accuracy. Taking the Yeast Cell cycle_384 data set as an example, compared to the MEANimpute, KNNimpute, BPCA, MICE, LLSimpute, and GOimpute, MOC-FNI reduces the NRMSE values by 26.1%, 23.5%, 25.1%, 21.7%, 17.0%, and 8.5% at 1% missing rate, respectively. These percentages, along with the increase in missing rates, reduce to 4.2%-10.5% at 30% missing rate, the superiority of MOC-FNI becomes less significant, but it is still effective. Similar observations can also be obtained from the other data sets. In terms of the average reduction rate over the four data sets and six missing rates, MOC-FNI reduces the NRMSE values by 14. 7%, 13.3%, 13.1%, 12.9%, 9.4%, and 6.4%  (c) Yeast Cell cycle_237 (d) Serum   Fig. 3 Average NRMSE values of seven methods at different missing rates for four data sets integrate both GO annotation information and gene expression information, which help to generate biologically meaningful imputation values; (ii) In the proposed framework, the joint optimization can promote the synergistic improvement of imputation and clustering.

Comparison of clustering results
Tables 3, 4, 5, and 6 show the average Silhouette index values obtained by MOC-FNI and six two-stage methods on four data sets. The optimal solutions in each column are highlighted in bold and the suboptimal solutions in italics.
To give an overall comparison, we sort the average Silhouette index values in each column in descending order and obtain the sort index. Therefore, the method with sort index 1 has the largest Silhouette index value, the best clustering result. We gather statistics on the sort indexes of the methods when clustering the four data sets with six missing rates (24 cases in total). Figure 4 gives the average sort indexes of the seven methods, and a smaller average sort index indicates a better clustering result. Take the proposed MOC-FNI as an example, MOC-FNI achieves the largest Silhouettes index values for 23 out of the 24 cases and gives a suboptimal solution in the other case. Therefore, the average sort index of MOC-FNI reaches 1.04 The statistics of other methods can be obtained in similar ways.
The computational time of different methods on the four data sets is shown in Fig. 5. As we examine the average imputation and clustering results over 10 trials, the computational time of 10 trials is reported. The time for calculating GO semantic similarity using website is not included. From Fig. 5, it can be seen that the computational time increases with the data size. The computational time of the compared two-stage methods only changes in small ranges under different missing rate in the same expression data set. The reason is that in these two-stage methods, the multi-objective clustering is much more time-consuming than the imputation task, and the chromosomes only encode cluster prototypes in the multi-objective clustering. Thus, the number of decision variables does not charge with the missing rate. It can also be seen that the computational time of MOC-FNI is higher than other methods. In addition, the computational time of MOC-FNI increases with the missing rates in the same data set. This comes from the fact that MOC-FNI encodes both imputation values and cluster prototypes in mixed chromosome. Thus, the number of decision variables increases along with the missing rates.
To illustrate the coherence of the gene clusters obtained by MOC-FNI, the Eisen plots and cluster profile plots on the four data sets are also presented in Fig. 6. In the Eisen plots, each row corresponds to a gene, each column to a time point (sample), and each entry of the plot represents the expression level of a gene at a specific time point by coloring the corresponding cell. To illustrate more clearly the gene clusters obtained by MOC-FNI, the genes partitioned into the same cluster are placed together. From the Esion plots presented in Fig. 6, it is evident that the color pattern (expression profiles) of the genes within a cluster are similar to each other, whereas the genes from different clusters illustrate different color patterns. In the cluster profile plots, the X-and Y-axis represent the time points and gene expression values, respectively. The expression values of genes partitioned into the same cluster are plotted in the same subplot. In the subplots, each green line indicates the normalized expression values of a gene over all time points, and the black line represents the mean expression level of the genes in the corresponding cluster. From the cluster profile plots presented in Fig. 6, it can be seen that the cluster profiles from different clusters are The clustering superiority of our MOC-FNI can be attribute to the joint optimization on imputation and clustering. On one hand, the imputation is guided by the cluster validities, making the imputation more suitable for the clustering task. On the other hand, this reasonable imputation improves the clustering performance. In contrast, the compared two-stage methods disconnect the imputation and clustering tasks. It is worth noting that the results shown in Tables 3, 4, 5 and 6 and Fig. 4 conform to the imputation accuracy shown in Fig. 3, which illustrate the positive impact of high imputation accuracy on clustering. These experimental results also validate the mutual promotion between imputation and clustering and the rationality of joint optimization.

Biological significance
To test the functional enrichment of gene clusters generated by MOC-FNI and the compared two-stage methods, we also perform biological relevance test with the help of GOTermFinder tool (https://www.yeastgenome.org/ goTermFinder). The network analysis tool can find the significant shared GO terms that describe the genes in each cluster from a GO sub-ontologies (BP, CC, MF) and provide the corresponding p values based on the hypergeometric distribution. The closer the obtained p value is to 0, the more biologically significant the clustering result saha et al. 2018).
Taking the 5% case of incomplete Yeast Cell cycle_237 data set as an example, the biological significance test is conducted at the 1% significance level. We focus on the three most significant GO terms (with the first three smallest p values) for each of the four clusters obtained by different methods. Figure 7 shows the plot of the average p values. To illustrate the difference significantly, the p values are negative log-transformed and the clusters are sorted in descending order according to the transformed values. Table 7 reports the three most significant GO terms and the corresponding p values in each cluster obtained by MOC-FNI.
From Fig. 7, it is clear that the curve corresponding to MOC-FNI is above all the other curves, that is, MOC-FNI can always get the smallest average p values in the four clusters. This indicates that the gene clusters obtained by MOC-FNI possess more biological significance. Moreover, all the p values of the significant GO terms listed in Table 7 are far less than 0.01. This establishes that the proposed  MOC-FNI is able to identify biologically relevant and functionally enriched gene clusters. MOC-FNI is superior to all the other methods due to its ability to the synergistic optimization of imputation and clustering. Unlike the other two-stage methods that fill in each missing value separately and ignore the connection of imputation and clustering, MOC-FNI treats the imputation values as a whole in the optimization process. MOC-FNI focuses on the overall impact of imputation values on clustering and optimizes the imputation values to serve clustering best. In addition, the proposed functional neighbor intervals introduce biological knowledge to the joint optimization framework. All these contribute to the superior performance of MOC-FNI in Biological significance.

Arabidopsis Thaliana (a) Arabidopsis Thaliana (b)
Yeast Cell cycle_384 (a) Yeast Cell cycle_384 (  This paper focused on the imputation and cluster tasks of incomplete gene expression data and presented a multiobjective clustering algorithm guided by biological information. In the proposed multi-objective joint optimization framework, we impute all the missing expression values as a whole rather than separately to serve the clustering task and realize the synergistic optimization of imputation and clustering. Besides, the GO annotation information integrated in our framework helps to provide reasonable constraints for the optimization process. Experimental results indicate that MOC-FNI outperforms the compared methods. In terms of the average improvement rate over the data sets and different missing rates, our framework can reduce the imputation error by 14.7, 13.3, 13.1, 12.9, 9.4, 6.4%, and increase the clustering accuracy by 10.1, 8.6, 6.9, 9.1, 5.7, 4.0% in comparison with six popular and promising methods. Besides, the biological significance test indicates that the gene clusters identified by the proposed framework are more significantly enriched. The advantages of the proposed method are as follows: (i) the proposed multi-objective joint optimization framework fully considers the inherent relation between the imputation and clustering, and synergistically improves the performance of both tasks; (ii) the proposed functional neighbor intervals integrate biological knowledge into the joint optimization framework so as to further improve the imputation and clustering accuracy. In real applications, the original gene expression matrix often contains missing values, and the proposed MOC-FNI can act as a powerful tool for generating reasonable imputation values and revealing the nature data structures. These identified clusters can further provide a useful basis for further investigation of gene function, gene regulation, and subtypes of cells. Besides, the proposed joint optimization on imputation and clustering can be applied to practical applications in bioinformatics analysis, including identification of negative correlations in gene expression data (Tu et al. 2016), cancer subtype discovery (Chen et al. 2019a), bioinformatic analysis of proteome alternations (Rahmatbakhsh et al. 2021).
The effectiveness of the proposed MOC-FNI has been evaluated on small and medium size gene expression data sets. When the data size becomes large, the time required for multi-objective joint optimization may increase drastically. In future, we would like to adopt decision variable analysis (Zhang et al. 2018) to reduce the search space so as to make our algorithm applicable to the case of large incomplete gene expression data sets. In addition, the proposed joint optimization framework will be applied to other tasks, such as semi-supervised clustering and biclustering on incomplete gene expression data.