Modified Firefly Algorithm for Optimizing Biomedical Breast Cancer Queries


 Querying and retrieving Semantic Web data is a challenging task due to the increment in its volume. Many query languages were designed to retrieve Semantic Web data. A popular querying method of communication in Semantic Web is SPARQL. The query languages were designed with some optimization strategies, and it was found in literature that these query languages were not able to handle large volume of data efficiently. In this research, a Modified Firefly Algorithm (MFA) is applied to optimize the SPARQL queries so that it can retrieve data from a large Semantic Web repository efficiently by reducing query execution time. Every query will have multiple query plans generated with different cost values. The challenge is to choose the best query plan which reduces the query cost and query execution time. The proposed algorithm uses the best query plan in the previous iteration to calculate the distance between two query plans using the radius parameter. The proposed algorithm generates a query plan which is a global optimal solution. MFA is evaluated using the BioPortal dataset with triples containing breast cancer. Experimental analysis is conducted to identify the significant improvement in performance of the proposed work with the existing nature inspired query optimization algorithms. The efficiency of MFA is compared with other algorithms in terms of query execution time and the performance is evaluated.


Introduction
In biomedical eld, biology-and medical-related information is combined to provide information related to different types of diseases. Biomedical data from knowledge bases are well represented using Resource Description Framework (RDF) triples. Large-scale health-related datasets from various heterogeneous domains are linked together and represented as Semantic Web data. There is a necessity to provide absolute results to the patients from querying such vast amount of data. Several research initiatives have been undertaken to query these bulk data depending on technologies like natural language processing [4], evolutionary optimization algorithms [7] and heuristic algorithms [9]. Owing to the bulky size and complexity of the data, e cient retrieval mechanisms are necessary in today's era as biomedical queries are challenging to search engines [1].
RDF is one of the common ways to represent the Semantic Web data, and SPARQL is the popular endpoint to query such data. The results of SPARQL queries can be retrieved in diverse formats. There is a requirement for novel architectures, which manage bulk RDF data and queries [2]. Federated query engines have been proposed in literature to access the biomedical data sources [3]. Several researchers have been working on the access to the decentralized contents of biomedical data.
Query optimization is an important step in the process of processing a query. Every query issued may have multiple query plans generated and each query plan will be executed with a different query cost and varying query execution time. The issue with processing large volume of semantic web data is the increase in query execution time with a scalable increase in large volume of data. Usually traditional methods of query optimization are applied to generate the optimal query plans. The new way of optimizing and generating the optimal query plan among the varying plans available is the application of nature inspired and evolutionary methods.
The nature inspired methods uses behaviour of different nature related aspects to the problem of interest and produces the result. Many natures inspired and evolutionary algorithms are available and applicable to solve this problem. Based on the parameters they depend on an acceptable nature inspired algorithm is chosen to solve the problem of query optimization.
The article is arranged as follows. Section 2 discusses an examination of the most recent works on the query optimization. Section 3 discusses the methodology of the Modi ed Fire y Algorithm (MFA) to optimize breast cancer queries. Section 4 describes the query optimization, problem statement, proposed encoding representations of query plans, and tness functions. In Section 5, the implementation results for the optimization of biomedical queries using breast cancer dataset are discussed and compared with those of the other algorithms and also their e ciencies are analysed. Section 6 presents the conclusions.

A Review Of Literature
Hamon et al. [4] proposed a method for converting questions in natural language into SPARQL queries. It was a rule-based method, which depends on linguistic and semantic annotation of questions using Semantic Web resources and RDF triples. Oren et al. [5] proposed a technique for querying RDF data through evolutionary algorithm. The solutions were evaluated using ngerprint and Bloom lters. The trade-off between computation time and results of the query was addressed in the research. Hees et al. [6] devised an evolutionary pattern learner to identify patterns in a dataset using evolutionary algorithm.
Hogenboom et al. [7] adapted an e cient method to optimize chain queries in a distributed environment. Various evolutionary optimization algorithms such as Ant Colony Optimization (ACO), two-phase optimization, and Genetic Algorithm (GA) for optimizing chain queries were compared in the research.
Saharan et al. [8] invented a solution to address the querying of scattered data using Differential Evolution. The algorithm uses the policy of reorganization of the order of triples pattern. The results were compared with other heuristic algorithms and were analysed.
Tsialiamanis et al. [9] proposed a new Heuristic SPARQL Planner to take advantage of the syntax and structural variations in triples. The proposed algorithm chooses an optimized query plan without depending on a cost model. The planner was implemented using an open source column-store and the quality of the query plan was compared with benchmarks. Abbas et al. [10] proposed an optimization algorithm to optimize the Conjunctive SPARQL queries, which is based on considering the ShEx constraints.
Wang et al. [11] proposed a skeleton to optimize the working of distributed SPARQL queries. The proposed algorithm is based on graph traversal and its performance is evaluated based on comparison with non-graph traversal algorithms. Ibragimov et al. [12] proposed MARVEL (Materialized Rdf Views with Entailment and incompleteness), a method that depends on the RDF cost model. This algorithm rewrites SPARQL queries based on materialized RDF views and view de nition syntax. The investigational analysis proved that the MARVEL system improved the query retort time e ciently.
A comparison of evolutionary algorithms applied to the problem of query optimization in distributed databases was [15] discussed in literature. The research focussed on bio-inspired computational algorithms used in optimizing distributed database queries. A new genetic algorithm based query optimizer was proposed in literature which compares the performance of existing algorithms with random algorithms. The analysis proves that genetic algorithm seems to be a feasible approach to handle large scale distributed systems. A multi colony ant algorithm was proposed based on MIN-MAX ant system to optimize distributed database queries. The proposed algorithm handles scalability and modularity. The algorithm proves that it takes less computation time to optimize the queries. A hybrid of particle swarm optimization algorithms was developed which depends on the probability distribution of movement of particles. Memetic algorithms which includes the local search process to re ne the search of best optimal query plan generation was proposed in research. The literature study shows that there are signi cant amount of bio-inspired algorithms available for generating optimal query plans.
The different cost models associated with the big data systems were studied in research [16]. The currently available cost models depending on the cardinality measure and the learned cost model depending on operators, common sub expression were also discussed. The basic and derived features of the cost model were studied and feature selection mechanisms were proposed in literature.
The following table 1 shows the compares [17] the existing evolutionary algorithms and their corresponding features. Nature inspired algorithms are the most popular algorithms to solve the optimization problem. Metaheuristic algorithms are designed to generate good solutions to an optimization problem by making a few assumptions. Fire y algorithm is one such meta-heuristic algorithm developed by Yang [13] for solving optimization problems. The algorithm was con gured such that it inspires the behaviour of a re y. The three behaviours of re ies include the following: Every re y creates a centre of attention on another re y using its brightness.
Fire ies having superior brightness have elevated attractiveness to the other re ies.
Fire ies having inferior brightness move to other re ies with superior brightness.
The second behaviour is analogous to the fact that newly produced solutions are based on previous solutions having an enhanced tness function. Theoretically, for each query, there may be either one or more new possible query plans.. This depends on the query cost opinion between solution and other solutions among the existing population of query plans.
In the standard re y algorithm the best query plan generated is the global optimal solution. If there is a movement of this re y as in standard re y algorithm, the brightness of the re y decreases which affects the performance of generating the query plan with less execution time. In conventional re y algorithm, the distance is calculated by considering the current solution and another better solution. Instead of doing this, MFA uses the best solution value in the previous iteration to calculate the radius.
This modi cation is done to make the re y move towards the global optimal solution.
Suppose that for every solution i, X i is a location of i th re y at the present iteration. When the objective function value of solution i is superior than that of solution j, the distance between the two re ies i and j is obtained using the following equation: Then, the updated distance is used into Equation (2) X ijnew = X i + β rand ΔX ij + rand Where r and is any random number of solution i; β 0 is the attractiveness at zero distance and normally set to 1; γ is set to 0 which indicates high visibility of query plans in this case. X j is a solution having lower tness function than X i ; and ΔX ij is a revised step size, which is calculated by the following equation: The owchart for the proposed work is given below. In Figure 1 Flowchart of proposed MFA and Figure 2 Algorithm for proposed MFA.The steps for the proposed MFA are as follows.

RDF Query Plans
Semantic Web facts stored in the RDF version can be retrieved using the popular SPARQL language. In the context of query optimization, every SPARQL query must be imagined by means of a query tree. The leaf nodes in the query tree stand for the triples. The internal nodes are meant to join the triples. Query trees can be represented in different formats that include bushy trees, left-deep trees, and right-deep trees. The nodes in a single query tree can be restructured in many dissimilar ways to turn out the same results.
The sequence in which methodologies are executed to regain the requested data is known as query plan.
In this line of investigation, the trees used to represent the query paths are left-deep trees [14]. Left-deep trees are chosen because these trees can process the joins by applying cost-reducing pipelining methods.

Solution Space
Generally, solution space consists of a set of all processing trees that can produce a result for a given query. In the context of query optimization problem, the solution space contains query execution plans as solutions. There are n! feasible means to assign n triples to the leaf of the tree. The leaves of the tree consist of triples and the inner nodes are used to join these triples. The n! solutions are obtained by applying the conversion set of laws.

Encoding Methodology
For every optimization algorithm, we need to choose an encoding methodology, a suitable encoding for the solutions in the solution space. In this research, the solutions are query plans and they are symbolized using left-deep trees. To encode left-deep trees, we choose ordered list. Solutions are represented using an ordered list [14] of leaves. After generating all possible query plans for the given SPARQL query, the trees are encoded using the ordered list format. For example, consider the query tree with nodes and joins in the following order: ((((R1⋈ R2) ⋈R3) ⋈ R4) ⋈R5) is encoded as "12345".

Objective Function
To determine RDF query path, we must rst come to a decision on the tness function. In this research work, the tness function refers to the cost model of the left-deep tree. The following formula is used to de ne the cost model of the query tree: Cost of single query tree = Sum of the costs of each operator involved in the tree The cost of a particular operator depends on the cardinality and selectivity estimation.
The intention of this research is to reduce the query execution time to retrieve data from a large volume of Semantic Web data. After deciding the aforementioned parameters such as encoding methodology, cost model, and solution space, they are given as input to MFA. The algorithm takes different query paths as input, converts the query plans to some encoding format, and applies cost function to determine the cost of the query plans. After executing MFA with all the inputs de ned, the algorithm returns the best query plan that reduces the execution time.

Experimental Results With Breast Cancer Queries
This section identi es the performance of the proposed work with breast cancer datasets and also compares the performance of the proposed and existing algorithms.

Datasets
BioPortal [4] is a dataset that consists of a repository of biomedical ontologies in many different formats including RDF. It consists of 190M triples and manually and automatically generated cross-ontology mappings. The American Cancer Society have mentioned in their analysis Breast cancer facts and gures that "Breast cancer affects one in eight women during their lives". The proposed algorithm is applied to sample SPARQL queries and with 1,00,000 breast cancer triples for testing. The RDF data can be queried using SPARQL endpoint. The main performance metric used to assess the working of the proposed algorithm is the query execution time. The query execution times obtained for different queries are compared with algorithms such as Particle Swarm Optimization (PSO) and GA. Clearly, the proposed MFA reduces query execution time. Table 2 shows the comparison of sample query execution times of proposed and existing algorithms. The advantage of the modi ed re y algorithm(MFA) is that it increases the visibility to search for the optimal query plan from the available plans. This mechanism increases the possibility of choosing the best plan with minimum execution cost as well as execution time. The choice of choosing the left deep tree representation of the query plan and the ordered list encoding methodology also contributes to the working of the proposed algorithm to produce the best results.

Performance Comparison
The comparison of the query execution times is shown graphically in Figure 6.

Conclusions
1. In this research, MFA -a Modi ed Fire y Algorithm -is used to optimize the SPARQL query plans.
2. The algorithm takes query plans as input and uses cost of query plans as a tness function. The proposed algorithm applied a modi ed version of Fire y algorithm called MFA, which uses best query plan in previous iteration to calculate the radius value. This modi cation optimizes the query and computes the best query plan.
3. Sample queries of BioPortal datasets are executed and the query execution times are recorded. The query execution times are compared with those of already existing conventional algorithms such as PSO and GA. The proposed algorithm is found to reduce the query execution time.
4. The proposed algorithm is designed to handle large semantic web datasets and its corresponding queries. The advantage of the proposed system resides in generating the optimal query plan among