Users rely on traditional search engines like Google, Bing, and Yahoo to find information due to the massive size of the Web. However, some information sources are not fully accessible through these traditional search engines for commercial or security reasons [1]. As such, users relying on search engines miss valuable collections of information sources that may be relevant but not fully accessible where they posed their query. Federated search, or distributed information retrieval (DIR) [2], connects users directly to those information sources through a unified search interface. This interface can perform multiple searches across distributed collections and combine their results into a single list [3]. However, with multiple collections involved, users cannot know beforehand which collections will most likely be relevant to their query. This is where the central broker acts as an intermediary between the component collections and users [2]. This way, as illustrated in Fig.1, a user uses the user interface to submit search query, which is transmitted directly to the broker. The broker picks a few collections that may hold relevant documents and routes the query toward them. Each collection processes the received query, returns a ranked results list, which the broker merges into a single ranked list before presenting to the user.
The literature divides the domain of federated search into three separate research areas: resource description, collection selection, and results merging [2]. The resource description [4], also called collection representation [2], is to obtain information concerning each collection's content and size. The collection selection [5, 6] uses the sampled information to select a few collections that may contain relevant documents against a searc query. Finally, the results merging [7, 8] combines the results from the component collections into an ordered list.
This paper studies results merging as it plays a key role in meeting users' information needs. This is because user satisfaction is compromised if the most relevant collections are selected, and their results lists are merged poorly, where the most relevant documents may not make it to the top of the merged result list. In addition, merging multiple results lists is a challenging task, especially in non-cooperative environments [2] due to the differences in the collections corpus statistics, the use of varying indexing schemes, retrieval models, and the non-availability of the documents' full-text at the merging time. Moreover, some collections may have non-text content, such as images, videos, etc.
Learning to rank (L2R) techniques have been considered an alternative to traditional document-ranking approaches in recent years. Several research works [7, 9-13] have used L2R for various retrieval tasks. In these techniques, a set of queries, each associated with the ranked list of documents and their relevance judgments, is exploited as training data [14]. This training data is used to extract query-document relevance features and then to learn a ranking function that predicts the documents' relevance for each given query. For example, in federated search, Ghansah et al. [13] considered the results merging problem a classification problem. They extracted some features from the returned documents by the collections and learned a ranking function that merged the multiple results list. This method's drawback is the need for a huge amount of training data in learning the ranking function.
Swarm intelligence and genetic programming are widely used in various domains to solve optimization problems. In applying these techniques, multiple rounds of iterative processes of selections, variations, and mutation are performed on the extracted features to obtain the optimum solution. These techniques have been reported [15] to provide better solutions than the algorithms defined by experts. A recent study [7] used genetic programming to solve the result merging problem in which returned documents by the collections are downloaded, some likely relevance features are extracted from the documents at each iterative process, and then the documents are re-ranked based on their relevance scores. However, the drawbacks of this approach are increases in bandwidth usage, computational resources, and latency time.
The proposed work addresses this issue by extracting similar features, like in [13] but with a few modifications: First, the mentioned studies consider only the document's features in obtaining their relevance score, whereas we consider the quality of the collection as an important factor in addition to these features. Second, we depart from learning a ranking function because it requires colossal training data. Instead, we use the ant colony optimization (ACO), a swarm intelligence algorithm, in generating the merged results list. The novelty of the proposed method lies in the use of the only available information in a way that neither overwhelms the broker nor increases its computational resources or response time. To summarize, we make the following contributions to the body of knowledge.
- We propose a method that uses only the information provided by the component collections at query time.
- The proposed method uses the ACO algorithm and considers collection quality and document features to compute the results merging scores of documents.
- We test the efficacy of the proposed approach by conducting experiments with the TREC 2013 FedWeb dataset. The experimental results demonstrate its robustness. It outperforms both the learning and non-learning models used as the baselines.
The remaining paper has five sections. Section 2 presents the related works. Section 3 presents research methdology. Section 4 presents experimental and evaluation setup. Section 5 presents and discusses the experimental results. Section 6 presents conclusions, followed by references.