A new Passage Retrieval Method in Arabic Question Answering Systems

In this paper, we present our approach to improve the performance of open-domain Arabic Question Answering systems. We focus on the passage retrieval phase which aims to retrieve the most related passages to the correct answer. To extract passages that are related to the question, the system passes through three phases: Question Analysis, Document Retrieval and Passage Retrieval. We define the passage as the sentence that ends with a dot ".". In the Question Processing phase, we applied the traditional NLP steps of tokenization, stopwords and unrelated symbols removal, and replacing the question words with their stems. We also applied Query Expansion by adding synonyms to the question words. In the Document Retrieval phase, we used the Vector Space Model (VSM) with TF-IDF vectorizer and cosine similarity. For the Passage Retrieval phase, which is the core of our system, we measured the similarity between passages and the question by a combination of the BM25 ranker and Word Embedding approach. We tested our system on ACRD dataset, which contains 1395 questions in different domains, and the system was able to achieve correct results with a precision of 92.2% and recall of 79.9% in finding the top-3 related passages for the query.

concerned with answering questions of a specific type ("why", "when",…) (Azmi et al., 2017). Other systems, called open-domains Question Answering systems, try to answer many types of questions and within different domains, and they are the most general systems (AL-SMADI et al., 2017; Zhou et al., 2020). Such systems consist of the following basic modules (1) Question Analysis Module, including question tokenization, stemming its words, diacritics removal (in Arabic), question reformulation and classification, (2) Document Retrieval Module, to extract the most related documents to the question, (3) Passage Retrieval Module, that is concerned with extracting the short text passage that are mostly related to the question and expected to contain an answer to the asked question, and finally (4) Answer Extraction Module, which includes extracting the exact answer from the returned passages from the previous phase, according to the type of the asked question (Sarrouti et al., 2020).
The Passage Retrieval phase is important for effective Question Answering systems' results, a. extracting the correct answer depends on its appearance in the passages resulting from passage retrieval phase, Retrieval is often achieved in open-domain Question Answering systems using traditional methods such as TF-IDF or BM25, such methods depend on representing the question and the passage with weighted vectors in dataset space and ranking the passages according to the similarity of passage vector with the query vector (Karpukhin et al., 2020). One of the most popular applications of the Word Embedding representation is the Information Retrieval (IR) systems, where it can be used in the passage retrieval module within the Question Answering systems (Zuccon et al., 2015). Word Embedding is a set of language modelling and feature learning methods that has been used recently in many Natural Language Processing tasks. This model generally depends on mapping textual words into a low-dimensional continuous space, so each word can be represented by a real-valued vector and thus semantic similar words have similar vector representation (Li et al., 2018). The advantage of this approach is that it is concerned with the words semantic similarity instead of being satisfied with the lexical similarity (Mitra et al., 2017).
In this paper, we propose a new model to implement the passage retrieval component in Arabic Question Answering systems. We considered a combination of lexical and semantic similarities, using BM25 word embedding model for similarity measuring. In the rest of this paper, we represent other related works in QAS domain in section 2. In section 3, we present in details our approach with its different steps. In section 4, we introduce our experimental results with a comparison with similar works. At the end, insights for the future and a short summary are presented.

Related Works
Abdi et al. proposed in (Abdi et al., 2020) ASHLK, a question answering system in Hadith using linguistic knowledge. The system retrieves the most related hadiths' sentences for the user's query from a large set of hadiths that represent the system's data set. The system relies on finding the answer by measuring the similarity between user's query and hadiths' sentences by using the graph similarity so that each node in the graph represents a sentence and each edge represents the value of the similarity between the two sentences (nodes), the similarity is measured by the combination between the semantic similarity between the words of the two sentences in addition to the syntactic similarity that takes into consideration the order of words in the sentence (on the assumption that the word order in a sentence reflects the meaning of this sentence), the system will return a specific number of sentences that are more similar to each other and more similar to the user's question as an answer, after eliminating the repeated sentences. The system was tested on a data set consisting of 4000 hadiths and 3825 queries divided as 2678 queries in training set, and 1147 queries in testing set, and had obtained a precision of 83.5% and a recall of 63%.
LEMAZA (Azmi et al., 2017) is an open-domain Arabic Question Answering system concerned with answering questions of the type "why". After the Question Analysis, and the text Document Preprocessing, phases, the most related passages to the question are extracted and verified in the Document|Passage Retrieval phase using Lemur retrieval engine. The Answer Extraction phase uses RST-based algorithm which has been implemented specifically for answering "why" questions. The system was tested on a manually collected dataset (110 "why"

Query Preprocessing
At this stage, the question is passed through traditional linguistic processing steps, wich are: a) Tokenization, by sperating question text into smaller units (words), this done by assuming space as a delimiter. b) Normalization, by replace some characters that can by written in different ways with its normal form like replace ‫"أ‬ , ‫إ‬ , ‫ا‬ , ‫آ‬ " with ‫."ا"‬ c) Diacritics removal, as shown in Table 1.

Query Expansion (QE)
To increase the accuracy of the system, the question was expanded by adding synonyms to each of the question words, Table 2 shows question before and after Query Expansion process. Through experimentation, it was found that the system gives the best results when we take one synonym (if found) for each word of the question, as shown in Table 3. To implement the question expansion phase, the approach of Word Embedding used in the

Document Retrieval (DR)
In this stage, the documents are processed by applying the same linguistic processing steps implemented in the previous stage. The Vector Space Model (VSM) method was used to represent each document with a vector using a TF-IDF vectorizer. The question vector is also represented in the same way, so that the most relevant documents are retrieved by applying the cosine similarity between each document vector and question vector.
The most five similar documents with the question are selected as input for the next stage (PR).

Passage Retrieval (PR)
In this phase, we aim to extract the three passages that are most related to the asked question. This is done using the concept of Word Embedding, based on a pre-trained model (AraVec) on an Arabic dataset collected from Arabic Wikipedia articles (Soliman et al., 2017).
To obtain the similarity between the question and one passage, we use the following algorithm, where SimAraVec represents the similarity between two sentences using AraVec model: To extract the most related passages to the question, we first divide the documents resulting from the Document Retrieval Phase by using a sentence segmenter, and we consider each passage to be a sentence that ends with a dot ".". For each passage, we calculate SimAraVec, between the question and the corresponding passage p {vp1, vp2, ...,vpn}. We perform a normalization operation for the resulting vector by dividing by the max value. We then sort the values in descending order and choose the top three passages that achieve the largest similarity values.
The previous measurement takes into account the similarity in terms of meaning only. To increase the accuracy of the model, we added another measure, which is BM25 function that takes into account the lexicon similarity.
The general form of the model used to achieve the passage retrieval module becomes as follows: where is a weighting parameter in the range [0, 1] for weighting the significance between semantic and lexical similarity and SimBM25 is the similarity between two sentences using BM25 method.

Results and discussion
We tested our system using the Arabic Reading Comprehension Dataset (ARCD), which consists of 1395 questions from Wikipedia articles. We have randomly chosen 100 questions for development and used the rest for testing.
To choose an appropriate value for the parameter in Equation1, we calculated the precision and recall for each value in the range [0.1, 0.9] with a step of 0.1. The system achieved the best performance for = 0.3, as shown in Figure 2.
We compared the performance of our system with other methods performance: ASHLK, LEMAZA, and SQAL (see Table 4). We used Precision, Recall and Sentence Match as metrics for comparison. SQAL is the only system that used ACRD dataset, whereas ASHLK system was tested on Hadith data, and LEMAZA system was tested on their own dataset. We notice that based on SM metric, our system performed better than SQAL (+0.8%) on the same dataset.
Compared to ASHLK and LEMAZA systems, and though different datasets were used for testing, we can notice that the precision of our system is generally better (with an increment of 8.7%, 13% respectively), and that our system's recall is better with an increment of 16.9%, and 7.2% respectively.

Conclusion and future work
In this paper, we presented a new efficient method for passage retrieval in Arabic Question Answering systems.
It is based on measuring semantic similarity between words using a pre-trained AraVec model. We considered a combination of lexical and semantic similarities, using BM25 word embedding model for similarity measuring.
The method was tested using ACRD dataset. The system was able to achieve a precision of 92.2% and a recall of 79.9%. The system can be improved by using a new method to measure similarity between words that take into account the word context addition to semantic similarity, this can be achieved using a pre-trained context-dependent language models such as BERT.