In recent years, the development of deep learning in neural networks improves performance in many natural language processing (NLP) tasks. In natural language processing, neural networks are used for the development of machine translation, speech recognition, text generation, text mining, and named entity recognition.
Idiomatic expression is a collection of words that have a different meaning from the individual words in them. The meaning of the idioms cannot be interpreted from the meaning of words that constructs them directly [1]. Idiomatic expressions are one of the important parts of all-natural languages [2]. The detection of this type of expression from Amharic text helps those individuals that are not familiar with the Language. For example, the expression “ፊቱን ጣለዉ” can be directly translated as “he drops his face” but the actual meaning is “he becomes sad”.
Idiomatic expression recognition from a given text plays an important role in the implementation of tasks such as machine translation, speech recognition, sentiment analysis, and dialog system within the respective language. Amharic is one of the languages grouped under the Semitic language families that have more than 4000 idiomatic expressions [3].
The paper by [4] presents the use of Skip-Thought Vectors to create distributed representations that encode features that are predictive concerning idiom token classification. They showed that classifiers using these representations have competitive performance compared with the state of the art in idiom token classification. However, their models use only the sentence containing the target phrase as input and are thus less dependent on a potentially inaccurate or incomplete model of discourse context. They further demonstrate the feasibility of using these representations to train a competitive general idiom token classifier.
The authors of [5] proposed an idiomatic expression detection method based on the assumption that idioms and their literal counterparts do not occur in the same contexts. The inner product of context word vectors with the vector representing a target expression is computed first by their model. Because literal vectors predict local contexts well, their inner product with contexts should be greater than idiomatic ones. This distinguishes literals from idioms, and then in word vector space, computes literal and idiomatic scatter (covariance) matrices from local contexts. Because the scatter matrices represent context distributions, they used the Frobenius norm to calculate the difference between the distributions.
The work of [6] presents a generalized model for determining whether an idiom is used figuratively or literally, based on the concept of semantic compatibility. They examine continuous bag-of-words (CBOW's) limitations in terms of semantic compatibility measurement and propose a novel semantic compatibility model based on CBOW training for idiom usage recognition. Experiments on two benchmark idiom usage corpora reveal that the proposed generalized model outperforms state-of-the-art per-idiom models at the time.
The Authors of [7] offer a model for detecting idiomatic phrases in written text. They attempted to recognize idioms as an anomaly and supervised sentence categorization. For outlier detection, they use principal component analysis. Idiom detection as lexical outliers does not make use of class label information. As a result, in their experiments, the authors utilize linear discriminant analysis to generate a discriminant subspace and then use the three nearest neighbor classifiers to calculate accuracy. They analyze the advantages and disadvantages of each technique. All of the techniques are broader than earlier idiom identification algorithms in that they do not rely on target idiom types, lexicons, or huge manually annotated corpora, nor do they confine the search area to a certain form of linguistic construction.
Idiomatic expression in language has a detrimental impact on NLP task performance [8]. However, according to the researchers' understanding, there is no Amharic natural language processing model that considers idiomatic expression. This inspired us to create an Amharic idiomatic phrase identification system based on deep learning. This study focuses on the construction of a CNN using the FastText model to detect the presence of idiomatic terms in an Amharic text. The overall contributions of the study are summarized as follows:
-
Prepare a general-purpose Amharic idiomatic expression dataset that can be used by other studies in the future.
-
Proposed a deep learning model that incorporates CNN with FastText to recognize idioms from Amharic texts.
-
Evaluate the performance of the proposed recognition model with various evaluation metrics.
The remainder of the paper is structured as follows. Section 2 presents the planned work's comprehensive methodology in detail. Section 3 defines the experimental results. In section 4, we present the outcome and a discussion of it. Finally, section 5 is the conclusion.