Sseveral evaluation criteria have been suggested In literature to evaluate the strength and accuracy of stemming algorithms (Sirsat, Chavan, and Mahalle 2013; Frakes and Fox 2003; Paice 1994). To measure the correctness and strengthen of UTS, Sirsat et al. (Sirsat, Chavan, and Mahalle 2013) evaluation metrics, as well as the precision, recall and F-measure metrics are used.
The chosen evaluation measures (Jabbar et al. 2018; S. Khan et al. 2015b) are described as follows:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Recall = (TP) / (TP + FN)
Precision= (TP) / (TP + FP)
F1(recall, precision) = 2*(Precision*Recall) / (Precision + Recall)
Sirsat et al. (Sirsat, Chavan, and Mahalle 2013) criterion is very compelling for assessing the strength and accuracy of a stemming algorithm. The following parameters are used to evaluate the strength and accuracy of the stemmer (Sirsat, Chavan, and Mahalle 2013).
Index compression factor (ICF)
ICF indicate the percentage by which a collection of distinct words is reduced by stemming. The ICF is defined as
ICF = n - s
Where,
n = Total number of words before stemming
s = Number of words after stemming
Word Stemmed Factor (WSF)
It is an average number of words that a stemmer has stemmed. The threshold value is 50
WSF = WS / TW * 100
Where,
WS indicates the number of stem words,
TW stands for the total number of words
Sample correctly stemmed word factor (CSWF)
The value of CSWF shows the accuracy of the stemmer. The minimum threshold is 50
CSWF = (CSW / SW) * 100
Where,
CSW = correctly stem words, whereas
WS refers to the total number of stemmed words
Average Words Conflation Factor (AWCF)
AWCF is the average number of variant words of different conflation class that is stemmed correctly to the stem/root. To calculate AWCF, we first compute the number of unique words after conflation, which is calculated as follows
NWC = S - CW
Where,
S shows the number of distinct stems after stemming,
CW refers the number of incorrectly stemmed words.
Then, AWCF is computed as:
AWCF=(CSW-NWC)/(CSW)*100
To evaluate the performance of the proposed stemming algorithm, a series of experiments is conducted.
5.1 Corpus description
In order to evaluate UTS, we constructed the dataset which contains text fragments, including news articles (politics, literature, science, and technology) collected from BBC Urdu (W1, n.d.) and (W2, n.d.) containing 20000 words, including stop words, verbs, adverb, adjectives, nouns, proper nouns, punctuation marks, English words, numbers and special symbols. Words corpus consists of 56074 Urdu words containing uni-gram, bi- gram, tri-gram compound words, broken plural words and words with infixes. The dataset titled USED (Urdu Stemmer Evaluation Dataset) is collected form the following sources. (It will be online when the paper will be accepted):
-
Four Urdu grammar books (Haq 1996; Board 2010; Baloch, n.d.; UEP 2014).
-
Resources provided by (S. Hussain 2004) on Urdu morphology,
-
Online resources: Urdu online encyclopedia (W3, n.d.), CLE Urdu words list (W4, n.d., 4)
-
CLE Urdu high frequency words list (W5, n.d.).
The words corpus have been preprocessed, by the following steps:
-
Removal of stop words, punctuations, numbers, and symbols.
-
Deletion of English and French characters.
-
Elimination of Urdu diacritics.
The text documents are related to 4 topics and each topic is represented by 5 texts with different lengths (Table 8), which lead to a total 20000 words. The text corpus feed the stemmer without preprocessing and token the text using token marker mentioning appendix.
Table 8
Text corpus
|
Topic
|
No of articles
|
Total Words
|
1
|
Politics
|
5
|
6500
|
2
|
Literature
|
5
|
4600
|
3
|
Science
|
5
|
5300
|
4
|
Technology
|
5
|
3600
|
Total
|
20
|
20000
|
5.2 Results and discussion
To measure the performance of proposed stemmer we involved the human experts to annotate the words with the corresponding actual stem. They are native Urdu speakers with a relevant qualification. Obtained annotations are cross validated by each other and results are used for the rule extraction which leads to the development of stemmer. The stemmer is applied to the raw data. The results produced are compared with human expert annotations.
In this subsection, we compare the obtained results of UTS with existing Urdu stemmers Assas-Band stemmer (Akram, Naseer, and Hussain 2009) and MU stemmer (Jabbar et al. 2018). The selection of stemmers is based on multiple factors. Assas-Band stemmer (Akram, Naseer, and Hussain 2009) stemmer has high accuracy among the rule-based Urdu stemmer so selected as representative of Urdu rule-based stemmers. Similarly, MU stemmer (Jabbar et al. 2018) is better among infixes removal stemmers (Mubashir Ali, Khalid, and Aslam 2017; S. Khan et al. 2015b). Likewise(Akram, Naseer, and Hussain 2009) is a statistical stemmer that is tested on Urdu text, and Husain et al. (Z. Hussain et al., n.d.) is sole stem word dictionary based Urdu stemmer.
Table 9
UTS confusion matrix description
Characteristics
|
Words corpus
|
Text corpus
|
TP
|
51788
|
10347
|
FN
|
2391
|
674
|
FP
|
457
|
268
|
TN
|
1438
|
198
|
Confusion matrix statistics obtained after applying stemmer on both corpora is shown in Table 9.
Finally, Table 10 and Table 11 show the performance measures of UTS compared with state-of-the-art Urdu stemmers. Majority of the existing Urdu stemmers (e.g. (Akram, Naseer, and Hussain 2009; Husain, Ahamad, and Khalid 2013; S. Khan et al. 2015b) are evaluated on the word corpus. Usal stemmer (Gupta, Joshi, and Mathur 2015) also works on textual data. But it uses hard space to identify the words boundary. In Usal stemmer, compound words are wrongly split into two unigram words, therefore, compound word remains unstemmed. For instance, ہے گاہ عبادت یہ [yeh ibadat gaah hai/this is a place of worship] that is tokenized form of the compound word گاہ عبادت [ibadat gaah/ place of worship] into two unigram word عبادت [ibadat] is a root word, گاہ [gaah] is a suffix, consequently compound word گاہ عبادت [ibadat gaah/ place of worship] remains unstemmed.
The effectiveness of UTS is measured by using words corpus and text corpus as mentioned in Table 10. The major attributes to measure the performance is prediction of correctness of query word’s stem. We conducted two different experiments and performed comparisons in term of accuracy on word corpus. The experimental results are presented in Fig. 5 and Fig. 6. Husain et al. (Z. Hussain et al., n.d.) claimed a higher accuracy of 94.85% on 3600 Urdu words, and stem words dictionary size is not mentioned. Assas Band stemmer (Akram, Naseer, and Hussain 2009) tested their stemmer on corpus consisting of 21757 Urdu words and achieved 92.97% accuracy. Their stemmer removed prefixes and /or suffixes but did not handle the infixes. Husain et al. (Husain, Ahamad, and Khalid 2013) proposed N-gram stemmer to truncate suffixes and ignored the prefixes and infixes. This stemmer obtained 84.27% purity on test size of 1200 Urdu words that are extracted from the E-mail corpus. Khan et al. (S. Khan et al. 2015b) used template to handle the infixes, but no mechanism is described to handle the exception of defined rules and template less Urdu words. For instance, a pattern is defined by Khan et al (S. Khan et al. 2015b), if an Urdu word has four characters and the third letter is و [vao], the third letter is removed to extract the stem, but this rule is violated in case of جلوس [juloos/ procession] and حصول [husool/ acquisition] to obtain the stem. This stemmer is evaluated on corpus consisting of 19351 words and claimed precision and recall are 89.95% and 96.08 respectively. The MU stemmer (Jabbar et al. 2018) assessed on both words and textual data and obtained a recall value of 99% and 95.586% precision.
Table 10
Performance comparison of UTS with state-of-the-art stemmers as reported in relevant papers
Stemmer
|
Corpus
|
No.
of words
|
Used
method
|
In %age
|
Acc.
|
Recall
|
Prec.
|
F1
score
|
Text
Stemmer (UTS)
|
Words
corpus
|
56074
|
Hybrid
|
94.92
|
99
|
95.58
|
97.32
|
Text
corpus
|
20000
|
-
|
91.8
|
97.48
|
93.88
|
95.65
|
MU stemmer (Jabbar et al. 2018)
|
Words
corpus
|
56074
|
Multi-
step
|
92.97
|
99
|
93.57
|
96.26
|
Text
corpus
|
20000
|
-
|
90.33
|
97.43
|
92.35
|
94.82
|
Hussain et al. (Hussain et al.2017)
|
Words
corpus
|
3600
|
Dictionary
based
|
94.85
|
-
|
-
|
-
|
Assas- Band stemmer
(Akram, Naseer, and Hussain 2009)
|
Words
corpus
|
21757
|
rule-
based
|
91.2
|
-
|
-
|
-
|
Husain et
al. (Husain, Ahamad, and Khalid 2013)
|
Words
corpus
|
1200
|
Statistics
(N-
gram)
|
84.27
|
-
|
-
|
-
|
Table 11
Performance comparison using standard dataset
Sirsats’ evaluation Metrics
(Sirsat and Mahalle 2013)
|
Assas Band stemmer
(Akram and Hussain 2009)
|
MU stemmer (Jabbar et al. 2018)
|
Urdu
Text Stemmer (UTS)
|
Total words (TW)
|
56074
|
56074
|
56074
|
No. of distinct words
after stemming (S)
|
26597
|
25233
|
24970
|
Index Compression Factor (ICF)
|
53
|
55
|
55.47
|
No. of stemmed words
|
55128
|
54012
|
54179
|
Words Stem factor (WSF)
|
98.31
|
96.32
|
96.62
|
correctly stem words (CSW)
|
49238
|
50693
|
51788
|
correctly stem words Factor (CSWF)
|
89.32
|
93.86
|
95.59
|
No. of distinct words after conflation (NWC)
|
24079
|
23795
|
23532
|
Average conflation words factor (AWCF)
|
51.10
|
53.06
|
54.56
|
MU stemmer (Jabbar et al. 2018) extracted the bigram word from textual data to produce stem, for example, the Urdu sentence ہے گاہ عبادت یہ [yeh ibadat gaah hai/this is place of worship], after eliminating the یہ [yeh/this] and ہے [hai/is], the compound word عبادت گاہ [ibadat gaah/ place of worship] is extracted and the suffix گاہ [gaah] is removed to obtain the stem عبادت [ibadat]. However, their stemmer fails to deal with Mohmil words, and multilevel inflection and derivation. Whereas UTS achieved recall and precision of 99% and 93.95%, respectively. These results are shown in Fig. 4. On text data set, the MU stemmer (Jabbar et al. 2018) yields 90.33% accuracy, followed by UTS that achieved accuracy of 91.8% as mentioned in Fig. 4. The results obtained on the same data set, show that UTS achieved the better performance (see Fig. 4 to 6) than existing state of the art MU stemmer (Jabbar et al. 2018) and Assas-Band stemmer (Akram, Naseer, and Hussain 2009) specifically, existing Urdu stemmers have caused some under-stemming errors for certain groups of words that hold multi-level inflection and derivation, for examples: Bigram words having co-suffix دار تھانے [thaanaydaar/the officer of a police station], corresponds to the mistaken stem تھانے [police stations].
The Urdu bigram word یافتہ تعلیم [taleem Yafta/ educated], possessed suffix یافتہ [yafta] and some infixes letters, existing stemmer commit under stemming errors and produced تعلیم [taleem/education] as a stem. Bigram words having a Mohmil word as an affix سمجھابجھا [samjhabujha/understand] cannot be stemmed. Trigram words having prefixes and suffix along with infixes, such as یافتہ تعلیم غیر [gher taleem yafta/uneducated], possess prefix, suffix, and infix. The existing Urdu stemmers produce incorrect stem, i.e., تعلیم [taleem/education]. Trigram words having co-suffix جات خانہ جیل [jail khanah jaat/ the prisons] and produce the incorrect stem خانہ جیل [jail khanah/ the prison].
Assas-Band stemmer (Akram, Naseer, and Hussain 2009) stemmer cannot handle the infixe cases that is why its performance is lower as mentioned in Table 11 and Table 12 and Fig. 5. Whereas UTS extracts the correct stem and it’s obtained CSWF is significantly higher 95.59% from MU stemmer (Jabbar et al. 2018) obtained score 93.86% and Assas-Band stemmer (Akram, Naseer, and Hussain 2009) show lowest CSWF score 89.32% as reflected in Table 11 and Fig. 5.
In the second experiment, we use text corpus and compare the achieved accuracy with (Jabbar et al. 2018). Again, UTS outperform than MU stemmers (Jabbar et al. 2018) ( see Fig. 6). The reason is, that MU stemmer (Jabbar et al. 2018) does not deduct the affixes from the token of three words size such as the obtained token مالیت کاری سرمایہ [sarmaya kaari maliyat/ worth of investment] consists of a compound word کاری سرمایہ [sarmaya kaari/investment], in which کاری [kaari] is an affix and سرمایہ [sarmaya/capital] is a stem. The compared stemmers with the proposed one do not remove the Mohmil affixes and multi-level affixes as shown in Table 12 and the incorrect produced stem in the column is underlined. In the Table 12, we can notice that all the words having Mohmil suffix چکاری [chaakari], چیت [cheet] are not stemmed. Whereas, in case of multi-level affix وار مردانہ [mardana waar/manly] under stemming errors are committed by MU stemmer (Jabbar et al. 2018) and Assas-Band stemmer (Akram, Naseer, and Hussain 2009) UTS is the first Urdu stemmer that handles the multi-level inflections and Mohmil words reduction, as shown in Table 12. UTS has obtained 95.5 % CSWF while MU stemmer (Jabbar et al. 2018) achieved 93.8 % and Assas-Band stemmer (Akram, Naseer, and Hussain 2009)obtained lowest CSWF score 89.32%. Assas-Band stemmer (Akram, Naseer, and Hussain 2009) blindly removed the affixes and achieved highest WSF [98.31%] score than their competitor. The ICF achieved by UTS is 55.47% which is higher than the competitor with 55% value and 51%. as shown in table 13 and Fig. 6.
The performance of proposed UTS is comparatively higher with respect to CSWF and AWCF score as shown in Table 11. Therefore, from the obtained results using Sirsats (Sirsat, Chavan, and Mahalle 2013) evaluation method, we show that the UTS provides better results in terms of performance, strength, and accuracy.
Table 12
Example of produced stem by state of art Urdu stemmers
Query
Words
|
Actual
stem
|
Assas- Band stemmer
(Akram and Hussain 2009)
|
Multi-step stemmer
(Jabbar et al. 2018)
|
Urdu
Text Stemmer (UTS)
|
چوری
چکاری
|
چور
|
چوری
چکاری
|
چوری
چکاری
|
چور
|
نا تجربہ کار
|
تجربہ
|
تجربہ ک
|
تجربہ
|
تجربہ
|
امراض
|
مرض
|
امراض
|
مرض
|
مرض
|
بات چیت
|
بات
|
ت چیت
|
بات چیت
|
بات
|
مردانہ وار
|
مرد
|
مردانہ و
|
مردانہ
|
مرد
|
وجوہات
|
وجہ
|
وجوہ
|
وجہ
|
وجہ
|
The time complexity of the proposed algorithm is \(\varvec{O}\left(\varvec{n}\right)\). Because the execution time is directly proportional to the size of the input. Step 4 to 7 is executed in the nested loop which is based on the number of rules defined for the step. The fixed number of rules adds a constant factor in time complexity. Moreover, lower order terms and constants are ignored (Cormen et al. 2009).
Similarly, space complexity also remains linear. At the start of the algorithm, data is loaded, which is then reduced in coming steps. Rule lists used to stem the words are of a fixed size, which adds a constant space complexity that can be ignored in space complexity analysis (Cormen et al. 2009). The time and space complexity of the UTS is better than MU stemmer (Jabbar et al. 2018) which exhibits O(n2). Other exiting Urdu stemmer’s time and space complexity is not provided, so the comparison is not possible.
Although the better efficiency to produce stem has been achieved by UTS; however, there are some deficiencies such as it may mistakenly stem (False Positive) the proper noun, for instance, ارشاد [Irshad/Name of a person] is wrongly stemmed to رشد [rushad/ guidance]. The reason is that there is no mechanism in the Urdu language to identify the proper nouns. The Mohmil compound words that are not split by hard space such as کھاناوانا [khana wana/ the meal], patternless words and confusing words that may use as a root word or as an affix, cause the wrong stemming cases (False Negative). Although we handle such cases by table lookup approach and stem words list, it will depend upon the table lookup and stem words list entries. In case of text corpus, the accuracy is low as compared to words corpus (see Fig. 4) due to the improper tokenization and identification of proper noun, for instance, the Urdu sentence ہے لڑکا یافتہ تعلیم حسین حارث محمد [Mohammad Haris Hussain taleem Yafta larka hai /Mohammad Haris Hus- sain is an educated boy] in which compound wordتعلیم یافتہ[taleem Yafta/ educated] is not properly extracted due to the proper noun حسین حارث محمد [Mohammad Haris Hussain] and remains un-stemmed or wrong stem is produced. The reason is that in such case hard space is used as a delimiter, as a result, compound words such as یافتہ تعلیم [taleem Yafta/ educated] became two independent words تعلیم [taleem/education] and suffix یافتہ [yafta]. In this situation, the suffix یافتہ that becomes independent word is not removed but تعلیم [taleem/education] is stemmed to علم [ilam/knowledge].