Research in Pashto NLP is still in infancy, and apart from POS tagging there are countless open research areas where the corpus can be used. However, some of the very basic use-cases are discussed here.
11.1 PROOFING TOOL
Pashto is not a standardized language and there are no rules to impose the proper use of whitespace in writing system. Though whitespace is used for words separation but not consistently and thus it cannot be treated as an explicit word-boundary like English or other western languages. On the other hand, some languages like Chinese and Japanese ignore a whitespace and it is a not considered as a part of writing. But, in Pashto a whitespace cannot be ignored as well, because it is a part of writing – especially typing with the modern keyboards. The improper and inconsistent use of whitespace in Pashto leads to two typical spelling errors, commonly known as space-omission and space-insertion errors. A brief explanation with examples is already given in Section 2. These two errors have made the task of word segmentation very challenging and unique that cannot be performed only by using the available baseline segmentation techniques, such as whitespace tokenization or Lexicon-based approach. Ignoring these errors will negatively affect the performance of any NLP algorithm. The proposed corpus can be used to solve the issue of space-omission and space-insertion errors. An AI model can be trained on character-level information that will capture the context information of each character to predict the proper position of whitespace in the sentence.
11.2 WORD SEGMENTER
The whitespace tokenizer may be sufficient for many NLP applications, but, some scenarios demand for a word segmenter. For example, a machine translation system may need to capture the whole word e.g., “New York” not “New” and “York”. The quality of word segmentation affects the performance of all downstream tasks, such as NER, homographs disambiguation and Machine Translation. One such example is a recent study on sentiment analysis in Pashto, where the classic ML algorithms like Random Forest and Naïve Bayes have outperformed sophisticated Deep Learning (DL) algorithms, RNN and LSTM (Iqbal, Khan et al. 2022). One of the reasons is that these DL models are based on word embeddings i.e., Word2Vec and GloVe, where most of the available word embeddings are trained on noisy content, and therefore fail to efficiently capture the syntax information. The capability of whitespace tokenizer is limited to simple-words, and cannot capture compound words. To efficiently overcome the limitations of whitespace tokenizer and capture the compound words, one tradigital solution is the Lexicon-based Longest Matching approach. But this approach has some limitations as well, one is the so-called OOV errors, and secondly, it’s hard to find publicly available digital lexicon of Pashto language. Using the proposed corpus, a specialized AI model can be trained for Pashto word segmentation. Keeping in view this use-case, the corpus is labeled for word-boundaries information where simple whitespaces and word boundaries are explicitly marked with “S” and “B” tags respectively, as shown in Fig. 6.
11.3 NER
NER is the process of identifying and classifying named entities (NEs) in text. NEs are words or phrases that refer to specific entities, such as locations, people, organizations and so on. NER is an important task in NLP as it can help in many downstream applications such as information extraction, machine translation and question answering. NER is still an open research area in Pashto language, and a rich corpus is the prerequisite for an efficient NER system. From NER perspective a rich corpus means a corpus with many NEs, and NEs are usually the Proper Nouns. In the proposed corpus, around 65% of the sentences have Proper Nouns that are assigned with NNP tags, this portion of the corpus can easily be extracted. For NER system, the dataset is first labeled with POS tags e.g., (Sang and De Meulder 2003), these tags can then be used as features for prediction of NEs. In our corpus, 2M words are already assigned with POS tags, and besides that we have developed the general-purpose automatic POS Tagger that can be used to tag more words with state-of-the-art accuracy.
11.4 HOMOGRAPH DISAMBIGUATION
Homographs are words that have the same spelling but different meanings and sometimes different pronunciations. For example, the word "تور" can be Adjective referring to “black” or Common Singular Masculine Noun (NNM), means “blame”, similarly the word “ګډه” can be Adjective means “mixed” or Common Singular Feminine Noun (NNF), means “sheep”. Homograph disambiguation is the process of determining the correct meaning of a homograph in a given context. Homograph disambiguation is an important task in NLP, as it helps to improve the accuracy of language models. However, it is a challenging, and POS tags can all play an important role in determining the correct meaning of a homograph.
11.5 CONSTITUENCY AND DEPENDENCY PARSING
Constituency Parsing is the process of analyzing a sentence by breaking it down into sub-phrases (known as constituents). These sub-phrases belong to a specific grammatical category, like VP (verb phrase) and NP (noun phrase). Output of the constituency parser is typically a parse tree that represents the hierarchical structure of the sentence. The parser uses a set of grammatical rules and a grammar model to analyze the sentence and construct a parse tree. Dependency Parsing is a technique used to identify the syntactic structure of a sentence by analyzing the relationships between words. It focuses on identifying the grammatical relationships between words i.e., subject, object and verb of the sentence. The output of dependency parser is typically a dependency tree/graph representing the relationships between words. Dependency parsing focuses on the linear structure of the sentence while constituency parsing focuses on the hierarchical structure. Both techniques have their own advantages and often used together to better understand the sentence. POS tagging is essential for both constituency and dependency parsing as it provides important information about the grammatical function of a word in a sentence e.g., whether it is a noun, verb, or adjective, and this information is used to determine its relationship with other words in the sentence. For example, consider the sentence "پيشو پۍ اوخوړل" <the cat ate the milk>, to parse this, the parser needs to know that "پيشو" is a noun, "پۍ" is a noun, and "اوخوړل" is a verb. With this information, the parser can identify that "پيشو " is the subject of the sentence and "پۍ" is the object of the verb "اوخوړل." The proposed corpus can thus provide a foundation for further types of parsing.