In the evolving context of automated language processing, unraveling the complex fabric of Moroccan Arabic (Darija) in a multilingual environment is a major challenge. This study embarks on the arduous task of detecting the nuances of Moroccan Arabic within a binary dataset that consists of both Standard Arabic and Darija expressions.
Using a comprehensive methodology, we interweave a sophisticated set of feature selection techniques, including well-known extraction techniques in natural language processing (NLP) such as TF-idf, CBOW, and Word2Vec. By leveraging the capabilities of machine learning techniques via LASSO decision and regression trees, we navigate the labyrinth of linguistic diversity, relying on semantic methods that consist of using mostly advanced encoders to deepen our understanding of the distinctive linguistic fabric. We also look at static methods of feature selection, such as ANOVA, Pearson correlation coefficient, and mutual information, in order to add strata of analysis.
Finally, by emphasizing the paramount importance of dimensionality reduction through principal component analysis (PCA) and singular value decomposition (SVD), our methodology not only preserves the essential structures in the high-dimensional linguistic space, but also strives to contribute significantly to the accurate detection of Moroccan Arabic dialects. The stability of our results is achieved with the XGBOOST algorithm using classical extraction methods and SVD, with a reasonable execution time.
This research aspires not only to unveil the specific subtleties of Arabic dialects, but also to open up new horizons in the field of natural language processing within diverse and multilingual societies.